<stereobutter[m]>
Hi folks :) I'm currently toying with gitops-based system upgrades for FCOS which means I will be calling `rpm-ostree deploy <some_version>`. What I'd like to do is use the graph data provided by cincinnati to validate the current version is compatible with `some_version`. Cincinnati's update graph apparently does only contain edges for going from a version to the most current version however.
<stereobutter[m]>
Say I'm on 32.20200629.3.0 and the new version is 32.20200715.3.0. I assume this is a valid transition (but the edge is omitted from the graph because the number of edges would explode) and is implied by both versions pointing at 32.20201104.3.0?
<stereobutter[m]>
So the graph would look something like the above if one painted in the implict edges
<lucab>
stereobutter: yes, it's just that the graph is optimized to minimize the amount of updates/reboots
jpn has quit [Ping timeout: 246 seconds]
<stereobutter[m]>
And I assume for downgrading the same rules apply. I can only go back to a version *A* < *B* when *A* -> *B* is a valid upgrade transition?
jpn has joined #fedora-coreos
<lucab>
stereobutter: no, downgrades in general have no guarantees. You cannot invert the direction of the graph and try to infer anything.
<stereobutter[m]>
So basically the only "safe" thing to do is a rollback i.e. I came from A and installed B then B->A is okay
<stereobutter[m]>
?
<stereobutter[m]>
Maybe not "safe" but "sensible"
<lucab>
that usually works in practice, yes, but it has no strict guarantees as well. It depends on specific softwares (and exact versions) on both A and B
<stereobutter[m]>
The reason BTW why I'd like to manually deploy versions is that for some workloads we use nvidia GPUs and the whole driver thing is a mess (we got to work) but which will for the forseeable future only work reliably with super fixed versions (both driver and FCOS) so we'd like to periodically try a new FCOS+Driver version and then pin this
<stereobutter[m]>
and roll out exactly this
<lucab>
usual example is: release A contains fooware-1.0 and release B contains fooware-2.0. fooware-2.0 has up-migration logic for data generated by 1.0, but fooware-1.0 has no idea about the layout of migrated data. As soon as release B is booted and the data is migrated, rolling-back to release A may result in troubles for fooware.
jpn has quit [Ping timeout: 256 seconds]
<stereobutter[m]>
Yeah sure, that makes sense. Say I use `rpm-ostree` to overlay a package `Foo` where A has `FooV1` and B has `FooV2` and assuming `Foo` upon installation does not doing anything to exotic will this work from FCOS point of view when I rollback from B to A?
<lucab>
stereobutter: it's a reasonable scenario, and zincati can not cover all the very-custom strategies. I'd say it makes sense to consume the graph on your down and drive rpm-ostree as you wish
<stereobutter[m]>
(Since we haven't upgraded anything yet we have not been in the situation to perform a rollback yet)
<lucab>
stereobutter: yes, the rollback part for the packages in itself should work. The main caveats are about persisted data, and data-format migrations between `FooV1` and `FooV2`.
<stereobutter[m]>
Nice!
<lucab>
sidenote, that's the reason why we suggest putting `fooware` in a container, so that OS upgrades/rollbacks do not force it to go back and forth between V1<->V2
<stereobutter[m]>
Yeah, the Nvidia Driver already runs in a container but there is some still some RPM that currently has to be installed (see https://discussion.fedoraproject.org/t/can-you-run-nvidia-gpu-workloads-on-fcos/35090). I actually had a look at this together with another developer for almost a week before we manage to get everything up and running.
ravanelli has joined #fedora-coreos
<stereobutter[m]>
When working on this I had the feeling that there are probably only a handful to a couple hundred people in existence that actually know how the whole thing actually works
jpn has joined #fedora-coreos
<stereobutter[m]>
> putting fooware in a container, so that OS upgrades/rollbacks do not force it to go back and forth
<stereobutter[m]>
only works the the container works on both versions. In the driver case the container version must match the OS version
<stereobutter[m]>
* > putting fooware in a container, so that OS upgrades/rollbacks do not force it to go back and forth
<stereobutter[m]>
only works the the container works on both versions. In the driver case the container version must match the OS version
dwalsh_ has joined #fedora-coreos
ravanelli has quit [Ping timeout: 256 seconds]
<stereobutter[m]>
Would it be possible to also add a github release on https://github.com/coreos/fedora-coreos-streams for every new FCOS version? That way one could point e.g. renovatebot at the repo and trigger some action when a new FCOS version is released.
<stereobutter[m]>
alternatively to github releases just plain git tags would also suffice I guess.
<stereobutter[m]>
I had a look at the github actions in the repo but from looking at them I'm not sure what the actual release process is like and when this should run.
dwalsh_ has quit [Remote host closed the connection]
dwalsh_ has joined #fedora-coreos
ravanelli has joined #fedora-coreos
crobinso has joined #fedora-coreos
<lucab>
stereobutter: a better trigger would be "whenever update/<yourstream.json> changes", as that is really the point when a new release starts to be rolled out
<lucab>
*updates/
<lucab>
(I don't know if renovatebot can do that, I haven't used it before)
<stereobutter[m]>
as far as I know renovatebot can only track github releases and tags and cannot run https requests to find out if a new version exists
<lucab>
if it can activate on git commits and filter by filepath, that may be enough
<lucab>
we may be already pushing out an event on the fedora message bus, I don't exactly remember
<Sheogorath[m]>
It's currently not possible to check for content of a git repository, renovate only works on metadata. Therefore no, that's not possible right now. One could consider to write a datasource integration https://github.com/renovatebot/renovate/tree/main/lib/modules/datasource
<ravanelli>
lucab: I have a question for you
<lucab>
ravanelli: sure!
<ravanelli>
lucab: I finished the release process late yesterday, and during the release job, I got: ostree pull --commit-metadata-only fedora:fedora/s390x/coreos/testing
<ravanelli>
Seems we don't have the ostree repo for s390x
<ravanelli>
23:09:39 error: No such branch 'fedora/s390x/coreos/testing' in repository summary
<lucab>
ravanelli: ah uhm, yes the branch is probably missing but I guess the pipeline should try to push/create it the first time. Do you have a link to the job/logs?
<lucab>
ravanelli: ok, I think that the `OSTree Import s390x: Prod Repo` before that had some troubles, or at least it didn't properly set up the branch
<lucab>
I don't have to the `coreos-ostree-importer` project on the OCP cluster to see if it says something in the logs, I'd wait for jlebon or dustymabe
<lucab>
(looking at the code, it seems to have logic to deal with scratch-new branches, so it may have just hit a bug)
paragan has quit [Ping timeout: 246 seconds]
<ravanelli>
lucab: Thanks for checking it ;)
mheon has joined #fedora-coreos
<jlebon>
ravanelli: try rerunning those, and it should work now
<jlebon>
the bit that failed is the script that waits until the branches are updated
<jlebon>
but it doesn't account for new branches correctly
<jlebon>
i'll see about fixing that, but in the meantime, we can sanity-check that a rerun passes since the summary should've had time to update by now
<lucab>
jlebon: was coreos-ostree-importer happy?
<lucab>
ah, it's only the summary that maybe does not match after the import
<jlebon>
lucab: i didn't check but from the job logs it should be. it knows to create the branch if it doesn't exist
<jlebon>
yeah, and i think there's CDN lag too IIRC
<jlebon>
`ostree remote summary fedora` on my FSB shows both `fedora/s390x/coreos/next` and `fedora/s390x/coreos/testing` are there yup
<dustymabe>
jlebon: ravanelli: all good now?
<ravanelli>
let me rerun that to see
<dustymabe>
hold on one sec
<dustymabe>
we shouldn't need to re-run anything for `testing`/`next` right?
plarsen has joined #fedora-coreos
<ravanelli>
ok
<dustymabe>
the step at the end that failed is just a check/read-only step
plarsen has quit [Remote host closed the connection]
<jlebon>
yeah should be fine now. i checked locally the summary was updated.
<dustymabe>
jlebon: can you check the failure in the re-run she did?
<ravanelli>
I should probably update the time that already passed
<ravanelli>
or Is it not an issue?
plarsen has joined #fedora-coreos
plarsen has quit [Remote host closed the connection]
plarsen has joined #fedora-coreos
<lucab>
ravanelli: not a real issue, the rollouts will just immediately start from something slightly larger than 0% (e.g. 0.5%) instead of starting from 0% and growing linearly to 0.5%
plarsen has quit [Ping timeout: 255 seconds]
fifofonix has joined #fedora-coreos
<jlebon>
i think ravanelli is afk, so i just merged it to not let it grow larger :)
dustymabe has quit [Quit: WeeChat 3.4]
<ravanelli>
jlebon: thanks =)
dustymabe has joined #fedora-coreos
jcajka has quit [Quit: Leaving]
fifofonix has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<travier[m]>
but I understand that this is still not ideal
<travier[m]>
we also have the repos "dirtying" the config right now for rhcos
<travier[m]>
and the content set file
<jlebon>
yeah, the RHCOS setup is really awkward
<travier[m]>
And I also agree that the SHA will not define anymore which version is built
<travier[m]>
we would have to store the variant name / version somewhere alonside it to keep track of it
<travier[m]>
it will however still correctly reflect the commit used.
<jlebon>
i think we can roll with this for now, though i do wonder if all this would be cleaner if it were separate branches instead and shared files like FCOS is setup. anyway, we can discuss this more down the line
<travier[m]>
I would have like having separated branches but this would have required some CI setup that is complex with Prow in openshift org
<jlebon>
all, just wanted to mention a few things about the @CoreOS/continuous COPR repo: (1) i've removed el8-based chroots because most upstream projects no longer build as is for various reasons, (2) i've enabled the aarch64, s390x, and ppc64le chroots; this means it's now possible to get multi-arch git main packages! as a result, i've closed https://pagure.io/releng/issue/9801 which pursued doing this in
<jlebon>
koji via packit.
<bgilbert>
\o/
jpn has quit [Ping timeout: 240 seconds]
crobinso has quit [Remote host closed the connection]
<guesswhat>
I am trying to expose zincati socket via socat , to scape it with prometheus, but I can not get it work, any ideas? ( /usr/bin/socat tcp-listen:9988,fork,reuseaddr unix-connect:/run/zincati/public/metrics.promsock )
<guesswhat>
*scrape
<guesswhat>
curl 127.0.0.1:9988 returns curl: (1) Received HTTP/0.9 when not allowed
<guesswhat>
lucab any ideas? i saw your exporter.. thanks
<dustymabe>
guesswhat: never done it before but I highly doubt the socket is going to speak HTTP (which is what curl is trying to use)
<guesswhat>
but its working for docker.sock afaik
<guesswhat>
its working like this echo -e "GET / HTTP/1.0\r\n" | socat unix-connect:/run/zincati/public/metrics.promsock STDIO
<dustymabe>
guesswhat: maybe i'm wrong then :)
<guesswhat>
its working even with empty stdin echo "" | socat unix-connect:/run/zincati/public/metrics.promsock STDIO
<dustymabe>
maybe specify different http versions to curl? --http1.0 --http1.1 etc..
<dustymabe>
--http0.9
<guesswhat>
yeah, it works, but not sure how to use it in prometheus :X
<dustymabe>
gursewak: jlebon: now that we've shipped the releases this week we should be able to proceed with the `rd.debug` change we discussed last week
<dustymabe>
OR - should we wait until we see the problem again and then enable `rd.debug` ?