dustymabe changed the topic of #fedora-coreos to: Fedora CoreOS :: Find out more at https://getfedora.org/coreos/ :: Logs at https://libera.irclog.whitequark.org/fedora-coreos
rsalveti has joined #fedora-coreos
bgilbert has quit [Ping timeout: 240 seconds]
odra has quit [Ping timeout: 268 seconds]
gursewak has quit [Remote host closed the connection]
gursewak has joined #fedora-coreos
gursewak has quit [Ping timeout: 240 seconds]
gursewak has joined #fedora-coreos
gursewak has quit [Ping timeout: 268 seconds]
paragan has joined #fedora-coreos
Casper_v2 has quit [Quit: service restarting]
gursewak has joined #fedora-coreos
bgilbert has joined #fedora-coreos
bgilbert has quit [Ping timeout: 268 seconds]
Betal has quit [Quit: WeeChat 3.6]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 255 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
jpn has joined #fedora-coreos
odra has joined #fedora-coreos
frigo has joined #fedora-coreos
crobinso has joined #fedora-coreos
frigo has quit [Quit: Client closed]
plarsen has joined #fedora-coreos
jpn has quit [Ping timeout: 268 seconds]
nalind has joined #fedora-coreos
heldwin has joined #fedora-coreos
jpn has joined #fedora-coreos
mheon has joined #fedora-coreos
jlebon has joined #fedora-coreos
<dustymabe> jlebon: can you help me chase down some pipeline failures this morning?
nbsadminaccount- has joined #fedora-coreos
<jlebon> dustymabe: yeah, was looking at some already :)
<dustymabe> travier[m]: can you look at why the postprocess that you did for liblockfile doesn't work?
<dustymabe> jlebon: there's a few issues I see
<dustymabe> 1. boot-mirror test failing intermittently
<dustymabe> 2. testiso failing in 4k/multipath test
<dustymabe> 3. aarch64 upgrade test failing (fails to reboot in 2m)
<dustymabe> 4. s390x can't compose (issue I linked travier[m] to above)
<dustymabe> did you notice all of those?
<jlebon> i hadn't yet no. i was looking at the kola aws aarch64 test failing due to aws capacity
<jlebon> was digging into the history there on why we're tied to an availability zone at all
<jlebon> lucab: let's backport that lua patch?
jpn has quit [Ping timeout: 268 seconds]
<jlebon> ahh ok, walters already backported it to rawhide, so we could just merge it into f36... or cut a new release
<jlebon> dustymabe: do you want to divide it up?
<dustymabe> jlebon: I think we're "tied to an AZ" because we specifiy a subnet.
jpn has joined #fedora-coreos
<dustymabe> we do have code in there that attempts to find a subnet that has the instance type.
<jlebon> dustymabe: indeed. but i'm not sure why we specify a subnet.
<dustymabe> I think we are trying to make sure we get placed into the correct VPC (that has all of our settings)
<dustymabe> I don't know if there is a way to instead specify a vpc and let AWS do the math on that for us
<dustymabe> might be some details in the git blame
<jlebon> yeah, i got to https://github.com/coreos/coreos-assembler/commit/a12a54949eb716da5e58d79d9c3a767a0760c71e but need to dig into aws api details to understand this more
<dustymabe> jlebon: might be something we can write up an issue for and let someone else dig in to? For that particular problem at least it's transient.
<dustymabe> and explainable
<jlebon> dustymabe: yup agreed. was mostly looking to see if there was a quickfix possible.
<dustymabe> +1 :)
<dustymabe> thank you for digging in, it sure helps when we don't ignore these things
<dustymabe> for the remaining issues
<dustymabe> i'm trying to reproduce 2. locally
<travier[m]> I'll take a look at the liblockfile issue but my afternoon is packed so that might be tomorrow
<dustymabe> travier[m]: +1 - you might be able to get lakshmi to look at it (I don't see her here on IRC but you might be able to reach her internall)
<dustymabe> should be really easy to reproduce!
<jlebon> ok, filed https://github.com/coreos/coreos-assembler/issues/3030 while it's still fresh
<jlebon> i can look at 3
<jlebon> looks like it might be specific to rawhide
saqali has joined #fedora-coreos
<dustymabe> jlebon: this is what happens when rawhide doesn't build for a week :) - we end up with 3+ problems and a larger package set diff to comb through
odra has quit [Quit: Leaving]
<jlebon> dustymabe: indeed heh
spresti[m] has joined #fedora-coreos
<jlebon> dustymabe: i'd like https://pagure.io/fedora-infra/ansible/pull-request/1162 to take effect. is it just:
<jlebon> sudo rbac-playbook -l os_control openshift-apps/fedora-coreos-pipeline.yml
<jlebon> ?
<guesswhat> question, what is the source for metadata for stable versions ? https://builds.coreos.fedoraproject.org/release-notes/stable.json this one?
<jlebon> spresti[m]: o/
<guesswhat> i would probably need this https://builds.coreos.fedoraproject.org/streams/stable.json ( it contains artifact metadata, for example AMI name )
<jlebon> guesswhat: ahh, I misunderstood your question. yes, the canonical user-facing API is that one.
<jlebon> i thought maybe you were looking to fix a typo in our release notes :)
<dustymabe> jlebon: I usually just look at this command and remote `-t delete` and use `-l os_control`: https://pagure.io/fork/spresti/fedora-infra/ansible/blob/4b83fe21c2cc5a813aad6e6870fdaa829e15da58/f/playbooks/openshift-apps/fedora-coreos-pipeline.yml#_37
<dustymabe> but yes :)
<guesswhat> i am trying to implement renovate ( https://docs.renovatebot.com/modules/datasource/#aws-machine-image-datasource ) support, but it does not work out of the box, cuz AMI name is universal for stable/next/testing/dev, for example its trying to bump version from stable to dev ( fedora-coreos-36.20220806.20.0-x86_64 )
<jlebon> dustymabe: ok that's a helpful tip
<dustymabe> s/remote/remove
<jlebon> dustymabe: maybe we should just add the real command on top too. then we don't have to worry about someone forgetting to remove the `-t delete` when doing this. :)
<dustymabe> jlebon: WFM
<spresti[m]> Thank you guys for getting this working :)
<dustymabe> I think there may be an SOP somewhere for this in the infra docs
<jlebon> spresti[m]: ok, it *should* work now
<spresti[m]> Thank you!!!!
<dustymabe> guesswhat: so you're saying the strategy doesn't work because we have 3 streams all named similarly?
<guesswhat> dustymabe: yes, its like a semver versioning or timestamps
<guesswhat> some classifier in AMI name would definetly help
<guesswhat> *definitely
<dustymabe> should be able to use something like: fedora-coreos-*.3.*-{x86_64,aarch64} ?
<dustymabe> for stable
<dustymabe> fedora-coreos-*.2.*-{x86_64,aarch64} for testing
<dustymabe> fedora-coreos-*.1.*-{x86_64,aarch64} for next
<guesswhat> yes, but what the "3" number, is it static and reliable ?
<dustymabe> yes `.3.` is for stable
<dustymabe> 2 for testing
<dustymabe> 1 for next
<guesswhat> this may work I guess
<dustymabe> I think aws has a new construct for image families but we haven't looked at them yet
<jlebon> hmm, it'd be better though to be based on the stream metadata instead
<dustymabe> jlebon: right, I don't know how this tool works though (if it's capable of that)
<dustymabe> https://pagure.io/cloud-sig/issue/356 -> looks like it's called AWS systems manager
<dustymabe> ^^ that would solve the issue - but tools would still probably need to be updated to look at the new parameters
<guesswhat> dustymabe: `fedora-coreos-*.3.*-x86_64` works for me, thanks :) didnt know about 3,2,1 classifiers
bgilbert has joined #fedora-coreos
<jlebon> i don't think we'll change our versioning schemes anytime soon, but again, I'd prefer users didn't rely on its exact internal details
<dustymabe> I agree.. guesswhat - do you you know if there is any way to make that tool query our stream metadata and use that information instead?
<dustymabe> jlebon: do you know if we have any examples of doing that in the docs?
<jlebon> guesswhat: we don't know the details of renovatebot, but if there's an easier way for us to provide the stream metadata information, we may consider it
<dustymabe> oh heck yeah
<dustymabe> nice!
<dustymabe> gursewak: the CI failure in https://jenkins-coreos-ci.apps.ocp.fedoraproject.org/blue/organizations/jenkins/coreos-assembler/detail/PR-3016/9/pipeline - you'll have to rebase that PR on top of latest main branch
<dustymabe> but don't do it yet.. i've got some things I'm giong to add to the code review
<spresti[m]> dustymabe: I seem to be running into some build failures
<spresti[m]> https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/15/console, it seems to be somthing ouft the k8s cluster? its timing out :(
<dustymabe> jlebon: ^^
<dustymabe> should we hand edit to get back to 64 connections?
<dustymabe> spresti[m]: looks like `testing` is still running ok so let's let that one finish
<dustymabe> I killed all subordinate (i.e. multi-arch) jobs for `stable` and `next` because we'll have to re-run them
<spresti[m]> Yeah; the testing branch is, the stable also timed out https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/17/console
<spresti[m]> Thank you
<dustymabe> the `aarch64` `testing` job also timed out, but we can restart that one safely (since we're hopeful the main testing job will succeed)
<dustymabe> and come back to me here with questions
paragan has quit [Quit: Leaving]
<jlebon> dustymabe, spresti[m]: bumped to 64 manually for now
Betal has joined #fedora-coreos
<dustymabe> jlebon: do we have a general strategy for how to workaround the issue (i.e. the configuration issue that we hit when we implemented the previous fix) in the future?
<jlebon> dustymabe: this is the first time we've tried to override something written by the s2i runner. i think the workaround is shipping a startup script that does it imperatively instead.
saqali has quit [Ping timeout: 244 seconds]
<guesswhat> dustymabe: no, it does not support generic http datasource with json/yaml support
<guesswhat> but it does support github releases and tags
<guesswhat> so its gonna work for 90% of tools versioned via gh
<guesswhat> "tools"
<jlebon> dustymabe: i think we need a FORCE flag on cosa-build
<dustymabe> hmm I thought I had implemented one (at least there is some plumbing for it in cosa itself I think)
<dustymabe> IOW I think we just need to update the pipeline
<dustymabe> pipeline job
<jlebon> you're saying the `cosa remote-build-container` already supports `--force` and we just need to expose it in the job?
<dustymabe> I think so.. looking
<dustymabe> ok.. i was have true
<dustymabe> "half"
<jlebon> i can do the param piping part fwiw
<dustymabe> ok i'm wrong - think I'm looking at the wrong version of the code
<dustymabe> one sec
<dustymabe> jlebon: I've got some local work for param piping.. if you haven't started. If you have then go for it :)
<dustymabe> LGTM
<dustymabe> at least in this case the underlying work had been done already :)
<spresti[m]> So did we disable the build pipeline?
<jlebon> dustymabe: yeah, nice :)
<dustymabe> I disabled it earlier, just to let the builders focus on prod stream builds
<dustymabe> you can un-disable it
<jlebon> gursewak: once https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build-cosa/9/console is done, you can restart CI on your filesystem PR
<dustymabe> jlebon: well. there is a small disconnect :(
<dustymabe> that build has to finish and the imagestream needs to be updated
<dustymabe> oh hmm. I don't know if I updated the imagestream for coreos-ci
<dustymabe> is there an imagestream?
<dustymabe> i.e. for COSA
<jlebon> no imagestream, no
<dustymabe> stamped
<dustymabe> jlebon: that is interesting.. so in coreos-ci we always hit quay?
<spresti[m]> dustymabe: Kk, thank you.
jpn has quit [Ping timeout: 245 seconds]
<jlebon> dustymabe: we always check quay, yes. but pretty sure there's optimizations for reusing downloaded images
jpn has joined #fedora-coreos
<spresti[m]> jlebon: I ended up rebuilding both next and stable. both completed with in a min.. that seems too fast right? https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/18/
<dustymabe> jlebon: like what? I imagine obviously if we hit a builder that already used this quay image then we won't re-download, but if we hit a new builder we'd pull from quay again (not the cluster)
<jlebon> spresti[m]: ahh yup, you'll want to check off FORCE
<jlebon> dustymabe: i'm trying to find a source for this, but I think the scheduler prefers nodes where the image already exists
<dustymabe> if that documentation can be improved please let us know
<dustymabe> jlebon: that's interesting. I didn't know about that
<jlebon> dustymabe: don't believe me yet :)
<dustymabe> if that is true then maybe we should drop the imagestream in the fcos pipeline
jpn has quit [Ping timeout: 245 seconds]
<spresti[m]> jlebon: dustymabe thank you
<dustymabe> why would all the others be *Artifact and only that one Artifact
<jlebon> dustymabe: because that one is not optional
saqali has joined #fedora-coreos
<dustymabe> I wonder how it keeps track of that
<dustymabe> polling the node constantly?
<jlebon> yeah, i'm not sure
<dustymabe> jlebon: no, I'm not worried about the `omitempty`.. worried more about the `*`
<dustymabe> unless that doesn't mean what I think it does
<jlebon> if it were a pointer, then it could be nil
<jlebon> schematyper is trying to enforce optionality, but golang doesn't have Option like rust does, so this is what we're left with :)
<dustymabe> ok.. :)
<dustymabe> does that mean we have to access it different than all the other artifacts?
<jlebon> we can access it the same way, but unlike the others, we don't have to nil-check first
<dustymabe> k
jpn has joined #fedora-coreos
heldwin has quit [Remote host closed the connection]
heldwin has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]
bgilbert has quit [Quit: Leaving]
bgilbert has joined #fedora-coreos
bgilbert has quit [Client Quit]
jpn has joined #fedora-coreos
<dustymabe> gursewak: I think I see a cleaner way for checking in https://github.com/coreos/coreos-assembler/pull/3016#pullrequestreview-1065441853 - let me know what you think.
* dustymabe food
<gursewak> Yep that def looks more comprehensive by having a list of required artifacts for each test and checking all of them in single function call.
<gursewak> I'll update the PR , thanks:)
jpn has quit [Ping timeout: 245 seconds]
<spresti[m]> WOW i am hitting a lot of flakes today.
<spresti[m]> s/./, or am I doing something unexpected? /
<spresti[m]> * WOW am i hitting a lot of flakes today? or am I doing something unexpected?
<dustymabe> new failure?
<spresti[m]> and the rerun of [testing]aarch64 also did
<spresti[m]> I then re-ran it again and I see red progress bars?
jpn has joined #fedora-coreos
<dustymabe> #22 and #23 haven't failed yet? still running?
<spresti[m]> Yeah sorry, I saw the bar change from blue to red, and thought it indicated failure
<dustymabe> #21 - we see this kind of randomly and we haven't figured it out yet: https://github.com/coreos/fedora-coreos-tracker/issues/1233
<dustymabe> would love to have it figured out though
<dustymabe> spresti[m]: can you mention and link to the job that failed in the issue
jpn has quit [Ping timeout: 255 seconds]
<spresti[m]> dustymabe: Done
<spresti[m]> 14:04:05 --- FAIL: coreos.ignition.failure (123.56s)
<spresti[m]> 14:04:05 qemufailure.go:43: timed out waiting for initramfs error: context deadline exceeded
<spresti[m]> 13:52:57 --- FAIL: ext.config.root-reprovision.raid1 (278.06s)
<spresti[m]> 13:52:57 harness.go:1488: mach.Start() failed: machine "f373dc23-2fd8-4347-abb1-2559e1d8a12e" failed basic checks: detected failed or stuck systemd units
<spresti[m]> 13:52:57 qemu-system-s390x: -drive if=none,id=mpath10,format=raw,file=nbd:unix:/var/tmp/mantle-qemu230010716/disk1855253409.socket,media=disk,auto-read-only=off,cache=unsafe: Failed to connect to '/var/tmp/mantle-qemu230010716/disk1855253409.socket': No such file or directory
<dustymabe> spresti[m]: will look
<dustymabe> spresti[m]: I think that test is expecting a failure and it's not getting the failure in the appropriate amount of time
bgilbert has joined #fedora-coreos
bgilbert has quit [Client Quit]
<dustymabe> so.. `next` and `testing` are essentially the exact same content set right now. Let's see if `next` passes, which is running right now:
<dustymabe> spresti[m]: ^^
<spresti[m]> dustymabe: ok, yeah
bgilbert has joined #fedora-coreos
<spresti[m]> sigh
<spresti[m]> * sigh, sorry to bug so much. Thank you
<dustymabe> it's normal
<dustymabe> hopefully we can chase down the transient errors we've been seeing
<dustymabe> spresti[m]: that one shows `PASS: coreos.ignition.failure`
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]
jpn has joined #fedora-coreos
<dustymabe> jlebon: I'd like to find an analogue to https://github.com/coreos/coreos-assembler/blob/8b5bec7c836bbea7eb7cd637f9533a7849cb889a/schema/cosa/build.go#L175-L182 that doesn't take the json tag as input (i.e it takes "LiveISO" as input, not "live-iso")
<dustymabe> do you know of any that exists? ORR would it be OK for us to create it?
<dustymabe> GetArtifactByStructName() ?
<jlebon> dustymabe: i think that would require golang reflection, which is yucky
<jlebon> bgilbert might have a better idea
<dustymabe> I basically just want to call GetArtifact("LiveIso") and have it give me back the right thing
jpn has quit [Ping timeout: 245 seconds]
jpn has joined #fedora-coreos
<gursewak> dustymabe, Also I was wondering whether we can update the error message for GetArtifact() to include the name of the artifact we are looking for?
<bgilbert> dustymabe: where does the string "LiveISO" come from?
<gursewak> Is there a reason as to why we want to use "LiveISO" and not just "live-iso"?
<bgilbert> +1 gursewak
<dustymabe> yeah I mean we can just use the json names, but it seems a bit unclean
<dustymabe> for example - we will be calling a function to do a check on an arfifact and then later using the struct name to access it
<dustymabe> which is fine, just felt awkward
<bgilbert> dustymabe: I think I'm missing some context
<dustymabe> yeah.. when gursewak pushes up his code (the missing context) i'll tag you in so you have the full picture
<bgilbert> +1
bgilbert_ has joined #fedora-coreos
jpn has quit [Ping timeout: 268 seconds]
crobinso2 has joined #fedora-coreos
llamma` has joined #fedora-coreos
hank_ has joined #fedora-coreos
aleek has joined #fedora-coreos
ninjanne has joined #fedora-coreos
dustymab1 has joined #fedora-coreos
djinni`_ has joined #fedora-coreos
pwhalen_ has joined #fedora-coreos
bgilbert has quit [*.net *.split]
crobinso has quit [*.net *.split]
nalind has quit [*.net *.split]
mnguyen has quit [*.net *.split]
mnguyen_ has quit [*.net *.split]
aleeku_ has quit [*.net *.split]
hank has quit [*.net *.split]
djinni` has quit [*.net *.split]
dustymabe has quit [*.net *.split]
llamma has quit [Ping timeout: 240 seconds]
nalind has joined #fedora-coreos
mnguyen has joined #fedora-coreos
mnguyen_ has joined #fedora-coreos
justJanne has quit [Ping timeout: 240 seconds]
pwhalen has quit [Ping timeout: 240 seconds]
pwhalen_ is now known as pwhalen
dustymab1 is now known as dustymabe
nalind has quit [Quit: bye]
llamma` has quit [Ping timeout: 240 seconds]
llamma has joined #fedora-coreos
bgilbert_ is now known as bgilbert
_whitelogger_ has joined #fedora-coreos
inflatador_ has quit [Changing host]
inflatador_ has joined #fedora-coreos
miabbott[m] has quit [Ping timeout: 240 seconds]
ccha has quit [Ping timeout: 240 seconds]
rishi has quit [Ping timeout: 240 seconds]
GiuDno[m] has quit [Ping timeout: 240 seconds]
mhayden has quit [Ping timeout: 240 seconds]
mnaser has quit [Ping timeout: 240 seconds]
nbsadminaccount- has quit [Ping timeout: 240 seconds]
_whitelogger has quit [Ping timeout: 240 seconds]
enilflah has quit [Ping timeout: 240 seconds]
inflatador has quit [Ping timeout: 240 seconds]
mnaser_ is now known as mnaser
inflatador_ is now known as inflatador
nbsadminaccount- has joined #fedora-coreos
mhayden has joined #fedora-coreos
crobinso2 has quit [Remote host closed the connection]
mheon has quit [Ping timeout: 268 seconds]
plarsen has quit [Quit: NullPointerException!]
<bgilbert> guesswhat: to be explicit: the concern is that just because there's a numerically larger image version in AWS, does not mean that anyone should use it
<bgilbert> guesswhat: we want to be able to instruct new launches to use a previous release, if it turns out that there's a regression in the current one
<bgilbert> guesswhat: the stream JSON is the canonical source of info