dustymabe changed the topic of #fedora-coreos to: Fedora CoreOS :: Find out more at https://getfedora.org/coreos/ :: Logs at https://libera.irclog.whitequark.org/fedora-coreos
jlebon has quit [Quit: leaving]
cyberpear has quit [Quit: Connection closed for inactivity]
ravanelli has quit [Remote host closed the connection]
llamma has quit [Ping timeout: 264 seconds]
mnguyen_ has quit [Ping timeout: 252 seconds]
mnguyen_ has joined #fedora-coreos
mnguyen has quit [Ping timeout: 268 seconds]
fifofonix has joined #fedora-coreos
mnguyen has joined #fedora-coreos
plarsen has joined #fedora-coreos
plarsen has quit [Remote host closed the connection]
bgilbert has quit [Ping timeout: 272 seconds]
fifofonix has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
mnguyen_ has quit [Ping timeout: 252 seconds]
mnguyen has quit [Ping timeout: 252 seconds]
mnguyen has joined #fedora-coreos
mnguyen_ has joined #fedora-coreos
paragan has joined #fedora-coreos
misuto has quit [Remote host closed the connection]
misuto has joined #fedora-coreos
ravanelli has joined #fedora-coreos
ravanelli has quit [Ping timeout: 276 seconds]
mnguyen_ has quit [Ping timeout: 245 seconds]
mnguyen has quit [Ping timeout: 252 seconds]
mnguyen has joined #fedora-coreos
mnguyen_ has joined #fedora-coreos
gursewak has quit [Ping timeout: 240 seconds]
gursewak has joined #fedora-coreos
gursewak has quit [Ping timeout: 272 seconds]
jpn has joined #fedora-coreos
ravanelli has joined #fedora-coreos
ravanelli has quit [Ping timeout: 276 seconds]
jcajka has joined #fedora-coreos
jpn has quit [Ping timeout: 268 seconds]
jpn has joined #fedora-coreos
ksinny has joined #fedora-coreos
ksinny has quit [Client Quit]
arnulfo_07 has joined #fedora-coreos
arnulfo_7 has quit [Ping timeout: 245 seconds]
Betal has quit [Quit: WeeChat 3.6]
fifofonix has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]
mnguyen has quit [Ping timeout: 245 seconds]
mnguyen has joined #fedora-coreos
mnguyen_ has quit [Ping timeout: 272 seconds]
mnguyen_ has joined #fedora-coreos
ravanelli has joined #fedora-coreos
rishi```` has quit [Remote host closed the connection]
rishi has joined #fedora-coreos
ravanelli has quit [Remote host closed the connection]
crobinso has joined #fedora-coreos
vgoyal has joined #fedora-coreos
ravanelli has joined #fedora-coreos
jlebon has joined #fedora-coreos
mheon has joined #fedora-coreos
jpn has joined #fedora-coreos
<dustymabe> jlebon: need your expertise over in https://github.com/coreos/fedora-coreos-pipeline/pull/581
bgilbert has joined #fedora-coreos
bgilbert has quit [Remote host closed the connection]
bgilbert has joined #fedora-coreos
jcajka has quit [Quit: Leaving]
bgilbert has quit [Ping timeout: 245 seconds]
<jlebon> PSA: https://github.com/coreos/coreos-ci-lib/pull/112 may affect the performance of CoreOS CI jobs. If needed, feel free to bump the CPU requests of your project up from the default of 2 (example: https://github.com/coreos/coreos-assembler/pull/2975/commits/dea10749ee471f5c56a8bed4ff2984b9bcec0b20).
cyberpear has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]
jpn has joined #fedora-coreos
gursewak has joined #fedora-coreos
gursewak has quit [Remote host closed the connection]
gursewak has joined #fedora-coreos
ravanelli has quit [Remote host closed the connection]
boudinie[m] has quit [Quit: You have been kicked for being idle]
<dustymabe> crobinso: do you know if there are some things that just won't work with binfmt? i.e. https://github.com/containers/podman/pull/12430#issuecomment-1198122042
ravanelli has joined #fedora-coreos
<crobinso> dustymabe: my experience with qemu-user and binfmt is basically non-existent, beyond packaging work. so any runtime questions I can't help with. no one at rh really works on it either so it's probably better to raise in regular qemu support channels
<dustymabe> crobinso: thanks for the context - I appreciate it'
paragan has quit [Ping timeout: 252 seconds]
<dustymabe> jlebon ravanelli: need to chat with you briefly when you get a chance
<dustymabe> it's about building cosa from a git ref versus git commit
<ravanelli> dustymabe: I'm free now if you have time
<dustymabe> yep.
rsalveti has quit [Quit: Connection closed for inactivity]
<dustymabe> basically I'm trying to figure out if we'll ever need to be able to build cosa from a specific commit. The `podman build https://github.com/coreos/coreos-assembler.git#main` syntax we are using doesn't support commits specifically
<dustymabe> oh hmm. wait let me check one more time (I was using the short hash)
<dustymabe> ok, yeah, no - it doesn't seem to work
<dustymabe> so basically if we ever need to do a build and it's not latest in a ref we'd need to make a tag
<dustymabe> i think that will probably happen rare enough that it's ok.
<ravanelli> dustymabe: Yeah, I tried that, but couldn't find a way to use commits neither
<dustymabe> alternatively we modify the code to git checkout the git repo first
<dustymabe> which is an easy enough fix
<dustymabe> maybe I should just do that now to prevent race conditions
<dustymabe> i'll do that
<dustymabe> sorry for leading you down a wrong path ravanelli
<ravanelli> dustymabe: The first time I did it that was the path I went, git checkout passing a dir to the build.
<ravanelli> dustymabe: What about asking about it as a feature request for podman in the future?
<ravanelli> dustymabe: aa that's ok. I didn't even know it existed, good to know anyway. It is strange to not have commits working
<ravanelli> tag works fine too
<dustymabe> I can ask over in podman.. maybe I'm doing something wrong
<ravanelli> the test I did, seems it gets the commit itself as a file let say, and not the commit in the tree.
bgilbert has joined #fedora-coreos
<jlebon> yeah, i think for our sanity we really should make sure we're building the same commit for all arches. so +1 to workaround it for now but in parallel file an RFE with podman.
<jlebon> "workaround it" = use git clone && checkout
ravanelli has quit [Remote host closed the connection]
Betal has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]
jpn has joined #fedora-coreos
saqali has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
gursewak has quit [Remote host closed the connection]
gursewak has joined #fedora-coreos
<dustymabe> jlebon: i'm trying to run a cosa build on the staging cluster.. seems that my pods that get scheduled keep cycling.. were you and saqib working on some similar issue today?
<dustymabe> here's what the pods are doing: https://paste.centos.org/view/325feb51
* dustymabe grabs late lunch
<dustymabe> jlebon: if you want a closer look just run the `build-cosa` job in https://jenkins-fedora-coreos-pipeline.apps.ocp.stg.fedoraproject.org/job/build-cosa/
ravanelli has joined #fedora-coreos
<jlebon> dustymabe: not sure what's going on, but looks like a different issue
<jlebon> try respawning jenkins maybe?
crobinso has quit [Quit: Leaving]
<dustymabe> jlebon: that didn't seem to help.. any other ideas?
<dustymabe> I guess I can blow away the whole namespace and start from scratch
<jlebon> dustymabe: yeah... tempting
<jlebon> let me look at the logs
<dustymabe> the whole no route to host things is interesting
<dustymabe> i.e. it's like the jnlp can't talk back to jenkins itself
<dustymabe> you can see it on the jenkins side too
<dustymabe> `WARNING: No route to host (Host unreachable)`
<dustymabe> `java.io.IOException: https://jenkins-fedora-coreos-pipeline.apps.ocp.stg.fedoraproject.org/ provided port:50000 is not reachable`
<dustymabe> anything you did the other day when you were fiddling with staging that would have caused an issue like this?
<jlebon> i was testing the timeout thing, which did work
<jlebon> i had nuked the PVC to make sure it worked from scratch
<jlebon> but hadn't tested a job run
<jlebon> so it's possible it somehow broke it
<dustymabe> let me nuke/pave and we'll see what we have after that
jpn has joined #fedora-coreos
<jlebon> wait one sec
<dustymabe> k
<jlebon> ok yeah, it totally broke the auto-cloud configuration from the s2i runner script
<jlebon> it seems like casc can't merge into an existing object. it overwrites it.
<dustymabe> the hotfix broke it or the change itself broke it
<jlebon> will revert for now
<dustymabe> got ya.. sounds like the change itself is broken, but if we hotfix (as we did for our prod instance) it works OK?
<jlebon> yup, indeed
<dustymabe> reviewed.. I guess we can revisit how to apply it properly in git next week?
<jlebon> +1 yeah let's
<dustymabe> ok i'm going to nuke/pave staging
<jlebon> you should be able to just redefine jcasc
<jlebon> and respawn jenkins
<dustymabe> ok
<jlebon> ...maybe :)
Guest24 has joined #fedora-coreos
<dustymabe> will let you know soon
Guest6641 has joined #fedora-coreos
<dustymabe> yep still broken
* dustymabe pulls out the big hammer
<jlebon> +1
Guest24 has quit [Quit: Client closed]
Guest6641 has quit [Quit: Client closed]
quentin96 has joined #fedora-coreos
Guidon has joined #fedora-coreos
<quentin96> Hi Guys
<quentin96> I've got some issue with the systemd network-online.target, I don't see why this target is failing. This issue seams to be caused by NetworkManager-wait-online.service. That unit never start and I don't know why. This issue is random, some of my AWS instance have this issue, some other not.
<quentin96> My version is Fedora CoreOS 36.20220716.3.1
ravanelli has quit [Remote host closed the connection]
<dustymabe> quentin96: what is in your Ignition config? Must be a service you are creating that pulls in that target
<quentin96> dustymabe here is my butane code
<quentin96> variant: fcos
<quentin96> version: 1.4.0
<quentin96> systemd:
<quentin96>   units:
<quentin96>     - name: wg-setup@.service
<quentin96>       enabled: false
<quentin96>       contents: |
<quentin96>         [Unit]
<quentin96>         Description=Setup Wireguard %i Configuration
<quentin96>         Before=wg-quick.target
<quentin96>         After=network-online.target
<quentin96>         Requires=network-online.target
<quentin96>         OnSuccess=wg-quick@%i.service
<quentin96>         [Service]
<quentin96>         Type=oneshot
<quentin96>         RemainAfterExit=no
<quentin96>         ExecStart=/usr/local/bin/wg-setup %i
<dustymabe> what does `sudo journalctl -u NetworkManager-wait-online.service` show you ?
<dustymabe> jlebon: I think I may have deployed the new staging instance with your change still in place.. how do I tell in the jenkins interface if that setting is still set?
<quentin96> dustymabe
<quentin96> ```
<quentin96> $ sudo journalctl -u NetworkManager-wait-online.service
<quentin96> -- No entries --
<quentin96> ```
<dustymabe> quentin96: I thought you said it failed
<quentin96> it didn't start
<dustymabe> oh
<quentin96> it's in dead state, like never start
<quentin96> $ systemctl status NetworkManager-wait-online.service
<quentin96> ○ NetworkManager-wait-online.service - Network Manager Wait Online
<quentin96>      Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: enabled)
<quentin96>      Active: inactive (dead)
<dustymabe> I see.. you are creating an instantiatable unit but you don't have any instances of it
<jlebon> dustymabe: check in the cloud configuration page. if e.g. the kubernetes certificate field is empty, it's borked
<dustymabe> it's there, but where is the `maxRequestsPerHost` field that I can see?
<dustymabe> oh I see it now
<quentin96> sorry, I just give you one part of my ignition. but I have that too:
<quentin96>     - name: wg-setup@wg0.service
<dustymabe> it's set to 32
<quentin96>       enabled: true
<dustymabe> quentin96: and what does that unit say? `sudo systemctl status wg-setup@wg0.service`
<jlebon> dustymabe: good. you can update it manually for now
<dustymabe> will do
<quentin96> $ sudo systemctl status wg-setup@wg0.service
<quentin96> ○ wg-setup@wg0.service - Setup Wireguard wg0 Configuration
<quentin96>      Loaded: loaded (/etc/systemd/system/wg-setup@.service; disabled; vendor preset: disabled)
<quentin96>     Drop-In: /etc/systemd/system/wg-setup@wg0.service.d
<quentin96>              └─00-enabled.conf, 10-config.conf
<quentin96>      Active: inactive (dead)
<dustymabe> if you `systemctl enable --now wg-setup@wg0.service` does the service come up fine?
<quentin96> dustymabe I don't understand, because with the EXACT same ignition, sometimes it start the unit, sometimes not. After investigation, I find that's it's related to my requirement to `network-online.target`. When my unit don't start, it's because `network-online.target` is dead, and when my unit start up correctly, I saw the `network-online.target`
<quentin96> active.
<dustymabe> quentin96: if you systemctl cat NetworkManager-wait-online.service you'll notice that it's just calling a program that waits 60s for the network to come up and then times out
<dustymabe> so if networking isn't good for 60s it will fail
<dustymabe> what does `sudo journalctl -u NetworkManager-wait-online.service` show you?
jpn has quit [Ping timeout: 245 seconds]
<dustymabe> hmm I guess you showed me that before with systemctl status NetworkManager-wait-online.service
<dustymabe> still. anything from journalctl?
<quentin96> We thought about the same issue, regarding NetworkManger.
<quentin96> Here is the output on a failing server:
<quentin96> $ sudo journalctl -u NetworkManager-wait-online.service
<quentin96> -- No entries --
<quentin96> and here a working server:
<quentin96> $ sudo journalctl -u NetworkManager-wait-online.service
<quentin96> Jul 29 18:11:20 ip-10-12-3-230 systemd[1]: Starting NetworkManager-wait-online.service - Network Manager Wait Online...
<quentin96> Jul 29 18:11:20 ip-10-12-3-230 systemd[1]: Finished NetworkManager-wait-online.service - Network Manager Wait Online.
<dustymabe> right.. so basically different servers, same setup, some of them come up fine, others don't ?
<quentin96> exactly
<jlebon> did you check the journal for ordering cycles?
<quentin96> just a precision, we use the official AWS AMI (ami-03929f88dfb4b1c1c)
<dustymabe> jlebon: check this out (ignore the fact that it failed at the very end because of a dumb mistake on my end): https://jenkins-fedora-coreos-pipeline.apps.ocp.stg.fedoraproject.org/blue/organizations/jenkins/build-cosa/detail/build-cosa/3/pipeline/89
<jlebon> quentin96: that could be a reason why sometimes systemd drops out network-online.target from the transaction
<dustymabe> jlebon: it pushed to the `main` tag here: https://quay.io/repository/dustymabe/coreos-assembler?tab=tags
<jlebon> dustymabe: that's awesome!
<quentin96> jlebon do you mean `journalctl -p err` ?
<jlebon> dustymabe: shouldn't it also build x86_64?
<jlebon> quentin96: just all of `journalctl -b 0` :)
<dustymabe> jlebon: yeah, when I'm not in staging :)
<jlebon> dustymabe: +1
<jlebon> there's `buildImage()` which should help there
<dustymabe> jlebon: oh?
<jlebon> though i think it might be cleaner long term to also use an external executor for x86_64 with this new podman remote world
<dustymabe> oh I see. builds it using a buildconfig?
<dustymabe> yeah I was just planning to use a new x86_64 builder
<jlebon> it'll build it in the same namespace, and then you should be able to `skopeo copy` it to quay too
<jlebon> +1
<jlebon> that'd be nicer yeah
<dustymabe> jlebon: what do you think about building with `--no-cache`?
<jlebon> dustymabe: definitely :)
<dustymabe> part of me thinks we should always `--no-cache` but another part of me thinks of days where we have a lot of commits go into cosa and the big waste --no-cache would be
<jlebon> that should be our default IMO unless we find out it causes serious performance issues :)
<dustymabe> I wish there was some sort of --cache-expire=1d
<jlebon> fair
<dustymabe> will go with --no-cache for now
<jlebon> i guess we could implement that manually by passing `--no-cache` only if the last push was X time ago
<jlebon> +1
<quentin96> jlebon I check in that `journalctl -b 0` and I don't find anything
misuto has quit [Remote host closed the connection]
misuto has joined #fedora-coreos
<quentin96> jlebon there's no obvious cycle in that logs
<jlebon> quentin96: hmm sorry, I'm not sure. can you open an issue in https://github.com/coreos/fedora-coreos-tracker with the full Ignition and journal logs in the case where it misbehaves?
<jlebon> this might not be an FCOS issue, but we can start diagnosing there
<dustymabe> also I'm interested to know.. if you start 10 instances fresh in the exact same way.. how many of them succeed and how many fail
jpn has joined #fedora-coreos
<quentin96> jlebon dustymabe thank you so much for you help, I will do that Monday and will post an issue with the full logs and details.
<quentin96> Thank a lot and have a good week end !
<jlebon> quentin96: have a good weekend!
Guidon has quit [Ping timeout: 252 seconds]
fifofonix has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
mheon has quit [Ping timeout: 272 seconds]
fifofonix has joined #fedora-coreos
vgoyal has quit [Quit: Leaving]
gursewak has quit [Ping timeout: 272 seconds]
quentin96 has quit [Ping timeout: 252 seconds]
bgilbert has quit [Ping timeout: 245 seconds]
jlebon has quit [Quit: leaving]
jpn has quit [Ping timeout: 252 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 245 seconds]