#fedora-coreos on 2022-07-29 — irc logs at libera.irclog.whitequark.org

2022-05-11 12:42 dustymabe changed the topic of #fedora-coreos to: Fedora CoreOS :: Find out more at https://getfedora.org/coreos/ :: Logs at https://libera.irclog.whitequark.org/fedora-coreos

00:17 jlebon has quit [Quit: leaving]

00:26 cyberpear has quit [Quit: Connection closed for inactivity]

00:27 ravanelli has quit [Remote host closed the connection]

00:34 llamma has quit [Ping timeout: 264 seconds]

00:46 mnguyen_ has quit [Ping timeout: 252 seconds]

00:46 mnguyen_ has joined #fedora-coreos

00:47 mnguyen has quit [Ping timeout: 268 seconds]

00:47 fifofonix has joined #fedora-coreos

00:48 mnguyen has joined #fedora-coreos

01:17 plarsen has joined #fedora-coreos

01:18 plarsen has quit [Remote host closed the connection]

01:46 bgilbert has quit [Ping timeout: 272 seconds]

02:54 fifofonix has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

03:10 mnguyen_ has quit [Ping timeout: 252 seconds]

03:11 mnguyen has quit [Ping timeout: 252 seconds]

03:11 mnguyen has joined #fedora-coreos

03:11 mnguyen_ has joined #fedora-coreos

03:38 paragan has joined #fedora-coreos

04:01 misuto has quit [Remote host closed the connection]

04:04 misuto has joined #fedora-coreos

04:16 ravanelli has joined #fedora-coreos

04:21 ravanelli has quit [Ping timeout: 276 seconds]

05:03 mnguyen_ has quit [Ping timeout: 245 seconds]

05:04 mnguyen has quit [Ping timeout: 252 seconds]

05:04 mnguyen has joined #fedora-coreos

05:04 mnguyen_ has joined #fedora-coreos

06:43 gursewak has quit [Ping timeout: 240 seconds]

06:49 gursewak has joined #fedora-coreos

07:11 gursewak has quit [Ping timeout: 272 seconds]

07:51 jpn has joined #fedora-coreos

07:52 ravanelli has joined #fedora-coreos

07:57 ravanelli has quit [Ping timeout: 276 seconds]

07:58 jcajka has joined #fedora-coreos

08:01 jpn has quit [Ping timeout: 268 seconds]

08:20 jpn has joined #fedora-coreos

09:10 ksinny has joined #fedora-coreos

09:12 ksinny has quit [Client Quit]

09:38 arnulfo_07 has joined #fedora-coreos

09:41 arnulfo_7 has quit [Ping timeout: 245 seconds]

09:49 Betal has quit [Quit: WeeChat 3.6]

10:36 fifofonix has joined #fedora-coreos

10:50 jpn has quit [Ping timeout: 245 seconds]

11:14 mnguyen has quit [Ping timeout: 245 seconds]

11:15 mnguyen has joined #fedora-coreos

11:15 mnguyen_ has quit [Ping timeout: 272 seconds]

11:15 mnguyen_ has joined #fedora-coreos

11:31 ravanelli has joined #fedora-coreos

11:32 rishi```` has quit [Remote host closed the connection]

11:32 rishi has joined #fedora-coreos

11:37 ravanelli has quit [Remote host closed the connection]

11:55 crobinso has joined #fedora-coreos

11:56 vgoyal has joined #fedora-coreos

12:48 ravanelli has joined #fedora-coreos

13:05 jlebon has joined #fedora-coreos

13:12 mheon has joined #fedora-coreos

13:16 <dustymabe> easy review: https://pagure.io/fedora-infra/ansible/pull-request/1156

13:31 jpn has joined #fedora-coreos

13:55 <dustymabe> jlebon: need your expertise over in https://github.com/coreos/fedora-coreos-pipeline/pull/581

14:04 bgilbert has joined #fedora-coreos

14:06 bgilbert has quit [Remote host closed the connection]

14:06 bgilbert has joined #fedora-coreos

14:25 jcajka has quit [Quit: Leaving]

14:35 bgilbert has quit [Ping timeout: 245 seconds]

14:56 <jlebon> PSA: https://github.com/coreos/coreos-ci-lib/pull/112 may affect the performance of CoreOS CI jobs. If needed, feel free to bump the CPU requests of your project up from the default of 2 (example: https://github.com/coreos/coreos-assembler/pull/2975/commits/dea10749ee471f5c56a8bed4ff2984b9bcec0b20).

15:11 cyberpear has joined #fedora-coreos

15:13 jpn has quit [Ping timeout: 245 seconds]

15:24 jpn has joined #fedora-coreos

15:29 gursewak has joined #fedora-coreos

15:40 gursewak has quit [Remote host closed the connection]

15:40 gursewak has joined #fedora-coreos

15:55 ravanelli has quit [Remote host closed the connection]

16:00 boudinie[m] has quit [Quit: You have been kicked for being idle]

16:02 <dustymabe> crobinso: do you know if there are some things that just won't work with binfmt? i.e. https://github.com/containers/podman/pull/12430#issuecomment-1198122042

16:11 ravanelli has joined #fedora-coreos

16:17 <crobinso> dustymabe: my experience with qemu-user and binfmt is basically non-existent, beyond packaging work. so any runtime questions I can't help with. no one at rh really works on it either so it's probably better to raise in regular qemu support channels

16:22 <dustymabe> crobinso: thanks for the context - I appreciate it'

16:24 paragan has quit [Ping timeout: 252 seconds]

16:32 <dustymabe> jlebon ravanelli: need to chat with you briefly when you get a chance

16:32 <dustymabe> it's about building cosa from a git ref versus git commit

16:33 <ravanelli> dustymabe: I'm free now if you have time

16:33 <dustymabe> yep.

16:33 rsalveti has quit [Quit: Connection closed for inactivity]

16:34 <dustymabe> basically I'm trying to figure out if we'll ever need to be able to build cosa from a specific commit. The `podman build https://github.com/coreos/coreos-assembler.git#main` syntax we are using doesn't support commits specifically

16:34 <dustymabe> oh hmm. wait let me check one more time (I was using the short hash)

16:35 <dustymabe> ok, yeah, no - it doesn't seem to work

16:37 <dustymabe> so basically if we ever need to do a build and it's not latest in a ref we'd need to make a tag

16:37 <dustymabe> i think that will probably happen rare enough that it's ok.

16:37 <ravanelli> dustymabe: Yeah, I tried that, but couldn't find a way to use commits neither

16:37 <dustymabe> alternatively we modify the code to git checkout the git repo first

16:38 <dustymabe> which is an easy enough fix

16:38 <dustymabe> maybe I should just do that now to prevent race conditions

16:38 <dustymabe> i'll do that

16:39 <dustymabe> sorry for leading you down a wrong path ravanelli

16:40 <ravanelli> dustymabe: The first time I did it that was the path I went, git checkout passing a dir to the build.

16:40 <ravanelli> dustymabe: What about asking about it as a feature request for podman in the future?

16:42 <ravanelli> dustymabe: aa that's ok. I didn't even know it existed, good to know anyway. It is strange to not have commits working

16:42 <ravanelli> tag works fine too

16:42 <dustymabe> I can ask over in podman.. maybe I'm doing something wrong

16:44 <ravanelli> the test I did, seems it gets the commit itself as a file let say, and not the commit in the tree.

17:09 bgilbert has joined #fedora-coreos

17:10 <jlebon> yeah, i think for our sanity we really should make sure we're building the same commit for all arches. so +1 to workaround it for now but in parallel file an RFE with podman.

17:11 <jlebon> "workaround it" = use git clone && checkout

17:20 ravanelli has quit [Remote host closed the connection]

17:21 Betal has joined #fedora-coreos

17:50 jpn has quit [Ping timeout: 245 seconds]

17:52 jpn has joined #fedora-coreos

18:02 saqali has joined #fedora-coreos

18:08 jpn has quit [Ping timeout: 240 seconds]

18:20 gursewak has quit [Remote host closed the connection]

18:20 gursewak has joined #fedora-coreos

18:44 <dustymabe> jlebon: i'm trying to run a cosa build on the staging cluster.. seems that my pods that get scheduled keep cycling.. were you and saqib working on some similar issue today?

18:46 <dustymabe> here's what the pods are doing: https://paste.centos.org/view/325feb51

18:47 <dustymabe> maybe I'm doing something wrong in my job? https://github.com/dustymabe/fedora-coreos-pipeline/blob/add-build-cosa-job/jobs/build-cosa.Jenkinsfile

18:50 * dustymabe grabs late lunch

18:50 <dustymabe> jlebon: if you want a closer look just run the `build-cosa` job in https://jenkins-fedora-coreos-pipeline.apps.ocp.stg.fedoraproject.org/job/build-cosa/

18:58 ravanelli has joined #fedora-coreos

19:04 <jlebon> dustymabe: not sure what's going on, but looks like a different issue

19:04 <jlebon> try respawning jenkins maybe?

19:11 crobinso has quit [Quit: Leaving]

19:41 <dustymabe> jlebon: that didn't seem to help.. any other ideas?

19:44 <dustymabe> I guess I can blow away the whole namespace and start from scratch

19:45 <jlebon> dustymabe: yeah... tempting

19:45 <jlebon> let me look at the logs

19:45 <dustymabe> the whole no route to host things is interesting

19:45 <dustymabe> i.e. it's like the jnlp can't talk back to jenkins itself

19:46 <dustymabe> you can see it on the jenkins side too

19:46 <dustymabe> `WARNING: No route to host (Host unreachable)`

19:46 <dustymabe> `java.io.IOException: https://jenkins-fedora-coreos-pipeline.apps.ocp.stg.fedoraproject.org/ provided port:50000 is not reachable`

19:46 <dustymabe> anything you did the other day when you were fiddling with staging that would have caused an issue like this?

19:49 <jlebon> i was testing the timeout thing, which did work

19:50 <jlebon> i had nuked the PVC to make sure it worked from scratch

19:50 <jlebon> but hadn't tested a job run

19:50 <jlebon> so it's possible it somehow broke it

19:50 <dustymabe> let me nuke/pave and we'll see what we have after that

19:50 jpn has joined #fedora-coreos

19:50 <jlebon> wait one sec

19:50 <dustymabe> k

19:59 <jlebon> ok yeah, it totally broke the auto-cloud configuration from the s2i runner script

19:59 <jlebon> it seems like casc can't merge into an existing object. it overwrites it.

19:59 <dustymabe> the hotfix broke it or the change itself broke it

19:59 <jlebon> will revert for now

20:00 <dustymabe> got ya.. sounds like the change itself is broken, but if we hotfix (as we did for our prod instance) it works OK?

20:01 <jlebon> yup, indeed

20:01 <jlebon> https://github.com/coreos/fedora-coreos-pipeline/pull/583

20:01 <dustymabe> reviewed.. I guess we can revisit how to apply it properly in git next week?

20:02 <jlebon> +1 yeah let's

20:02 <dustymabe> ok i'm going to nuke/pave staging

20:02 <jlebon> you should be able to just redefine jcasc

20:02 <jlebon> and respawn jenkins

20:02 <dustymabe> ok

20:02 <jlebon> ...maybe :)

20:03 Guest24 has joined #fedora-coreos

20:03 <dustymabe> will let you know soon

20:03 Guest6641 has joined #fedora-coreos

20:06 <dustymabe> yep still broken

20:07 * dustymabe pulls out the big hammer

20:07 <jlebon> +1

20:09 Guest24 has quit [Quit: Client closed]

20:09 Guest6641 has quit [Quit: Client closed]

20:10 quentin96 has joined #fedora-coreos

20:10 Guidon has joined #fedora-coreos

20:15 <quentin96> Hi Guys

20:15 <quentin96> I've got some issue with the systemd network-online.target, I don't see why this target is failing. This issue seams to be caused by NetworkManager-wait-online.service. That unit never start and I don't know why. This issue is random, some of my AWS instance have this issue, some other not.

20:15 <quentin96> My version is Fedora CoreOS 36.20220716.3.1

20:16 ravanelli has quit [Remote host closed the connection]

20:17 <dustymabe> quentin96: what is in your Ignition config? Must be a service you are creating that pulls in that target

20:18 <quentin96> dustymabe here is my butane code

20:18 <quentin96> variant: fcos

20:18 <quentin96> version: 1.4.0

20:18 <quentin96> systemd:

20:18 <quentin96> units:

20:18 <quentin96> - name: wg-setup@.service

20:18 <quentin96> enabled: false

20:18 <quentin96> contents: |

20:18 <quentin96> [Unit]

20:18 <quentin96> Description=Setup Wireguard %i Configuration

20:18 <quentin96> Before=wg-quick.target

20:18 <quentin96> After=network-online.target

20:18 <quentin96> Requires=network-online.target

20:18 <quentin96> OnSuccess=wg-quick@%i.service

20:18 <quentin96> [Service]

20:18 <quentin96> Type=oneshot

20:18 <quentin96> RemainAfterExit=no

20:18 <quentin96> ExecStart=/usr/local/bin/wg-setup %i

20:19 <dustymabe> what does `sudo journalctl -u NetworkManager-wait-online.service` show you ?

20:20 <dustymabe> jlebon: I think I may have deployed the new staging instance with your change still in place.. how do I tell in the jenkins interface if that setting is still set?

20:20 <quentin96> dustymabe

20:20 <quentin96> ```

20:20 <quentin96> $ sudo journalctl -u NetworkManager-wait-online.service

20:20 <quentin96> -- No entries --

20:20 <quentin96> ```

20:21 <dustymabe> quentin96: I thought you said it failed

20:21 <quentin96> it didn't start

20:21 <dustymabe> oh

20:21 <quentin96> it's in dead state, like never start

20:21 <quentin96> $ systemctl status NetworkManager-wait-online.service

20:22 <quentin96> ○ NetworkManager-wait-online.service - Network Manager Wait Online

20:22 <quentin96> Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: enabled)

20:22 <quentin96> Active: inactive (dead)

20:22 <dustymabe> I see.. you are creating an instantiatable unit but you don't have any instances of it

20:22 <jlebon> dustymabe: check in the cloud configuration page. if e.g. the kubernetes certificate field is empty, it's borked

20:23 <dustymabe> it's there, but where is the `maxRequestsPerHost` field that I can see?

20:23 <dustymabe> oh I see it now

20:23 <quentin96> sorry, I just give you one part of my ignition. but I have that too:

20:23 <quentin96> - name: wg-setup@wg0.service

20:23 <dustymabe> it's set to 32

20:23 <quentin96> enabled: true

20:24 <dustymabe> quentin96: and what does that unit say? `sudo systemctl status wg-setup@wg0.service`

20:24 <jlebon> dustymabe: good. you can update it manually for now

20:24 <dustymabe> will do

20:24 <quentin96> $ sudo systemctl status wg-setup@wg0.service

20:24 <quentin96> ○ wg-setup@wg0.service - Setup Wireguard wg0 Configuration

20:24 <quentin96> Loaded: loaded (/etc/systemd/system/wg-setup@.service; disabled; vendor preset: disabled)

20:24 <quentin96> Drop-In: /etc/systemd/system/wg-setup@wg0.service.d

20:24 <quentin96> └─00-enabled.conf, 10-config.conf

20:24 <quentin96> Active: inactive (dead)

20:26 <dustymabe> quentin96: hmm. we have a test for this: https://github.com/coreos/fedora-coreos-config/tree/testing-devel/tests/kola/ignition/systemd-enable-units

20:28 <dustymabe> if you `systemctl enable --now wg-setup@wg0.service` does the service come up fine?

20:29 <quentin96> dustymabe I don't understand, because with the EXACT same ignition, sometimes it start the unit, sometimes not. After investigation, I find that's it's related to my requirement to `network-online.target`. When my unit don't start, it's because `network-online.target` is dead, and when my unit start up correctly, I saw the `network-online.target`

20:29 <quentin96> active.

20:32 <dustymabe> quentin96: if you systemctl cat NetworkManager-wait-online.service you'll notice that it's just calling a program that waits 60s for the network to come up and then times out

20:32 <dustymabe> so if networking isn't good for 60s it will fail

20:32 <dustymabe> what does `sudo journalctl -u NetworkManager-wait-online.service` show you?

20:33 jpn has quit [Ping timeout: 245 seconds]

20:36 <dustymabe> hmm I guess you showed me that before with systemctl status NetworkManager-wait-online.service

20:37 <dustymabe> still. anything from journalctl?

20:37 <quentin96> We thought about the same issue, regarding NetworkManger.

20:37 <quentin96> Here is the output on a failing server:

20:37 <quentin96> $ sudo journalctl -u NetworkManager-wait-online.service

20:37 <quentin96> -- No entries --

20:37 <quentin96> and here a working server:

20:37 <quentin96> $ sudo journalctl -u NetworkManager-wait-online.service

20:37 <quentin96> Jul 29 18:11:20 ip-10-12-3-230 systemd[1]: Starting NetworkManager-wait-online.service - Network Manager Wait Online...

20:37 <quentin96> Jul 29 18:11:20 ip-10-12-3-230 systemd[1]: Finished NetworkManager-wait-online.service - Network Manager Wait Online.

20:38 <dustymabe> right.. so basically different servers, same setup, some of them come up fine, others don't ?

20:38 <quentin96> exactly

20:41 <jlebon> did you check the journal for ordering cycles?

20:42 <quentin96> just a precision, we use the official AWS AMI (ami-03929f88dfb4b1c1c)

20:42 <dustymabe> jlebon: check this out (ignore the fact that it failed at the very end because of a dumb mistake on my end): https://jenkins-fedora-coreos-pipeline.apps.ocp.stg.fedoraproject.org/blue/organizations/jenkins/build-cosa/detail/build-cosa/3/pipeline/89

20:42 <jlebon> quentin96: that could be a reason why sometimes systemd drops out network-online.target from the transaction

20:43 <dustymabe> jlebon: it pushed to the `main` tag here: https://quay.io/repository/dustymabe/coreos-assembler?tab=tags

20:43 <jlebon> dustymabe: that's awesome!

20:43 <quentin96> jlebon do you mean `journalctl -p err` ?

20:43 <jlebon> dustymabe: shouldn't it also build x86_64?

20:44 <jlebon> quentin96: just all of `journalctl -b 0` :)

20:44 <dustymabe> jlebon: yeah, when I'm not in staging :)

20:44 <jlebon> dustymabe: +1

20:44 <jlebon> there's `buildImage()` which should help there

20:44 <dustymabe> jlebon: oh?

20:45 <jlebon> though i think it might be cleaner long term to also use an external executor for x86_64 with this new podman remote world

20:45 <jlebon> yup: https://github.com/coreos/coreos-ci-lib/blob/main/vars/buildImage.groovy

20:46 <dustymabe> oh I see. builds it using a buildconfig?

20:46 <dustymabe> yeah I was just planning to use a new x86_64 builder

20:46 <jlebon> it'll build it in the same namespace, and then you should be able to `skopeo copy` it to quay too

20:46 <jlebon> +1

20:46 <jlebon> that'd be nicer yeah

20:48 <dustymabe> jlebon: what do you think about building with `--no-cache`?

20:49 <jlebon> dustymabe: definitely :)

20:49 <dustymabe> part of me thinks we should always `--no-cache` but another part of me thinks of days where we have a lot of commits go into cosa and the big waste --no-cache would be

20:49 <jlebon> that should be our default IMO unless we find out it causes serious performance issues :)

20:49 <dustymabe> I wish there was some sort of --cache-expire=1d

20:49 <jlebon> fair

20:50 <dustymabe> will go with --no-cache for now

20:50 <jlebon> i guess we could implement that manually by passing `--no-cache` only if the last push was X time ago

20:50 <jlebon> +1

20:54 <quentin96> jlebon I check in that `journalctl -b 0` and I don't find anything

20:54 misuto has quit [Remote host closed the connection]

20:55 misuto has joined #fedora-coreos

20:55 <quentin96> jlebon there's no obvious cycle in that logs

20:59 <jlebon> quentin96: hmm sorry, I'm not sure. can you open an issue in https://github.com/coreos/fedora-coreos-tracker with the full Ignition and journal logs in the case where it misbehaves?

20:59 <jlebon> this might not be an FCOS issue, but we can start diagnosing there

20:59 <dustymabe> also I'm interested to know.. if you start 10 instances fresh in the exact same way.. how many of them succeed and how many fail

21:01 jpn has joined #fedora-coreos

21:07 <quentin96> jlebon dustymabe thank you so much for you help, I will do that Monday and will post an issue with the full logs and details.

21:07 <quentin96> Thank a lot and have a good week end !

21:08 <jlebon> quentin96: have a good weekend!

21:12 Guidon has quit [Ping timeout: 252 seconds]

21:14 fifofonix has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

21:18 mheon has quit [Ping timeout: 272 seconds]

21:25 fifofonix has joined #fedora-coreos

21:31 vgoyal has quit [Quit: Leaving]

21:32 gursewak has quit [Ping timeout: 272 seconds]

21:33 quentin96 has quit [Ping timeout: 252 seconds]

21:56 bgilbert has quit [Ping timeout: 245 seconds]

22:17 jlebon has quit [Quit: leaving]

22:20 jpn has quit [Ping timeout: 252 seconds]

22:47 jpn has joined #fedora-coreos

22:53 jpn has quit [Ping timeout: 245 seconds]

23:06 jpn has joined #fedora-coreos

23:11 jpn has quit [Ping timeout: 245 seconds]

23:24 jpn has joined #fedora-coreos

23:29 jpn has quit [Ping timeout: 245 seconds]

23:43 jpn has joined #fedora-coreos

23:49 jpn has quit [Ping timeout: 245 seconds]