#fedora-coreos on 2022-07-20 — irc logs at libera.irclog.whitequark.org

2022-05-11 12:42 dustymabe changed the topic of #fedora-coreos to: Fedora CoreOS :: Find out more at https://getfedora.org/coreos/ :: Logs at https://libera.irclog.whitequark.org/fedora-coreos

00:04 crobinso has quit [Ping timeout: 268 seconds]

01:30 daMaestro has joined #fedora-coreos

01:40 mnguyen has joined #fedora-coreos

01:42 mnguyen_ has quit [Ping timeout: 272 seconds]

02:08 paragan has joined #fedora-coreos

02:11 jpn has joined #fedora-coreos

02:16 jpn has quit [Ping timeout: 272 seconds]

02:48 gursewak has quit [Ping timeout: 272 seconds]

03:25 gursewak has joined #fedora-coreos

04:20 gursewak has quit [Ping timeout: 240 seconds]

05:02 jpn has joined #fedora-coreos

05:07 jpn has quit [Ping timeout: 240 seconds]

05:45 jcajka has joined #fedora-coreos

06:38 gursewak has joined #fedora-coreos

06:41 bgilbert has quit [Ping timeout: 268 seconds]

07:11 <dngray[m]> dustymabe: hmm maybe i should use iPXE, so i can just use http

07:11 <dngray[m]> and not have to worry about a tftp server

07:11 <dngray[m]> pretty sure SeaBios used in qemu supports that

07:12 jpn has joined #fedora-coreos

07:12 <dngray[m]> like if i was deploying new, then seems like iPXE would be the way to go

07:17 <dngray[m]> https://andreaskaris.github.io/blog/linux/ipxe-boot-environment/ hmm so maybe fedora LXC can be the iPXE server

07:22 jpn has quit [Ping timeout: 268 seconds]

07:23 jpn has joined #fedora-coreos

07:30 daMaestro has quit [Quit: Leaving]

07:44 jpn has quit [Ping timeout: 240 seconds]

07:55 jpn has joined #fedora-coreos

08:05 jpn has quit [Ping timeout: 260 seconds]

08:25 jpn has joined #fedora-coreos

08:47 ksinny has joined #fedora-coreos

09:00 MHamzahKhan[m] has quit [Quit: You have been kicked for being idle]

09:09 paragan has quit [Read error: Connection reset by peer]

09:39 Betal has quit [Quit: WeeChat 3.6]

10:31 mnguyen has quit [Read error: Connection reset by peer]

10:31 mnguyen has joined #fedora-coreos

11:20 jpn has quit [Ping timeout: 255 seconds]

11:45 u1106 has joined #fedora-coreos

11:45 didib has joined #fedora-coreos

11:50 vgoyal has joined #fedora-coreos

11:50 <u1106> I understand stable was updated yesterday. We have 4 instances running in AWS EC2, 2 of them are hanging now. I saw kernel errors on the serial console. 2 haven't been updated yet and they are working fine. Has anybody else had problems?

11:55 <lucab> u1106: not that I've seen. There was a kernel update in latest stable (5.18.5 -> 5.18.7), but the same kernel has been in the testing stream for some weeks and we didn't see reports. What does it say on the console? Maybe it is something specific to your type workload that our CI is not hitting?

11:57 <u1106> there were deadlocks from reported from RCU

11:57 jpn has joined #fedora-coreos

12:00 <lucab> u1106: let's open a ticket with the full logs then

12:00 nalind has joined #fedora-coreos

12:10 paragan has joined #fedora-coreos

12:21 <u1106> thanks lucab. At the the moment I cannot provide you logs. The system is hanging and the serial console did not support copy-paste. I was to slow to take screenshot

12:22 ksinny has quit [Remote host closed the connection]

12:22 <u1106> I will certainly continue debugging this because we need our instances running. As soon as manage to grab some logs I will share them.

12:24 <u1106> but the kernel versions you mention matched my observation: 5.18.5 is running just fine, the errors I saw were from 5.18.7 (according to my handwritten notes)

12:32 <dustymabe> u1106++

12:32 <lucab> in case it helps, current testing has a newer one (5.18.11)

12:41 <u1106> ok, that might be an option to test

12:43 <dustymabe> u1106: the AMI can be found at https://builds.coreos.fedoraproject.org/browser?stream=testing-devel&arch=x86_64 (sorry us-east-1 only)

12:44 <u1106> ah, ok. For production we cannot use us-east-1, but for testing that's fine

12:48 jpn has quit [Ping timeout: 255 seconds]

12:51 jpn has joined #fedora-coreos

13:02 crobinso has joined #fedora-coreos

13:09 mheon has joined #fedora-coreos

13:19 plarsen has joined #fedora-coreos

13:35 <dustymabe> u1106: preferably don't use `testing-devel` for production either :)

13:37 <u1106> what could possibly go wrong ;)

14:18 ksinny has joined #fedora-coreos

14:30 didib has quit [Quit: Leaving]

15:17 <lucab> I won't be around for the meeting today

15:25 <jlebon> PSA: the new CI checks "ShellCheck" and "golangci-lint" in coreos/coreos-assembler are now marked as required for `main`.

15:32 cyberpear has quit [Quit: Connection closed for inactivity]

15:44 wolfshappen has quit [Read error: Connection reset by peer]

15:45 wolfshappen has joined #fedora-coreos

16:02 paragan has quit [Quit: Leaving]

16:18 bgilbert has joined #fedora-coreos

16:24 jbrooks has joined #fedora-coreos

16:24 aaradhak has joined #fedora-coreos

16:30 <dustymabe> aaradhak davdunc dustymabe gursewak jaimelm jbrooks jcajka jdoss jlebon jmarrero lorbus miabbott nasirhm ravanelli saqali skunkerk walters

16:30 <dustymabe> FCOS community meeting in #fedora-meeting-1

16:30 <dustymabe> If you don't want to be pinged remove your name from this file: https://github.com/coreos/fedora-coreos-tracker/blob/main/meeting-people.txt

16:50 ksinny has quit [Quit: http://quassel-irc.org - Leaving]

17:00 Betal has joined #fedora-coreos

17:01 jpn has quit [Ping timeout: 268 seconds]

17:07 jcajka has quit [Quit: Leaving]

18:27 cyberpear has joined #fedora-coreos

18:34 jpn has joined #fedora-coreos

18:53 aaradhak has quit [Quit: Connection closed for inactivity]

19:00 jpn has quit [Ping timeout: 260 seconds]

19:13 <anthr76[m]> If im not mistaken it look like a new release should be cut on https://github.com/ostreedev/ostree-rs-ext

19:27 jpn has joined #fedora-coreos

19:49 <dustymabe> cc walters ^^

19:50 <dustymabe> jlebon: i'm going to start churning fedora-coreos-pipeline. will probably be a few fixup PRs that roll in

19:51 <jlebon> dustymabe: ack

19:53 <walters> anthr76[m]: it's there in https://crates.io/crates/ostree-ext/versions but you're right we're not consistently doing GH releases

19:53 <walters> (it's also a tag)

19:56 misuto has quit [Quit: Leaving]

19:59 jpn has quit [Ping timeout: 272 seconds]

20:00 misuto has joined #fedora-coreos

20:05 <dustymabe> walters: mind investigating https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/blue/organizations/jenkins/build/detail/build/916/pipeline for me?

20:10 <walters> fix in https://github.com/coreos/coreos-assembler/pull/3000

20:12 misuto has quit [Remote host closed the connection]

20:15 misuto has joined #fedora-coreos

21:00 <dustymabe> jlebon: hmm. I think I figured it out

21:00 <jlebon> need a stamp from a committer on https://github.com/coreos/fedora-coreos-config/pull/1857

21:01 <jlebon> dustymabe: nice

21:01 <dustymabe> somehow the two runs are stepping on each other

21:01 <dustymabe> i.e. the aarch64 run is using the container ID from s390x

21:02 <dustymabe> jlebon: why is the conditional in that PR unnecessary?

21:02 <jlebon> dustymabe: added a comment :)

21:02 <jlebon> dustymabe: ahhh fun

21:03 nalind has quit [Quit: bye]

21:07 <jlebon> dustymabe: ahh I think I may have found it

21:08 <jlebon> https://github.com/coreos/fedora-coreos-pipeline/pull/570 ?

21:11 jpn has joined #fedora-coreos

21:16 <dustymabe> jlebon: thanks!

21:17 <dustymabe> looks like it's working

21:18 <u1106> dustymabe: lucab: I recovered the journal from that machine that failed after the upgrade today. After the machine had been hanging already for 2.5 hours there was a kernel: BUG: kernel NULL pointer dereference, address: 00000000000000f6

21:19 <u1106> however according to our best knowledge at the moment the root cause for the problems was an OOM

21:20 <u1106> we could reproduce the problem also with 36.20220618.3.1

21:21 <u1106> although those machines had been working without problems for several weeks, when deploying a new one from the old image, today they also ended up in OOM

21:22 jpn has quit [Ping timeout: 276 seconds]

21:24 <u1106> I have the feeling somewhat has changed in the EC2 hypervisor so that we have less available memory than before. Or the newest vulnerability mitigations have changed the timing during boot so much that we ended up in OOM. We were probably really tight already before

21:25 <u1106> At the moment we have both 36.20220618.3.1 and 36.20220703.3.1 working fine after we gave it a bit of swap space

21:26 <dustymabe> u1106: thank you for continuing to give us updates. I look forward to hearing from you more over the coming days

21:26 <dustymabe> might be worth making a discussion forum post to chronicle the problem and your investigation

21:26 <dustymabe> others may be seeing something similar

21:27 <u1106> So I do have the complete log with null pointer and what not problems. But I have the feeling the root cause it not the version update, but that we were really tight on memory before and for some reason it bit us today. So I doubt making a bug report will help anyone.

21:29 <u1106> dustymabe: what forum would that be?

21:49 misuto has quit [Remote host closed the connection]

21:51 misuto has joined #fedora-coreos

22:03 misuto has quit [Remote host closed the connection]

22:03 misuto has joined #fedora-coreos

22:06 misuto has quit [Client Quit]

22:16 misuto has joined #fedora-coreos

22:29 npcomp has quit [Quit: WeeChat 1.9.1]

22:30 npcomp has joined #fedora-coreos

22:34 mheon has quit [Ping timeout: 272 seconds]

22:52 crobinso has quit [Remote host closed the connection]

22:59 <bgilbert> u1106: https://discussion.fedoraproject.org/tag/coreos

23:09 hjst has quit [Quit: Bye.]

23:11 hjst has joined #fedora-coreos

23:54 vgoyal has quit [Quit: Leaving]