dustymabe changed the topic of #fedora-coreos to: Fedora CoreOS :: Find out more at https://getfedora.org/coreos/ :: Logs at https://libera.irclog.whitequark.org/fedora-coreos
crobinso has quit [Ping timeout: 268 seconds]
daMaestro has joined #fedora-coreos
mnguyen has joined #fedora-coreos
mnguyen_ has quit [Ping timeout: 272 seconds]
paragan has joined #fedora-coreos
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 272 seconds]
gursewak has quit [Ping timeout: 272 seconds]
gursewak has joined #fedora-coreos
gursewak has quit [Ping timeout: 240 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
jcajka has joined #fedora-coreos
gursewak has joined #fedora-coreos
bgilbert has quit [Ping timeout: 268 seconds]
<dngray[m]> dustymabe: hmm maybe i should use iPXE, so i can just use http
<dngray[m]> and not have to worry about a tftp server
<dngray[m]> pretty sure SeaBios used in qemu supports that
jpn has joined #fedora-coreos
<dngray[m]> like if i was deploying new, then seems like iPXE would be the way to go
<dngray[m]> https://andreaskaris.github.io/blog/linux/ipxe-boot-environment/ hmm so maybe fedora LXC can be the iPXE server
jpn has quit [Ping timeout: 268 seconds]
jpn has joined #fedora-coreos
daMaestro has quit [Quit: Leaving]
jpn has quit [Ping timeout: 240 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 260 seconds]
jpn has joined #fedora-coreos
ksinny has joined #fedora-coreos
MHamzahKhan[m] has quit [Quit: You have been kicked for being idle]
paragan has quit [Read error: Connection reset by peer]
Betal has quit [Quit: WeeChat 3.6]
mnguyen has quit [Read error: Connection reset by peer]
mnguyen has joined #fedora-coreos
jpn has quit [Ping timeout: 255 seconds]
u1106 has joined #fedora-coreos
didib has joined #fedora-coreos
vgoyal has joined #fedora-coreos
<u1106> I understand stable was updated yesterday. We have 4 instances running in AWS EC2, 2 of them are hanging now. I saw kernel errors on the serial console. 2 haven't been updated yet and they are working fine. Has anybody else had problems?
<lucab> u1106: not that I've seen. There was a kernel update in latest stable (5.18.5 -> 5.18.7), but the same kernel has been in the testing stream for some weeks and we didn't see reports. What does it say on the console? Maybe it is something specific to your type workload that our CI is not hitting?
<u1106> there were deadlocks from reported from RCU
jpn has joined #fedora-coreos
<lucab> u1106: let's open a ticket with the full logs then
nalind has joined #fedora-coreos
paragan has joined #fedora-coreos
<u1106> thanks lucab. At the the moment I cannot provide you logs. The system is hanging and the serial console did not support copy-paste. I was to slow to take screenshot
ksinny has quit [Remote host closed the connection]
<u1106> I will certainly continue debugging this because we need our instances running. As soon as manage to grab some logs I will share them.
<u1106> but the kernel versions you mention matched my observation: 5.18.5 is running just fine, the errors I saw were from 5.18.7 (according to my handwritten notes)
<dustymabe> u1106++
<lucab> in case it helps, current testing has a newer one (5.18.11)
<u1106> ok, that might be an option to test
<dustymabe> u1106: the AMI can be found at https://builds.coreos.fedoraproject.org/browser?stream=testing-devel&arch=x86_64 (sorry us-east-1 only)
<u1106> ah, ok. For production we cannot use us-east-1, but for testing that's fine
jpn has quit [Ping timeout: 255 seconds]
jpn has joined #fedora-coreos
crobinso has joined #fedora-coreos
mheon has joined #fedora-coreos
plarsen has joined #fedora-coreos
<dustymabe> u1106: preferably don't use `testing-devel` for production either :)
<u1106> what could possibly go wrong ;)
ksinny has joined #fedora-coreos
didib has quit [Quit: Leaving]
<lucab> I won't be around for the meeting today
<jlebon> PSA: the new CI checks "ShellCheck" and "golangci-lint" in coreos/coreos-assembler are now marked as required for `main`.
cyberpear has quit [Quit: Connection closed for inactivity]
wolfshappen has quit [Read error: Connection reset by peer]
wolfshappen has joined #fedora-coreos
paragan has quit [Quit: Leaving]
bgilbert has joined #fedora-coreos
jbrooks has joined #fedora-coreos
aaradhak has joined #fedora-coreos
<dustymabe> aaradhak davdunc dustymabe gursewak jaimelm jbrooks jcajka jdoss jlebon jmarrero lorbus miabbott nasirhm ravanelli saqali skunkerk walters
<dustymabe> FCOS community meeting in #fedora-meeting-1
<dustymabe> If you don't want to be pinged remove your name from this file: https://github.com/coreos/fedora-coreos-tracker/blob/main/meeting-people.txt
ksinny has quit [Quit: http://quassel-irc.org - Leaving]
Betal has joined #fedora-coreos
jpn has quit [Ping timeout: 268 seconds]
jcajka has quit [Quit: Leaving]
cyberpear has joined #fedora-coreos
jpn has joined #fedora-coreos
aaradhak has quit [Quit: Connection closed for inactivity]
jpn has quit [Ping timeout: 260 seconds]
<anthr76[m]> If im not mistaken it look like a new release should be cut on https://github.com/ostreedev/ostree-rs-ext
jpn has joined #fedora-coreos
<dustymabe> cc walters ^^
<dustymabe> jlebon: i'm going to start churning fedora-coreos-pipeline. will probably be a few fixup PRs that roll in
<jlebon> dustymabe: ack
<walters> anthr76[m]: it's there in https://crates.io/crates/ostree-ext/versions but you're right we're not consistently doing GH releases
<walters> (it's also a tag)
misuto has quit [Quit: Leaving]
jpn has quit [Ping timeout: 272 seconds]
misuto has joined #fedora-coreos
misuto has quit [Remote host closed the connection]
misuto has joined #fedora-coreos
<dustymabe> jlebon: hmm. I think I figured it out
<jlebon> need a stamp from a committer on https://github.com/coreos/fedora-coreos-config/pull/1857
<jlebon> dustymabe: nice
<dustymabe> somehow the two runs are stepping on each other
<dustymabe> i.e. the aarch64 run is using the container ID from s390x
<dustymabe> jlebon: why is the conditional in that PR unnecessary?
<jlebon> dustymabe: added a comment :)
<jlebon> dustymabe: ahhh fun
nalind has quit [Quit: bye]
<jlebon> dustymabe: ahh I think I may have found it
jpn has joined #fedora-coreos
<dustymabe> jlebon: thanks!
<dustymabe> looks like it's working
<u1106> dustymabe: lucab: I recovered the journal from that machine that failed after the upgrade today. After the machine had been hanging already for 2.5 hours there was a kernel: BUG: kernel NULL pointer dereference, address: 00000000000000f6
<u1106> however according to our best knowledge at the moment the root cause for the problems was an OOM
<u1106> we could reproduce the problem also with 36.20220618.3.1
<u1106> although those machines had been working without problems for several weeks, when deploying a new one from the old image, today they also ended up in OOM
jpn has quit [Ping timeout: 276 seconds]
<u1106> I have the feeling somewhat has changed in the EC2 hypervisor so that we have less available memory than before. Or the newest vulnerability mitigations have changed the timing during boot so much that we ended up in OOM. We were probably really tight already before
<u1106> At the moment we have both 36.20220618.3.1 and 36.20220703.3.1 working fine after we gave it a bit of swap space
<dustymabe> u1106: thank you for continuing to give us updates. I look forward to hearing from you more over the coming days
<dustymabe> might be worth making a discussion forum post to chronicle the problem and your investigation
<dustymabe> others may be seeing something similar
<u1106> So I do have the complete log with null pointer and what not problems. But I have the feeling the root cause it not the version update, but that we were really tight on memory before and for some reason it bit us today. So I doubt making a bug report will help anyone.
<u1106> dustymabe: what forum would that be?
misuto has quit [Remote host closed the connection]
misuto has joined #fedora-coreos
misuto has quit [Remote host closed the connection]
misuto has joined #fedora-coreos
misuto has quit [Client Quit]
misuto has joined #fedora-coreos
npcomp has quit [Quit: WeeChat 1.9.1]
npcomp has joined #fedora-coreos
mheon has quit [Ping timeout: 272 seconds]
crobinso has quit [Remote host closed the connection]
hjst has quit [Quit: Bye.]
hjst has joined #fedora-coreos
vgoyal has quit [Quit: Leaving]