MHamzahKhan[m] has quit [Quit: You have been kicked for being idle]
paragan has quit [Read error: Connection reset by peer]
Betal has quit [Quit: WeeChat 3.6]
mnguyen has quit [Read error: Connection reset by peer]
mnguyen has joined #fedora-coreos
jpn has quit [Ping timeout: 255 seconds]
u1106 has joined #fedora-coreos
didib has joined #fedora-coreos
vgoyal has joined #fedora-coreos
<u1106>
I understand stable was updated yesterday. We have 4 instances running in AWS EC2, 2 of them are hanging now. I saw kernel errors on the serial console. 2 haven't been updated yet and they are working fine. Has anybody else had problems?
<lucab>
u1106: not that I've seen. There was a kernel update in latest stable (5.18.5 -> 5.18.7), but the same kernel has been in the testing stream for some weeks and we didn't see reports. What does it say on the console? Maybe it is something specific to your type workload that our CI is not hitting?
<u1106>
there were deadlocks from reported from RCU
jpn has joined #fedora-coreos
<lucab>
u1106: let's open a ticket with the full logs then
nalind has joined #fedora-coreos
paragan has joined #fedora-coreos
<u1106>
thanks lucab. At the the moment I cannot provide you logs. The system is hanging and the serial console did not support copy-paste. I was to slow to take screenshot
ksinny has quit [Remote host closed the connection]
<u1106>
I will certainly continue debugging this because we need our instances running. As soon as manage to grab some logs I will share them.
<u1106>
but the kernel versions you mention matched my observation: 5.18.5 is running just fine, the errors I saw were from 5.18.7 (according to my handwritten notes)
<dustymabe>
u1106++
<lucab>
in case it helps, current testing has a newer one (5.18.11)
<u1106>
dustymabe: lucab: I recovered the journal from that machine that failed after the upgrade today. After the machine had been hanging already for 2.5 hours there was a kernel: BUG: kernel NULL pointer dereference, address: 00000000000000f6
<u1106>
however according to our best knowledge at the moment the root cause for the problems was an OOM
<u1106>
we could reproduce the problem also with 36.20220618.3.1
<u1106>
although those machines had been working without problems for several weeks, when deploying a new one from the old image, today they also ended up in OOM
jpn has quit [Ping timeout: 276 seconds]
<u1106>
I have the feeling somewhat has changed in the EC2 hypervisor so that we have less available memory than before. Or the newest vulnerability mitigations have changed the timing during boot so much that we ended up in OOM. We were probably really tight already before
<u1106>
At the moment we have both 36.20220618.3.1 and 36.20220703.3.1 working fine after we gave it a bit of swap space
<dustymabe>
u1106: thank you for continuing to give us updates. I look forward to hearing from you more over the coming days
<dustymabe>
might be worth making a discussion forum post to chronicle the problem and your investigation
<dustymabe>
others may be seeing something similar
<u1106>
So I do have the complete log with null pointer and what not problems. But I have the feeling the root cause it not the version update, but that we were really tight on memory before and for some reason it bit us today. So I doubt making a bug report will help anyone.
<u1106>
dustymabe: what forum would that be?
misuto has quit [Remote host closed the connection]
misuto has joined #fedora-coreos
misuto has quit [Remote host closed the connection]
misuto has joined #fedora-coreos
misuto has quit [Client Quit]
misuto has joined #fedora-coreos
npcomp has quit [Quit: WeeChat 1.9.1]
npcomp has joined #fedora-coreos
mheon has quit [Ping timeout: 272 seconds]
crobinso has quit [Remote host closed the connection]