dustymabe changed the topic of #fedora-coreos to: Fedora CoreOS :: Find out more at https://getfedora.org/coreos/ :: Logs at https://libera.irclog.whitequark.org/fedora-coreos
jpn has quit [Ping timeout: 276 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
mnguyen_ has joined #fedora-coreos
mnguyen has quit [Ping timeout: 240 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 256 seconds]
misuto has quit [Remote host closed the connection]
misuto has joined #fedora-coreos
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
Betal has quit [Quit: WeeChat 3.5]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
poppajarv has quit [*.net *.split]
mock has quit [*.net *.split]
djinni` has quit [*.net *.split]
mockgeek has joined #fedora-coreos
mnaser has quit [*.net *.split]
miabbott[m] has quit [*.net *.split]
lvrabec has quit [*.net *.split]
enilflah has quit [*.net *.split]
lvrabec has joined #fedora-coreos
enilflah has joined #fedora-coreos
mnaser has joined #fedora-coreos
djinni` has joined #fedora-coreos
miabbott[m] has joined #fedora-coreos
bagasse has joined #fedora-coreos
arnulfo_7 has quit [Read error: Connection reset by peer]
gursewak has joined #fedora-coreos
gursewak has quit [Ping timeout: 240 seconds]
gursewak has joined #fedora-coreos
gursewak has quit [Ping timeout: 255 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 244 seconds]
jpn has joined #fedora-coreos
jcajka has joined #fedora-coreos
tormath1 has joined #fedora-coreos
odra_ has quit [Ping timeout: 256 seconds]
odra has joined #fedora-coreos
jpn has quit [Ping timeout: 272 seconds]
jpn has joined #fedora-coreos
odra has quit [Quit: Leaving]
jpn has quit [Ping timeout: 272 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 260 seconds]
jpn has joined #fedora-coreos
mockgeek is now known as mock
jpn has quit [Ping timeout: 272 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 272 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 272 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 268 seconds]
mnguyen has joined #fedora-coreos
mnguyen_ has quit [Ping timeout: 256 seconds]
jpn has joined #fedora-coreos
mheon has joined #fedora-coreos
plarsen has joined #fedora-coreos
jlebon has joined #fedora-coreos
jlebon has quit [Quit: leaving]
plarsen has quit [Quit: NullPointerException!]
jlebon has joined #fedora-coreos
plarsen has joined #fedora-coreos
<dustymabe> hey lucab do you mind looking into this failure: https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/build/874/
<dustymabe> failing in basic checks.. no new packages so must be something changed with f-c-c and your commits are the only thing that have been merged recently
<dustymabe> could also be something in cosa changed
<lucab> dustymabe: ack, just let me login to see that
<dustymabe> lucab++
<dustymabe> if it is something related to your commits I'd be interested to know why CI failed us in this case
<dustymabe> "CI on the PR"
<lucab> like for rawhide, the actual failure is at compose time: https://paste.centos.org/view/3e8989f8
<lucab> this is the same as https://github.com/coreos/rpm-ostree/pull/3424, which we are still somehow tracking as https://github.com/coreos/coreos-assembler/issues/2707 I think
<lucab> I'm not exactly sure what's going wrong with the groupadd, but I don't think it's the PR per se
<lucab> dustymabe: were there two build going on in parallel or did you re-trigger it?
<lucab> *two builds
<dustymabe> lucab: I don't think there were two builds going on at the same time - let me check
<dustymabe> ahh - lucab I imagine since there were two PRs merged to the f-c-c repo then two builds were scheduled
<dustymabe> any queued builds wait (i.e. only one runs at a time for a given stream (in this case `testing-devel`))
<dustymabe> lucab: and in this case the 2nd build is past the kola basic checks
<lucab> ok, then it shouldn't be caused by some lock crossing the builds
<lucab> (and I think groupadd only does locking based on local files)
<dustymabe> yeah, this is a weird one.. some sort of race condition?
<lucab> I think it's some kind of cache mixup or locking/unlocking regression, possibly due to overlays
<dustymabe> we saw this at least one other time recently
<dustymabe> and I see you investigated that one too
<dustymabe> i wish we could just error out at compose time if a scriptlet fails
<dustymabe> for the set of packages we care about maybe we could enforce that and work with the maintainers to make it work
<dustymabe> but that's tangential to the current problem
<dustymabe> other concerning failures I'm seeing from that build:
<dustymabe> [2022-07-11T13:17:48.772Z] rpm.posttrans: ln: failed to create symbolic link '/var/lib/rpm': Read-only file system
<dustymabe> [2022-07-11T13:17:48.772Z] kexec-tools.posttrans: /usr/bin/kdumpctl: line 49: /var/lock/kdump: No such file or directory
<dustymabe> [2022-07-11T13:17:48.772Z] kexec-tools.posttrans: kdump: Create file lock failed
<dustymabe> jlebon: WDYT about getting to a point where we enforce our scriptlets not failing in order to pass a rpm-ostree compose in FCOS?
<dustymabe> we could allowlist failures - but would be nice to evaluate them before allowing them
<walters> dustymabe: note that traditional rpm also has this semantic...it's because post scripts are just regular shell scripts and no one uses `set -e` there
<walters> we could of course inject `set -e`, and if we do that probably want `-o pipefail` too but...the fallout from such a thing is hard to predict
<jlebon> yup exactly. rpm-ostree matches rpm in that respect AFAIK.
<jlebon> dustymabe: do we have something filed yet? last time that group lock issue came up, it turned out to be due to something else that went wrong earlier IIRC
<dustymabe> jlebon: right - it's a failure during the compose
<dustymabe> so the lock is left around and baked into the image
<walters> rpm-ostree v2022.11 relnotes https://hackmd.io/fQG88NY5RqaDDZRcZh_wdQ
<dustymabe> I don't think we have anything filed in the tracker repo other than the pieces luca linked to earlier
<dustymabe> lucab: could you file something in the tracker repo?
<dustymabe> walters: correct - we are matching rpm semantics here. I'm wondering if we could add an option to be stricter and then start to try to work on the package set we care about to get into non failing compliance
<dustymabe> i.e. work towards a more sane future at least for ourselves
<dustymabe> some of these failures may be expected, but some of them may be because we are using rpm-ostree and we want to know where we differ if we differ
<dustymabe> IMO
<walters> I am not sure if this was intentional but I believe the reason traditional RPM does not abort on script failure is because for its model of "live, in place updates", you can't really undo and it's not transactional, so the idea is to just stumble forward and hope for the best instead
<walters> but clearly for generating new install roots, we could require strictness
<dustymabe> walters: indeed. This is where we could be better
<dustymabe> and I doubt package maintainers would object too hard to PRs that enhace their scriptlets
<dustymabe> i guess 1st we'd need to support the enhancement in `rpm-ostree` and then we could go from there
<walters> yeah let's track as an rpm-ostree issue? I know we have a lot but some do get fixed 😄
<jlebon> hmm, shouldn't this be a proposed packaging guidelines change instead? it's valid to not use bash strict mode.
<jlebon> and i wouldn't be surprised if many scriptlets are already relying on that fact
<walters> yeah agree
<dustymabe> jlebon: I think there are two ways to approach it
<dustymabe> one is to add a check/allowlist on our side so we can incrementally implement the feature and file/track fixes upstream
<dustymabe> the other is to enforce it at the distro level, trying to get all packages to update
<dustymabe> we should probably do both
<dustymabe> I doubt we'll get RPM to change semantics itself (as colin mentioned earlier, there's nothing rpm can do because it can't roll back)
<walters> librpm could do the same thing in the case where it's constructing a new root, it's the same thing
<dustymabe> walters: correct. I don't know if it would do some sort of detection there or if the user would need to tell it
<dustymabe> I guess that could be hosted into librpm if we wanted to try to go that path
<dustymabe> hoisted*
<lucab> I don't know if it is related to our flake, but I noticed that systemd-sysusers seems to also be buggy on the topic of shadow locking: https://github.com/systemd/systemd/issues/23977
<dustymabe> +1
<dustymabe> lucab: can you open a tracker issue for us so we can track when this happens in our pipeline
<dustymabe> at least when it does happen the tests fail so we won't leak any bad builds
gursewak has joined #fedora-coreos
vgoyal has joined #fedora-coreos
<lucab> dustymabe: ah yes sorry, I had that unfinished in a pending tab, it's https://github.com/coreos/fedora-coreos-tracker/issues/1250
<dustymabe> thanks lucab!
vgoyal has quit [Quit: Leaving]
vgoyal has joined #fedora-coreos
vgoyal has quit [Client Quit]
vgoyal has joined #fedora-coreos
vgoyal has quit [Quit: Leaving]
vgoyal has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
jcajka has quit [Quit: Leaving]
jpn has joined #fedora-coreos
saqali has joined #fedora-coreos
<dustymabe> hey saqali
<saqali> hey
<dustymabe> for the grub password feature - do we have a test for that?
<dustymabe> i guess it's hard to test considering it kind of needs an interactive user to type the passworxc
<dustymabe> password*
<saqali> ye we dont really have a test for it
<dustymabe> but we do have docs?
<dustymabe> do we have any docs in our Fedora CoreOS docs?
<dustymabe> I'm not saying we absolutely need it, but we do have a lot of examples there too
<saqali> no we do not - What will that look like?
<dustymabe> probably a new tab under System Configuration
<dustymabe> Configuring the bootloader - or Configuring GRUB
<dustymabe> basically what I'm looking to do is make sure we have manual test converage for this periodically
<saqali> Hmm ok - I can add an entry
<dustymabe> i.e. we should create a new test case for it
<dustymabe> here's the last time we ran through and identified new test cases to create: https://github.com/coreos/fedora-coreos-tracker/issues/1147
<saqali> I have some ideas for automatic tests, but nothing that will fully test the feature
<dustymabe> but usually what we do in those test cases is link to our documentation
<dustymabe> we could use the link you provided above
<dustymabe> but typically we've linked directly to the FCOS docs (not the subproject docs)
<saqali> right that makes sense
<dustymabe> maybe let's ask bgilbert and jlebon if they think a new toplevel doc would be useful here
<dustymabe> but whether we create the new toplevel doc or not, let's make sure to get a manual test case added for this feature
<dustymabe> there are steps in https://github.com/coreos/fedora-coreos-tracker/issues/1147 that say how to do it
<jlebon> i think an FCOS doc would be nice, including instructions on how to generate the password
<dustymabe> +1
<saqali> +1
<dustymabe> jlebon: does "System Configuration" -> "Configuring the Bootloader" work?
<dustymabe> or prefer "System Configuration" -> "Configuring GRUB" ?
<jlebon> maybe "Setting a GRUB password" ?
<dustymabe> jlebon: is that the only thing we can do?
<jlebon> it's the only thing we should document IMO :)
<dustymabe> I thought we added a feature to make user grub configs possible (i.e. not just setting password)
<dustymabe> be back in 10
<dustymabe> back
<saqali> dustymabe, nope we've limited it so that the currently you can only set a password
<dustymabe> saqali: added a review to https://github.com/coreos/fedora-coreos-config/pull/1810
<jlebon> saqali: but note users can write anything they want there by editing the Ignition config directly
<saqali> yep and that is not officially supported
<vgoyal> Trying to use fedora core os for the first time. Trying to boot an image using qemu. As per documentation trying to prepare a ignition file. Seeting a password_hash for user "core"
<vgoyal> https://docs.fedoraproject.org/en-US/fedora-coreos/authentication/#:~:text=Fedora%20CoreOS%20ships%20with%20no,password%20for%20a%20local%20user.
justJanne has joined #fedora-coreos
<vgoyal> Problem is, mkpasswd seems to generate different hash for same password.
<vgoyal> podman run -ti --rm quay.io/coreos/mkpasswd --method=yescrypt foo
<vgoyal> $y$j9T$7nncGqUEGgas2BLNFxusp.$ly1lAss9JpeGPm7LpgVsbtb2F078dA6enJegQF23Z2D
<vgoyal> podman run -ti --rm quay.io/coreos/mkpasswd --method=yescrypt foo
<vgoyal> $y$j9T$hJdKZIuXD3SoFBbK8Prmz0$5WAcPyIsGS1JSjEogI5gtjA31OLAOob.xDwpm20S8p6
<justJanne> vgoyal: yes, that’s intended, the password is salted
<vgoyal> so how does verification work when image boots.
<vgoyal> When I enter password, I am assuming it will generate hash and try to match with the one I passed in ignotion file. And if hash generated is different everytime, how does this hash match.
<justJanne> A salted hash works like this: you generate a random string, the salt. Then you return salt + . + hash(salt + password).
<justJanne> When comparing passwords, you split by . into salt and the salted hash. Then you can compute hash(salt+password) and compare that with the salted hash
<vgoyal> justJanne: aha, thanks for the explanation. So by looking at the hash, it can be figured out what's the salt and use that to create hash again with the password and that time it should result in same hash.
<justJanne> exactly
* vgoyal will go through wikipedia page as well.
<justJanne> most password schemes use first a $, then the identifier for the algorithm (b for bcrypt, y for yescrypt, etc), then another $, the parameters that need to be used to compare the password, another $, the salt, another $ and the password
<justJanne> *hashed password
<justJanne> e.g., $y$j9T$hJdKZIuXD3SoFBbK8Prmz0$5WAcPyIsGS1JSjEogI5gtjA31OLAOob.xDwpm20S8p6 is likely y as algorith, j9T as parameters, hJdKZIuXD3SoFBbK8Prmz0 as salt and 5WAcPyIsGS1JSjEogI5gtjA31OLAOob.xDwpm20S8p6 as salted hash
<justJanne> the $y$j9T$ part tells the system which algorithm to use for comparing hashes
<vgoyal> got it. So all the information needed to create same hash from password is part of generated hash
<justJanne> exactly :)
<jlebon> walters: i'm inclined to drop the whole "What's Changed" section. otherwise LGTM!
<dustymabe> justJanne++
<justJanne> So, I’ve got an issue where I feel like I’m doing something so wrong it has to be embarassing
<justJanne> I’ve now simplified my ignition config down to this: https://gist.github.com/justjanne/f890504e6cea5e363228e02de3f5b913 (generated with butane -sp from a variant: fcos, version: 1.4.0 butane.yaml file)
<justJanne> I’m installing that with quay.io/coreos/coreos-installer:release (running this on bare metal)
<justJanne> the only further change I do is removing rm /mnt/boot/EFI/fedora/BOOTX64.CSV, to avoid fcos from altering the default boot order (as otherwise the pxe-boot for the rescue system doesn’t work anymore)
<justJanne> it boots fine. Everything seems to work, in theory
<justJanne> except, I can’t log in via ssh
<justJanne> all I get is `op=PAM:bad_ident`
<justJanne> ah, with that simplified setup it actually works, let’s see if it still breaks with raid1 booting or if something else was the issue
<jlebon> walters: ok, i stamped a bunch of PRs, but didn't merge them so they get in after the release
tormath1 has quit [Quit: leaving]
<justJanne> without boot_device: mirror: ..., fcos boots in 40 seconds from first install to everything working + ssh login possible
<justJanne> with boot_device: mirror: ..., even after 7 minutes it’s not even listening on port 22
<justJanne> everything else exactly identical
Betal has joined #fedora-coreos
vgoyal has quit [Quit: Leaving]
<dustymabe> justJanne: how large are your disks?
<justJanne> 2x 1TB NVMe, in raid1 as boot disks, plus 2x 4TB HDD, in raid1 as data disks.
<justJanne> I shouldn’t be affected by the MBR/2TB issue
<dustymabe> justJanne: anything happening on the console (serial and/or VGA) of the machine? any errors that you see?
<justJanne> dustymabe: I have neither serial nor VGA access
<dustymabe> well that makes things difficult :)
<justJanne> I usually reboot back into rescue via the API and pick apart the system.journal from the ostree manually
<dustymabe> IOW it's mostly a black box
<justJanne> it’s a dedicated server located remotely for which I’ve got only one API call, which is basically `function reboot(rescue: boolean)`
Betal has quit [Ping timeout: 260 seconds]
<justJanne> I can request VGA console access, but as the hoster has a limited amount of those, that quickly tends to get pricy :)
<dustymabe> justJanne: only thing I can recommend is trying to reproduce this with a similar hardware setup in an environment where you do have access to those basic debugging tools
<justJanne> I’m atm trying to deploy the following ignition file https://gist.github.com/justjanne/f890504e6cea5e363228e02de3f5b913, I’ll wait ~5 minutes, then reboot back to rescue, and then I can give you logs
<justJanne> dustymabe: I’m running fcos in multiple places just fine, it seems like it’s an issue with the raid config, which I can’t easily replicate elsewhere
jpn has quit [Ping timeout: 260 seconds]
<dustymabe> the problem is that if it's an issue in Ignition then there won't be any logs on the disk
<dustymabe> i.e. Ignition fails, system goes to emergency.target, never switches to the real root, journal logs never get persisted to disk
<justJanne> ah the entire filesystem on root is irreparably broken
<justJanne> I don’t know why, but I guess it is
<dustymabe> I mean, if it fails provisioning the disk then that's not surprising I don't guess. We just don't know how it is failing without the console :(
vgoyal has joined #fedora-coreos
<justJanne> I’ve got to be honest, the whole situation with how the ecosystem broke apart after the acquisition is really painful
<justJanne> flatcar boots, but can’t do boot-disk raid. fcos doesn’t always provision despite the config passing all checks, but at least in theory it should work.
jpn has joined #fedora-coreos
<dustymabe> I'm struggling to find a constructive response to that last comment
<dustymabe> software is never going to #justwork all the time. In the cases it doesn't you need debugging tools. The lack of those tools is going to make your life hard if there are every any unforseen issues.
<justJanne> I don’t think there’s an actual bug here, I think I just made a stupid mistake, which neither I nor the linters are able to catch
jpn has quit [Ping timeout: 260 seconds]
jpn has joined #fedora-coreos
Betal has joined #fedora-coreos
<dustymabe> also possible..
<dustymabe> only other thing I can think of is for you to run the OS off of one of the disks and then pass through the remaining disks directly to a FCOS VM and watch the whole process on the console of the VM (via your SSH connection to the host)
<dustymabe> i.e. Fedora or Ubuntu or FCOS or whatever host with libvirt/kvm installed then create an FCOS VM with the disks passed through to it where you run the install process (and watch Ignition run/fail)
<justJanne> ooooh I think I remeber the issue
<justJanne> it may be an issue I ran into before
<walters> this also relates to https://github.com/coreos/ignition/issues/585
<walters> (just edited https://github.com/coreos/ignition/issues/585#issuecomment-687181337 with an updated strawman)
jpn has quit [Ping timeout: 272 seconds]
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 272 seconds]
<justJanne> tl;dr: if you try to mount a new filesystem to a folder that doesn’t exist *yet*, it never actually completes booting or creating that folder. I’ll try further to figure out more of the details
<justJanne> mounting to /var/lib/container-data doesn’t work and fails booting
<justJanne> let’s see what’s the actual reason behind it, and how to work around it
<justJanne> (for reference, this was a relatively standard setup for persistent container data in old containerlinux)
<justJanne> (alternatively /var/lib/data)
<justJanne> jlebon: let me update that with the current iteration
<jlebon> Ignition should be able to cope fine with previous non-existent mount points. definitely we should file an issue if we regressed there.
<jlebon> dustymabe, walters: feels nice to close a 4.5 year old RFE :) (https://github.com/coreos/rpm-ostree/issues/1265)
<dustymabe> jlebon: nice!
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 272 seconds]
jpn has joined #fedora-coreos
<justJanne> dustymabe: I managed to get it down to these two sample cases: https://gist.github.com/justjanne/f890504e6cea5e363228e02de3f5b913
nb[m] has joined #fedora-coreos
nbsadminaccount- has joined #fedora-coreos
jpn has quit [Ping timeout: 240 seconds]
<justJanne> dustymabe: could this be related to https://bugzilla.redhat.com/show_bug.cgi?id=1841049 ?
vgoyal has quit [Quit: Leaving]
<justJanne> Yeah, the issue is one with fstab/mount units, which definitely looks like another selinux issue
<justJanne> I miss the pre-selinux days of containerlinux
jpn has joined #fedora-coreos
jpn has quit [Ping timeout: 244 seconds]