#osdev on 2022-10-03 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:00 <heat> it's a multiple-thousand-line project

00:00 <clever> QT even has a primitive that can kinda do this, deleteLater, but its not multi-thread friendly

00:00 <heat> right

00:00 <heat> so it's useless :)

00:01 <heat> the hard part is making it scale

00:01 <clever> deleteLater is mainly so class method can delete `this` without the destructor then becoming a use-after-free

00:02 <heat> and the parts that mostly make it scale are still under patents

00:02 <heat> or maybe they're not! it's a minefield

00:02 <heat> who knows if you'll blow off your leg

00:02 <heat> better step on it and see what happens

00:02 <clever> and if i just come up with a solution on my own and implement it, i could still get sued??

00:02 <heat> clever, you can totally delete this;

00:03 <heat> yes

00:03 <heat> patents baby

00:03 <heat> well, IANAL but that's my understanding of it

00:05 <heat> ...I still don't understand the fucking point of patenting it but fuck IBM anyway

00:10 <clever> heat: i can see 2 main parts of RCU that are costly, 1: the copying, 2: when to do the free

00:11 <heat> yup

00:11 <clever> what if i just put it on a timer, and free the object after 60 seconds? if any irq context is holding a reference that long, youve done something wrong

00:11 <heat> linux has like 3 different RCU implementations

00:11 <heat> that sounds bad

00:11 <heat> you fire off an interrupt for every object?

00:12 <heat> you can also get interrupts in RCU sections

00:12 <clever> the rough view ive seen, is that the rx irq on your NIC, will then use an RCU to parse iptables, and either block or accept the packet

00:12 <clever> and RCU is used, so you can mutate the tables, without blocking all packet rx

00:12 xenos1984 has joined #osdev

00:13 DanDan has quit [Ping timeout: 252 seconds]

00:14 <heat> sure

00:14 <heat> RCU is super abused all around linux

00:14 <heat> your fd table is entirely made out of RCU

00:14 <clever> ah, hadnt known that

00:14 <heat> you can actually modify it concurrently

00:14 <mjg> no you can't

00:14 <heat> there's no lock around it

00:14 <mjg> bro

00:14 <heat> you can't?

00:14 <mjg> where the fuck you taking this from

00:14 <heat> source?

00:14 <clever> heat: i assume there is a cmpxchg, so concurrent writes dont undo eachother

00:15 <mjg> shit scalability of fd allocation is a long standing sore point

00:15 <mjg> posix requires that you hand out lowest fd number possible

00:15 <heat> https://elixir.bootlin.com/linux/latest/source/include/linux/fdtable.h#L27

00:15 <mjg> which serializes the shit out ofi t

00:15 <bslsk05> elixir.bootlin.com: fdtable.h - include/linux/fdtable.h - Linux source code (v6.0) - Bootlin

00:15 <heat> I'm seeing some RCU there

00:15 <mjg> it is for *lookup*

00:15 <mjg> fd -> file translation

00:16 <clever> mjg: and so that translation, can happen in parallel to a mutation?

00:16 <clever> because the edit is occuring on a copy

00:16 <clever> but 2 edits would still serialize

00:17 <mjg> yes

00:17 <heat> ah ok

00:17 <heat> my bad then

00:18 <mjg> but most imporatntly you can have independend lookups not reducing each others perf

00:18 <mjg> independent even

00:18 <clever> another thing i see often in databases is a r/w lock, where multiple readers dont conflict

00:18 <clever> but a writer needs exclusive access, not even other readers

00:18 <mjg> sounds like the rw lock conflicts

00:18 <heat> rw locks are still bad

00:18 <mjg> :>

00:19 <mjg> heat: whatever rcu you see, it is most likely only for lookup

00:19 <clever> yeah, rca avoids the writers blocking readers

00:19 <heat> your writers may wait for a long time, and cache lines goes bouncy

00:19 <clever> rcu*

00:19 <mjg> read locking already serializes lookup perf

00:19 <clever> rcu is meant for when your readers must never block, and writes are rare

00:19 <mjg> provided this the standard one-word to bump

00:20 frkzoid has quit [Read error: Connection reset by peer]

00:20 <mjg> which is why rcu to begin with

00:20 <heat> is the fbsd vfs abusing epoch yet?

00:20 <heat> or are you not there yet

00:20 <mjg> it is abusing smr

00:20 <heat> cringe

00:20 <heat> rcu-like or riot

00:20 <mjg> 's how i got the perf which i already linked

00:20 <mjg> the lookup which scales

00:22 <heat> i dont understand what smr does

00:22 vdamewood has joined #osdev

00:22 <clever> hdd smr? or some other smr?

00:22 <heat> lmao

00:22 <heat> https://reviews.freebsd.org/D22586

00:22 <bslsk05> reviews.freebsd.org: ⚙ D22586 Implement safe memory reclamation in UMA.

00:22 <clever> ah, safe memory reclamation?

00:23 <heat> yes

00:23 nyah has quit [Ping timeout: 260 seconds]

00:24 <mjg> it basically waits for a counter to gtfo

00:24 <mjg> except contrary to rcu it does not do it with context switches

00:24 <mjg> i don't know what other conceptual differences are there

00:24 <mjg> i can tell you it sucks though because it has 0 numa support

00:24 <heat> perfect

00:24 <mjg> this is fixable, just did not happen

00:25 <heat> onyx also has 0 numa support

00:25 <mjg> despite me telling the author it will be a problem

00:25 <heat> it goes hand in hand

00:25 <mjg> oh you can take it then

00:25 <heat> i'll wait for the PR

00:26 <mjg> i posted one yesterday!

00:26 <mjg> i guess github lost it

00:26 <mjg> too bad

00:26 <heat> oh really?

00:26 <heat> i can set up a gerrit dont worry

00:26 <heat> I also take email patches

00:27 <heat> you can CC tech-kern@netbsd.org

00:27 <heat> (It's my email)

00:28 <mjg> i did

00:28 <mjg> gogole mail must be on a fritz this weekend

00:28 <heat> damn gog

00:28 <mjg> ye wtf gog

00:29 <heat> go send the emails

00:29 <mjg> this is why onyx can't have nice things

00:29 <mjg> heat: openbsd has its own delayed memory reclamation solution

00:29 <mjg> it is... something

00:30 <heat> let me guess

00:30 <heat> it delays inode reclamation until the end of the universe

00:31 <heat> fun fact I almost applied to netbsd's GSoC a year ago

00:31 <heat> you could have a prized netbsd developer here

00:31 <heat> sadly I got fucking ghosted

00:31 <mjg> lol

00:32 <mjg> that was... an interesting choice

00:32 <mjg> i can tell you OS gsoc tends to be a disaster

00:33 <mjg> for example there was a student accepted in freebsd few years back

00:33 <mjg> not only he was not even using the system, he did not know how to install it in a vm

00:33 <heat> ... I ended up in tianocore

00:33 <mjg> so that did not work out

00:33 <heat> i'm now a prized tianocore maintainer

00:33 <heat> it's like

00:33 <heat> the opposite of netbsd

00:34 <mjg> not only one arch but *runs* htere?

00:34 <heat> hrm?

00:35 gog has quit [Ping timeout: 268 seconds]

00:35 <heat> >install it in a vm

00:35 <heat> this is why I would never apply for openbsd

00:35 <heat> :)

00:36 <heat> I think the project was what, adding PCI support to userspace for rump? something like that

00:36 <heat> I did look into freebsd but the projects didn't have much kernel fun

00:38 frkzoid has joined #osdev

00:39 <heat> https://wiki.netbsd.org/projects/project/userland_pci/ yeah this was it

00:39 <bslsk05> wiki.netbsd.org: Userland PCI drivers (350h)

00:40 <heat> it does sound like great fun, maybe I should take a look again

00:40 <mjg> well

00:40 <mjg> you should have asked around

00:40 <mjg> i can't be fucked to add a gsoc proposal

00:41 <mjg> but if someone can code up something in the kernel i may be up for mentiring

00:42 <heat> i'm relatively interested :v

00:42 <heat> i want something to scratch my "contributing to a real project" itch

00:45 <mjg> > "contributing to a real project"

00:45 <mjg> > bsd

00:45 <mjg> lol

00:46 <mjg> just kidding

00:46 <heat> or are you

00:54 <mjg> im watching wandavision

00:55 <mjg> i don't give a crap about super hero shit and don't even know who the characters are

00:55 <mjg> i heard the show is quite original and wahtnot so gave it a shot, i'm mid episode 2 and so far A+

00:55 <heat> i pay for disney+ and I watch like 0 shows and movies

00:56 <heat> it's a waste of money but I know i'll want to watch shit from disney+ if I cancel my subscription

00:56 <heat> it's how it goes

00:56 <heat> I stopped having netflix and then I suddenly got the urge to watch like every dave chapelle special

00:59 <mjg> lol

00:59 <mjg> for real though, very refreshing

01:05 vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

01:36 aleamb has joined #osdev

02:07 dude12312414 has joined #osdev

02:08 dude12312414 has quit [Remote host closed the connection]

02:10 dude12312414 has joined #osdev

02:11 dude12312414 has quit [Remote host closed the connection]

02:24 dude12312414 has joined #osdev

02:37 wxwisiasdf has joined #osdev

02:39 <wxwisiasdf> hello :D

02:41 <heat> henl

02:41 <heat> o

02:42 <wxwisiasdf> heat

02:42 <wxwisiasdf> you know i am the person who did tss right

02:42 <wxwisiasdf> and also the 286 memes right

02:42 <heat> yes

02:42 <wxwisiasdf> well, we expand now with bigger memes: kernel made in COBOL

02:42 <heat> cheers mate

02:42 <heat> whatever makes you happy honey

02:42 <wxwisiasdf> COBOL is dynamic like python so it was a hassle to get it working but finally

02:42 <wxwisiasdf> uart driver in cobol

02:43 <heat> I've been progressively been beating openbsd's path walking these past few days

02:43 <wxwisiasdf> ?

02:43 terrorjack has quit [Quit: The Lounge - https://thelounge.chat]

02:43 <heat> i'm now 1.15x faster than openbsd at threads=4 path walking

02:44 <heat> sorry

02:44 <wxwisiasdf> oh you're optimizing multitheading for your kernel?

02:44 <heat> 10.15?

02:44 <heat> something stupid like that

02:44 <heat> yes

02:44 <wxwisiasdf> ohoh, nice

02:44 <heat> ah yes, 11.5x

02:44 <heat> it turns out openbsd bad onyx good

02:44 <heat> subscrarb

02:44 <wxwisiasdf> exactly

02:45 terrorjack has joined #osdev

02:45 <heat> it's still not ideal but

02:45 <heat> it is what it is

02:45 <wxwisiasdf> i mean if it works for you

02:46 <heat> yeah i feel great

02:46 <heat> beating 30 years of bsd development feels fucking amazing

02:46 <heat> i wonder how it measures up with netbsd

02:46 <heat> anyway, now I'm writing a radix tree

02:47 <wxwisiasdf> i am curious what did you do to optimize that made it faster than OpenBSD?

02:47 <wxwisiasdf> better mem align?

02:47 <wxwisiasdf> lazier context switch?

02:47 <heat> oh I was already almost 10x faster but misread the number

02:47 <heat> :D

02:47 <wxwisiasdf> :p

02:47 <heat> I was 1.6M, they were 200k

02:48 <wxwisiasdf> in what context, context switches per sec?

02:48 <heat> then I added rw spinlocks instead of sleepable rw locks, got 1.8M

02:48 <heat> some will-it-scale test of path opens per second I think

02:48 <wxwisiasdf> ah

02:48 <heat> then I did a bunch of work and disabled ubsan, 200k

02:49 <heat> then I did the real work and replaced my shitty musl malloc with a proper slab allocator like a proper unix

02:49 <heat> with percpu caches

02:49 <wxwisiasdf> haha

02:49 <heat> so I'm at 2.3M

02:49 <wxwisiasdf> wonderful

02:49 <heat> yeah

02:49 <heat> malloc locks were like 20-30% of my flamegraphs

02:49 <heat> now i can barely see them

02:49 <wxwisiasdf> how did you

02:49 <wxwisiasdf> benchmark your os????

02:50 <heat> i added flamegraphs

02:50 <wxwisiasdf> did you do like an xlsl output file via serial for the timings

02:50 <wxwisiasdf> oh

02:50 <heat> no lmao what

02:50 <wxwisiasdf> uh oh :(

02:50 <heat> basically you need a timer and a way to get stack traces

02:50 <heat> then you symbolize them and send them however you'd like

02:50 <heat> I was using qemu's serial yeah

02:51 <wxwisiasdf> fair

02:51 <wxwisiasdf> and how did you bench openbsd?

02:51 <heat> then I was running flamegraph scripts on it

02:51 <heat> bench? I used will-it-scale

02:51 <wxwisiasdf> okie

02:51 <heat> https://github.com/antonblanchard/will-it-scale

02:51 [itchyjunk] has quit [Read error: Connection reset by peer]

02:53 <heat> anyway, I'm going to take care of vm technical debt next

02:54 <heat> I'm going to use a radix tree thing, not too unlike page tables, to do this

02:54 <heat> vm_objects have been using a rb tree which is not ideal

02:54 <wxwisiasdf> fair

02:54 <wxwisiasdf> what would you recommend for a memory allocator btw

02:55 <heat> what as in the allocator or an algo?

02:55 <wxwisiasdf> algorithm

02:55 <heat> i'm using slabs

02:55 <heat> it works well

02:55 <wxwisiasdf> okay

02:55 <heat> well, it depends on what allocator you're talking about

02:56 <heat> I have 3 allocators that work exactly like a stack :P

02:56 <wxwisiasdf> well i just asked because when i implement memory allocators i usually just use a freelist

02:56 <wxwisiasdf> not a freelist, just more like linked list

02:56 <heat> plus a simple memory pool and a bootmem allocator

02:56 <wxwisiasdf> oh

02:56 <heat> yeah I have 5 allocators

02:56 <heat> technically 6

02:57 <heat> anyway it depends

02:57 <wxwisiasdf> sounds messy to mantain ig

02:57 <wxwisiasdf> i usually just have 2 allocators: 1 physical, 1 virtual (sometimes)

02:57 <heat> if you want malloc you can do a slabish approach, or a buddy allocator

02:57 <heat> there are multiple approaches

02:57 <heat> more than these, really

02:57 <wxwisiasdf> yeah but i mean on multicore

02:58 <wxwisiasdf> whats a good approach to scalable multicore and smh

02:58 <wxwisiasdf> i always have issues doing smp because i make my kernel be just expecting 1 thread and when i enable smp: boom it all implodes

02:58 <heat> I have a page allocator, a virtual memory allocator, a vmalloc allocator (uses vm's algorithm more or less, but conceptually different and way simpler), a bootmem allocator (to allocate page allocator structures), a page allocator (for now, simple list of pages) and a memory pool (used for simple, stupid allocations.)

02:59 <heat> I meant slab allocator first

02:59 <wxwisiasdf> fair

02:59 vdamewood has joined #osdev

02:59 <heat> anyway, I think that for multithreading as far as I've seen you really need percpu/per thread caching

02:59 <heat> it's generally The Way

02:59 <heat> grabbing the lock = bad

03:00 <heat> and asking a backend for memory is also bad, so avoid giving memory back

03:00 <heat> https://gist.github.com/heatd/2ce5ac1e94f20ce1f7fa9a992225073b

03:00 <bslsk05> gist.github.com: onyx-open3.svg · GitHub

03:00 <heat> this is one of my earlier flamegraphs

03:01 <heat> if you open this on non-github you'll be able to click through it

03:01 <wxwisiasdf> oh

03:01 <heat> you'll see malloc and __bin_chunk(free) very easily on the graph

03:01 <heat> and most of it is just waiting on a spin lock

03:01 <heat> on 4 THREADS

03:01 <wxwisiasdf> do_styscall64 moment

03:02 <heat> anyway "enabling SMP and it blowing up" is not because of a lack of efficiency

03:02 <heat> you're probably just missing locks

03:03 <wxwisiasdf> right

03:03 <heat> an SMP kernel is just a really big really multithreaded thing

03:03 <heat> so spinlocks, mutexes, rwlocks everywhere

03:03 <heat> or fancier things in the late game

03:07 tarel2 has joined #osdev

03:09 <clever> thats something i still need to investigate on the VPU

03:09 <clever> i can turn on the 2nd core and run code there, but i have yet to enable proper SMP in LK

03:09 <clever> i dont trust the lock primitives yet

03:09 <heat> test em

03:09 <clever> yeah, thats on my todo list

03:11 dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]

03:13 <clever> heat: from memory, there are no atomic memory access ops, only ~16 hw spinlocks, reading it will return one value for the winner, and another value for all the loose and all future reads

03:13 <clever> a write will reset it

03:14 <clever> any core can reset it, software rules should limit it to only the lock holder

03:14 <heat> hash your software spinlocks on those 16 hw ones

03:14 <clever> i think the hw spinlock also cant be a indexed from a register

03:14 <clever> its basically a cpu register

03:15 <clever> currently, i just have one hw lock, that protects all "atomic" ops

03:15 <heat> that's just needlessly slow

03:15 <clever> yeah

03:16 <clever> it works, but could be improved

03:17 <heat> woohoo linux 6.0

03:17 <heat> it's so new kernel.org hasn't updated

03:23 <wxwisiasdf> i am a hurr durr ninja

03:23 <wxwisiasdf> or something it was codenamed

03:23 <wxwisiasdf> linux giving it's kernel silly names :P

03:23 <heat> Hurr durr I'ma ninja sloth

03:26 <heat> https://github.com/torvalds/linux/commit/6e4664525b1db28f8c4e1130957f70a94c19213e

03:26 <bslsk05> github.com: Linux 3.11 · torvalds/linux@6e46645 · GitHub

03:26 <heat> easily the best name

03:38 <klys> I've been setting up an Epyc 7453 system that takes 3-4 min. to boot. It's been giving me trouble with gpu passthru. nouveau.ko made the screen light up, that's about all I have so far on this gtx 960 card. passthrough works on another box though. using debian.

03:38 <heat> don't use nouveau.ko and try again

03:38 <klys> I will try again another time, as it does take over three minutes to try again.

03:39 <heat> servers do be like that

03:39 <heat> kexec?

03:41 <klys> there's a kexec binary somewhere, and it's like kexec /boot/vmlinuz-4.16.18 root=/dev/sda5 rw blah blah; ?

03:42 <heat> something like it

03:42 <heat> --reuse-cmdline

03:43 <heat> https://wiki.archlinux.org/title/kexec

03:43 <bslsk05> wiki.archlinux.org: Kexec - ArchWiki

03:44 <heat> i should kexec more often

03:45 <klys> I was using it with my w7pro in virt-manager, and device manager was like "there was a problem with this device. no resources are allocated because there was a problem."

03:45 tarel2 has quit [Ping timeout: 252 seconds]

03:45 <klys> one thing I should double check is that it is really using ovmf

03:46 <klys> because I don't see the tianocore logo

03:47 <heat> then it's not

03:49 <klys> seems reasonable, thanks I'll keep that in mind.

03:51 <klys> meanwhile I have the Epyc 7453 box built now, with mobo, transitional 700W power, 4u case (yes the fan is too big), 56 threads, and 256GB A-grade samsung ram with heat spreaders.

03:54 <geist> server stuff can spend a lot of time training the dram on every boot

03:54 <geist> the arm server box i have takes 2-3 minutes to retrain the 16 dimms or so, but at least it dumps a whole log over serial so you can watch it

03:55 <geist> it gives ilke 10 seconds per dimm, and there are 16 of them, so it kinda makes sense

03:56 heat has quit [Ping timeout: 268 seconds]

03:57 <mrvn> Why do some boards do that and others don't? Why isn't the result saved in NVram?

03:58 <clever> mrvn: the rpi4 is one case where it doesnt save it to ram

03:59 <clever> it adds a noticable amount to the boot time

03:59 <mrvn> On a SOC I can kind of see the penny saving attitude. But a server? Where uptime matters?

03:59 <clever> dont reboot your server :P

04:00 <clever> uptime matters!

04:00 <mrvn> Sorry, we can't give you 99.9% uptime, the ram retraining takes too long.

04:00 <clever> ive also heard claims that the training depends on temps

04:00 <clever> if the ram/soc are hot, it needs different training

04:00 <clever> and its constantly fine-tuning it while running

04:00 <mrvn> But then it should retrain at runtime, after the ram has heated up.

04:00 <clever> but if you do a reboot, the ram is already hot

04:00 <clever> and the timings in nvram are wrong

04:01 <mrvn> if you do a reset then it should just keep the settings, doesn't even need NVram. Just have a few bit in the DIMM that says: I've been trained.

04:02 <mrvn> But all this still doesn't answer the original question: Why do some boards take 10s per DIMM to train?

04:03 <clever> yeah, good question

04:03 <clever> and the lack of source makes it near impossible to answer

04:03 <klys> likely because I have 32gb per dimm

04:04 <klys> which all has to squeeze thru a bus

04:10 <mrvn> klys: No. Other boards with the same amount of ram don't take so long.

04:10 <klys> it's ecc if that matters

04:11 <mrvn> I sure hope nobody runs a server without ECC

04:12 <klys> CL22 column access strobe

04:44 frkzoid has quit [Read error: Connection reset by peer]

04:56 archenoth has joined #osdev

05:34 [itchyjunk] has joined #osdev

05:36 <geist> mrvn: so i think a few reasons: one is it *does* take longer on a new/clean boot. in my case it takes over 5 minutes until it's done a training once, so there's some sort of deeper thing there

05:37 <geist> and i think the idea is lots of traces across the board with larger sets of dimms requires more training

05:37 <geist> vs say 2 channels, maybe 4 dimms max

05:57 <mxshift> If you get info about DIMM training over serial, someone left debug turned on which makes it take considerably longer

05:58 <mxshift> Training data is usually cached in the main flash so sometimes naive firmware updates erase it

05:59 <geist> yah like i said there's two levels of training on this machine. one of which takes over 5 minutes, the faster one being about 10 seconds per dimm

06:01 <mxshift> Training data is also invalidated if the DIMM config changes at all. Even shuffling DIMMs will invalidate since the SI channel being measured is the SoC+socket+motherboard+DIMMs

06:03 <mxshift> Normal training is just trying to find an operating point. Debug will often do a 2D sweep of voltage and timing adjustments to give a report of how much margin there is around that operating point

06:04 * geist nods

06:04 <geist> yah the dump out of this machine shows basically that

06:05 <mxshift> I haven't seen training data be invalidated due to temperature. It's plausible. I have seen it be invalidated after some number of boots on the assumption that the channel may age

06:06 <mxshift> I spent way too much time looking at EPYC 7xx3 training during our early being up

06:07 <mxshift> You can enable debug or MBIST training on any EPYC 7xx3 system by setting a few APCB tokens to tell the PSP what you want

06:08 <mxshift> Training is usually pretty quick but zeroing ECC DIMMs scales with amount of DRAM installed

06:09 vin has quit [Remote host closed the connection]

06:40 edr has quit [Quit: ZNC 1.6.5 - http://znc.in]

06:57 [itchyjunk] has quit [Read error: Connection reset by peer]

07:00 genpaku has quit [Read error: Connection reset by peer]

07:00 genpaku has joined #osdev

07:05 k8yun__ has joined #osdev

07:09 k8yun_ has quit [Ping timeout: 252 seconds]

07:10 k8yun__ has quit [Ping timeout: 268 seconds]

07:11 <geist> oh that's a fair point. does all ECC by definition need to be initialized befor eyou can use it

07:11 <geist> i guess so, right? or some sort of bit taht says this is uninitialized

07:12 <Mutabah> At a guess, the first read would return a ECC error if it's not written before

07:13 <geist> yeah, never thought about that but i guess that's true. an ECC machine would probably need to ake at least a few more seconds to boot to zero everything out

07:18 <Mutabah> Note: I'm just guessing at how ECC works, but it seems like the easiest way

07:20 scoobydoo_ has joined #osdev

07:21 scoobydoo has quit [Ping timeout: 252 seconds]

07:21 scoobydoo_ is now known as scoobydoo

07:25 wxwisiasdf has quit [Ping timeout: 252 seconds]

07:40 tarel2 has joined #osdev

07:56 heat has joined #osdev

07:56 <heat> morning

07:57 <heat> i was reading that old ass linux 2.4 buddy allocator doc and I realized how they avoid always coalescing and splitting buddies

07:57 <heat> they have a percpu list of pages (order 0) which they allocate from

07:58 <heat> and do exactly the batch thing that you would do in slab allocation

07:59 <heat> this is very interesting and may be worth considering

08:07 ThinkT510 has quit [Ping timeout: 246 seconds]

08:08 <heat> btw re memory training, from what I've gathered the new DDR5 machines take a stupid amount of time to train memory

08:08 <heat> like 5 minutes

08:08 <heat> for fucking laptops

08:30 <Griwes> finally, bringing the server experience to the masses

08:31 k4m1 has quit [Quit: Lost terminal]

08:32 k4m1 has joined #osdev

08:33 <kazinsal> me: "I can't wait to build a new machine to replace my old SB Xeon box that takes 10 minutes to post after a power failure"

08:33 <kazinsal> DDR5 board manufacturers: "haha guess what"

08:44 <mjg> :]

08:52 terminalpusher has joined #osdev

09:02 gog has joined #osdev

09:03 ThinkT510 has joined #osdev

09:05 gog has quit [Client Quit]

09:06 kof123 has joined #osdev

09:11 GeDaMo has joined #osdev

09:22 Oshawott has joined #osdev

09:23 archenoth has quit [Ping timeout: 268 seconds]

09:46 SGautam has joined #osdev

10:00 xenos1984 has quit [Read error: Connection reset by peer]

10:02 terminalpusher has quit [Remote host closed the connection]

10:02 terminalpusher has joined #osdev

10:03 tarel2 has quit [Ping timeout: 252 seconds]

10:04 xenos1984 has joined #osdev

10:13 epony has joined #osdev

10:33 DanDan has joined #osdev

10:35 ThinkT510 has quit [Ping timeout: 268 seconds]

10:37 elastic_dog has quit [Ping timeout: 244 seconds]

10:39 elastic_dog has joined #osdev

10:40 ThinkT510 has joined #osdev

10:42 marshmallow has joined #osdev

10:54 <mjg> ffs

10:54 <mjg> i finished watching wandavision

10:54 <mjg> starts super strong, then quality gets a huge hit and it's downilll from there

10:56 <GeDaMo> Some of the Marvel stuff from last year was affected by Covid

10:58 <GeDaMo> I like Marvel but everything after Endgame has been a bit lackluster

11:05 <heat> lmao

11:18 <heat> https://github.com/rui314/mold/issues/584

11:18 <bslsk05> github.com: Kernel page-in linking · Issue #584 · rui314/mold · GitHub

11:23 <mrvn> "makes lots of pages dirty"? Firefox has a few gig of ram that's all dirty. The binary is a few MB, of which maybe 100-200 pages are dirtied due to relocations. Is this worth it tp fix?

11:23 tarel2 has joined #osdev

11:25 <heat> the point is that it's slow

11:25 <heat> also wasteful, yes

11:25 <mrvn> heat: Relocations is not why firefox is slow to start

11:26 <heat> ok

11:26 <heat> but firefox is an extreme example

11:26 zaquest has quit [Remote host closed the connection]

11:28 <mrvn> Also consider this: If you use a good fraction of the functionality then all relocation pages will be used. So you aren't saving time but paying for it at runtime instead of once at load. And with 2 context switches per page unless they do this fully in kernel.

11:28 <heat> Onyx's PIE compiled bash has 2543 relocations

11:28 <mrvn> heat: how many plt pages is that?

11:28 zaquest has joined #osdev

11:29 <heat> it's not just plt and got but .data too for instance

11:30 <mrvn> much harder to count in data though.

11:31 <heat> right. the only way would be to stop in the middle of linking and check the dirty pages

11:31 <heat> my /usr/lib/libstdc++.so has _5971_

11:32 <mrvn> and you can fix up a million a second?

11:32 <heat> all of which are relocated at startup

11:32 <heat> can you?

11:32 <mrvn> Not counting the load time a million a second sounds reasonable.

11:32 <heat> touch -> fault in 4KB -> write, but mostly randomish access

11:32 <heat> the load time is the biggie

11:33 <mrvn> debatable. The load happens when you use the program anyway.

11:33 <heat> the point is that it takes time

11:33 <mrvn> heat: aren't relocations sorted linear?

11:33 <heat> no

11:34 <mrvn> well, sort them. That should speed things up when you fix pages sequentially.

11:34 <heat> my clang has 361574 lmao

11:35 <mrvn> For me the interesting number would be how many of those are accessed at runtime and not just for the fixup.

11:36 <froggey> how many unique pages need relocations? if you have 1000 relocations and they're all on the same page, you don't win anything but maybe you do if they're scattered over 1000 pages

11:36 <mrvn> The one thing interesting in the proposal is that the kernel can throw away pages and relocate them again when it later pages in the page again.

11:37 <heat> those are both interesting questions I can't answer

11:37 <mrvn> froggey: For library calls the compiler uses a trampoline and puts all those together so they take verry few pages.

11:38 <mrvn> Oh, and those trampolines do the fixup on first use.

11:38 <heat> not true

11:38 <heat> in fact doing fixups on first use isn't a great idea

11:38 <heat> lazy binding is error prone

11:39 <heat> let me call this fun... process just crashed

11:39 <mrvn> heat: why would it crash?

11:39 <heat> out of memory, symbol not found, etc

11:39 <mrvn> I believe the symbols are checked at load, just the address calculation happens later.

11:40 <froggey> iirc hardened binaries that use relro also do relocations up-front

11:40 <heat> https://wiki.musl-libc.org/functional-differences-from-glibc.html#Lazy-bindings

11:40 <bslsk05> wiki.musl-libc.org: musl libc - Functional differences from glibc

11:44 <mrvn> Bit of a contradiction in the proposal: You want to make it faster but then you page-out data and have to relocate it again on page-in. So you will be doing relocations over and over in low mem situations.

11:45 tarel221 has joined #osdev

11:47 <mrvn> I wonder if Apple benchmarked the old way, their new idea and using a global base address register.

11:48 <mrvn> For statically linked PIE the later should solve all problems.

11:48 <heat> that's the worst idea

11:48 <heat> you take away a register completely

11:49 tarel2 has quit [Ping timeout: 252 seconds]

11:49 <mrvn> heat: you have 32, a register is cheap :)

11:49 <heat> no you don't

11:50 <heat> you have 15

11:52 <mrvn> heat: ARM had 15, AArch64 has 31. :)

11:53 <mrvn> Even with 15 testing the cost of one reg vs. all the relocations is worth it.

11:53 <heat> x86-64 has 15

11:53 <heat> I don't see how that can be true

11:54 <mrvn> you think measuring the effect is pointless?

11:54 <heat> I would prefer paying a memory cost over taking away a precious register (and breaking the ABI in the meanwhile)

11:54 <heat> I guess you could feasibly use x18 in arm64 to do that kind of fuckery

11:54 <heat> but is it free? probably not

11:55 <mrvn> You are designing a new dynamic linker here. Total ABI breakage.

11:55 <heat> having a relocation aware kernel would not break any ABI

11:55 <mrvn> heat: they changed the relocations data too

11:56 MiningMarsh has quit [Read error: Connection reset by peer]

11:58 MiningMarsh has joined #osdev

11:58 <heat> they who? apple?

11:59 nyah has joined #osdev

11:59 <mrvn> heat: yes

12:00 <heat> you could very much implement this either in the kernel or in a dynamic linker without changing any part of the ABI

12:01 terminalpusher has quit [Remote host closed the connection]

12:02 terminalpusher has joined #osdev

12:02 terminalpusher has quit [Remote host closed the connection]

12:03 <heat> actually looking at actual numbers this seems a bit useless

12:03 <heat> https://chromium.googlesource.com/chromium/src/+/master/docs/native_relocations.md

12:03 <bslsk05> chromium.googlesource.com: Chromium Docs - Native Relocations

12:03 dude12312414 has joined #osdev

12:06 <mrvn> heat: 20ms for 390'000 relocations. See what I mean?

12:07 <mrvn> And 6.5MB of memory. Thats less than 0.1% of what chrome uses.

12:07 <heat> it probably depends on the storage medium

12:07 <heat> 20ms is not a lot for the hugest program out there

12:07 <mrvn> I assume they didn't measure a cold start from rotating disks.

12:08 <mrvn> 8ms seek time kills you

12:08 <heat> i would be kind of concerned for clang though, when compiling .c files or so

12:08 <heat> something fast

12:08 <heat> 20ms could be around 10% of the compile time

12:10 <mrvn> I wish my compiles would take 200ms.

12:10 <heat> dont write C++ lmao

12:11 <mrvn> bingo

12:11 <heat> the C parts of my build are stupidly fast

12:11 <heat> the slowest parts usually involve C++ and some header-only library

12:11 <mrvn> totall killer

12:11 <heat> gtest tests are sloooooooow

12:11 <heat> nlohmann is also a POC

12:12 dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]

12:13 <heat> my own C++'s build takes longer and even that is a super restrained, C with classes-like idiom

12:13 <heat> the cf workers runtime had files which took 40s to compile on top of the line CPUs

12:14 <heat> that whole codebase reeks of people who are not kernel developers :v

12:15 <mrvn> heat: there is another thing to consider. Every time you load a lib you have to relocate it to somewhere more or less random. Wouldn't it make sense to pick some place in the address space and put the library there every time? Then you could share the relocation pages and only have to relocate once.

12:15 <heat> no

12:16 <heat> if someone accidentally leaks an address you have just defeated ASLR for that library, for the whole boot, whole system

12:16 <heat> hell, you could craft your own program that does that, and then pwn sudo

12:17 <mrvn> assuming you have a bug to exploit

12:17 <heat> well, yes

12:17 <heat> ASLR is defense in depth

12:17 <heat> which is why it's important

12:17 <heat> that thing just leads to shitty ASLR (@windows)

12:18 <mrvn> Then you have to pay for it with the extra memory. :)

12:18 <heat> might as well not be there if anyone can leak it

12:18 <heat> yeah

12:18 <heat> it's cheap though

12:18 <mrvn> .oO(my words)

12:20 <mrvn> We could bring back segment registers and give every library a segment. Jipey, no more relocations.

12:23 <heat> i thought of that when you mentioned reserving a register

12:23 <heat> but it doesn't work, you pay for segment addressing

12:23 <heat> it's like an extra cycle

12:23 <heat> (but it would totally work, you have a fully unused %gs in x86_64 userspace)

12:24 <mrvn> Only because they didn't optimize it. More important is probably the prefix byte on the opcode

12:25 <heat> i would think they have optimized it

12:25 <heat> %fs and %gs addressing is heavily used for TLS/percpu

12:26 <mrvn> only once to load the base address into a regular reg

12:26 <heat> no

12:26 <mrvn> once in a while

12:26 <heat> mov %fs:variable, %reg

12:26 <heat> that's how you do things

12:27 <mrvn> heat: you can do that. If you have more variables you load %fs:0 and then use that as base

12:27 <heat> here's some funny codegen for my percpu inline asm

12:27 <heat> mov %gs:0x7fe8e16c(%rip),%rax # 0x338 <preemption_counter>

12:28 <heat> it's paying the cost of %gs: and the cost of rip-relative!

12:28 <heat> I haven't found a way to directly encode 0x338 in inline asm

12:28 <mrvn> and a 2GB offet?

12:28 <heat> yeah

12:29 <heat> it ends up working because it's running at -2GB

12:29 <heat> so -2GB + a few megs + 0x7fe8e16c = 0x338

12:29 <mrvn> why are you accessing 0x338?

12:30 <heat> it's a percpu variable

12:30 <heat> again

12:30 <heat> %gs:0x338

12:30 <mrvn> Ahh, yeah.

12:30 <heat> i would like a way to just encode 0x338 but I can't figure out what's the needed asm constraint here

12:31 <heat> which is *annoying*

12:31 <j`ey> write something n C?

12:31 <heat> I can't do mov %gs:preemption_counter because of symbol decoration

12:31 <heat> (well, not directly in the string)

12:32 <j`ey> make it extern C so it wont mangle?

12:32 <heat> maybe I could embed that in the DEFINE_PERCPU

12:33 <heat> I don't understand why it sometimes does mangling and other times it doesn't

12:34 <heat> like _ZL10bga_driver

12:34 <mrvn> you forgot extern "C"

12:34 <heat> what's _ZL10 for? local symbols maybe?

12:34 <heat> no

12:34 <heat> I haven't applied extern "C" anywhere

12:35 <mrvn> Is that one of the strings the demangling doesn't demangle?

12:35 <heat> for instance "thread *thread_queues_head[NUM_PRIO]" gets thread_queues_head

12:35 <heat> static struct pci::pci_id pci_bga_devids[] gets _ZL14pci_bga_devids

12:35 <heat> I think it really does it for local symbols

12:36 <heat> the L must mean local vis

12:36 <mrvn> namespace 14?

12:36 <heat> no

12:36 <heat> 14 = number of chars

12:37 <heat> https://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangling

12:37 <bslsk05> itanium-cxx-abi.github.io: Itanium C++ ABI

12:37 <mrvn> Z for pointer/size_t size?

12:38 <heat> no

12:38 <heat> it's just The Prefix

12:41 MiningMarsh has quit [Ping timeout: 268 seconds]

12:42 <heat> I really like my thunks' symbol names

12:42 <heat> _Z21__sys_symlinkat_thunkmmmmmmm

12:42 <heat> they make me hungry mmmmmmmmmmmmmmmmmm

12:42 FireFly is now known as Luci-ghoule

12:42 <heat> this demangles into __sys_symlinkat_thunk(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) if you're wondering

12:43 tarel221 has quit [Ping timeout: 252 seconds]

12:44 MiningMarsh has joined #osdev

12:51 roan has joined #osdev

13:18 <heat> mjg, can you confirm that st_blocks works normally in freebsd zfs?

13:18 <heat> I've read the code and it does seem to make sense

13:19 <heat> but someone was complaining that a file hole test was failing

13:19 <mjg> with understanding that st_blocks

13:19 <mjg> blkcnt_t st_blocks; /* Number of 512B blocks allocated */

13:19 <mjg> then yes

13:20 <mjg> this is not the number of blocks configured

13:20 <mjg> as in actual block size used by the fs

13:20 <mjg> which is funny given that st_blksize is next to it

13:23 <heat> https://github.com/capnproto/capnproto/blob/7819d247c6c24d04226ed1954c43432b7d2e8c52/c%2B%2B/src/kj/filesystem-disk-test.c%2B%2B#L891-L895

13:23 <bslsk05> github.com: capnproto/filesystem-disk-test.c++ at 7819d247c6c24d04226ed1954c43432b7d2e8c52 · capnproto/capnproto · GitHub

13:23 <heat> this seems BS

13:26 <mjg> there is a delay between writes and getattr being able to see the update

13:26 <mjg> i don't know if this is zfs on freebsd specific or zfs in general

13:27 <mjg> also woudl help if they noted versions at hand

13:27 <mjg> will check it out later

13:30 <heat> would datasync not supposedly sync that?

13:30 <heat> sounds like a broken datasync otherwise

13:30 <mjg> do they sync?

13:30 <mjg> as i said will take a look later

13:32 <heat> yeah

13:33 <heat> https://github.com/capnproto/capnproto/blob/045d5ff0e50cd044c1f05925789d5c3e46d96d21/c%2B%2B/src/kj/filesystem-disk-unix.c%2B%2B#L324

13:33 <bslsk05> github.com: capnproto/filesystem-disk-unix.c++ at 045d5ff0e50cd044c1f05925789d5c3e46d96d21 · capnproto/capnproto · GitHub

13:33 <heat> maybe fdatasync isn't the call here

13:33 <heat> probably isn't

13:34 <heat> I can see how fdatasync could skip updating the metadata immediately

13:36 SGautam has quit [Quit: Connection closed for inactivity]

13:37 <mjg> zfs_getattr is quite complicated, with going down to the object layer

13:37 <mjg> it is quite sad how inefficienti ti s, given how often it is being called

13:38 <heat> what's freebsd's strace?

13:39 <mjg> truss

13:41 <heat> bah

13:41 <heat> this is so fucking weird

13:41 <mjg> > In fact, in benchmarks, Cap'n Proto is INFINITY TIMES faster than Protocol Buffers.

13:41 <mjg> i have doubts about this codebase

13:41 <heat> the guy wrote protocol buffers

13:41 <heat> i guarantee you he knows what he's doing

13:42 <heat> anyway, I can repro their issue pretty easily

13:42 <heat> for new inodes, st_blocks = 1

13:42 <heat> if I don't unlink and open(O_TRUNC), I get a correct block size

13:42 <heat> fsync, fdatasync don't seem to make a difference

13:42 <heat> s/block size/st_blocks/g

13:43 <heat> not even sync(2) makes a difference

13:43 <mjg> can you repro the problem, except add sync() + sleep 10

13:43 <mjg> before you stat

13:43 <heat> what kind of cursed filesystem is this

13:44 <heat> that fixes it

13:45 <mjg> note sync() returns immediately, so if you stat immediately after you are racing it

13:45 <heat> wtf

13:45 <heat> ;_;

13:45 <mjg> unix

13:45 <heat> linux's doesn't

13:46 <mjg> According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return

13:46 <mjg> before the actual writing is done. However Linux waits for I/O completions, and thus sync() or syncfs()

13:46 <mjg> provide the same guarantees as fsync() called on every file in the system or filesystem respectively.

13:46 <mjg> indeed it does not

13:47 <mjg> so what happens if you fsync?

13:47 <heat> jack shit

13:48 <mjg> :)

13:48 <mjg> i'm gonna take a look later, time for a nap

13:48 <heat> ok

13:48 <heat> do you want a bug report?

13:49 <mjg> if you care to write one make it against openzfs

13:49 <mjg> on github

13:49 <mjg> preferably with an ez reproducer

13:51 dude12312414 has joined #osdev

13:52 <mjg> it should be easy to add a flag to make sync() blocking optionally

13:52 <heat> does fsync block?

13:52 <heat> in paidbsd

13:53 <mjg> it is supposed to but i never needed to test

13:56 <mjg> what does zfs on linux think about it

13:56 <heat> I don't use zfs

13:56 <heat> I explicitly reinstalled my freebsd vm on zfs to test it

13:56 <heat> I'm a simple man, I use simple filesystems for simple people

13:59 k8yun__ has joined #osdev

14:06 Raito_Bezarius has quit [Ping timeout: 268 seconds]

14:18 <heat> mjg, cc'd you on the issue

14:19 Raito_Bezarius has joined #osdev

14:25 Raito_Bezarius has quit [Max SendQ exceeded]

14:26 gxt has quit [Remote host closed the connection]

14:28 gxt has joined #osdev

14:42 Raito_Bezarius has joined #osdev

14:48 gxt has quit [Ping timeout: 258 seconds]

14:49 gxt has joined #osdev

14:52 k8yun_ has joined #osdev

14:56 k8yun__ has quit [Ping timeout: 268 seconds]

15:01 gxt has quit [Ping timeout: 258 seconds]

15:04 gxt has joined #osdev

15:05 <Maja[m]> hey, just curious – do any of your OS's support suspend? 👀

15:06 <heat> no

15:06 <heat> I've looked at it for a bit but it just doesn't make much sense to support it

15:06 <heat> power management is finecky

15:07 <Maja[m]> yeah I've mentioned it in passing and a friend of mine has very kindly treated me to a traumatic flashback^W^W infodump

15:08 <Maja[m]> I think I'll just try to be able to boot very fast

15:25 dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]

15:29 <heat> Maja[m], S3 resume is literally just a very fast boot

15:31 axis9 has joined #osdev

15:32 axis9 has quit [Quit: joins libera]

15:39 wxwisiasdf has joined #osdev

15:39 <wxwisiasdf> Hello :D

15:40 axis9 has joined #osdev

15:40 <heat> HELLO

15:41 <axis9> hello

15:43 frkzoid has joined #osdev

15:45 <mats1> herro

15:45 <heat> wxwisiasdf, hows the cobol mainframe person

15:48 <wxwisiasdf> very good

15:49 <wxwisiasdf> i am trying to remove all the C backend from my code as to leave COBOL only because COBOL required heavy runtime support

15:49 <wxwisiasdf> eg. moving memory allocator to COBOL

15:49 <wxwisiasdf> unfortunely COBOL is dynamic so it's proving to be rather difficult

15:54 <heat> good shit

15:54 <heat> just kidding, scary ass shit

15:54 <heat> i have a question for god

15:54 <heat> WHY DOESNT X86_64 AUTOSWITCH GS_BASE IN INTERRUPTS AND SYSCALLS

15:54 <heat> ITS RIGHT FUCKING THERE

15:54 <heat> WHY SWAPGS

15:54 <heat> WHYYYY

15:56 <wxwisiasdf> the unfortunate history of heat, who didn't use the superior hardware task switch mechanism

15:56 <heat> hardware task switching doesn't exist in x86_64

15:56 <heat> but I bet they would forget to swap swapgs there too

15:56 <heat> s/swapgs/gs/

15:57 <wxwisiasdf> it swaps gs automatically for you

15:57 <heat> but in x86_64 that just nulls your segment base

15:57 <heat> is it by design?

15:58 <wxwisiasdf> no idea ask the intel designers

15:58 <wxwisiasdf> "hey intel why you suck so much"

15:59 frkzoid has quit [Remote host closed the connection]

15:59 <heat> absofuckinglutely

15:59 <heat> why dont they just switch the stack on syscall

15:59 <heat> why the fuck

15:59 <heat> ?????????

15:59 <heat> it makes no god damn fucking sense

16:00 <heat> every time i think of x86 i get remembered that everything here fucking suckssss

16:01 <heat> why did they did a quirky adaptation of segment registers

16:01 <heat> and this is not even on Intel

16:01 <heat> this is on fucking AMD

16:01 <heat> the supposed saviours of x86

16:05 frkzoid has joined #osdev

16:20 frkzoid has quit [Ping timeout: 260 seconds]

16:24 SGautam has joined #osdev

16:28 xenos1984 has quit [Ping timeout: 260 seconds]

16:29 xenos1984 has joined #osdev

16:36 freakazoid332 has joined #osdev

16:41 xenos1984 has quit [Ping timeout: 246 seconds]

16:42 xenos1984 has joined #osdev

17:04 wootehfoot has joined #osdev

17:16 xenos1984 has quit [Ping timeout: 268 seconds]

17:16 corank_ has quit [Ping timeout: 255 seconds]

17:18 corank_ has joined #osdev

17:30 xenos1984 has joined #osdev

17:34 bgs has joined #osdev

17:34 pretty_dumm_guy has joined #osdev

17:35 <geist> heat: simmer down

17:37 <mjg> chill pill would help

17:37 immibis_ has quit [Ping timeout: 265 seconds]

17:37 <mjg> heat: read a solaris internals to calm yourslef

17:38 <mjg> *book

17:38 immibis_ has joined #osdev

17:39 <geist> also if you want to bitch about swags or whatnot, that's AMD's doing

17:39 <mjg> mismatched swapgs gang unite

17:39 <geist> but yes anything that toggles like that is a bad idea anywhere in hardware. architectures, device registers, etc

17:39 <geist> any toggling behavior is always a bad idea

17:40 <mjg> no no geist i'm sure it made sense! otherwise they would not do it

17:40 <mjg> #standarddefense

17:40 <geist> well it made sense in they wanted to add exactly one instruction and had to keep all the old machinery around because x86-32

17:40 <geist> but they *could* have done it with two instructions: set gs aux, set gs reg or something

17:41 <geist> same mechanism, but not toggling and thus hard to fuck up

17:44 <geist> sigh. that fatalbert guy keeps whining at me

17:44 <geist> if he could just control his own impulses it would work, but he just cant. it's really pathetic. finally had to hard /ignore him this time

17:45 <mjg> discord?

17:47 <geist> no, here on irc

17:47 <geist> banned him a while back, so about once a week he's been whining at me in privmsg

17:47 <geist> i eventually just hard ignored him

17:56 vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

18:15 <wxwisiasdf> sheesh

18:31 <mjg> solid

18:33 gog has joined #osdev

18:40 <heat> bonk

18:43 <gog> hi

18:43 <heat> after a nice nap I can say that 1) I was just pissed at x86 2) x86 really is user-hostile

18:43 <heat> >read a solaris internals to calm yourslef

18:44 <heat> I would rather use zfs

18:44 * CompanionCube installs Oracle ZFS on heat

18:50 <zid> I was also installed on heat

18:50 <zid> while on heat* sorry

19:09 <geist> heat: yea but it's job security! once you know x86 you're a god

19:09 <geist> well, okay one of many but still.

19:10 <geist> Impress Your Friends!

19:10 <geist> Become the Life of the Party!

19:10 <mjg> you are asking for a meme

19:10 <mjg> lemme find th template

19:10 <geist> or one of those american psycho memes where they show off their kernel ported to different architectures

19:11 <mjg> well i found a photo from your meet up yesterday geist

19:11 <mjg> https://imgflip.com/i/6vlrzz

19:11 <bslsk05> imgflip.com: They don't know - Imgflip

19:11 <geist> heat i know has taken the plunge and hacked on some non x86 arches, and has seen that the grass is slightly greener on other sides. or at least a more pleasant shade

19:11 <geist> haha

19:12 gog has quit [Ping timeout: 268 seconds]

19:19 <mats1> lol

19:22 <Griwes> heat, "I'd rather use zfs" doesn't tell us much, because zfs is delightful :P

19:22 GeDaMo has quit [Quit: Physics -> Chemistry -> Biology -> Intelligence -> ???]

19:27 <geist> honestly i'm probably the one person that doesn't below to the zfs fan club

19:27 <geist> for no particular reason, to be honest, but it seems like it gets just a Little Too Much love

19:28 <geist> kinda like plan9 or something, where feels a lot like folks just repeat 'omg that is awesome'

19:28 <geist> Griwes: and in no way am i implying that's what you are doing here, just getting it off my chest

19:30 <mjg> zfs is pretty great if you don't look to close at it

19:30 Iris_Persephone has joined #osdev

19:30 <Griwes> Well, I've been running zfs as rootfs on every non trivially rebuildable system I have for years now, and I couldn't be happier

19:30 <mjg> but i admit i don't know how it really compares to btrfs

19:30 <Iris_Persephone> heya guys!

19:30 <Griwes> So at least I put my money where my mouth is :p

19:30 <mjg> crucial selling point for me is that there is no partitioning in a way which limits space

19:31 <geist> yeah i just get a little suspicious of stuff that is universally loved, etc. the more love something gets the more suspicious i am that it's just groupthink

19:31 <mjg> as in if i want to create a separate dataset at /lolo, i can just do it

19:31 <mjg> don]t have to carve out any space for it

19:31 <Griwes> geist: yeah that's fair

19:31 <mjg> that turned to shittingo n linux real quick

19:31 <mjg> :S

19:32 <Iris_Persephone> My Linux From Scratch didn't end up working out, so I thought I would follow Bare Bones and see where that takes me

19:32 <geist> well, it's educational if nothing else

19:32 demindiro has joined #osdev

19:33 <geist> honestly LFS was probably a lot simpler and usefl about 10-15 years ago when a basic system was init and a shell, and maybe one or two daemons

19:33 <geist> i *assume* it's moved on to more complex setups now

19:33 <zid> I'd only LFS from stripped down packages built on a gentoo host

19:33 <mjg> what's the point of lfs?

19:33 <zid> no idea

19:33 <zid> edutainment for getting it working I think

19:33 <geist> educational if nothing else. i think i learned a few things the first tiem i fiddled with it

19:34 <mjg> ye ok

19:34 <geist> exactly

19:34 <mjg> that used to be gentoo back in my day

19:34 <demindiro> Also smaller images I guess

19:34 <geist> you end up with something roughly BSD shaped, or gentoo stage 1/2

19:34 <geist> yah i loved the days back when you actually built the first few stages of gentoo

19:34 <zid> I should scour ebay for rams, speaking of building packages

19:34 <geist> gave you the same dopamine hit as LFS

19:34 <zid> I have some paypal dollars to burn but I refuse to overpay for this old ass ram I want

19:35 pretty_dumm_guy has quit [Quit: WeeChat 3.5]

19:35 <Iris_Persephone> I spent... god, it must have been weeks at this point

19:35 <Iris_Persephone> Eventually I decided tracking down all the errors wasn't worth my time

19:36 <geist> yah if it ceases to be edutainment, move on

19:36 <geist> you alas dont get an Achievement badge

19:37 <zid> That's why I never played that game, no achievement tracking

19:37 <Iris_Persephone> Gave me a new appreciation for Unix-likes, though!

19:37 <zid> okay I checked all of ebay.. maybe I should figure out how alerts work

19:38 <geist> oh no. that way leads to madness

19:38 <Iris_Persephone> What are you looking for?

19:38 <zid> 2x8GB 1866MHz UDIMMs

19:39 <Iris_Persephone> ah

19:39 <zid> or alternatively, 4x8GB 1600/1866Mhz URDIMMs

19:41 <zid> https://cdn.discordapp.com/attachments/417023075348119556/1026579849189531678/unknown.png I'm not *entirely* sure ebay knows what words mean

19:42 <Iris_Persephone> "Everything is a file" is remarkably intuitive, when you wrap your head around it

19:43 <Iris_Persephone> zid: "Great price" indeed!

19:43 <zid> To be fair the alternative just ends up being "everything is a handle"

19:43 <zid> which ultimately works the same, as files also have handles

19:43 <demindiro> I think "file" is a misnomer and "object" or "handle" is better

19:43 <demindiro> Because every time there's some guy who's like "but network sockets aren't files"

19:43 <zid> but unix just pretends various things *are* files, where other things would not

19:44 <zid> like, windows will let you open a com port and use read/write on it, but there isn't actually a COM2 `file` sat around anywhere with mknod

19:44 <geist> i think a lot of that has to do with whether or not all the things you get handles to are in a single namespace

19:44 <geist> that i think is kinda a differentiator

19:45 <zid> I think the thing that makes unix unix is pipes

19:46 <zid> windows is often remarkably hard to use because it isn't pipey

19:47 <zid> you download some random program, some other random program, and end up having to write your own third program to connect them

19:47 <zid> along with stuff like autohotkey

19:47 <demindiro> I think that's more of a "GUI everything" symptom

19:48 <demindiro> Automating GUIs instead of CLIs is harder IMO

19:48 Iris_Persephone has quit [Ping timeout: 248 seconds]

19:50 Iris_Persephone has joined #osdev

19:51 gog has joined #osdev

19:52 <Iris_Persephone> Pipes are *really convenient* too, I didn't realize *how* convenient until I started using Mint

20:02 <Iris_Persephone> I wonder... can I make an OS that is as "pipey" as possible?

20:14 <zid> You should invent a gui pipe system, like the node system in blender et al

20:15 wootehfoot has quit [Quit: Leaving]

20:19 <Iris_Persephone> I'd have to come up with some sort of standardized interface between programs, but that was a given if I wanted to do anything pipey

20:20 frkzoid has joined #osdev

20:24 freakazoid332 has quit [Ping timeout: 264 seconds]

20:32 frkzoid has quit [Ping timeout: 268 seconds]

20:32 wxwisiasdf has quit [Ping timeout: 246 seconds]

20:33 <Iris_Persephone> Wait, I just realized something

20:33 <Iris_Persephone> Protocols exist

20:35 freakazoid332 has joined #osdev

20:40 <gog> it's basically drag and drop

20:41 <gog> only without the dragging and dropping

20:43 elastic_dog has quit [Ping timeout: 264 seconds]

20:43 elastic_dog has joined #osdev

20:46 Iris_Persephone has quit [Ping timeout: 252 seconds]

20:48 <heat> IM BACK

20:49 <heat> Griwes, zfs has insane fsync semantics apparently

20:49 <heat> which doesn't bode well for the fs

20:49 <heat> https://github.com/openzfs/zfs/issues/13991

20:49 <bslsk05> github.com: [FreeBSD] Creating + writing to a file + stat() in quick succession returns bad st_blocks · Issue #13991 · openzfs/zfs · GitHub

20:49 <heat> not only on freebsd but also repros on linux

20:50 <heat> geist, grass is very much greener

20:50 <heat> i dont understand the weird decisions and inconsistencies

20:50 <heat> is it to save transistors or something? idk

20:51 Iris_Persephone has joined #osdev

20:51 <Iris_Persephone> gog: although that begs the question, "if it's so simple why hasn't someone else implemented it?"

20:52 <heat> because it's not a great idea

20:52 <heat> pipes are slow

20:52 <gog> probably because for a GUI application it's not very intuitive

20:52 darkstardevx has joined #osdev

20:52 <gog> like when do you output to the pope

20:52 <gog> pipe

20:53 <heat> fuck the pope

20:53 <gog> yes

20:53 <heat> the unix philosophy is a bad idea

20:53 <gog> and how long does an application wait for input from the pipe (i typed pope again wtf)

20:53 <heat> everything is a file is the worst one

20:54 <gog> heat's been reading hte UNIX-Haters handbook

20:54 <gog> such a good boy

20:54 <heat> everything is a file was also not in the OG UNIX philosophy

20:54 <heat> it was retrofitted in

20:54 darkstardevx has quit [Remote host closed the connection]

20:54 <heat> https://www.youtube.com/watch?v=9-IWMbJXoLM

20:54 <bslsk05> '"What UNIX Cost Us" - Benno Rice (LCA 2020)' by linux.conf.au (00:34:14)

20:54 <heat> great watch

20:54 <gog> yes, well-defined interfaces are bad. let's slap everything inside ioctls and make it look like a file

20:54 darkstardevx has joined #osdev

20:55 <Iris_Persephone> To be perfectly honest

20:55 <heat> the worse parts are when they go half write/read and then give up and ioctl the shit out of it

20:55 <Iris_Persephone> half of this is just pure spite at Voicemeeter and AutoHotKey

20:55 <heat> THANKFULLY they now understand they can just return a fd from syscalls that are not named open

20:56 <heat> which is why pidfd, etc are decent interfaces

20:56 <gog> yeah the excuse of "it keeps a generic interface for I/O!!" like

20:56 <gog> no

20:56 <gog> it fucken doesnðt

20:56 <gog> if you have to use ioctl at all you've negated the benefit

20:56 <heat> ioctls are not bad

20:56 <gog> i suppose not

20:57 <gog> but you still have to use a file descriptor with them

20:57 <Iris_Persephone> I would give my firstborn child to be able to pipe my audio/video/whathaveyou to and from a program as easily as using "tee"

20:57 <gog> so the big abstraction is still there for the benefit of the little one

20:57 <heat> char buffer[200]; sprintf(buffer, "/proc/%d/comm", pid); open(buffer) is horrible

20:57 <heat> dog shit

20:57 <heat> it's also like half the linux libc

20:57 <heat> Iris_Persephone, not doable

20:58 <heat> audio and video is low latency stuff

20:58 <Iris_Persephone> Yeah, that figures

20:58 <heat> video is bandwidth heavy and needs acceleration

20:59 <heat> it turns out the whole UNIX philosophy thing of making small, composable programs with pipes and whatever the fuck isn't a great idea

20:59 <heat> because it's slow and a poor fit

21:00 Celelibi has quit [Ping timeout: 260 seconds]

21:00 <heat> so people just added options to already existing programs

21:00 <heat> like cat -n

21:01 <Iris_Persephone> So, the slowness is *inherent* to the concept of pipes?

21:01 <heat> yes

21:01 <demindiro> no

21:01 <heat> lots of copying and IPC

21:01 <heat> yes

21:01 <heat> absolutely yes

21:01 <heat> you're not getting anything performant out of pipes

21:01 <demindiro> How performant does it need to be?

21:01 <demindiro> Piping works plenty well for lots of stuff

21:01 <heat> on real services? it totally does

21:02 <heat> I've seen people suggest https as httpd | tlsd or whatever

21:02 <heat> ridiculous shit

21:02 <CompanionCube> does anyone really care about st_blocks tho?

21:02 <heat> I do

21:03 <heat> fsync() should make sure metadata is written back

21:03 <heat> also kenton too

21:03 <heat> he has a file hole test which needs it

21:03 <Iris_Persephone> well shit, there goes that idea

21:03 <heat> do things -> fdatasync -> check st_blocks -> expect st_blocks to be valid

21:04 <demindiro> iris_persephone: if redundant copying is a concern, consider using shared buffers and passing small messages between processes instead.

21:05 <heat> demindiro, and your unix philosophy goes out the door

21:05 <demindiro> No

21:05 <heat> yes

21:05 <demindiro> Shared buffers are very generic

21:05 <heat> it all revolves around pipes

21:05 <demindiro> Message can just be "this offset + length go figure"

21:05 <heat> unix revolves around simple utilities that are easily composable using |

21:05 <heat> on the shell

21:06 <demindiro> And you can do the same with shared buffers

21:06 <heat> it worked great in the 70s

21:06 <demindiro> Just involves a little more setup

21:06 <heat> again, unix philosophy goes out the door

21:06 <kof123> eh, if anyone cares somewhere in unix interviews there is someone who had a fancier pipe idea that was deemed too complex, would look at that, not very helpful i know :D

21:06 <kof123> so i would lean towards heat it wasn't original, just based on that

21:06 <kof123> *in the original

21:06 <CompanionCube> does btrfs has less weird semantics

21:06 <heat> idk

21:06 <heat> hope so

21:06 <kof123> i mean, it might be a horrible idea, just if someone was looking into pipes

21:06 <kof123> maybe it was left out for very good reasons

21:07 <heat> "This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."

21:07 <kof123> misspoke, you said everything is a file was not the original, my bad

21:07 <heat> take your data, serialize it to text (but not json because that isn't streamy), send it over a pipe, receive it on the other program, deserialize it, ...

21:08 Celelibi has joined #osdev

21:08 <heat> according to this your C compiler would be a 200 line piped up invocation|

21:08 <heat> !

21:08 <demindiro> I guess I don't take the UNIX philosophy as literally, you can do e.g. lz4 blah.txt | pv | lz4 -d

21:09 <demindiro> and lz4 outputs binary data

21:10 <Iris_Persephone> I suppose I don't want to *strictly* follow the Unix philosophy, I just find the concepts useful

21:12 smach has joined #osdev

21:13 * kof123 prepares to throw gasoline on the fire

21:13 <kof123> piping audio and video...wouldnt that just be remote importing /dev/xyz on remote machines?

21:13 <gog> nothing is a file

21:13 * kof123 runs

21:13 <gog> that's my philosophy

21:14 <heat> kof123, very plan9 of you

21:14 <kof123> yes :D

21:14 <kof123> "simpsons did it!" "plan9 did it!" lol

21:16 <CompanionCube> heat: iirc the thing with the metadata is that it doesn't know the on-disk size until the TXG commits. fsync only writes back to the on-disk ZIL.

21:17 <heat> that sounds like fsync and sync are broken

21:17 <CompanionCube> no it's just write-ahead logging as you would find in a database

21:18 <CompanionCube> you can make fsync properly broken by setting sync=disabled

21:18 <heat> fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device

21:18 <heat> (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.

21:19 <heat> after fsync of zfs your st_blocks metadata isn't consistent at all with how things look like or will look like

21:19 <heat> fsync should be a proper barrier

21:20 <heat> what's fsync for if you can't guarantee consistency on metadata, the inode, and the disk itself?

21:21 <heat> the problem used to be the lack of guarantees on synchronization for other API functions

21:21 <CompanionCube> oh hey, found a btrfs thread on this problem: https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00000.html

21:21 <bslsk05> lists.gnu.org: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in

21:21 <heat> now these fancy filesystems can't even guarantee fsync synchronizes?

21:21 <heat> good, fuck it

21:22 <gog> yes

21:22 <gog> move beyond the need for files

21:22 <heat> "until the data are synced"

21:22 <heat> at least it works on fsync

21:22 <CompanionCube> single-level storage when

21:23 <heat> (hopefully)

21:23 <Griwes> is on-disk ZIL lost on reboot?

21:23 <Iris_Persephone> Stupid newbie question: What is an filesystem object, if not a file?

21:24 <demindiro> generic data blob, I suppose

21:24 <CompanionCube> Griwes: no, the one lost on reboot is the in-memory ZIL.

21:24 <Griwes> if no (and I don't think it is), then this behavior fulfills the behavior of fsync you quoted vOv

21:24 <Griwes> so everything is fine then

21:24 <heat> it's not

21:24 <clever> Iris_Persephone: for zfs, there are a lot of fs metadata objects, and block devices (zvol's) are also one big object

21:24 <heat> metadata isn't synced

21:24 <clever> but files are also objects

21:24 <heat> data isn't even fucking sync'd

21:24 <CompanionCube> async writes go to the in-memory ZIL but not the on-disk one

21:24 <heat> how can you call *sync* and have it not be sync

21:25 <Griwes> > so that all changed information can be retrieved even if the system crashes or is rebooted

21:25 <heat> comitting something zfs's magic super duper table of awesome things is not guaranteeing synchronization

21:25 <clever> heat: with zfs, i believe sync will write it to the on-disk ZIL (journal), so it can return faster

21:25 <Griwes> this seems to be the core part of fsync by the quoted definition

21:25 <heat> and that's the fucking problem clever

21:25 <heat> it's sync, not a race

21:25 <clever> heat: but i think the ZIL is basically just a record of VFS calls, write/delete/symlink

21:25 <Griwes> so you are arguing the definition of fsync you quoted is wrong

21:25 <heat> I am

21:26 <heat> fsync and sync and fdatasync need to be *properly done*

21:26 <heat> not POSIX's-vague-definition done

21:26 <Griwes> okay so you're saying the unix (or at least linux) definition of fsync is bad

21:26 <heat> yes!

21:26 <Griwes> stop blaming zfs for that then!

21:26 <demindiro> IIRC macOS even just ignores fsync for the sake of performance

21:26 <demindiro> Which apparently is valid

21:27 <heat> and you'll know that filesystem fuck around with the semantics of fsync

21:27 <clever> demindiro: i have seen macos bugs, where a file that was written via mmap() and not sync'd, reports having holes when you use the seek-hole function

21:27 <heat> this is a contentious subject

21:27 <heat> please, make it right.

21:27 <clever> demindiro: cp then just skips the first 8mb of the file, and doesnt even bother copying it

21:27 <Griwes> I don't have a full opinion on whether the fsync definition is sufficient or not, and I can see merit in an argument that it is wrong

21:27 smach has quit [Remote host closed the connection]

21:27 <Griwes> but that doesn't make it the FS' fault

21:27 <mjg> macos andp erfomrance?

21:27 <heat> it is

21:27 <clever> demindiro: and by pure chance, i then saw nearly the identical bug on linux+zfs a week later, lol

21:27 <heat> they implement fsync!!!

21:28 <Griwes> would you prefer them to *not* implement fsync?

21:28 <heat> better than pretend it does anything remotely useful

21:28 <mjg> if lolo fsync provides no guarantees i would prefer to software to know

21:28 <Griwes> but it does what it says on the man page lol

21:28 <kof123> i was looking at raidframe to steal ideas...it it way too much for me, but a couple gems in the source: @echo "RAIDFRAME R00LZ D00D" >> /dev/null Set this to 1 if you are the CMU Parallel Data Lab. You probably aren't.

21:28 <heat> it's like ext4's fsync just writing journaling data and fucking off

21:28 <Griwes> yeah? it's implementing a unix FS protocol lol

21:29 <Griwes> blame unix or whoever the heck first defined fsync like this

21:29 <CompanionCube> tbf at least opposing ext4 journaling fsync is being consistent

21:29 <heat> do you think that guarantees consistency?

21:29 <Griwes> again: I see merit in your argument but you are misplacing the blame for the fault in the API that you perceive

21:30 <heat> I expect sane filesystem semantics

21:30 <gog> why

21:30 <Griwes> I expect a driver to implement the API it is asked to implement

21:30 <heat> if they can't provide it, fix it. I called fsync, I want fsync, I want consistency

21:30 <Griwes> and if that API is shit, it's not the driver's fault

21:30 <heat> there already is consistency

21:30 <heat> (ish)

21:30 smach has joined #osdev

21:30 <heat> ext4 delays block allocation but st_blocks is always correct

21:31 <Griwes> you called fsync, but you wanted something else than what the manpage for fsync says will happen

21:31 <heat> now, I understand zfs's set of constraints is totally different

21:31 <clever> heat: zfs compresses, so the on-disk size can vary

21:31 <heat> so, make it work on fsync

21:31 <clever> CompanionCube: is the ZIL before or after compression?

21:31 <heat> ^^^^

21:31 <Griwes> expecting more than what the manpage says is on you, not on the driver

21:31 <heat> Griwes, the manpage is irrelevant here

21:31 <heat> the manpage is written based on the code and behavior of fsync

21:31 <CompanionCube> clever: that's actually a good question, more so how that interacts with immediate writes which are also i thing i forgot to mention

21:32 <Griwes> heat, okay. what does POSIX say?

21:32 <heat> posix is also irrelevant

21:32 <heat> sorry

21:32 <Griwes> no

21:32 <heat> yes

21:32 <heat> absolutely yes

21:32 <Griwes> because POSIX is the spec that they are implementing

21:32 <pitust> its not for xnu

21:32 <heat> you can't write a whole OS looking only at POSIX

21:32 <heat> because they are purposefully vague

21:32 <CompanionCube> given that https://github.com/openzfs/zfs/issues/8896 exists, seems answer is that compressed ZIL writes aren't a thing

21:32 <bslsk05> github.com: Request: compress ZIL writes · Issue #8896 · openzfs/zfs · GitHub

21:33 <heat> like, look at this shit

21:33 <heat> The sync() function shall cause all information in memory that updates file systems to be scheduled for writing out to all file systems.

21:33 <heat> The writing, although scheduled, is not necessarily complete upon return from sync().

21:33 <heat> ???????????????????????????

21:33 <Griwes> this is time for us to agree to disagree because you have a fundamentally different PoV on what you are supposed to do when implementing an interface with a formal spec than I have and we will not agree on this

21:33 <heat> the kernel doesn't need to "just implement POSIX"

21:33 <heat> it never has, that was never the purpose

21:34 <heat> POSIX was defined by the kernels

21:34 <heat> still is

21:34 * Iris_Persephone pokes her head into the chat

21:34 <Iris_Persephone> is it over yet?

21:34 <Griwes> POSIX is defined by a spec created by the austin group and then ratified by multiple international standards orgs

21:34 <demindiro> Be careful not to get decapitated

21:34 <Griwes> but let's agree to disagree

21:34 <heat> if you don't provide a useful fsync() then you're doing a bad job

21:35 <heat> Griwes, spec that is built by... multiple operating systems in order to form a consensus

21:35 <Griwes> if you expect fsync to do more than is documented as its user, you're doing a bad job

21:35 <heat> ok

21:35 <clever> CompanionCube: my rough understanding, is that the ZIL is basically just a log of these VFS ops: https://github.com/openzfs/zfs/blob/master/include/sys/zil.h#L142-L167

21:35 <bslsk05> github.com: zfs/zil.h at master · openzfs/zfs · GitHub

21:35 <heat> tell me how to do a better job?

21:35 <heat> sleep(10)?

21:35 <heat> the filesystem doesn't give me the tools to do a better job

21:36 <clever> CompanionCube: either with the data inline (in the ZIL itself) or out of line (in the main pool, in "free" space)

21:36 <CompanionCube> yep

21:36 <CompanionCube> as always, utcc has good stuff on the ZIL.

21:36 <Griwes> heat, *why* do you need to know the number of blocks that you've just written?

21:36 <Iris_Persephone> I should really read POSIX when I get around to it, huh

21:36 <heat> you know why all this sync stuff is contentious shit in filesystems?

21:36 <heat> because they're scared people fsync too much and it "hUrTs PeRfOrMaNcE"

21:37 <heat> that's literally it

21:37 <Griwes> you're not answering the question

21:37 <clever> CompanionCube: yep, https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZILSafeDirectWrites this explains why writes to "free" space are safe without updating free-space maps

21:37 <heat> there have been huge debates in linux about it

21:37 <bslsk05> utcc.utoronto.ca: Chris's Wiki :: blog/solaris/ZFSZILSafeDirectWrites

21:37 <CompanionCube> clever: and the backlinks are good too

21:37 <clever> basically, the in-memory maps are updated, so the live system is aware of it

21:37 <heat> Griwes, to know how much disk I've used, file holes, etc

21:37 <clever> and during recovery, you read the ZIL and rebuild that in-memory state

21:37 <Griwes> "sync" being "the data is on the permanent storage device so it doesn't get lost" seems like a perfectly fine definition to me

21:38 xenos1984 has quit [Read error: Connection reset by peer]

21:38 <clever> CompanionCube: and from what ive see, the ZIL is entirely optional, assuming there is nothing to recover, you could just do every write directly to the pool itself

21:38 <Griwes> heat, why do you need to know _exactly_ how many blocks you've used?

21:38 <heat> Griwes, the metadata doesn't even add up!

21:38 <heat> <heat> Griwes, to know how much disk I've used, file holes, etc

21:38 <clever> the only purpose of the ZIL is to make sync writes finish faster

21:38 <heat> of fucking course

21:39 <Griwes> the first part of your answer is why I'm asking the question

21:39 <heat> it's a performance thing!

21:39 <Griwes> lol

21:39 <Griwes> anyway I'm out

21:39 <heat> it's not a "Lets just do the bare minimum for posix"

21:39 <Iris_Persephone> If you don't like it, just implement it differently in your OS...?

21:39 <Iris_Persephone> Is that not practical?

21:39 <heat> no

21:39 <heat> I'm not realistically daily driving my OS

21:40 <CompanionCube> clever: mhm

21:40 <heat> if you don't fight for change you just get stupid behavior all the way

21:40 <Griwes> (we're at the point of disagreeing what constitutes "stupid behavior")

21:40 freakazoid332 has quit [Read error: Connection reset by peer]

21:41 <clever> CompanionCube: i also have theories on how a bootloader can cheat zfs writes, what if i just append an entry to the ZIL?

21:41 smach has quit [Ping timeout: 265 seconds]

21:41 <CompanionCube> clever: do you really need to cheat zfs writes when there's enough reserved space to be used

21:41 <mrvn> heat: The number of blocks something uses becomes basically meaningless when you have snapshots (is the space = total / snapshots using the block?) and cow (a freshly edited file can take 3 times as much blocks and go down to 1 times as things are flushed).

21:42 <clever> CompanionCube: then i need to update the spacemaps, and all of the indirect blocks in several objects

21:42 <demindiro> ZIL is a separate device no?

21:42 <CompanionCube> demindiro: not per se

21:42 <clever> demindiro: it can either be in the main pool, or its own device

21:42 <CompanionCube> the term 'SLOG' refers to the latter, but ZIL applies to both as well as the in-memory one

21:43 <clever> demindiro: under normal conditions, the ZIL is write-only

21:44 <clever> to make a write() or sync() finish asap, it syncs the data to the ZIL on a disk, but also keeps a copy in ram

21:44 <clever> and then it can update the proper pool at a later time, from the copy in ram

21:44 <mrvn> heat: there is also fdatasync(). Sounds like ZIL ,akes fsync behave like fdatasync

21:45 <heat> yes

21:45 <heat> the difference between fsync and fdatasync is that metadata is supposed to be written and consistent

21:46 <clever> https://github.com/openzfs/zfs/pull/12724

21:46 <bslsk05> github.com: Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency by behlendorf · Pull Request #12724 · openzfs/zfs · GitHub

21:46 <mrvn> BUT: "Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk."

21:46 <heat> yes, that makes sense

21:46 <clever> from memory, a normal sync in CLI did mask this bug

21:46 <mrvn> You could argue that the block counts etc is part of the directory?

21:46 <heat> could I?

21:47 <mrvn> .oO(Well, you can argue anything)

21:48 <mrvn> The manpage refers to inode(7) for what metadata is flushed by fsync() but I don't know how normative that is.

21:48 <clever> mrvn: for zfs, the blocks are part of the dnode (similar to an inode), and a directory is just a name->dnode-number mapping

21:48 <clever> mrvn: but the ZIL lets you ensure data is secure on disk, without updating the dnode

21:49 <mrvn> clever: Which fits with "transfers all modified in-core data of the file referred to by the file descriptor fd to the disk device"

21:49 <clever> mrvn: the dnode table is also an object in zfs, with its own indirect block tree, and a dnode at its root

21:49 <mrvn> clever: except it's not "all"

21:50 <heat> st_blocks is part of metadata per inode(7) if we want to go down that route

21:50 <heat> exactly.

21:50 <mrvn> clever: heats argument is that "st_blocks" is not written to disk, nor even reported correctly from ram

21:51 <heat> yup

21:52 <mrvn> heat: but what should st_blocks be? I have a file that takes up 4 blocks so st_blocks is 4. Now I overwrite 1 block so with COW the count goes up to 5? 2 flushes later the old block is removed and the count goes back to 4?

21:53 <mrvn> heat: or with journal and compression I write a block the count goes up to 5 to account for the journal block. Then copression takes it down to 2 blocks total and it drops to 2?

21:53 <heat> i don't know

21:53 <heat> that's up for the fs to decide

21:53 <heat> not like I've thought much about that particular question anyway

21:54 <mrvn> heat: well, you seem to disagree with the decision zfs made.

21:54 <heat> no

21:54 <heat> zfs has a decision regarding st_blocks

21:54 immibis_ has quit [Ping timeout: 264 seconds]

21:54 <heat> it's pretty clear

21:54 <heat> you do get st_blocks values after everything is flushed per sleep 10

21:54 <heat> the problem is that it's nowhere near consistent

21:55 <heat> and fsync, sync, fdatasync that should serve as barriers for in-memory and in-disk consistency, don't

21:55 <mrvn> But that's the problem with journaling, COW and compression. The block count changes over time.

21:55 Iris_Persephone has quit [Ping timeout: 268 seconds]

21:55 <heat> sure

21:55 <mrvn> fsync() is more about the data being retrievable.

21:56 <mrvn> it wasn't designed with metadata that changes on it's own over time.

21:56 <heat> I would say it's about consistency

21:56 xenos1984 has joined #osdev

21:56 <heat> "the only purpose of the ZIL is to make sync writes finish faster" <-- having a structure to make syncs faster isn't a particularly good goal

21:57 <heat> if it jeopardizes the consistency

21:57 <clever> heat: all writes to an fs are also serialized within the ZIL

21:57 <clever> because the ZIL is a singly linked list, and each write appends to it

21:57 <heat> sure

21:57 <mrvn> heat: does it violate "so that all changed information can be retrieved even if the system crashes or is rebooted."?

21:57 <heat> I'm not saying your fancy ass filesystem doesn't have top tier consistency in practice

21:58 <heat> no

21:58 <heat> but again, manpages reflect reality, not the opposite

21:59 <mrvn> heat: so your only problem is that the metadata doesn't become stable with fsync() but keeps changing for a while after that.

21:59 <heat> you want to enforce sane, useful semantics for your filesystem so that things Just Work, are safe, make sense

21:59 <heat> it's not "do the bare minimum"

21:59 <Griwes> manpages reflect the behavior a user can depend on

21:59 <mjg> i remember zfs is doing something nasty to keep fsync oepratinal without demolishing perf

21:59 <mjg> and it may be it is is zil

21:59 <Griwes> a user depending on more than the manpages say they can depend on is a user problem

21:59 <heat> Griwes, I'm talking about changing the implementation :)

21:59 <mjg> i'm only sightly familiar with zfs, mostly from its pessimal multicore code :[

22:00 <Griwes> changing the implementation would not stop meaning you can only depend on what the manpages say you can depend on

22:00 <heat> sure

22:00 <Griwes> you are presenting all this as a bug

22:00 <heat> make everyone do that, then change the man page

22:00 <mrvn> Pol: Who here uses fdatasync() in their code because they only care about data integrity?

22:00 <mrvn> +l

22:00 <Griwes> it's not; it's a feature request

22:00 <heat> it is a bug, there's no consistency in memory

22:00 <Griwes> it's a feature request to widen the guarantees of fsync

22:01 <heat> I'm saying they're not wide enough for normal usage

22:01 <Griwes> and you are asking to widen the guarantees

22:01 <heat> this is all happening because they wanted to make sync faster

22:01 <Griwes> that's a feature request

22:01 <heat> they regressed it

22:02 <heat> it is in fact a bug, a regression. it was fine, now it's not

22:02 <heat> this is something that needs to be fixed and not added

22:02 <Griwes> they regressed the manpage?

22:02 <heat> ...

22:02 <heat> i give up

22:02 <Griwes> did the manpage promise you more earlier?

22:02 <Griwes> if it didn't, then this is a feature request

22:02 <heat> it's pointless

22:03 <mrvn> Griwes: that depends on the interpretation

22:03 <Griwes> look, I work on a formal standard

22:03 <Griwes> and I deal with that distinction regularly

22:03 <mrvn> Griwes: it doesn't flush (or even compute in ram) the metadata

22:03 <Griwes> if it wasn't in the spec that you can depend on the behavior you want, it's not a bug

22:03 <mrvn> and it's more the "compute in ram" part heat objects too I think.

22:04 <Griwes> mrvn, is it required to?

22:04 <mrvn> Griwes: yes

22:04 <Griwes> by?

22:04 <heat> an operating system is not a fucking language

22:04 <heat> you do realize that right?

22:04 <mrvn> Griwes: I only have the manpage open but the specs surely say something similar: "As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7))."

22:04 <Griwes> we're talking about an API that has a spec

22:04 <heat> POSIX specifies the bare minimum

22:05 <heat> the absolutely bare fucking minimum

22:05 <heat> can they just comply to that? oh of fucking course

22:05 <heat> is it useful? absolutely not

22:05 <mrvn> heat: have you actually looked into the specs to see what it says about the inode metadata?

22:05 <heat> it says nothing

22:05 <heat> POSIX doesn't know what an "inode metadata" is

22:06 <heat> POSIX will never make stronger guarantees about fsync, or anything else

22:06 <heat> it's purposefully generic

22:06 <heat> so every god damn UNIX out there can say they're POSIX(r) Compliant(r)(tm)

22:06 <mrvn> heat: surely it must say something about fsync() vs. fdatasync()

22:07 Iris_Persephone has joined #osdev

22:07 <Griwes> > This field indicates the number of blocks allocated to the file

22:07 <heat> "The fsync() function shall request that all data "

22:07 <Griwes> okay, are blocks allocated to the file while the write is in ZIL?

22:07 <mrvn> Griwes: arguably.

22:07 <heat> "The fdatasync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state."

22:07 <Griwes> mrvn, but that's just ZIL blocks. the actual blocks haven't been allocated yet

22:08 <Griwes> I'd argue that the only bug is that it returns 1 and not 0

22:08 <mrvn> Griwes: it's space on the disk

22:08 <heat> then that's the wrong fsync behavior

22:08 <heat> why is zfs autonomously saying "fuck standard behavior" and trying to just be faster

22:08 <mrvn> Griwes: same with journaling and COW copies. But then the block count changes over time. So it's not that useful.

22:08 <clever> Griwes: ZIL can store data either in the log itself, or in the main pool, in both cases, there are data blocks assigned to that data, but the free space maps havent been updated

22:09 <Griwes> mrvn, it also says "the unit may differ on a per-filesystem basis" - wonder if that means the filesystem can use two units depending on the file's state

22:09 <Griwes> clever, but arguably those blocks are assigned to the ZIL entry and not to the file itself

22:09 <clever> Griwes: zfs does allow the block-size for every file on an fs to differ

22:09 <Griwes> idk, looks like a feature request to me

22:09 <mrvn> Griwes: try computing st_block on a unionfs where each branch can have a different block size.

22:10 <heat> st_block is in practice in 512byte units

22:10 <mrvn> Personally the st_block has been rather meaningless for decades now.

22:10 <heat> de facto

22:10 <heat> anyway

22:10 <heat> this is not worth debating

22:10 <clever> [root@amd-nixos:~]# stat /nix/var/nix/db/db.sqlite

22:10 <mrvn> heat: it's rather random content on NFS

22:10 <clever> Size: 139017216 Blocks: 309489 IO Block: 131072 regular file

22:11 <heat> mrvn, I would expect it to reflect the source file's blocks

22:11 <mrvn> clever: 1MB block size for the FS?

22:11 <heat> no

22:11 <clever> mrvn: looking...

22:11 <heat> 131072

22:11 <heat> 128KiB

22:12 <clever> yep, 128kb for that file

22:12 <Griwes> heat, the manpage says it can be FS defined vOv

22:12 <Griwes> user error, skill issue, file a feature request

22:12 <mrvn> heat: so for /nfs/foo it's 5 1MB blocks and /nfs/other/mount/bla it's 100 4k blocks?

22:12 <heat> i seriously can't anymore

22:12 <heat> it's like you're pursefully trying to piss me off

22:12 <clever> mrvn: but due to free space fragmentation, 1049 of the 128kb blocks, are fragmented, so they are stored as 3+ blocks each on-disk

22:12 <Griwes> that's probably slightly too sassy and I apologize but I've dealt with too many feature requests disguised as bug reports to not be that

22:13 <heat> broken behavior, developers' skill issue, bug

22:13 <clever> mrvn: yikes, thats out of a total of 1153 L0 blocks, 90% of the blocks are fragmented

22:13 <mrvn> heat: I agree with you that it's bad that st_block is reporting bad data. But fact is st_blocks has been total chaos for decades now. It's basically useless unless you know what FS you are checking.

22:13 <Griwes> well be glad I'm not a zfs dev because I'd just slam your issue as closed, not a bug, rtfm

22:14 <heat> what manual?

22:14 <heat> do I need to be intimately familiar with my filesystem's format??

22:14 Iris_Persephone has quit [Ping timeout: 265 seconds]

22:14 <Griwes> the manpages for fsync and inode

22:14 <mrvn> heat: for st_blocks? yes.

22:14 <demindiro> Is the block count even accurate? du -a random_text_file claims it 5 blocks of 512 bytes, but the file itself is 2162 bytes large.

22:14 <dh`> isn't st_block defined to count in 512-byte blocks? or was that something we did to avoid chaos?

22:14 <demindiro> Which honestly doesn't make much sense to me

22:14 <demindiro> (also I use ashift=12)

22:15 <clever> mrvn: most of the holes on my disk are 8kb in size...

22:15 <mrvn> dh`: no

22:15 <clever> demindiro: check zdb as well, `zdb -ddddddddbbbbbbbb <dataset> <inode#>`

22:15 <mrvn> demindiro: 2162 rounded up to the next full block is 5*512.

22:15 <demindiro> wait

22:15 <clever> demindiro: L0 blocks contain the actual data, L1 blocks are a list of L0 pointers and so on, if a block says gang then its fragmented into more pieces

22:16 <demindiro> uh

22:16 <demindiro> 4.5K dux_notes.txt

22:16 <demindiro> hm

22:16 <dh`> for us it is: st_blocks The actual number of blocks allocated for the file in

22:16 <dh`> 512-byte units.

22:16 <dh`> makes me wonder if our zfs does it right

22:16 <demindiro> I'm trying to understand how du -ha arrives at 4.5K with 512 * 5

22:16 <Griwes> dh`, are writes in the write journal (ZIL for zfs) "allocated for the file"?

22:17 <clever> dh`: should st_blocks count metadata, like the indirect block tree?

22:17 <dh`> traditionally it does, yes

22:17 <mrvn> Another thing about st_blocks: When a file is small or the tail is small (or has small blocks in the middle) it can be included in the inode (or other metadata). How do you count that?

22:17 <clever> dh`: what about when data blocks are of variable size?

22:17 <dh`> As short symbolic links are stored in

22:17 <dh`> the inode, this number may be zero.

22:17 <heat> mrvn, ext4 fakes 1

22:17 <dh`> clever: the goal is to get du(1) to print useful values, so ultimately it's up to the filesystem to be helpful

22:17 <heat> in practice all relevant st_blocks are defined as 512-byte by the kernel

22:18 <clever> mrvn: zfs has a feature called embedded block pointers, if the entire file data (after compression) is under ~300 bytes, it just shoves the data in where the block#+checksum would have gone

22:18 <clever> [root@amd-nixos:/nix/var/nix/db]# ls -lhs

22:18 <mrvn> clever: one of the extrem cases

22:18 <clever> 152M -rw-r--r-- 1 root root 133M Sep 30 19:28 db.sqlite

22:18 <clever> due to gang blocks, this file has ~20mb of overhead

22:19 <dh`> seems perfectly reasonable

22:19 <heat> >isn't st_block defined to count in 512-byte blocks? or was that something we did to avoid chaos?

22:19 <clever> 183M -rw-r--r-- 1 root root 133M Oct 3 19:19 db.sqlite.2

22:19 <heat> POSIX doesn't specify anything

22:19 <heat> I think most implementations converged into 512

22:19 <clever> dh`: heh, all i did was cp it, and it gained more overhead!

22:19 <dh`> that, however, is not reasonable!

22:20 <mrvn> clever: should you report those 300 bytes as "1 512byte block"? Or sum up all the data + metadata of the file (some 300+32+something bytes) and round up to 512 blocks?

22:20 <clever> mrvn: good question

22:21 <clever> 135M -rw-r--r-- 1 root root 133M Oct 3 19:20 db.sqlite.2

22:21 <clever> dh`: changed my block size to 8kb, and now its only 2mb of overhead

22:21 <clever> but it still wound up with 16 fragmented L1 blocks

22:22 <mrvn> clever: why 8k and not 4k?

22:22 <clever> indirect blocks dont respect my block size

22:22 <clever> that could also be done

22:22 <clever> 137M -rw-r--r-- 1 root root 133M Oct 3 19:22 db.sqlite.2

22:23 <clever> mrvn: now it has 32 fragmented L1 blocks, because it has twice as many L0's

22:23 <mrvn> heat: did you see the note in stat "Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call."? So the stat data itself has never been deemed consistent anyway. :)

22:24 <dh`> whose stat is that in?

22:24 <mrvn> clever: Will that change over time as sqlite modifies blocks?

22:24 <mrvn> dh`: man stat on Debian

22:24 <mrvn> man 2 stat

22:24 <dh`> ah, linux

22:24 <clever> mrvn: once a file has 2 blocks, the block size is set in stone, but the CoW nature means the indirect blocks(L1/L2) are re-written on every change

22:25 <clever> mrvn: so if my free space situation improves, the L1 blocks will un-fragment, as they get modified

22:25 <dh`> one of the things I've been debating for years is whether it's acceptable for stat() to succeed and tell you the linkcount is 0

22:25 <clever> but, each file on a filesystem, can have a different block size

22:25 <mrvn> clever: but if you have 8k blocks and sqlite writes 4k changes then your L0 will fragment, or not?

22:25 <dh`> apparently linux thinks so!

22:25 <clever> mrvn: nope, zfs will do a read/modify/write cycle

22:26 <clever> pull 8k off the disk, modify half of it, then write 8k back to disk

22:26 <mrvn> dh`: more importantly what is the link count for a directory? The "normal" behavior lets you know the number of files and subdirs in that dir.

22:26 <clever> same reason you want to align your partitions, so your 4k writes are not straddling half of 2 4k blocks

22:26 <dh`> that's not more important, it's a completely different issue

22:26 <dh`> :-p

22:27 <clever> it still causes a read/modify/write cycle, but twice now!

22:27 <mrvn> clever: so it stores 4k zeroes and 4k data when you have a hole?

22:28 <clever> not sure what it does, if half a L0 is nulls, and compression is off

22:28 <clever> and now that youve reminded me, let me flip compression on

22:28 <mrvn> clever: with compression you get a lot more partially filled blocks.

22:28 Iris_Persephone has joined #osdev

22:29 <clever> 119M -rw-r--r-- 1 root root 133M Oct 3 19:28 db.sqlite.2

22:29 <clever> with lz4

22:29 <clever> and also, my L1 blocks are no longer fragmented

22:30 <demindiro> ZFS has LZE compression which strips zeroes, runs before LZ4 IIRC

22:30 <demindiro> Ah half, ignore me

22:30 <clever> demindiro: ive yet to notice that, while implementing my own zfs driver

22:30 <clever> part of why i'm writing a driver, i'll discover half truths and out of date docs

22:31 <clever> if i dont parse it exactly write, garbage comes out!

22:31 <demindiro> Do you have a public repo somewhere?

22:31 <clever> demindiro: https://github.com/librerpi/lk-overlay/blob/master/lib/fs/zfs/zfs.c

22:31 <bslsk05> github.com: lk-overlay/zfs.c at master · librerpi/lk-overlay · GitHub

22:32 <clever> currently, its able to parse the vdev labels, find the most recent uberblock, load the MOS dnode, and partially load the root dataset dnode

22:32 <clever> to get directory traversal, i need to implement ZAP parsing

22:32 <clever> a fat-zap is basically a b-tree, hash the filename, use the hash in an index to find the block# with the name->inode entry

22:33 <clever> but if the filenames are small and the total size of the serialized structure is small, it instead becomes a micro-zap, no index, just a single block of name->inode pairs

22:34 <clever> demindiro: only sha256 is validated currently, but zfs uses fletcher4 for metadata, so my more critical reads arent validated

22:34 <clever> and only lz4 decompression is supported

22:35 <clever> mrvn: oh right, and with 4kb blocks, compression is pointless

22:35 wxwisiasdf has joined #osdev

22:36 <clever> mrvn: block allocations, are done in units of ashift, 4kb for my current setup

22:36 <clever> so with 4k fs blocks, it takes 4kb of data, compresses it down, then stores it in a 4kb block!

22:36 <clever> with 8kb blocks, you need to halve the size of the data, for compression to actually be a benefit

22:37 gildasio has quit [Remote host closed the connection]

22:37 <clever> yep, i can confirm that in zdb, despite asking for lz4, a large chunk of the blocks arent compressed

22:38 <clever> zfs realized it wasnt a net-gain, so it just stored the uncompressed version instead

22:39 gildasio has joined #osdev

22:39 <heat> dh`: now that you're here, mjg told me to ping wrt rename and locks

22:40 <heat> <heat> what do you think of a per-filesystem rename lock?

22:40 <dh`> what about it? rename is not trivial and ~everyone out there has it wrong

22:40 <heat> why wrong?

22:40 <dh`> as far as I know there are only two correct ways to do it: one is with strict two-phase vnode locking and the other is with a per-volume rename lock

22:41 <dh`> so yes

22:41 <heat> linux is seqlocking these days

22:41 <heat> soooooo i guess it seems to work

22:41 <heat> although I don't fully understand how that works

22:42 <heat> how likely is it that openbsd is using a global mutex or something weird like that?

22:42 <dh`> openbsd almost certainly has the completely broken 4.4 code

22:42 <heat> lmao

22:43 <dh`> it is unlikely that they've changed it at all

22:43 <heat> what makes it broken?

22:43 <dh`> the 4.4 code will not deadlock at runtime (though mostly by accident) but it will cheerfully detach and lose sections of the namespace, so you don't realize anything bad happened until you get a fatal fsck failure

22:44 <dh`> and I've seen that failure be unrecoverable by fsck, too (fortunately on a crash machine)

22:44 <heat> ugh

22:44 <Iris_Persephone> I am now leaning towards "implement POSIX as closely as you can, except for the parts where POSIX is stupid"

22:44 <dh`> the 4.4 code just randomly locks and unlocks with no clear pattern

22:44 <heat> rename is not stupid tbf

22:44 <heat> it's just hard to implement

22:44 <dh`> it obviously meant something to someone at some point but it never made any sense to me

22:44 <dh`> there's a POSIX_MISTAKE in rename

22:45 <heat> are you talking about 4.4's path/directory code in general or just rename?

22:45 <dh`> which is: ordinarily if you rename a/b over c/d, it unlinks d and replaces it with a

22:46 <dh`> but if you first hardlink b and d, posix says this does nothing at all rather than unlinking b

22:46 <clever> 512 -rw-r--r-- 1 root root 126M Oct 3 19:45 db.sqlite.2

22:46 <clever> 80M -rw-r--r-- 1 root root 126M Oct 3 19:45 db.sqlite.2

22:46 <dh`> heat: rename in ffs

22:46 <heat> ah

22:46 <clever> heat: is this what you where saying? i ran `sync` in CLI`, and the file still claimed to be 512 bytes, then seconds later, 80M!

22:46 <dh`> 4.4's pathname and directory code in general is a mess but rename is quite extra

22:46 <heat> clever, ack

22:47 <clever> mrvn: also, because i raised my block size back to 128kb, lz4 doesnt have to work as hard to get savings (i fixed my free space)

22:47 <heat> I should be fancier with rename

22:47 <dh`> anyway, the fundamental nature of the problem is: if you rename a/b/c/d/e to f/g/h/i/j, you need to make sure e isn't an ancestor of j in the tree

22:47 <clever> it still has to shave off at least 4kb, but 4kb off 128kb is far easier

22:47 <heat> I'm doing unlinks + links as the base operations which seems unsafe

22:47 <dh`> this check is nonlocal and so requires nonlocal locking, hence the per-volume lock

22:48 <clever> https://github.com/openzfs/zfs/blob/master/include/sys/zil.h#L153

22:48 <heat> yea

22:48 <clever> heat: in the ZIL, rename is a recordable action, so it can just append that to the ZIL and fix the metadata later

22:48 <dh`> you also need to make sure that you don't violate the locking order rules, which can depend on the location of the subpaths in the tree

22:48 <clever> and things remain atomic and consistent after a power failure

22:49 <dh`> unlink and link is fine for renaming regular files; it's dirs that make trouble

22:49 <dh`> well, fine in the sense that if you crash in the middle you might be left with both names if you don't journal it

22:49 <heat> yrs

22:49 <heat> yea

22:49 <clever> demindiro: but then hardlink counts?

22:49 <clever> dh`: oops, ^

22:50 <dh`> clever: if you don't increment the link count while doing the link/unlink you can detect that a rename was in progress but that violates an intended invariant and can make you other problems

22:51 <clever> now that i say that, i dont know how zfs manages link counts....

22:51 <dh`> you know how the original softupdates you're supposed to be able to fsck in the background while the system starts up?

22:52 <clever> ah, in the bouns space on a dnode

22:52 <dh`> never enable that, it's not safe

22:52 <dh`> because of rename!

22:52 <heat> lol

22:52 <dh`> even with softupdates, rename must be fixed up afterwards by fsck or journal replay

22:52 <heat> love me some rename

22:53 <dh`> there's no sequence of writes that can make an operation atomic in two separate places at once

22:53 <clever> dh`: zfs can do that with both the zil and the txg

22:54 <clever> for the zil way, the rename is just appended to the log, on replay you try the rename again

22:54 <dh`> so the failure mode is: rename /tmp/foo /tmp/bar/baz, crash in the middle, go to clean tmp, rm -r goes into /tmp/bar/baz but comes back out into /tmp, so it thinks it's one layer deeper than it is, then it comes out of /tmp and starts erasing the whole system

22:54 <clever> for the txg, every transaction in that group must be commited to disk at once, and due to the cow/immutable nature, that involves creating a new set of state, where the rename has fully happened

22:54 <clever> and that new state doesnt take effect until you update the uberblock, the single root of truth

22:54 <dh`> clever: any reasonable journaling or snapshot-based solution deals with the problem

22:55 <clever> if you get interrupted, the new state basically never existed

22:55 <clever> yeah, this is both journaling and snapshotting

22:55 <clever> the ZIL is a journal, while the cow/immutable nature of the main pool is snapshotting

22:58 <gog> moo

23:00 wxwisiasdf has quit [Ping timeout: 260 seconds]

23:13 wxwisiasdf has joined #osdev

23:20 nyah has quit [Quit: leaving]

23:22 <wxwisiasdf> cobol is amazing

23:22 <epony> https://en.wikipedia.org/wiki/Apple_M1#Problems

23:22 <bslsk05> en.wikipedia.org: Apple M1 - Wikipedia

23:24 <clever> wxwisiasdf: https://i.imgur.com/H1lkYxC.png

23:26 <wxwisiasdf> haha

23:28 gxt has quit [Remote host closed the connection]

23:28 Iris_Persephone has quit [Ping timeout: 252 seconds]

23:31 gxt has joined #osdev

23:32 <clever> epony: what do you have to screw up, that a simple usb-c hub can brick the system??

23:33 <heat> bus mastering?

23:36 <clever> heat: how?

23:36 <heat> idk

23:36 <heat> just throwing out some ideas

23:36 ThinkT510 has quit [Ping timeout: 268 seconds]

23:38 ThinkT510 has joined #osdev

23:56 Iris_Persephone has joined #osdev

23:58 <epony> clever, probably CPU debugging exposed on the USB port

23:59 <clever> epony: but why was that on the charging C port, couldnt it have been on any other C port?