klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
<heat> it's a multiple-thousand-line project
<clever> QT even has a primitive that can kinda do this, deleteLater, but its not multi-thread friendly
<heat> right
<heat> so it's useless :)
<heat> the hard part is making it scale
<clever> deleteLater is mainly so class method can delete `this` without the destructor then becoming a use-after-free
<heat> and the parts that mostly make it scale are still under patents
<heat> or maybe they're not! it's a minefield
<heat> who knows if you'll blow off your leg
<heat> better step on it and see what happens
<clever> and if i just come up with a solution on my own and implement it, i could still get sued??
<heat> clever, you can totally delete this;
<heat> yes
<heat> patents baby
<heat> well, IANAL but that's my understanding of it
<heat> ...I still don't understand the fucking point of patenting it but fuck IBM anyway
<clever> heat: i can see 2 main parts of RCU that are costly, 1: the copying, 2: when to do the free
<heat> yup
<clever> what if i just put it on a timer, and free the object after 60 seconds? if any irq context is holding a reference that long, youve done something wrong
<heat> linux has like 3 different RCU implementations
<heat> that sounds bad
<heat> you fire off an interrupt for every object?
<heat> you can also get interrupts in RCU sections
<clever> the rough view ive seen, is that the rx irq on your NIC, will then use an RCU to parse iptables, and either block or accept the packet
<clever> and RCU is used, so you can mutate the tables, without blocking all packet rx
xenos1984 has joined #osdev
DanDan has quit [Ping timeout: 252 seconds]
<heat> sure
<heat> RCU is super abused all around linux
<heat> your fd table is entirely made out of RCU
<clever> ah, hadnt known that
<heat> you can actually modify it concurrently
<mjg> no you can't
<heat> there's no lock around it
<mjg> bro
<heat> you can't?
<mjg> where the fuck you taking this from
<heat> source?
<clever> heat: i assume there is a cmpxchg, so concurrent writes dont undo eachother
<mjg> shit scalability of fd allocation is a long standing sore point
<mjg> posix requires that you hand out lowest fd number possible
<mjg> which serializes the shit out ofi t
<bslsk05> ​elixir.bootlin.com: fdtable.h - include/linux/fdtable.h - Linux source code (v6.0) - Bootlin
<heat> I'm seeing some RCU there
<mjg> it is for *lookup*
<mjg> fd -> file translation
<clever> mjg: and so that translation, can happen in parallel to a mutation?
<clever> because the edit is occuring on a copy
<clever> but 2 edits would still serialize
<mjg> yes
<heat> ah ok
<heat> my bad then
<mjg> but most imporatntly you can have independend lookups not reducing each others perf
<mjg> independent even
<clever> another thing i see often in databases is a r/w lock, where multiple readers dont conflict
<clever> but a writer needs exclusive access, not even other readers
<mjg> sounds like the rw lock conflicts
<heat> rw locks are still bad
<mjg> :>
<mjg> heat: whatever rcu you see, it is most likely only for lookup
<clever> yeah, rca avoids the writers blocking readers
<heat> your writers may wait for a long time, and cache lines goes bouncy
<clever> rcu*
<mjg> read locking already serializes lookup perf
<clever> rcu is meant for when your readers must never block, and writes are rare
<mjg> provided this the standard one-word to bump
frkzoid has quit [Read error: Connection reset by peer]
<mjg> which is why rcu to begin with
<heat> is the fbsd vfs abusing epoch yet?
<heat> or are you not there yet
<mjg> it is abusing smr
<heat> cringe
<heat> rcu-like or riot
<mjg> 's how i got the perf which i already linked
<mjg> the lookup which scales
<heat> i dont understand what smr does
vdamewood has joined #osdev
<clever> hdd smr? or some other smr?
<heat> lmao
<bslsk05> ​reviews.freebsd.org: ⚙ D22586 Implement safe memory reclamation in UMA.
<clever> ah, safe memory reclamation?
<heat> yes
nyah has quit [Ping timeout: 260 seconds]
<mjg> it basically waits for a counter to gtfo
<mjg> except contrary to rcu it does not do it with context switches
<mjg> i don't know what other conceptual differences are there
<mjg> i can tell you it sucks though because it has 0 numa support
<heat> perfect
<mjg> this is fixable, just did not happen
<heat> onyx also has 0 numa support
<mjg> despite me telling the author it will be a problem
<heat> it goes hand in hand
<mjg> oh you can take it then
<heat> i'll wait for the PR
<mjg> i posted one yesterday!
<mjg> i guess github lost it
<mjg> too bad
<heat> oh really?
<heat> i can set up a gerrit dont worry
<heat> I also take email patches
<heat> you can CC tech-kern@netbsd.org
<heat> (It's my email)
<mjg> i did
<mjg> gogole mail must be on a fritz this weekend
<heat> damn gog
<mjg> ye wtf gog
<heat> go send the emails
<mjg> this is why onyx can't have nice things
<mjg> heat: openbsd has its own delayed memory reclamation solution
<mjg> it is... something
<heat> let me guess
<heat> it delays inode reclamation until the end of the universe
<heat> fun fact I almost applied to netbsd's GSoC a year ago
<heat> you could have a prized netbsd developer here
<heat> sadly I got fucking ghosted
<mjg> lol
<mjg> that was... an interesting choice
<mjg> i can tell you OS gsoc tends to be a disaster
<mjg> for example there was a student accepted in freebsd few years back
<mjg> not only he was not even using the system, he did not know how to install it in a vm
<heat> ... I ended up in tianocore
<mjg> so that did not work out
<heat> i'm now a prized tianocore maintainer
<heat> it's like
<heat> the opposite of netbsd
<mjg> not only one arch but *runs* htere?
<heat> hrm?
gog has quit [Ping timeout: 268 seconds]
<heat> >install it in a vm
<heat> this is why I would never apply for openbsd
<heat> :)
<heat> I think the project was what, adding PCI support to userspace for rump? something like that
<heat> I did look into freebsd but the projects didn't have much kernel fun
frkzoid has joined #osdev
<bslsk05> ​wiki.netbsd.org: Userland PCI drivers (350h)
<heat> it does sound like great fun, maybe I should take a look again
<mjg> well
<mjg> you should have asked around
<mjg> i can't be fucked to add a gsoc proposal
<mjg> but if someone can code up something in the kernel i may be up for mentiring
<heat> i'm relatively interested :v
<heat> i want something to scratch my "contributing to a real project" itch
<mjg> > "contributing to a real project"
<mjg> > bsd
<mjg> lol
<mjg> just kidding
<heat> or are you
<mjg> im watching wandavision
<mjg> i don't give a crap about super hero shit and don't even know who the characters are
<mjg> i heard the show is quite original and wahtnot so gave it a shot, i'm mid episode 2 and so far A+
<heat> i pay for disney+ and I watch like 0 shows and movies
<heat> it's a waste of money but I know i'll want to watch shit from disney+ if I cancel my subscription
<heat> it's how it goes
<heat> I stopped having netflix and then I suddenly got the urge to watch like every dave chapelle special
<mjg> lol
<mjg> for real though, very refreshing
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
aleamb has joined #osdev
dude12312414 has joined #osdev
dude12312414 has quit [Remote host closed the connection]
dude12312414 has joined #osdev
dude12312414 has quit [Remote host closed the connection]
dude12312414 has joined #osdev
wxwisiasdf has joined #osdev
<wxwisiasdf> hello :D
<heat> henl
<heat> o
<wxwisiasdf> heat
<wxwisiasdf> you know i am the person who did tss right
<wxwisiasdf> and also the 286 memes right
<heat> yes
<wxwisiasdf> well, we expand now with bigger memes: kernel made in COBOL
<heat> cheers mate
<heat> whatever makes you happy honey
<wxwisiasdf> COBOL is dynamic like python so it was a hassle to get it working but finally
<wxwisiasdf> uart driver in cobol
<heat> I've been progressively been beating openbsd's path walking these past few days
<wxwisiasdf> ?
terrorjack has quit [Quit: The Lounge - https://thelounge.chat]
<heat> i'm now 1.15x faster than openbsd at threads=4 path walking
<heat> sorry
<wxwisiasdf> oh you're optimizing multitheading for your kernel?
<heat> 10.15?
<heat> something stupid like that
<heat> yes
<wxwisiasdf> ohoh, nice
<heat> ah yes, 11.5x
<heat> it turns out openbsd bad onyx good
<heat> subscrarb
<wxwisiasdf> exactly
terrorjack has joined #osdev
<heat> it's still not ideal but
<heat> it is what it is
<wxwisiasdf> i mean if it works for you
<heat> yeah i feel great
<heat> beating 30 years of bsd development feels fucking amazing
<heat> i wonder how it measures up with netbsd
<heat> anyway, now I'm writing a radix tree
<wxwisiasdf> i am curious what did you do to optimize that made it faster than OpenBSD?
<wxwisiasdf> better mem align?
<wxwisiasdf> lazier context switch?
<heat> oh I was already almost 10x faster but misread the number
<heat> :D
<wxwisiasdf> :p
<heat> I was 1.6M, they were 200k
<wxwisiasdf> in what context, context switches per sec?
<heat> then I added rw spinlocks instead of sleepable rw locks, got 1.8M
<heat> some will-it-scale test of path opens per second I think
<wxwisiasdf> ah
<heat> then I did a bunch of work and disabled ubsan, 200k
<heat> then I did the real work and replaced my shitty musl malloc with a proper slab allocator like a proper unix
<heat> with percpu caches
<wxwisiasdf> haha
<heat> so I'm at 2.3M
<wxwisiasdf> wonderful
<heat> yeah
<heat> malloc locks were like 20-30% of my flamegraphs
<heat> now i can barely see them
<wxwisiasdf> how did you
<wxwisiasdf> benchmark your os????
<heat> i added flamegraphs
<wxwisiasdf> did you do like an xlsl output file via serial for the timings
<wxwisiasdf> oh
<heat> no lmao what
<wxwisiasdf> uh oh :(
<heat> basically you need a timer and a way to get stack traces
<heat> then you symbolize them and send them however you'd like
<heat> I was using qemu's serial yeah
<wxwisiasdf> fair
<wxwisiasdf> and how did you bench openbsd?
<heat> then I was running flamegraph scripts on it
<heat> bench? I used will-it-scale
<wxwisiasdf> okie
[itchyjunk] has quit [Read error: Connection reset by peer]
<heat> anyway, I'm going to take care of vm technical debt next
<heat> I'm going to use a radix tree thing, not too unlike page tables, to do this
<heat> vm_objects have been using a rb tree which is not ideal
<wxwisiasdf> fair
<wxwisiasdf> what would you recommend for a memory allocator btw
<heat> what as in the allocator or an algo?
<wxwisiasdf> algorithm
<heat> i'm using slabs
<heat> it works well
<wxwisiasdf> okay
<heat> well, it depends on what allocator you're talking about
<heat> I have 3 allocators that work exactly like a stack :P
<wxwisiasdf> well i just asked because when i implement memory allocators i usually just use a freelist
<wxwisiasdf> not a freelist, just more like linked list
<heat> plus a simple memory pool and a bootmem allocator
<wxwisiasdf> oh
<heat> yeah I have 5 allocators
<heat> technically 6
<heat> anyway it depends
<wxwisiasdf> sounds messy to mantain ig
<wxwisiasdf> i usually just have 2 allocators: 1 physical, 1 virtual (sometimes)
<heat> if you want malloc you can do a slabish approach, or a buddy allocator
<heat> there are multiple approaches
<heat> more than these, really
<wxwisiasdf> yeah but i mean on multicore
<wxwisiasdf> whats a good approach to scalable multicore and smh
<wxwisiasdf> i always have issues doing smp because i make my kernel be just expecting 1 thread and when i enable smp: boom it all implodes
<heat> I have a page allocator, a virtual memory allocator, a vmalloc allocator (uses vm's algorithm more or less, but conceptually different and way simpler), a bootmem allocator (to allocate page allocator structures), a page allocator (for now, simple list of pages) and a memory pool (used for simple, stupid allocations.)
<heat> I meant slab allocator first
<wxwisiasdf> fair
vdamewood has joined #osdev
<heat> anyway, I think that for multithreading as far as I've seen you really need percpu/per thread caching
<heat> it's generally The Way
<heat> grabbing the lock = bad
<heat> and asking a backend for memory is also bad, so avoid giving memory back
<bslsk05> ​gist.github.com: onyx-open3.svg · GitHub
<heat> this is one of my earlier flamegraphs
<heat> if you open this on non-github you'll be able to click through it
<wxwisiasdf> oh
<heat> you'll see malloc and __bin_chunk(free) very easily on the graph
<heat> and most of it is just waiting on a spin lock
<heat> on 4 THREADS
<wxwisiasdf> do_styscall64 moment
<heat> anyway "enabling SMP and it blowing up" is not because of a lack of efficiency
<heat> you're probably just missing locks
<wxwisiasdf> right
<heat> an SMP kernel is just a really big really multithreaded thing
<heat> so spinlocks, mutexes, rwlocks everywhere
<heat> or fancier things in the late game
tarel2 has joined #osdev
<clever> thats something i still need to investigate on the VPU
<clever> i can turn on the 2nd core and run code there, but i have yet to enable proper SMP in LK
<clever> i dont trust the lock primitives yet
<heat> test em
<clever> yeah, thats on my todo list
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<clever> heat: from memory, there are no atomic memory access ops, only ~16 hw spinlocks, reading it will return one value for the winner, and another value for all the loose and all future reads
<clever> a write will reset it
<clever> any core can reset it, software rules should limit it to only the lock holder
<heat> hash your software spinlocks on those 16 hw ones
<clever> i think the hw spinlock also cant be a indexed from a register
<clever> its basically a cpu register
<clever> currently, i just have one hw lock, that protects all "atomic" ops
<heat> that's just needlessly slow
<clever> yeah
<clever> it works, but could be improved
<heat> woohoo linux 6.0
<heat> it's so new kernel.org hasn't updated
<wxwisiasdf> i am a hurr durr ninja
<wxwisiasdf> or something it was codenamed
<wxwisiasdf> linux giving it's kernel silly names :P
<heat> Hurr durr I'ma ninja sloth
<bslsk05> ​github.com: Linux 3.11 · torvalds/linux@6e46645 · GitHub
<heat> easily the best name
<klys> I've been setting up an Epyc 7453 system that takes 3-4 min. to boot. It's been giving me trouble with gpu passthru. nouveau.ko made the screen light up, that's about all I have so far on this gtx 960 card. passthrough works on another box though. using debian.
<heat> don't use nouveau.ko and try again
<klys> I will try again another time, as it does take over three minutes to try again.
<heat> servers do be like that
<heat> kexec?
<klys> there's a kexec binary somewhere, and it's like kexec /boot/vmlinuz-4.16.18 root=/dev/sda5 rw blah blah; ?
<heat> something like it
<heat> --reuse-cmdline
<bslsk05> ​wiki.archlinux.org: Kexec - ArchWiki
<heat> i should kexec more often
<klys> I was using it with my w7pro in virt-manager, and device manager was like "there was a problem with this device. no resources are allocated because there was a problem."
tarel2 has quit [Ping timeout: 252 seconds]
<klys> one thing I should double check is that it is really using ovmf
<klys> because I don't see the tianocore logo
<heat> then it's not
<klys> seems reasonable, thanks I'll keep that in mind.
<klys> meanwhile I have the Epyc 7453 box built now, with mobo, transitional 700W power, 4u case (yes the fan is too big), 56 threads, and 256GB A-grade samsung ram with heat spreaders.
<geist> server stuff can spend a lot of time training the dram on every boot
<geist> the arm server box i have takes 2-3 minutes to retrain the 16 dimms or so, but at least it dumps a whole log over serial so you can watch it
<geist> it gives ilke 10 seconds per dimm, and there are 16 of them, so it kinda makes sense
heat has quit [Ping timeout: 268 seconds]
<mrvn> Why do some boards do that and others don't? Why isn't the result saved in NVram?
<clever> mrvn: the rpi4 is one case where it doesnt save it to ram
<clever> it adds a noticable amount to the boot time
<mrvn> On a SOC I can kind of see the penny saving attitude. But a server? Where uptime matters?
<clever> dont reboot your server :P
<clever> uptime matters!
<mrvn> Sorry, we can't give you 99.9% uptime, the ram retraining takes too long.
<clever> ive also heard claims that the training depends on temps
<clever> if the ram/soc are hot, it needs different training
<clever> and its constantly fine-tuning it while running
<mrvn> But then it should retrain at runtime, after the ram has heated up.
<clever> but if you do a reboot, the ram is already hot
<clever> and the timings in nvram are wrong
<mrvn> if you do a reset then it should just keep the settings, doesn't even need NVram. Just have a few bit in the DIMM that says: I've been trained.
<mrvn> But all this still doesn't answer the original question: Why do some boards take 10s per DIMM to train?
<clever> yeah, good question
<clever> and the lack of source makes it near impossible to answer
<klys> likely because I have 32gb per dimm
<klys> which all has to squeeze thru a bus
<mrvn> klys: No. Other boards with the same amount of ram don't take so long.
<klys> it's ecc if that matters
<mrvn> I sure hope nobody runs a server without ECC
<klys> CL22 column access strobe
frkzoid has quit [Read error: Connection reset by peer]
archenoth has joined #osdev
[itchyjunk] has joined #osdev
<geist> mrvn: so i think a few reasons: one is it *does* take longer on a new/clean boot. in my case it takes over 5 minutes until it's done a training once, so there's some sort of deeper thing there
<geist> and i think the idea is lots of traces across the board with larger sets of dimms requires more training
<geist> vs say 2 channels, maybe 4 dimms max
<mxshift> If you get info about DIMM training over serial, someone left debug turned on which makes it take considerably longer
<mxshift> Training data is usually cached in the main flash so sometimes naive firmware updates erase it
<geist> yah like i said there's two levels of training on this machine. one of which takes over 5 minutes, the faster one being about 10 seconds per dimm
<mxshift> Training data is also invalidated if the DIMM config changes at all. Even shuffling DIMMs will invalidate since the SI channel being measured is the SoC+socket+motherboard+DIMMs
<mxshift> Normal training is just trying to find an operating point. Debug will often do a 2D sweep of voltage and timing adjustments to give a report of how much margin there is around that operating point
* geist nods
<geist> yah the dump out of this machine shows basically that
<mxshift> I haven't seen training data be invalidated due to temperature. It's plausible. I have seen it be invalidated after some number of boots on the assumption that the channel may age
<mxshift> I spent way too much time looking at EPYC 7xx3 training during our early being up
<mxshift> You can enable debug or MBIST training on any EPYC 7xx3 system by setting a few APCB tokens to tell the PSP what you want
<mxshift> Training is usually pretty quick but zeroing ECC DIMMs scales with amount of DRAM installed
vin has quit [Remote host closed the connection]
edr has quit [Quit: ZNC 1.6.5 - http://znc.in]
[itchyjunk] has quit [Read error: Connection reset by peer]
genpaku has quit [Read error: Connection reset by peer]
genpaku has joined #osdev
k8yun__ has joined #osdev
k8yun_ has quit [Ping timeout: 252 seconds]
k8yun__ has quit [Ping timeout: 268 seconds]
<geist> oh that's a fair point. does all ECC by definition need to be initialized befor eyou can use it
<geist> i guess so, right? or some sort of bit taht says this is uninitialized
<Mutabah> At a guess, the first read would return a ECC error if it's not written before
<geist> yeah, never thought about that but i guess that's true. an ECC machine would probably need to ake at least a few more seconds to boot to zero everything out
<Mutabah> Note: I'm just guessing at how ECC works, but it seems like the easiest way
scoobydoo_ has joined #osdev
scoobydoo has quit [Ping timeout: 252 seconds]
scoobydoo_ is now known as scoobydoo
wxwisiasdf has quit [Ping timeout: 252 seconds]
tarel2 has joined #osdev
heat has joined #osdev
<heat> morning
<heat> i was reading that old ass linux 2.4 buddy allocator doc and I realized how they avoid always coalescing and splitting buddies
<heat> they have a percpu list of pages (order 0) which they allocate from
<heat> and do exactly the batch thing that you would do in slab allocation
<heat> this is very interesting and may be worth considering
ThinkT510 has quit [Ping timeout: 246 seconds]
<heat> btw re memory training, from what I've gathered the new DDR5 machines take a stupid amount of time to train memory
<heat> like 5 minutes
<heat> for fucking laptops
<Griwes> finally, bringing the server experience to the masses
k4m1 has quit [Quit: Lost terminal]
k4m1 has joined #osdev
<kazinsal> me: "I can't wait to build a new machine to replace my old SB Xeon box that takes 10 minutes to post after a power failure"
<kazinsal> DDR5 board manufacturers: "haha guess what"
<mjg> :]
terminalpusher has joined #osdev
gog has joined #osdev
ThinkT510 has joined #osdev
gog has quit [Client Quit]
kof123 has joined #osdev
GeDaMo has joined #osdev
Oshawott has joined #osdev
archenoth has quit [Ping timeout: 268 seconds]
SGautam has joined #osdev
xenos1984 has quit [Read error: Connection reset by peer]
terminalpusher has quit [Remote host closed the connection]
terminalpusher has joined #osdev
tarel2 has quit [Ping timeout: 252 seconds]
xenos1984 has joined #osdev
epony has joined #osdev
DanDan has joined #osdev
ThinkT510 has quit [Ping timeout: 268 seconds]
elastic_dog has quit [Ping timeout: 244 seconds]
elastic_dog has joined #osdev
ThinkT510 has joined #osdev
marshmallow has joined #osdev
<mjg> ffs
<mjg> i finished watching wandavision
<mjg> starts super strong, then quality gets a huge hit and it's downilll from there
<GeDaMo> Some of the Marvel stuff from last year was affected by Covid
<GeDaMo> I like Marvel but everything after Endgame has been a bit lackluster
<heat> lmao
<bslsk05> ​github.com: Kernel page-in linking · Issue #584 · rui314/mold · GitHub
<mrvn> "makes lots of pages dirty"? Firefox has a few gig of ram that's all dirty. The binary is a few MB, of which maybe 100-200 pages are dirtied due to relocations. Is this worth it tp fix?
tarel2 has joined #osdev
<heat> the point is that it's slow
<heat> also wasteful, yes
<mrvn> heat: Relocations is not why firefox is slow to start
<heat> ok
<heat> but firefox is an extreme example
zaquest has quit [Remote host closed the connection]
<mrvn> Also consider this: If you use a good fraction of the functionality then all relocation pages will be used. So you aren't saving time but paying for it at runtime instead of once at load. And with 2 context switches per page unless they do this fully in kernel.
<heat> Onyx's PIE compiled bash has 2543 relocations
<mrvn> heat: how many plt pages is that?
zaquest has joined #osdev
<heat> it's not just plt and got but .data too for instance
<mrvn> much harder to count in data though.
<heat> right. the only way would be to stop in the middle of linking and check the dirty pages
<heat> my /usr/lib/libstdc++.so has _5971_
<mrvn> and you can fix up a million a second?
<heat> all of which are relocated at startup
<heat> can you?
<mrvn> Not counting the load time a million a second sounds reasonable.
<heat> touch -> fault in 4KB -> write, but mostly randomish access
<heat> the load time is the biggie
<mrvn> debatable. The load happens when you use the program anyway.
<heat> the point is that it takes time
<mrvn> heat: aren't relocations sorted linear?
<heat> no
<mrvn> well, sort them. That should speed things up when you fix pages sequentially.
<heat> my clang has 361574 lmao
<mrvn> For me the interesting number would be how many of those are accessed at runtime and not just for the fixup.
<froggey> how many unique pages need relocations? if you have 1000 relocations and they're all on the same page, you don't win anything but maybe you do if they're scattered over 1000 pages
<mrvn> The one thing interesting in the proposal is that the kernel can throw away pages and relocate them again when it later pages in the page again.
<heat> those are both interesting questions I can't answer
<mrvn> froggey: For library calls the compiler uses a trampoline and puts all those together so they take verry few pages.
<mrvn> Oh, and those trampolines do the fixup on first use.
<heat> not true
<heat> in fact doing fixups on first use isn't a great idea
<heat> lazy binding is error prone
<heat> let me call this fun... process just crashed
<mrvn> heat: why would it crash?
<heat> out of memory, symbol not found, etc
<mrvn> I believe the symbols are checked at load, just the address calculation happens later.
<froggey> iirc hardened binaries that use relro also do relocations up-front
<bslsk05> ​wiki.musl-libc.org: musl libc - Functional differences from glibc
<mrvn> Bit of a contradiction in the proposal: You want to make it faster but then you page-out data and have to relocate it again on page-in. So you will be doing relocations over and over in low mem situations.
tarel221 has joined #osdev
<mrvn> I wonder if Apple benchmarked the old way, their new idea and using a global base address register.
<mrvn> For statically linked PIE the later should solve all problems.
<heat> that's the worst idea
<heat> you take away a register completely
tarel2 has quit [Ping timeout: 252 seconds]
<mrvn> heat: you have 32, a register is cheap :)
<heat> no you don't
<heat> you have 15
<mrvn> heat: ARM had 15, AArch64 has 31. :)
<mrvn> Even with 15 testing the cost of one reg vs. all the relocations is worth it.
<heat> x86-64 has 15
<heat> I don't see how that can be true
<mrvn> you think measuring the effect is pointless?
<heat> I would prefer paying a memory cost over taking away a precious register (and breaking the ABI in the meanwhile)
<heat> I guess you could feasibly use x18 in arm64 to do that kind of fuckery
<heat> but is it free? probably not
<mrvn> You are designing a new dynamic linker here. Total ABI breakage.
<heat> having a relocation aware kernel would not break any ABI
<mrvn> heat: they changed the relocations data too
MiningMarsh has quit [Read error: Connection reset by peer]
MiningMarsh has joined #osdev
<heat> they who? apple?
nyah has joined #osdev
<mrvn> heat: yes
<heat> you could very much implement this either in the kernel or in a dynamic linker without changing any part of the ABI
terminalpusher has quit [Remote host closed the connection]
terminalpusher has joined #osdev
terminalpusher has quit [Remote host closed the connection]
<heat> actually looking at actual numbers this seems a bit useless
<bslsk05> ​chromium.googlesource.com: Chromium Docs - Native Relocations
dude12312414 has joined #osdev
<mrvn> heat: 20ms for 390'000 relocations. See what I mean?
<mrvn> And 6.5MB of memory. Thats less than 0.1% of what chrome uses.
<heat> it probably depends on the storage medium
<heat> 20ms is not a lot for the hugest program out there
<mrvn> I assume they didn't measure a cold start from rotating disks.
<mrvn> 8ms seek time kills you
<heat> i would be kind of concerned for clang though, when compiling .c files or so
<heat> something fast
<heat> 20ms could be around 10% of the compile time
<mrvn> I wish my compiles would take 200ms.
<heat> dont write C++ lmao
<mrvn> bingo
<heat> the C parts of my build are stupidly fast
<heat> the slowest parts usually involve C++ and some header-only library
<mrvn> totall killer
<heat> gtest tests are sloooooooow
<heat> nlohmann is also a POC
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<heat> my own C++'s build takes longer and even that is a super restrained, C with classes-like idiom
<heat> the cf workers runtime had files which took 40s to compile on top of the line CPUs
<heat> that whole codebase reeks of people who are not kernel developers :v
<mrvn> heat: there is another thing to consider. Every time you load a lib you have to relocate it to somewhere more or less random. Wouldn't it make sense to pick some place in the address space and put the library there every time? Then you could share the relocation pages and only have to relocate once.
<heat> no
<heat> if someone accidentally leaks an address you have just defeated ASLR for that library, for the whole boot, whole system
<heat> hell, you could craft your own program that does that, and then pwn sudo
<mrvn> assuming you have a bug to exploit
<heat> well, yes
<heat> ASLR is defense in depth
<heat> which is why it's important
<heat> that thing just leads to shitty ASLR (@windows)
<mrvn> Then you have to pay for it with the extra memory. :)
<heat> might as well not be there if anyone can leak it
<heat> yeah
<heat> it's cheap though
<mrvn> .oO(my words)
<mrvn> We could bring back segment registers and give every library a segment. Jipey, no more relocations.
<heat> i thought of that when you mentioned reserving a register
<heat> but it doesn't work, you pay for segment addressing
<heat> it's like an extra cycle
<heat> (but it would totally work, you have a fully unused %gs in x86_64 userspace)
<mrvn> Only because they didn't optimize it. More important is probably the prefix byte on the opcode
<heat> i would think they have optimized it
<heat> %fs and %gs addressing is heavily used for TLS/percpu
<mrvn> only once to load the base address into a regular reg
<heat> no
<mrvn> once in a while
<heat> mov %fs:variable, %reg
<heat> that's how you do things
<mrvn> heat: you can do that. If you have more variables you load %fs:0 and then use that as base
<heat> here's some funny codegen for my percpu inline asm
<heat> mov %gs:0x7fe8e16c(%rip),%rax # 0x338 <preemption_counter>
<heat> it's paying the cost of %gs: and the cost of rip-relative!
<heat> I haven't found a way to directly encode 0x338 in inline asm
<mrvn> and a 2GB offet?
<heat> yeah
<heat> it ends up working because it's running at -2GB
<heat> so -2GB + a few megs + 0x7fe8e16c = 0x338
<mrvn> why are you accessing 0x338?
<heat> it's a percpu variable
<heat> again
<heat> %gs:0x338
<mrvn> Ahh, yeah.
<heat> i would like a way to just encode 0x338 but I can't figure out what's the needed asm constraint here
<heat> which is *annoying*
<j`ey> write something n C?
<heat> I can't do mov %gs:preemption_counter because of symbol decoration
<heat> (well, not directly in the string)
<j`ey> make it extern C so it wont mangle?
<heat> maybe I could embed that in the DEFINE_PERCPU
<heat> I don't understand why it sometimes does mangling and other times it doesn't
<heat> like _ZL10bga_driver
<mrvn> you forgot extern "C"
<heat> what's _ZL10 for? local symbols maybe?
<heat> no
<heat> I haven't applied extern "C" anywhere
<mrvn> Is that one of the strings the demangling doesn't demangle?
<heat> for instance "thread *thread_queues_head[NUM_PRIO]" gets thread_queues_head
<heat> static struct pci::pci_id pci_bga_devids[] gets _ZL14pci_bga_devids
<heat> I think it really does it for local symbols
<heat> the L must mean local vis
<mrvn> namespace 14?
<heat> no
<heat> 14 = number of chars
<bslsk05> ​itanium-cxx-abi.github.io: Itanium C++ ABI
<mrvn> Z for pointer/size_t size?
<heat> no
<heat> it's just The Prefix
MiningMarsh has quit [Ping timeout: 268 seconds]
<heat> I really like my thunks' symbol names
<heat> _Z21__sys_symlinkat_thunkmmmmmmm
<heat> they make me hungry mmmmmmmmmmmmmmmmmm
FireFly is now known as Luci-ghoule
<heat> this demangles into __sys_symlinkat_thunk(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) if you're wondering
tarel221 has quit [Ping timeout: 252 seconds]
MiningMarsh has joined #osdev
roan has joined #osdev
<heat> mjg, can you confirm that st_blocks works normally in freebsd zfs?
<heat> I've read the code and it does seem to make sense
<heat> but someone was complaining that a file hole test was failing
<mjg> with understanding that st_blocks
<mjg> blkcnt_t st_blocks; /* Number of 512B blocks allocated */
<mjg> then yes
<mjg> this is not the number of blocks configured
<mjg> as in actual block size used by the fs
<mjg> which is funny given that st_blksize is next to it
<bslsk05> ​github.com: capnproto/filesystem-disk-test.c++ at 7819d247c6c24d04226ed1954c43432b7d2e8c52 · capnproto/capnproto · GitHub
<heat> this seems BS
<mjg> there is a delay between writes and getattr being able to see the update
<mjg> i don't know if this is zfs on freebsd specific or zfs in general
<mjg> also woudl help if they noted versions at hand
<mjg> will check it out later
<heat> would datasync not supposedly sync that?
<heat> sounds like a broken datasync otherwise
<mjg> do they sync?
<mjg> as i said will take a look later
<heat> yeah
<bslsk05> ​github.com: capnproto/filesystem-disk-unix.c++ at 045d5ff0e50cd044c1f05925789d5c3e46d96d21 · capnproto/capnproto · GitHub
<heat> maybe fdatasync isn't the call here
<heat> probably isn't
<heat> I can see how fdatasync could skip updating the metadata immediately
SGautam has quit [Quit: Connection closed for inactivity]
<mjg> zfs_getattr is quite complicated, with going down to the object layer
<mjg> it is quite sad how inefficienti ti s, given how often it is being called
<heat> what's freebsd's strace?
<mjg> truss
<heat> bah
<heat> this is so fucking weird
<mjg> > In fact, in benchmarks, Cap'n Proto is INFINITY TIMES faster than Protocol Buffers.
<mjg> i have doubts about this codebase
<heat> the guy wrote protocol buffers
<heat> i guarantee you he knows what he's doing
<heat> anyway, I can repro their issue pretty easily
<heat> for new inodes, st_blocks = 1
<heat> if I don't unlink and open(O_TRUNC), I get a correct block size
<heat> fsync, fdatasync don't seem to make a difference
<heat> s/block size/st_blocks/g
<heat> not even sync(2) makes a difference
<mjg> can you repro the problem, except add sync() + sleep 10
<mjg> before you stat
<heat> what kind of cursed filesystem is this
<heat> that fixes it
<mjg> note sync() returns immediately, so if you stat immediately after you are racing it
<heat> wtf
<heat> ;_;
<mjg> unix
<heat> linux's doesn't
<mjg> According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return
<mjg> before the actual writing is done. However Linux waits for I/O completions, and thus sync() or syncfs()
<mjg> provide the same guarantees as fsync() called on every file in the system or filesystem respectively.
<mjg> indeed it does not
<mjg> so what happens if you fsync?
<heat> jack shit
<mjg> :)
<mjg> i'm gonna take a look later, time for a nap
<heat> ok
<heat> do you want a bug report?
<mjg> if you care to write one make it against openzfs
<mjg> on github
<mjg> preferably with an ez reproducer
dude12312414 has joined #osdev
<mjg> it should be easy to add a flag to make sync() blocking optionally
<heat> does fsync block?
<heat> in paidbsd
<mjg> it is supposed to but i never needed to test
<mjg> what does zfs on linux think about it
<heat> I don't use zfs
<heat> I explicitly reinstalled my freebsd vm on zfs to test it
<heat> I'm a simple man, I use simple filesystems for simple people
k8yun__ has joined #osdev
Raito_Bezarius has quit [Ping timeout: 268 seconds]
<heat> mjg, cc'd you on the issue
Raito_Bezarius has joined #osdev
Raito_Bezarius has quit [Max SendQ exceeded]
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
Raito_Bezarius has joined #osdev
gxt has quit [Ping timeout: 258 seconds]
gxt has joined #osdev
k8yun_ has joined #osdev
k8yun__ has quit [Ping timeout: 268 seconds]
gxt has quit [Ping timeout: 258 seconds]
gxt has joined #osdev
<Maja[m]> hey, just curious – do any of your OS's support suspend? 👀
<heat> no
<heat> I've looked at it for a bit but it just doesn't make much sense to support it
<heat> power management is finecky
<Maja[m]> yeah I've mentioned it in passing and a friend of mine has very kindly treated me to a traumatic flashback^W^W infodump
<Maja[m]> I think I'll just try to be able to boot very fast
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<heat> Maja[m], S3 resume is literally just a very fast boot
axis9 has joined #osdev
axis9 has quit [Quit: joins libera]
wxwisiasdf has joined #osdev
<wxwisiasdf> Hello :D
axis9 has joined #osdev
<heat> HELLO
<axis9> hello
frkzoid has joined #osdev
<mats1> herro
<heat> wxwisiasdf, hows the cobol mainframe person
<wxwisiasdf> very good
<wxwisiasdf> i am trying to remove all the C backend from my code as to leave COBOL only because COBOL required heavy runtime support
<wxwisiasdf> eg. moving memory allocator to COBOL
<wxwisiasdf> unfortunely COBOL is dynamic so it's proving to be rather difficult
<heat> good shit
<heat> just kidding, scary ass shit
<heat> i have a question for god
<heat> WHYYYY
<wxwisiasdf> the unfortunate history of heat, who didn't use the superior hardware task switch mechanism
<heat> hardware task switching doesn't exist in x86_64
<heat> but I bet they would forget to swap swapgs there too
<heat> s/swapgs/gs/
<wxwisiasdf> it swaps gs automatically for you
<heat> but in x86_64 that just nulls your segment base
<heat> is it by design?
<wxwisiasdf> no idea ask the intel designers
<wxwisiasdf> "hey intel why you suck so much"
frkzoid has quit [Remote host closed the connection]
<heat> absofuckinglutely
<heat> why dont they just switch the stack on syscall
<heat> why the fuck
<heat> ?????????
<heat> it makes no god damn fucking sense
<heat> every time i think of x86 i get remembered that everything here fucking suckssss
<heat> why did they did a quirky adaptation of segment registers
<heat> and this is not even on Intel
<heat> this is on fucking AMD
<heat> the supposed saviours of x86
frkzoid has joined #osdev
frkzoid has quit [Ping timeout: 260 seconds]
SGautam has joined #osdev
xenos1984 has quit [Ping timeout: 260 seconds]
xenos1984 has joined #osdev
freakazoid332 has joined #osdev
xenos1984 has quit [Ping timeout: 246 seconds]
xenos1984 has joined #osdev
wootehfoot has joined #osdev
xenos1984 has quit [Ping timeout: 268 seconds]
corank_ has quit [Ping timeout: 255 seconds]
corank_ has joined #osdev
xenos1984 has joined #osdev
bgs has joined #osdev
pretty_dumm_guy has joined #osdev
<geist> heat: simmer down
<mjg> chill pill would help
immibis_ has quit [Ping timeout: 265 seconds]
<mjg> heat: read a solaris internals to calm yourslef
<mjg> *book
immibis_ has joined #osdev
<geist> also if you want to bitch about swags or whatnot, that's AMD's doing
<mjg> mismatched swapgs gang unite
<geist> but yes anything that toggles like that is a bad idea anywhere in hardware. architectures, device registers, etc
<geist> any toggling behavior is always a bad idea
<mjg> no no geist i'm sure it made sense! otherwise they would not do it
<mjg> #standarddefense
<geist> well it made sense in they wanted to add exactly one instruction and had to keep all the old machinery around because x86-32
<geist> but they *could* have done it with two instructions: set gs aux, set gs reg or something
<geist> same mechanism, but not toggling and thus hard to fuck up
<geist> sigh. that fatalbert guy keeps whining at me
<geist> if he could just control his own impulses it would work, but he just cant. it's really pathetic. finally had to hard /ignore him this time
<mjg> discord?
<geist> no, here on irc
<geist> banned him a while back, so about once a week he's been whining at me in privmsg
<geist> i eventually just hard ignored him
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<wxwisiasdf> sheesh
<mjg> solid
gog has joined #osdev
<heat> bonk
<gog> hi
<heat> after a nice nap I can say that 1) I was just pissed at x86 2) x86 really is user-hostile
<heat> >read a solaris internals to calm yourslef
<heat> I would rather use zfs
* CompanionCube installs Oracle ZFS on heat
<zid> I was also installed on heat
<zid> while on heat* sorry
<geist> heat: yea but it's job security! once you know x86 you're a god
<geist> well, okay one of many but still.
<geist> Impress Your Friends!
<geist> Become the Life of the Party!
<mjg> you are asking for a meme
<mjg> lemme find th template
<geist> or one of those american psycho memes where they show off their kernel ported to different architectures
<mjg> well i found a photo from your meet up yesterday geist
<bslsk05> ​imgflip.com: They don't know - Imgflip
<geist> heat i know has taken the plunge and hacked on some non x86 arches, and has seen that the grass is slightly greener on other sides. or at least a more pleasant shade
<geist> haha
gog has quit [Ping timeout: 268 seconds]
<mats1> lol
<Griwes> heat, "I'd rather use zfs" doesn't tell us much, because zfs is delightful :P
GeDaMo has quit [Quit: Physics -> Chemistry -> Biology -> Intelligence -> ???]
<geist> honestly i'm probably the one person that doesn't below to the zfs fan club
<geist> for no particular reason, to be honest, but it seems like it gets just a Little Too Much love
<geist> kinda like plan9 or something, where feels a lot like folks just repeat 'omg that is awesome'
<geist> Griwes: and in no way am i implying that's what you are doing here, just getting it off my chest
<mjg> zfs is pretty great if you don't look to close at it
Iris_Persephone has joined #osdev
<Griwes> Well, I've been running zfs as rootfs on every non trivially rebuildable system I have for years now, and I couldn't be happier
<mjg> but i admit i don't know how it really compares to btrfs
<Iris_Persephone> heya guys!
<Griwes> So at least I put my money where my mouth is :p
<mjg> crucial selling point for me is that there is no partitioning in a way which limits space
<geist> yeah i just get a little suspicious of stuff that is universally loved, etc. the more love something gets the more suspicious i am that it's just groupthink
<mjg> as in if i want to create a separate dataset at /lolo, i can just do it
<mjg> don]t have to carve out any space for it
<Griwes> geist: yeah that's fair
<mjg> that turned to shittingo n linux real quick
<mjg> :S
<Iris_Persephone> My Linux From Scratch didn't end up working out, so I thought I would follow Bare Bones and see where that takes me
<geist> well, it's educational if nothing else
demindiro has joined #osdev
<geist> honestly LFS was probably a lot simpler and usefl about 10-15 years ago when a basic system was init and a shell, and maybe one or two daemons
<geist> i *assume* it's moved on to more complex setups now
<zid> I'd only LFS from stripped down packages built on a gentoo host
<mjg> what's the point of lfs?
<zid> no idea
<zid> edutainment for getting it working I think
<geist> educational if nothing else. i think i learned a few things the first tiem i fiddled with it
<mjg> ye ok
<geist> exactly
<mjg> that used to be gentoo back in my day
<demindiro> Also smaller images I guess
<geist> you end up with something roughly BSD shaped, or gentoo stage 1/2
<geist> yah i loved the days back when you actually built the first few stages of gentoo
<zid> I should scour ebay for rams, speaking of building packages
<geist> gave you the same dopamine hit as LFS
<zid> I have some paypal dollars to burn but I refuse to overpay for this old ass ram I want
pretty_dumm_guy has quit [Quit: WeeChat 3.5]
<Iris_Persephone> I spent... god, it must have been weeks at this point
<Iris_Persephone> Eventually I decided tracking down all the errors wasn't worth my time
<geist> yah if it ceases to be edutainment, move on
<geist> you alas dont get an Achievement badge
<zid> That's why I never played that game, no achievement tracking
<Iris_Persephone> Gave me a new appreciation for Unix-likes, though!
<zid> okay I checked all of ebay.. maybe I should figure out how alerts work
<geist> oh no. that way leads to madness
<Iris_Persephone> What are you looking for?
<zid> 2x8GB 1866MHz UDIMMs
<Iris_Persephone> ah
<zid> or alternatively, 4x8GB 1600/1866Mhz URDIMMs
<Iris_Persephone> "Everything is a file" is remarkably intuitive, when you wrap your head around it
<Iris_Persephone> zid: "Great price" indeed!
<zid> To be fair the alternative just ends up being "everything is a handle"
<zid> which ultimately works the same, as files also have handles
<demindiro> I think "file" is a misnomer and "object" or "handle" is better
<demindiro> Because every time there's some guy who's like "but network sockets aren't files"
<zid> but unix just pretends various things *are* files, where other things would not
<zid> like, windows will let you open a com port and use read/write on it, but there isn't actually a COM2 `file` sat around anywhere with mknod
<geist> i think a lot of that has to do with whether or not all the things you get handles to are in a single namespace
<geist> that i think is kinda a differentiator
<zid> I think the thing that makes unix unix is pipes
<zid> windows is often remarkably hard to use because it isn't pipey
<zid> you download some random program, some other random program, and end up having to write your own third program to connect them
<zid> along with stuff like autohotkey
<demindiro> I think that's more of a "GUI everything" symptom
<demindiro> Automating GUIs instead of CLIs is harder IMO
Iris_Persephone has quit [Ping timeout: 248 seconds]
Iris_Persephone has joined #osdev
gog has joined #osdev
<Iris_Persephone> Pipes are *really convenient* too, I didn't realize *how* convenient until I started using Mint
<Iris_Persephone> I wonder... can I make an OS that is as "pipey" as possible?
<zid> You should invent a gui pipe system, like the node system in blender et al
wootehfoot has quit [Quit: Leaving]
<Iris_Persephone> I'd have to come up with some sort of standardized interface between programs, but that was a given if I wanted to do anything pipey
frkzoid has joined #osdev
freakazoid332 has quit [Ping timeout: 264 seconds]
frkzoid has quit [Ping timeout: 268 seconds]
wxwisiasdf has quit [Ping timeout: 246 seconds]
<Iris_Persephone> Wait, I just realized something
<Iris_Persephone> Protocols exist
freakazoid332 has joined #osdev
<gog> it's basically drag and drop
<gog> only without the dragging and dropping
elastic_dog has quit [Ping timeout: 264 seconds]
elastic_dog has joined #osdev
Iris_Persephone has quit [Ping timeout: 252 seconds]
<heat> IM BACK
<heat> Griwes, zfs has insane fsync semantics apparently
<heat> which doesn't bode well for the fs
<bslsk05> ​github.com: [FreeBSD] Creating + writing to a file + stat() in quick succession returns bad st_blocks · Issue #13991 · openzfs/zfs · GitHub
<heat> not only on freebsd but also repros on linux
<heat> geist, grass is very much greener
<heat> i dont understand the weird decisions and inconsistencies
<heat> is it to save transistors or something? idk
Iris_Persephone has joined #osdev
<Iris_Persephone> gog: although that begs the question, "if it's so simple why hasn't someone else implemented it?"
<heat> because it's not a great idea
<heat> pipes are slow
<gog> probably because for a GUI application it's not very intuitive
darkstardevx has joined #osdev
<gog> like when do you output to the pope
<gog> pipe
<heat> fuck the pope
<gog> yes
<heat> the unix philosophy is a bad idea
<gog> and how long does an application wait for input from the pipe (i typed pope again wtf)
<heat> everything is a file is the worst one
<gog> heat's been reading hte UNIX-Haters handbook
<gog> such a good boy
<heat> everything is a file was also not in the OG UNIX philosophy
<heat> it was retrofitted in
darkstardevx has quit [Remote host closed the connection]
<bslsk05> ​'"What UNIX Cost Us" - Benno Rice (LCA 2020)' by linux.conf.au (00:34:14)
<heat> great watch
<gog> yes, well-defined interfaces are bad. let's slap everything inside ioctls and make it look like a file
darkstardevx has joined #osdev
<Iris_Persephone> To be perfectly honest
<heat> the worse parts are when they go half write/read and then give up and ioctl the shit out of it
<Iris_Persephone> half of this is just pure spite at Voicemeeter and AutoHotKey
<heat> THANKFULLY they now understand they can just return a fd from syscalls that are not named open
<heat> which is why pidfd, etc are decent interfaces
<gog> yeah the excuse of "it keeps a generic interface for I/O!!" like
<gog> no
<gog> it fucken doesnðt
<gog> if you have to use ioctl at all you've negated the benefit
<heat> ioctls are not bad
<gog> i suppose not
<gog> but you still have to use a file descriptor with them
<Iris_Persephone> I would give my firstborn child to be able to pipe my audio/video/whathaveyou to and from a program as easily as using "tee"
<gog> so the big abstraction is still there for the benefit of the little one
<heat> char buffer[200]; sprintf(buffer, "/proc/%d/comm", pid); open(buffer) is horrible
<heat> dog shit
<heat> it's also like half the linux libc
<heat> Iris_Persephone, not doable
<heat> audio and video is low latency stuff
<Iris_Persephone> Yeah, that figures
<heat> video is bandwidth heavy and needs acceleration
<heat> it turns out the whole UNIX philosophy thing of making small, composable programs with pipes and whatever the fuck isn't a great idea
<heat> because it's slow and a poor fit
Celelibi has quit [Ping timeout: 260 seconds]
<heat> so people just added options to already existing programs
<heat> like cat -n
<Iris_Persephone> So, the slowness is *inherent* to the concept of pipes?
<heat> yes
<demindiro> no
<heat> lots of copying and IPC
<heat> yes
<heat> absolutely yes
<heat> you're not getting anything performant out of pipes
<demindiro> How performant does it need to be?
<demindiro> Piping works plenty well for lots of stuff
<heat> on real services? it totally does
<heat> I've seen people suggest https as httpd | tlsd or whatever
<heat> ridiculous shit
<CompanionCube> does anyone really care about st_blocks tho?
<heat> I do
<heat> fsync() should make sure metadata is written back
<heat> also kenton too
<heat> he has a file hole test which needs it
<Iris_Persephone> well shit, there goes that idea
<heat> do things -> fdatasync -> check st_blocks -> expect st_blocks to be valid
<demindiro> iris_persephone: if redundant copying is a concern, consider using shared buffers and passing small messages between processes instead.
<heat> demindiro, and your unix philosophy goes out the door
<demindiro> No
<heat> yes
<demindiro> Shared buffers are very generic
<heat> it all revolves around pipes
<demindiro> Message can just be "this offset + length go figure"
<heat> unix revolves around simple utilities that are easily composable using |
<heat> on the shell
<demindiro> And you can do the same with shared buffers
<heat> it worked great in the 70s
<demindiro> Just involves a little more setup
<heat> again, unix philosophy goes out the door
<kof123> eh, if anyone cares somewhere in unix interviews there is someone who had a fancier pipe idea that was deemed too complex, would look at that, not very helpful i know :D
<kof123> so i would lean towards heat it wasn't original, just based on that
<kof123> *in the original
<CompanionCube> does btrfs has less weird semantics
<heat> idk
<heat> hope so
<kof123> i mean, it might be a horrible idea, just if someone was looking into pipes
<kof123> maybe it was left out for very good reasons
<heat> "This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."
<kof123> misspoke, you said everything is a file was not the original, my bad
<heat> take your data, serialize it to text (but not json because that isn't streamy), send it over a pipe, receive it on the other program, deserialize it, ...
Celelibi has joined #osdev
<heat> according to this your C compiler would be a 200 line piped up invocation|
<heat> !
<demindiro> I guess I don't take the UNIX philosophy as literally, you can do e.g. lz4 blah.txt | pv | lz4 -d
<demindiro> and lz4 outputs binary data
<Iris_Persephone> I suppose I don't want to *strictly* follow the Unix philosophy, I just find the concepts useful
smach has joined #osdev
* kof123 prepares to throw gasoline on the fire
<kof123> piping audio and video...wouldnt that just be remote importing /dev/xyz on remote machines?
<gog> nothing is a file
* kof123 runs
<gog> that's my philosophy
<heat> kof123, very plan9 of you
<kof123> yes :D
<kof123> "simpsons did it!" "plan9 did it!" lol
<CompanionCube> heat: iirc the thing with the metadata is that it doesn't know the on-disk size until the TXG commits. fsync only writes back to the on-disk ZIL.
<heat> that sounds like fsync and sync are broken
<CompanionCube> no it's just write-ahead logging as you would find in a database
<CompanionCube> you can make fsync properly broken by setting sync=disabled
<heat> fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device
<heat> (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.
<heat> after fsync of zfs your st_blocks metadata isn't consistent at all with how things look like or will look like
<heat> fsync should be a proper barrier
<heat> what's fsync for if you can't guarantee consistency on metadata, the inode, and the disk itself?
<heat> the problem used to be the lack of guarantees on synchronization for other API functions
<CompanionCube> oh hey, found a btrfs thread on this problem: https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00000.html
<bslsk05> ​lists.gnu.org: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in
<heat> now these fancy filesystems can't even guarantee fsync synchronizes?
<heat> good, fuck it
<gog> yes
<gog> move beyond the need for files
<heat> "until the data are synced"
<heat> at least it works on fsync
<CompanionCube> single-level storage when
<heat> (hopefully)
<Griwes> is on-disk ZIL lost on reboot?
<Iris_Persephone> Stupid newbie question: What is an filesystem object, if not a file?
<demindiro> generic data blob, I suppose
<CompanionCube> Griwes: no, the one lost on reboot is the in-memory ZIL.
<Griwes> if no (and I don't think it is), then this behavior fulfills the behavior of fsync you quoted vOv
<Griwes> so everything is fine then
<heat> it's not
<clever> Iris_Persephone: for zfs, there are a lot of fs metadata objects, and block devices (zvol's) are also one big object
<heat> metadata isn't synced
<clever> but files are also objects
<heat> data isn't even fucking sync'd
<CompanionCube> async writes go to the in-memory ZIL but not the on-disk one
<heat> how can you call *sync* and have it not be sync
<Griwes> > so that all changed information can be retrieved even if the system crashes or is rebooted
<heat> comitting something zfs's magic super duper table of awesome things is not guaranteeing synchronization
<clever> heat: with zfs, i believe sync will write it to the on-disk ZIL (journal), so it can return faster
<Griwes> this seems to be the core part of fsync by the quoted definition
<heat> and that's the fucking problem clever
<heat> it's sync, not a race
<clever> heat: but i think the ZIL is basically just a record of VFS calls, write/delete/symlink
<Griwes> so you are arguing the definition of fsync you quoted is wrong
<heat> I am
<heat> fsync and sync and fdatasync need to be *properly done*
<heat> not POSIX's-vague-definition done
<Griwes> okay so you're saying the unix (or at least linux) definition of fsync is bad
<heat> yes!
<Griwes> stop blaming zfs for that then!
<demindiro> IIRC macOS even just ignores fsync for the sake of performance
<demindiro> Which apparently is valid
<heat> and you'll know that filesystem fuck around with the semantics of fsync
<clever> demindiro: i have seen macos bugs, where a file that was written via mmap() and not sync'd, reports having holes when you use the seek-hole function
<heat> this is a contentious subject
<heat> please, make it right.
<clever> demindiro: cp then just skips the first 8mb of the file, and doesnt even bother copying it
<Griwes> I don't have a full opinion on whether the fsync definition is sufficient or not, and I can see merit in an argument that it is wrong
smach has quit [Remote host closed the connection]
<Griwes> but that doesn't make it the FS' fault
<mjg> macos andp erfomrance?
<heat> it is
<clever> demindiro: and by pure chance, i then saw nearly the identical bug on linux+zfs a week later, lol
<heat> they implement fsync!!!
<Griwes> would you prefer them to *not* implement fsync?
<heat> better than pretend it does anything remotely useful
<mjg> if lolo fsync provides no guarantees i would prefer to software to know
<Griwes> but it does what it says on the man page lol
<kof123> i was looking at raidframe to steal ideas...it it way too much for me, but a couple gems in the source: @echo "RAIDFRAME R00LZ D00D" >> /dev/null Set this to 1 if you are the CMU Parallel Data Lab. You probably aren't.
<heat> it's like ext4's fsync just writing journaling data and fucking off
<Griwes> yeah? it's implementing a unix FS protocol lol
<Griwes> blame unix or whoever the heck first defined fsync like this
<CompanionCube> tbf at least opposing ext4 journaling fsync is being consistent
<heat> do you think that guarantees consistency?
<Griwes> again: I see merit in your argument but you are misplacing the blame for the fault in the API that you perceive
<heat> I expect sane filesystem semantics
<gog> why
<Griwes> I expect a driver to implement the API it is asked to implement
<heat> if they can't provide it, fix it. I called fsync, I want fsync, I want consistency
<Griwes> and if that API is shit, it's not the driver's fault
<heat> there already is consistency
<heat> (ish)
smach has joined #osdev
<heat> ext4 delays block allocation but st_blocks is always correct
<Griwes> you called fsync, but you wanted something else than what the manpage for fsync says will happen
<heat> now, I understand zfs's set of constraints is totally different
<clever> heat: zfs compresses, so the on-disk size can vary
<heat> so, make it work on fsync
<clever> CompanionCube: is the ZIL before or after compression?
<heat> ^^^^
<Griwes> expecting more than what the manpage says is on you, not on the driver
<heat> Griwes, the manpage is irrelevant here
<heat> the manpage is written based on the code and behavior of fsync
<CompanionCube> clever: that's actually a good question, more so how that interacts with immediate writes which are also i thing i forgot to mention
<Griwes> heat, okay. what does POSIX say?
<heat> posix is also irrelevant
<heat> sorry
<Griwes> no
<heat> yes
<heat> absolutely yes
<Griwes> because POSIX is the spec that they are implementing
<pitust> its not for xnu
<heat> you can't write a whole OS looking only at POSIX
<heat> because they are purposefully vague
<CompanionCube> given that https://github.com/openzfs/zfs/issues/8896 exists, seems answer is that compressed ZIL writes aren't a thing
<bslsk05> ​github.com: Request: compress ZIL writes · Issue #8896 · openzfs/zfs · GitHub
<heat> like, look at this shit
<heat> The sync() function shall cause all information in memory that updates file systems to be scheduled for writing out to all file systems.
<heat> The writing, although scheduled, is not necessarily complete upon return from sync().
<heat> ???????????????????????????
<Griwes> this is time for us to agree to disagree because you have a fundamentally different PoV on what you are supposed to do when implementing an interface with a formal spec than I have and we will not agree on this
<heat> the kernel doesn't need to "just implement POSIX"
<heat> it never has, that was never the purpose
<heat> POSIX was defined by the kernels
<heat> still is
* Iris_Persephone pokes her head into the chat
<Iris_Persephone> is it over yet?
<Griwes> POSIX is defined by a spec created by the austin group and then ratified by multiple international standards orgs
<demindiro> Be careful not to get decapitated
<Griwes> but let's agree to disagree
<heat> if you don't provide a useful fsync() then you're doing a bad job
<heat> Griwes, spec that is built by... multiple operating systems in order to form a consensus
<Griwes> if you expect fsync to do more than is documented as its user, you're doing a bad job
<heat> ok
<clever> CompanionCube: my rough understanding, is that the ZIL is basically just a log of these VFS ops: https://github.com/openzfs/zfs/blob/master/include/sys/zil.h#L142-L167
<bslsk05> ​github.com: zfs/zil.h at master · openzfs/zfs · GitHub
<heat> tell me how to do a better job?
<heat> sleep(10)?
<heat> the filesystem doesn't give me the tools to do a better job
<clever> CompanionCube: either with the data inline (in the ZIL itself) or out of line (in the main pool, in "free" space)
<CompanionCube> yep
<CompanionCube> as always, utcc has good stuff on the ZIL.
<Griwes> heat, *why* do you need to know the number of blocks that you've just written?
<Iris_Persephone> I should really read POSIX when I get around to it, huh
<heat> you know why all this sync stuff is contentious shit in filesystems?
<heat> because they're scared people fsync too much and it "hUrTs PeRfOrMaNcE"
<heat> that's literally it
<Griwes> you're not answering the question
<clever> CompanionCube: yep, https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZILSafeDirectWrites this explains why writes to "free" space are safe without updating free-space maps
<heat> there have been huge debates in linux about it
<bslsk05> ​utcc.utoronto.ca: Chris's Wiki :: blog/solaris/ZFSZILSafeDirectWrites
<CompanionCube> clever: and the backlinks are good too
<clever> basically, the in-memory maps are updated, so the live system is aware of it
<heat> Griwes, to know how much disk I've used, file holes, etc
<clever> and during recovery, you read the ZIL and rebuild that in-memory state
<Griwes> "sync" being "the data is on the permanent storage device so it doesn't get lost" seems like a perfectly fine definition to me
xenos1984 has quit [Read error: Connection reset by peer]
<clever> CompanionCube: and from what ive see, the ZIL is entirely optional, assuming there is nothing to recover, you could just do every write directly to the pool itself
<Griwes> heat, why do you need to know _exactly_ how many blocks you've used?
<heat> Griwes, the metadata doesn't even add up!
<heat> <heat> Griwes, to know how much disk I've used, file holes, etc
<clever> the only purpose of the ZIL is to make sync writes finish faster
<heat> of fucking course
<Griwes> the first part of your answer is why I'm asking the question
<heat> it's a performance thing!
<Griwes> lol
<Griwes> anyway I'm out
<heat> it's not a "Lets just do the bare minimum for posix"
<Iris_Persephone> If you don't like it, just implement it differently in your OS...?
<Iris_Persephone> Is that not practical?
<heat> no
<heat> I'm not realistically daily driving my OS
<CompanionCube> clever: mhm
<heat> if you don't fight for change you just get stupid behavior all the way
<Griwes> (we're at the point of disagreeing what constitutes "stupid behavior")
freakazoid332 has quit [Read error: Connection reset by peer]
<clever> CompanionCube: i also have theories on how a bootloader can cheat zfs writes, what if i just append an entry to the ZIL?
smach has quit [Ping timeout: 265 seconds]
<CompanionCube> clever: do you really need to cheat zfs writes when there's enough reserved space to be used
<mrvn> heat: The number of blocks something uses becomes basically meaningless when you have snapshots (is the space = total / snapshots using the block?) and cow (a freshly edited file can take 3 times as much blocks and go down to 1 times as things are flushed).
<clever> CompanionCube: then i need to update the spacemaps, and all of the indirect blocks in several objects
<demindiro> ZIL is a separate device no?
<CompanionCube> demindiro: not per se
<clever> demindiro: it can either be in the main pool, or its own device
<CompanionCube> the term 'SLOG' refers to the latter, but ZIL applies to both as well as the in-memory one
<clever> demindiro: under normal conditions, the ZIL is write-only
<clever> to make a write() or sync() finish asap, it syncs the data to the ZIL on a disk, but also keeps a copy in ram
<clever> and then it can update the proper pool at a later time, from the copy in ram
<mrvn> heat: there is also fdatasync(). Sounds like ZIL ,akes fsync behave like fdatasync
<heat> yes
<heat> the difference between fsync and fdatasync is that metadata is supposed to be written and consistent
<bslsk05> ​github.com: Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency by behlendorf · Pull Request #12724 · openzfs/zfs · GitHub
<mrvn> BUT: "Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk."
<heat> yes, that makes sense
<clever> from memory, a normal sync in CLI did mask this bug
<mrvn> You could argue that the block counts etc is part of the directory?
<heat> could I?
<mrvn> .oO(Well, you can argue anything)
<mrvn> The manpage refers to inode(7) for what metadata is flushed by fsync() but I don't know how normative that is.
<clever> mrvn: for zfs, the blocks are part of the dnode (similar to an inode), and a directory is just a name->dnode-number mapping
<clever> mrvn: but the ZIL lets you ensure data is secure on disk, without updating the dnode
<mrvn> clever: Which fits with "transfers all modified in-core data of the file referred to by the file descriptor fd to the disk device"
<clever> mrvn: the dnode table is also an object in zfs, with its own indirect block tree, and a dnode at its root
<mrvn> clever: except it's not "all"
<heat> st_blocks is part of metadata per inode(7) if we want to go down that route
<heat> exactly.
<mrvn> clever: heats argument is that "st_blocks" is not written to disk, nor even reported correctly from ram
<heat> yup
<mrvn> heat: but what should st_blocks be? I have a file that takes up 4 blocks so st_blocks is 4. Now I overwrite 1 block so with COW the count goes up to 5? 2 flushes later the old block is removed and the count goes back to 4?
<mrvn> heat: or with journal and compression I write a block the count goes up to 5 to account for the journal block. Then copression takes it down to 2 blocks total and it drops to 2?
<heat> i don't know
<heat> that's up for the fs to decide
<heat> not like I've thought much about that particular question anyway
<mrvn> heat: well, you seem to disagree with the decision zfs made.
<heat> no
<heat> zfs has a decision regarding st_blocks
immibis_ has quit [Ping timeout: 264 seconds]
<heat> it's pretty clear
<heat> you do get st_blocks values after everything is flushed per sleep 10
<heat> the problem is that it's nowhere near consistent
<heat> and fsync, sync, fdatasync that should serve as barriers for in-memory and in-disk consistency, don't
<mrvn> But that's the problem with journaling, COW and compression. The block count changes over time.
Iris_Persephone has quit [Ping timeout: 268 seconds]
<heat> sure
<mrvn> fsync() is more about the data being retrievable.
<mrvn> it wasn't designed with metadata that changes on it's own over time.
<heat> I would say it's about consistency
xenos1984 has joined #osdev
<heat> "the only purpose of the ZIL is to make sync writes finish faster" <-- having a structure to make syncs faster isn't a particularly good goal
<heat> if it jeopardizes the consistency
<clever> heat: all writes to an fs are also serialized within the ZIL
<clever> because the ZIL is a singly linked list, and each write appends to it
<heat> sure
<mrvn> heat: does it violate "so that all changed information can be retrieved even if the system crashes or is rebooted."?
<heat> I'm not saying your fancy ass filesystem doesn't have top tier consistency in practice
<heat> no
<heat> but again, manpages reflect reality, not the opposite
<mrvn> heat: so your only problem is that the metadata doesn't become stable with fsync() but keeps changing for a while after that.
<heat> you want to enforce sane, useful semantics for your filesystem so that things Just Work, are safe, make sense
<heat> it's not "do the bare minimum"
<Griwes> manpages reflect the behavior a user can depend on
<mjg> i remember zfs is doing something nasty to keep fsync oepratinal without demolishing perf
<mjg> and it may be it is is zil
<Griwes> a user depending on more than the manpages say they can depend on is a user problem
<heat> Griwes, I'm talking about changing the implementation :)
<mjg> i'm only sightly familiar with zfs, mostly from its pessimal multicore code :[
<Griwes> changing the implementation would not stop meaning you can only depend on what the manpages say you can depend on
<heat> sure
<Griwes> you are presenting all this as a bug
<heat> make everyone do that, then change the man page
<mrvn> Pol: Who here uses fdatasync() in their code because they only care about data integrity?
<mrvn> +l
<Griwes> it's not; it's a feature request
<heat> it is a bug, there's no consistency in memory
<Griwes> it's a feature request to widen the guarantees of fsync
<heat> I'm saying they're not wide enough for normal usage
<Griwes> and you are asking to widen the guarantees
<heat> this is all happening because they wanted to make sync faster
<Griwes> that's a feature request
<heat> they regressed it
<heat> it is in fact a bug, a regression. it was fine, now it's not
<heat> this is something that needs to be fixed and not added
<Griwes> they regressed the manpage?
<heat> ...
<heat> i give up
<Griwes> did the manpage promise you more earlier?
<Griwes> if it didn't, then this is a feature request
<heat> it's pointless
<mrvn> Griwes: that depends on the interpretation
<Griwes> look, I work on a formal standard
<Griwes> and I deal with that distinction regularly
<mrvn> Griwes: it doesn't flush (or even compute in ram) the metadata
<Griwes> if it wasn't in the spec that you can depend on the behavior you want, it's not a bug
<mrvn> and it's more the "compute in ram" part heat objects too I think.
<Griwes> mrvn, is it required to?
<mrvn> Griwes: yes
<Griwes> by?
<heat> an operating system is not a fucking language
<heat> you do realize that right?
<mrvn> Griwes: I only have the manpage open but the specs surely say something similar: "As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7))."
<Griwes> we're talking about an API that has a spec
<heat> POSIX specifies the bare minimum
<heat> the absolutely bare fucking minimum
<heat> can they just comply to that? oh of fucking course
<heat> is it useful? absolutely not
<mrvn> heat: have you actually looked into the specs to see what it says about the inode metadata?
<heat> it says nothing
<heat> POSIX doesn't know what an "inode metadata" is
<heat> POSIX will never make stronger guarantees about fsync, or anything else
<heat> it's purposefully generic
<heat> so every god damn UNIX out there can say they're POSIX(r) Compliant(r)(tm)
<mrvn> heat: surely it must say something about fsync() vs. fdatasync()
Iris_Persephone has joined #osdev
<Griwes> > This field indicates the number of blocks allocated to the file
<heat> "The fsync() function shall request that all data "
<Griwes> okay, are blocks allocated to the file while the write is in ZIL?
<mrvn> Griwes: arguably.
<heat> "The fdatasync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state."
<Griwes> mrvn, but that's just ZIL blocks. the actual blocks haven't been allocated yet
<Griwes> I'd argue that the only bug is that it returns 1 and not 0
<mrvn> Griwes: it's space on the disk
<heat> then that's the wrong fsync behavior
<heat> why is zfs autonomously saying "fuck standard behavior" and trying to just be faster
<mrvn> Griwes: same with journaling and COW copies. But then the block count changes over time. So it's not that useful.
<clever> Griwes: ZIL can store data either in the log itself, or in the main pool, in both cases, there are data blocks assigned to that data, but the free space maps havent been updated
<Griwes> mrvn, it also says "the unit may differ on a per-filesystem basis" - wonder if that means the filesystem can use two units depending on the file's state
<Griwes> clever, but arguably those blocks are assigned to the ZIL entry and not to the file itself
<clever> Griwes: zfs does allow the block-size for every file on an fs to differ
<Griwes> idk, looks like a feature request to me
<mrvn> Griwes: try computing st_block on a unionfs where each branch can have a different block size.
<heat> st_block is in practice in 512byte units
<mrvn> Personally the st_block has been rather meaningless for decades now.
<heat> de facto
<heat> anyway
<heat> this is not worth debating
<clever> [root@amd-nixos:~]# stat /nix/var/nix/db/db.sqlite
<mrvn> heat: it's rather random content on NFS
<clever> Size: 139017216 Blocks: 309489 IO Block: 131072 regular file
<heat> mrvn, I would expect it to reflect the source file's blocks
<mrvn> clever: 1MB block size for the FS?
<heat> no
<clever> mrvn: looking...
<heat> 131072
<heat> 128KiB
<clever> yep, 128kb for that file
<Griwes> heat, the manpage says it can be FS defined vOv
<Griwes> user error, skill issue, file a feature request
<mrvn> heat: so for /nfs/foo it's 5 1MB blocks and /nfs/other/mount/bla it's 100 4k blocks?
<heat> i seriously can't anymore
<heat> it's like you're pursefully trying to piss me off
<clever> mrvn: but due to free space fragmentation, 1049 of the 128kb blocks, are fragmented, so they are stored as 3+ blocks each on-disk
<Griwes> that's probably slightly too sassy and I apologize but I've dealt with too many feature requests disguised as bug reports to not be that
<heat> broken behavior, developers' skill issue, bug
<clever> mrvn: yikes, thats out of a total of 1153 L0 blocks, 90% of the blocks are fragmented
<mrvn> heat: I agree with you that it's bad that st_block is reporting bad data. But fact is st_blocks has been total chaos for decades now. It's basically useless unless you know what FS you are checking.
<Griwes> well be glad I'm not a zfs dev because I'd just slam your issue as closed, not a bug, rtfm
<heat> what manual?
<heat> do I need to be intimately familiar with my filesystem's format??
Iris_Persephone has quit [Ping timeout: 265 seconds]
<Griwes> the manpages for fsync and inode
<mrvn> heat: for st_blocks? yes.
<demindiro> Is the block count even accurate? du -a random_text_file claims it 5 blocks of 512 bytes, but the file itself is 2162 bytes large.
<dh`> isn't st_block defined to count in 512-byte blocks? or was that something we did to avoid chaos?
<demindiro> Which honestly doesn't make much sense to me
<demindiro> (also I use ashift=12)
<clever> mrvn: most of the holes on my disk are 8kb in size...
<mrvn> dh`: no
<clever> demindiro: check zdb as well, `zdb -ddddddddbbbbbbbb <dataset> <inode#>`
<mrvn> demindiro: 2162 rounded up to the next full block is 5*512.
<demindiro> wait
<clever> demindiro: L0 blocks contain the actual data, L1 blocks are a list of L0 pointers and so on, if a block says gang then its fragmented into more pieces
<demindiro> uh
<demindiro> 4.5K    dux_notes.txt
<demindiro> hm
<dh`> for us it is: st_blocks The actual number of blocks allocated for the file in
<dh`> 512-byte units.
<dh`> makes me wonder if our zfs does it right
<demindiro> I'm trying to understand how du -ha arrives at 4.5K with 512 * 5
<Griwes> dh`, are writes in the write journal (ZIL for zfs) "allocated for the file"?
<clever> dh`: should st_blocks count metadata, like the indirect block tree?
<dh`> traditionally it does, yes
<mrvn> Another thing about st_blocks: When a file is small or the tail is small (or has small blocks in the middle) it can be included in the inode (or other metadata). How do you count that?
<clever> dh`: what about when data blocks are of variable size?
<dh`> As short symbolic links are stored in
<dh`> the inode, this number may be zero.
<heat> mrvn, ext4 fakes 1
<dh`> clever: the goal is to get du(1) to print useful values, so ultimately it's up to the filesystem to be helpful
<heat> in practice all relevant st_blocks are defined as 512-byte by the kernel
<clever> mrvn: zfs has a feature called embedded block pointers, if the entire file data (after compression) is under ~300 bytes, it just shoves the data in where the block#+checksum would have gone
<clever> [root@amd-nixos:/nix/var/nix/db]# ls -lhs
<mrvn> clever: one of the extrem cases
<clever> 152M -rw-r--r-- 1 root root 133M Sep 30 19:28 db.sqlite
<clever> due to gang blocks, this file has ~20mb of overhead
<dh`> seems perfectly reasonable
<heat> >isn't st_block defined to count in 512-byte blocks? or was that something we did to avoid chaos?
<clever> 183M -rw-r--r-- 1 root root 133M Oct 3 19:19 db.sqlite.2
<heat> POSIX doesn't specify anything
<heat> I think most implementations converged into 512
<clever> dh`: heh, all i did was cp it, and it gained more overhead!
<dh`> that, however, is not reasonable!
<mrvn> clever: should you report those 300 bytes as "1 512byte block"? Or sum up all the data + metadata of the file (some 300+32+something bytes) and round up to 512 blocks?
<clever> mrvn: good question
<clever> 135M -rw-r--r-- 1 root root 133M Oct 3 19:20 db.sqlite.2
<clever> dh`: changed my block size to 8kb, and now its only 2mb of overhead
<clever> but it still wound up with 16 fragmented L1 blocks
<mrvn> clever: why 8k and not 4k?
<clever> indirect blocks dont respect my block size
<clever> that could also be done
<clever> 137M -rw-r--r-- 1 root root 133M Oct 3 19:22 db.sqlite.2
<clever> mrvn: now it has 32 fragmented L1 blocks, because it has twice as many L0's
<mrvn> heat: did you see the note in stat "Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call."? So the stat data itself has never been deemed consistent anyway. :)
<dh`> whose stat is that in?
<mrvn> clever: Will that change over time as sqlite modifies blocks?
<mrvn> dh`: man stat on Debian
<mrvn> man 2 stat
<dh`> ah, linux
<clever> mrvn: once a file has 2 blocks, the block size is set in stone, but the CoW nature means the indirect blocks(L1/L2) are re-written on every change
<clever> mrvn: so if my free space situation improves, the L1 blocks will un-fragment, as they get modified
<dh`> one of the things I've been debating for years is whether it's acceptable for stat() to succeed and tell you the linkcount is 0
<clever> but, each file on a filesystem, can have a different block size
<mrvn> clever: but if you have 8k blocks and sqlite writes 4k changes then your L0 will fragment, or not?
<dh`> apparently linux thinks so!
<clever> mrvn: nope, zfs will do a read/modify/write cycle
<clever> pull 8k off the disk, modify half of it, then write 8k back to disk
<mrvn> dh`: more importantly what is the link count for a directory? The "normal" behavior lets you know the number of files and subdirs in that dir.
<clever> same reason you want to align your partitions, so your 4k writes are not straddling half of 2 4k blocks
<dh`> that's not more important, it's a completely different issue
<dh`> :-p
<clever> it still causes a read/modify/write cycle, but twice now!
<mrvn> clever: so it stores 4k zeroes and 4k data when you have a hole?
<clever> not sure what it does, if half a L0 is nulls, and compression is off
<clever> and now that youve reminded me, let me flip compression on
<mrvn> clever: with compression you get a lot more partially filled blocks.
Iris_Persephone has joined #osdev
<clever> 119M -rw-r--r-- 1 root root 133M Oct 3 19:28 db.sqlite.2
<clever> with lz4
<clever> and also, my L1 blocks are no longer fragmented
<demindiro> ZFS has LZE compression which strips zeroes, runs before LZ4 IIRC
<demindiro> Ah half, ignore me
<clever> demindiro: ive yet to notice that, while implementing my own zfs driver
<clever> part of why i'm writing a driver, i'll discover half truths and out of date docs
<clever> if i dont parse it exactly write, garbage comes out!
<demindiro> Do you have a public repo somewhere?
<bslsk05> ​github.com: lk-overlay/zfs.c at master · librerpi/lk-overlay · GitHub
<clever> currently, its able to parse the vdev labels, find the most recent uberblock, load the MOS dnode, and partially load the root dataset dnode
<clever> to get directory traversal, i need to implement ZAP parsing
<clever> a fat-zap is basically a b-tree, hash the filename, use the hash in an index to find the block# with the name->inode entry
<clever> but if the filenames are small and the total size of the serialized structure is small, it instead becomes a micro-zap, no index, just a single block of name->inode pairs
<clever> demindiro: only sha256 is validated currently, but zfs uses fletcher4 for metadata, so my more critical reads arent validated
<clever> and only lz4 decompression is supported
<clever> mrvn: oh right, and with 4kb blocks, compression is pointless
wxwisiasdf has joined #osdev
<clever> mrvn: block allocations, are done in units of ashift, 4kb for my current setup
<clever> so with 4k fs blocks, it takes 4kb of data, compresses it down, then stores it in a 4kb block!
<clever> with 8kb blocks, you need to halve the size of the data, for compression to actually be a benefit
gildasio has quit [Remote host closed the connection]
<clever> yep, i can confirm that in zdb, despite asking for lz4, a large chunk of the blocks arent compressed
<clever> zfs realized it wasnt a net-gain, so it just stored the uncompressed version instead
gildasio has joined #osdev
<heat> dh`: now that you're here, mjg told me to ping wrt rename and locks
<heat> <heat> what do you think of a per-filesystem rename lock?
<dh`> what about it? rename is not trivial and ~everyone out there has it wrong
<heat> why wrong?
<dh`> as far as I know there are only two correct ways to do it: one is with strict two-phase vnode locking and the other is with a per-volume rename lock
<dh`> so yes
<heat> linux is seqlocking these days
<heat> soooooo i guess it seems to work
<heat> although I don't fully understand how that works
<heat> how likely is it that openbsd is using a global mutex or something weird like that?
<dh`> openbsd almost certainly has the completely broken 4.4 code
<heat> lmao
<dh`> it is unlikely that they've changed it at all
<heat> what makes it broken?
<dh`> the 4.4 code will not deadlock at runtime (though mostly by accident) but it will cheerfully detach and lose sections of the namespace, so you don't realize anything bad happened until you get a fatal fsck failure
<dh`> and I've seen that failure be unrecoverable by fsck, too (fortunately on a crash machine)
<heat> ugh
<Iris_Persephone> I am now leaning towards "implement POSIX as closely as you can, except for the parts where POSIX is stupid"
<dh`> the 4.4 code just randomly locks and unlocks with no clear pattern
<heat> rename is not stupid tbf
<heat> it's just hard to implement
<dh`> it obviously meant something to someone at some point but it never made any sense to me
<dh`> there's a POSIX_MISTAKE in rename
<heat> are you talking about 4.4's path/directory code in general or just rename?
<dh`> which is: ordinarily if you rename a/b over c/d, it unlinks d and replaces it with a
<dh`> but if you first hardlink b and d, posix says this does nothing at all rather than unlinking b
<clever> 512 -rw-r--r-- 1 root root 126M Oct 3 19:45 db.sqlite.2
<clever> 80M -rw-r--r-- 1 root root 126M Oct 3 19:45 db.sqlite.2
<dh`> heat: rename in ffs
<heat> ah
<clever> heat: is this what you where saying? i ran `sync` in CLI`, and the file still claimed to be 512 bytes, then seconds later, 80M!
<dh`> 4.4's pathname and directory code in general is a mess but rename is quite extra
<heat> clever, ack
<clever> mrvn: also, because i raised my block size back to 128kb, lz4 doesnt have to work as hard to get savings (i fixed my free space)
<heat> I should be fancier with rename
<dh`> anyway, the fundamental nature of the problem is: if you rename a/b/c/d/e to f/g/h/i/j, you need to make sure e isn't an ancestor of j in the tree
<clever> it still has to shave off at least 4kb, but 4kb off 128kb is far easier
<heat> I'm doing unlinks + links as the base operations which seems unsafe
<dh`> this check is nonlocal and so requires nonlocal locking, hence the per-volume lock
<heat> yea
<clever> heat: in the ZIL, rename is a recordable action, so it can just append that to the ZIL and fix the metadata later
<dh`> you also need to make sure that you don't violate the locking order rules, which can depend on the location of the subpaths in the tree
<clever> and things remain atomic and consistent after a power failure
<dh`> unlink and link is fine for renaming regular files; it's dirs that make trouble
<dh`> well, fine in the sense that if you crash in the middle you might be left with both names if you don't journal it
<heat> yrs
<heat> yea
<clever> demindiro: but then hardlink counts?
<clever> dh`: oops, ^
<dh`> clever: if you don't increment the link count while doing the link/unlink you can detect that a rename was in progress but that violates an intended invariant and can make you other problems
<clever> now that i say that, i dont know how zfs manages link counts....
<dh`> you know how the original softupdates you're supposed to be able to fsck in the background while the system starts up?
<clever> ah, in the bouns space on a dnode
<dh`> never enable that, it's not safe
<dh`> because of rename!
<heat> lol
<dh`> even with softupdates, rename must be fixed up afterwards by fsck or journal replay
<heat> love me some rename
<dh`> there's no sequence of writes that can make an operation atomic in two separate places at once
<clever> dh`: zfs can do that with both the zil and the txg
<clever> for the zil way, the rename is just appended to the log, on replay you try the rename again
<dh`> so the failure mode is: rename /tmp/foo /tmp/bar/baz, crash in the middle, go to clean tmp, rm -r goes into /tmp/bar/baz but comes back out into /tmp, so it thinks it's one layer deeper than it is, then it comes out of /tmp and starts erasing the whole system
<clever> for the txg, every transaction in that group must be commited to disk at once, and due to the cow/immutable nature, that involves creating a new set of state, where the rename has fully happened
<clever> and that new state doesnt take effect until you update the uberblock, the single root of truth
<dh`> clever: any reasonable journaling or snapshot-based solution deals with the problem
<clever> if you get interrupted, the new state basically never existed
<clever> yeah, this is both journaling and snapshotting
<clever> the ZIL is a journal, while the cow/immutable nature of the main pool is snapshotting
<gog> moo
wxwisiasdf has quit [Ping timeout: 260 seconds]
wxwisiasdf has joined #osdev
nyah has quit [Quit: leaving]
<wxwisiasdf> cobol is amazing
<bslsk05> ​en.wikipedia.org: Apple M1 - Wikipedia
<clever> wxwisiasdf: https://i.imgur.com/H1lkYxC.png
<wxwisiasdf> haha
gxt has quit [Remote host closed the connection]
Iris_Persephone has quit [Ping timeout: 252 seconds]
gxt has joined #osdev
<clever> epony: what do you have to screw up, that a simple usb-c hub can brick the system??
<heat> bus mastering?
<clever> heat: how?
<heat> idk
<heat> just throwing out some ideas
ThinkT510 has quit [Ping timeout: 268 seconds]
ThinkT510 has joined #osdev
Iris_Persephone has joined #osdev
<epony> clever, probably CPU debugging exposed on the USB port
<clever> epony: but why was that on the charging C port, couldnt it have been on any other C port?