<clever>
QT even has a primitive that can kinda do this, deleteLater, but its not multi-thread friendly
<heat>
right
<heat>
so it's useless :)
<heat>
the hard part is making it scale
<clever>
deleteLater is mainly so class method can delete `this` without the destructor then becoming a use-after-free
<heat>
and the parts that mostly make it scale are still under patents
<heat>
or maybe they're not! it's a minefield
<heat>
who knows if you'll blow off your leg
<heat>
better step on it and see what happens
<clever>
and if i just come up with a solution on my own and implement it, i could still get sued??
<heat>
clever, you can totally delete this;
<heat>
yes
<heat>
patents baby
<heat>
well, IANAL but that's my understanding of it
<heat>
...I still don't understand the fucking point of patenting it but fuck IBM anyway
<clever>
heat: i can see 2 main parts of RCU that are costly, 1: the copying, 2: when to do the free
<heat>
yup
<clever>
what if i just put it on a timer, and free the object after 60 seconds? if any irq context is holding a reference that long, youve done something wrong
<heat>
linux has like 3 different RCU implementations
<heat>
that sounds bad
<heat>
you fire off an interrupt for every object?
<heat>
you can also get interrupts in RCU sections
<clever>
the rough view ive seen, is that the rx irq on your NIC, will then use an RCU to parse iptables, and either block or accept the packet
<clever>
and RCU is used, so you can mutate the tables, without blocking all packet rx
xenos1984 has joined #osdev
DanDan has quit [Ping timeout: 252 seconds]
<heat>
sure
<heat>
RCU is super abused all around linux
<heat>
your fd table is entirely made out of RCU
<clever>
ah, hadnt known that
<heat>
you can actually modify it concurrently
<mjg>
no you can't
<heat>
there's no lock around it
<mjg>
bro
<heat>
you can't?
<mjg>
where the fuck you taking this from
<heat>
source?
<clever>
heat: i assume there is a cmpxchg, so concurrent writes dont undo eachother
<mjg>
shit scalability of fd allocation is a long standing sore point
<mjg>
posix requires that you hand out lowest fd number possible
[itchyjunk] has quit [Read error: Connection reset by peer]
<heat>
anyway, I'm going to take care of vm technical debt next
<heat>
I'm going to use a radix tree thing, not too unlike page tables, to do this
<heat>
vm_objects have been using a rb tree which is not ideal
<wxwisiasdf>
fair
<wxwisiasdf>
what would you recommend for a memory allocator btw
<heat>
what as in the allocator or an algo?
<wxwisiasdf>
algorithm
<heat>
i'm using slabs
<heat>
it works well
<wxwisiasdf>
okay
<heat>
well, it depends on what allocator you're talking about
<heat>
I have 3 allocators that work exactly like a stack :P
<wxwisiasdf>
well i just asked because when i implement memory allocators i usually just use a freelist
<wxwisiasdf>
not a freelist, just more like linked list
<heat>
plus a simple memory pool and a bootmem allocator
<wxwisiasdf>
oh
<heat>
yeah I have 5 allocators
<heat>
technically 6
<heat>
anyway it depends
<wxwisiasdf>
sounds messy to mantain ig
<wxwisiasdf>
i usually just have 2 allocators: 1 physical, 1 virtual (sometimes)
<heat>
if you want malloc you can do a slabish approach, or a buddy allocator
<heat>
there are multiple approaches
<heat>
more than these, really
<wxwisiasdf>
yeah but i mean on multicore
<wxwisiasdf>
whats a good approach to scalable multicore and smh
<wxwisiasdf>
i always have issues doing smp because i make my kernel be just expecting 1 thread and when i enable smp: boom it all implodes
<heat>
I have a page allocator, a virtual memory allocator, a vmalloc allocator (uses vm's algorithm more or less, but conceptually different and way simpler), a bootmem allocator (to allocate page allocator structures), a page allocator (for now, simple list of pages) and a memory pool (used for simple, stupid allocations.)
<heat>
I meant slab allocator first
<wxwisiasdf>
fair
vdamewood has joined #osdev
<heat>
anyway, I think that for multithreading as far as I've seen you really need percpu/per thread caching
<heat>
it's generally The Way
<heat>
grabbing the lock = bad
<heat>
and asking a backend for memory is also bad, so avoid giving memory back
<heat>
if you open this on non-github you'll be able to click through it
<wxwisiasdf>
oh
<heat>
you'll see malloc and __bin_chunk(free) very easily on the graph
<heat>
and most of it is just waiting on a spin lock
<heat>
on 4 THREADS
<wxwisiasdf>
do_styscall64 moment
<heat>
anyway "enabling SMP and it blowing up" is not because of a lack of efficiency
<heat>
you're probably just missing locks
<wxwisiasdf>
right
<heat>
an SMP kernel is just a really big really multithreaded thing
<heat>
so spinlocks, mutexes, rwlocks everywhere
<heat>
or fancier things in the late game
tarel2 has joined #osdev
<clever>
thats something i still need to investigate on the VPU
<clever>
i can turn on the 2nd core and run code there, but i have yet to enable proper SMP in LK
<clever>
i dont trust the lock primitives yet
<heat>
test em
<clever>
yeah, thats on my todo list
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<clever>
heat: from memory, there are no atomic memory access ops, only ~16 hw spinlocks, reading it will return one value for the winner, and another value for all the loose and all future reads
<clever>
a write will reset it
<clever>
any core can reset it, software rules should limit it to only the lock holder
<heat>
hash your software spinlocks on those 16 hw ones
<clever>
i think the hw spinlock also cant be a indexed from a register
<clever>
its basically a cpu register
<clever>
currently, i just have one hw lock, that protects all "atomic" ops
<heat>
that's just needlessly slow
<clever>
yeah
<clever>
it works, but could be improved
<heat>
woohoo linux 6.0
<heat>
it's so new kernel.org hasn't updated
<wxwisiasdf>
i am a hurr durr ninja
<wxwisiasdf>
or something it was codenamed
<wxwisiasdf>
linux giving it's kernel silly names :P
<bslsk05>
github.com: Linux 3.11 · torvalds/linux@6e46645 · GitHub
<heat>
easily the best name
<klys>
I've been setting up an Epyc 7453 system that takes 3-4 min. to boot. It's been giving me trouble with gpu passthru. nouveau.ko made the screen light up, that's about all I have so far on this gtx 960 card. passthrough works on another box though. using debian.
<heat>
don't use nouveau.ko and try again
<klys>
I will try again another time, as it does take over three minutes to try again.
<heat>
servers do be like that
<heat>
kexec?
<klys>
there's a kexec binary somewhere, and it's like kexec /boot/vmlinuz-4.16.18 root=/dev/sda5 rw blah blah; ?
<klys>
I was using it with my w7pro in virt-manager, and device manager was like "there was a problem with this device. no resources are allocated because there was a problem."
tarel2 has quit [Ping timeout: 252 seconds]
<klys>
one thing I should double check is that it is really using ovmf
<klys>
because I don't see the tianocore logo
<heat>
then it's not
<klys>
seems reasonable, thanks I'll keep that in mind.
<klys>
meanwhile I have the Epyc 7453 box built now, with mobo, transitional 700W power, 4u case (yes the fan is too big), 56 threads, and 256GB A-grade samsung ram with heat spreaders.
<geist>
server stuff can spend a lot of time training the dram on every boot
<geist>
the arm server box i have takes 2-3 minutes to retrain the 16 dimms or so, but at least it dumps a whole log over serial so you can watch it
<geist>
it gives ilke 10 seconds per dimm, and there are 16 of them, so it kinda makes sense
heat has quit [Ping timeout: 268 seconds]
<mrvn>
Why do some boards do that and others don't? Why isn't the result saved in NVram?
<clever>
mrvn: the rpi4 is one case where it doesnt save it to ram
<clever>
it adds a noticable amount to the boot time
<mrvn>
On a SOC I can kind of see the penny saving attitude. But a server? Where uptime matters?
<clever>
dont reboot your server :P
<clever>
uptime matters!
<mrvn>
Sorry, we can't give you 99.9% uptime, the ram retraining takes too long.
<clever>
ive also heard claims that the training depends on temps
<clever>
if the ram/soc are hot, it needs different training
<clever>
and its constantly fine-tuning it while running
<mrvn>
But then it should retrain at runtime, after the ram has heated up.
<clever>
but if you do a reboot, the ram is already hot
<clever>
and the timings in nvram are wrong
<mrvn>
if you do a reset then it should just keep the settings, doesn't even need NVram. Just have a few bit in the DIMM that says: I've been trained.
<mrvn>
But all this still doesn't answer the original question: Why do some boards take 10s per DIMM to train?
<clever>
yeah, good question
<clever>
and the lack of source makes it near impossible to answer
<klys>
likely because I have 32gb per dimm
<klys>
which all has to squeeze thru a bus
<mrvn>
klys: No. Other boards with the same amount of ram don't take so long.
<klys>
it's ecc if that matters
<mrvn>
I sure hope nobody runs a server without ECC
<klys>
CL22 column access strobe
frkzoid has quit [Read error: Connection reset by peer]
archenoth has joined #osdev
[itchyjunk] has joined #osdev
<geist>
mrvn: so i think a few reasons: one is it *does* take longer on a new/clean boot. in my case it takes over 5 minutes until it's done a training once, so there's some sort of deeper thing there
<geist>
and i think the idea is lots of traces across the board with larger sets of dimms requires more training
<geist>
vs say 2 channels, maybe 4 dimms max
<mxshift>
If you get info about DIMM training over serial, someone left debug turned on which makes it take considerably longer
<mxshift>
Training data is usually cached in the main flash so sometimes naive firmware updates erase it
<geist>
yah like i said there's two levels of training on this machine. one of which takes over 5 minutes, the faster one being about 10 seconds per dimm
<mxshift>
Training data is also invalidated if the DIMM config changes at all. Even shuffling DIMMs will invalidate since the SI channel being measured is the SoC+socket+motherboard+DIMMs
<mxshift>
Normal training is just trying to find an operating point. Debug will often do a 2D sweep of voltage and timing adjustments to give a report of how much margin there is around that operating point
* geist
nods
<geist>
yah the dump out of this machine shows basically that
<mxshift>
I haven't seen training data be invalidated due to temperature. It's plausible. I have seen it be invalidated after some number of boots on the assumption that the channel may age
<mxshift>
I spent way too much time looking at EPYC 7xx3 training during our early being up
<mxshift>
You can enable debug or MBIST training on any EPYC 7xx3 system by setting a few APCB tokens to tell the PSP what you want
<mxshift>
Training is usually pretty quick but zeroing ECC DIMMs scales with amount of DRAM installed
[itchyjunk] has quit [Read error: Connection reset by peer]
genpaku has quit [Read error: Connection reset by peer]
genpaku has joined #osdev
k8yun__ has joined #osdev
k8yun_ has quit [Ping timeout: 252 seconds]
k8yun__ has quit [Ping timeout: 268 seconds]
<geist>
oh that's a fair point. does all ECC by definition need to be initialized befor eyou can use it
<geist>
i guess so, right? or some sort of bit taht says this is uninitialized
<Mutabah>
At a guess, the first read would return a ECC error if it's not written before
<geist>
yeah, never thought about that but i guess that's true. an ECC machine would probably need to ake at least a few more seconds to boot to zero everything out
<Mutabah>
Note: I'm just guessing at how ECC works, but it seems like the easiest way
scoobydoo_ has joined #osdev
scoobydoo has quit [Ping timeout: 252 seconds]
scoobydoo_ is now known as scoobydoo
wxwisiasdf has quit [Ping timeout: 252 seconds]
tarel2 has joined #osdev
heat has joined #osdev
<heat>
morning
<heat>
i was reading that old ass linux 2.4 buddy allocator doc and I realized how they avoid always coalescing and splitting buddies
<heat>
they have a percpu list of pages (order 0) which they allocate from
<heat>
and do exactly the batch thing that you would do in slab allocation
<heat>
this is very interesting and may be worth considering
ThinkT510 has quit [Ping timeout: 246 seconds]
<heat>
btw re memory training, from what I've gathered the new DDR5 machines take a stupid amount of time to train memory
<heat>
like 5 minutes
<heat>
for fucking laptops
<Griwes>
finally, bringing the server experience to the masses
k4m1 has quit [Quit: Lost terminal]
k4m1 has joined #osdev
<kazinsal>
me: "I can't wait to build a new machine to replace my old SB Xeon box that takes 10 minutes to post after a power failure"
<mrvn>
"makes lots of pages dirty"? Firefox has a few gig of ram that's all dirty. The binary is a few MB, of which maybe 100-200 pages are dirtied due to relocations. Is this worth it tp fix?
tarel2 has joined #osdev
<heat>
the point is that it's slow
<heat>
also wasteful, yes
<mrvn>
heat: Relocations is not why firefox is slow to start
<heat>
ok
<heat>
but firefox is an extreme example
zaquest has quit [Remote host closed the connection]
<mrvn>
Also consider this: If you use a good fraction of the functionality then all relocation pages will be used. So you aren't saving time but paying for it at runtime instead of once at load. And with 2 context switches per page unless they do this fully in kernel.
<heat>
Onyx's PIE compiled bash has 2543 relocations
<mrvn>
heat: how many plt pages is that?
zaquest has joined #osdev
<heat>
it's not just plt and got but .data too for instance
<mrvn>
much harder to count in data though.
<heat>
right. the only way would be to stop in the middle of linking and check the dirty pages
<heat>
my /usr/lib/libstdc++.so has _5971_
<mrvn>
and you can fix up a million a second?
<heat>
all of which are relocated at startup
<heat>
can you?
<mrvn>
Not counting the load time a million a second sounds reasonable.
<heat>
touch -> fault in 4KB -> write, but mostly randomish access
<heat>
the load time is the biggie
<mrvn>
debatable. The load happens when you use the program anyway.
<heat>
the point is that it takes time
<mrvn>
heat: aren't relocations sorted linear?
<heat>
no
<mrvn>
well, sort them. That should speed things up when you fix pages sequentially.
<heat>
my clang has 361574 lmao
<mrvn>
For me the interesting number would be how many of those are accessed at runtime and not just for the fixup.
<froggey>
how many unique pages need relocations? if you have 1000 relocations and they're all on the same page, you don't win anything but maybe you do if they're scattered over 1000 pages
<mrvn>
The one thing interesting in the proposal is that the kernel can throw away pages and relocate them again when it later pages in the page again.
<heat>
those are both interesting questions I can't answer
<mrvn>
froggey: For library calls the compiler uses a trampoline and puts all those together so they take verry few pages.
<mrvn>
Oh, and those trampolines do the fixup on first use.
<heat>
not true
<heat>
in fact doing fixups on first use isn't a great idea
<heat>
lazy binding is error prone
<heat>
let me call this fun... process just crashed
<mrvn>
heat: why would it crash?
<heat>
out of memory, symbol not found, etc
<mrvn>
I believe the symbols are checked at load, just the address calculation happens later.
<froggey>
iirc hardened binaries that use relro also do relocations up-front
<bslsk05>
wiki.musl-libc.org: musl libc - Functional differences from glibc
<mrvn>
Bit of a contradiction in the proposal: You want to make it faster but then you page-out data and have to relocate it again on page-in. So you will be doing relocations over and over in low mem situations.
tarel221 has joined #osdev
<mrvn>
I wonder if Apple benchmarked the old way, their new idea and using a global base address register.
<mrvn>
For statically linked PIE the later should solve all problems.
<heat>
that's the worst idea
<heat>
you take away a register completely
tarel2 has quit [Ping timeout: 252 seconds]
<mrvn>
heat: you have 32, a register is cheap :)
<heat>
no you don't
<heat>
you have 15
<mrvn>
heat: ARM had 15, AArch64 has 31. :)
<mrvn>
Even with 15 testing the cost of one reg vs. all the relocations is worth it.
<heat>
x86-64 has 15
<heat>
I don't see how that can be true
<mrvn>
you think measuring the effect is pointless?
<heat>
I would prefer paying a memory cost over taking away a precious register (and breaking the ABI in the meanwhile)
<heat>
I guess you could feasibly use x18 in arm64 to do that kind of fuckery
<heat>
but is it free? probably not
<mrvn>
You are designing a new dynamic linker here. Total ABI breakage.
<heat>
having a relocation aware kernel would not break any ABI
<mrvn>
heat: they changed the relocations data too
MiningMarsh has quit [Read error: Connection reset by peer]
MiningMarsh has joined #osdev
<heat>
they who? apple?
nyah has joined #osdev
<mrvn>
heat: yes
<heat>
you could very much implement this either in the kernel or in a dynamic linker without changing any part of the ABI
terminalpusher has quit [Remote host closed the connection]
terminalpusher has joined #osdev
terminalpusher has quit [Remote host closed the connection]
<heat>
actually looking at actual numbers this seems a bit useless
<mrvn>
heat: 20ms for 390'000 relocations. See what I mean?
<mrvn>
And 6.5MB of memory. Thats less than 0.1% of what chrome uses.
<heat>
it probably depends on the storage medium
<heat>
20ms is not a lot for the hugest program out there
<mrvn>
I assume they didn't measure a cold start from rotating disks.
<mrvn>
8ms seek time kills you
<heat>
i would be kind of concerned for clang though, when compiling .c files or so
<heat>
something fast
<heat>
20ms could be around 10% of the compile time
<mrvn>
I wish my compiles would take 200ms.
<heat>
dont write C++ lmao
<mrvn>
bingo
<heat>
the C parts of my build are stupidly fast
<heat>
the slowest parts usually involve C++ and some header-only library
<mrvn>
totall killer
<heat>
gtest tests are sloooooooow
<heat>
nlohmann is also a POC
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<heat>
my own C++'s build takes longer and even that is a super restrained, C with classes-like idiom
<heat>
the cf workers runtime had files which took 40s to compile on top of the line CPUs
<heat>
that whole codebase reeks of people who are not kernel developers :v
<mrvn>
heat: there is another thing to consider. Every time you load a lib you have to relocate it to somewhere more or less random. Wouldn't it make sense to pick some place in the address space and put the library there every time? Then you could share the relocation pages and only have to relocate once.
<heat>
no
<heat>
if someone accidentally leaks an address you have just defeated ASLR for that library, for the whole boot, whole system
<heat>
hell, you could craft your own program that does that, and then pwn sudo
<mrvn>
assuming you have a bug to exploit
<heat>
well, yes
<heat>
ASLR is defense in depth
<heat>
which is why it's important
<heat>
that thing just leads to shitty ASLR (@windows)
<mrvn>
Then you have to pay for it with the extra memory. :)
<heat>
might as well not be there if anyone can leak it
<heat>
yeah
<heat>
it's cheap though
<mrvn>
.oO(my words)
<mrvn>
We could bring back segment registers and give every library a segment. Jipey, no more relocations.
<heat>
i thought of that when you mentioned reserving a register
<heat>
but it doesn't work, you pay for segment addressing
<heat>
it's like an extra cycle
<heat>
(but it would totally work, you have a fully unused %gs in x86_64 userspace)
<mrvn>
Only because they didn't optimize it. More important is probably the prefix byte on the opcode
<heat>
i would think they have optimized it
<heat>
%fs and %gs addressing is heavily used for TLS/percpu
<mrvn>
only once to load the base address into a regular reg
<heat>
no
<mrvn>
once in a while
<heat>
mov %fs:variable, %reg
<heat>
that's how you do things
<mrvn>
heat: you can do that. If you have more variables you load %fs:0 and then use that as base
<heat>
here's some funny codegen for my percpu inline asm
<geist>
heat i know has taken the plunge and hacked on some non x86 arches, and has seen that the grass is slightly greener on other sides. or at least a more pleasant shade
<geist>
haha
gog has quit [Ping timeout: 268 seconds]
<mats1>
lol
<Griwes>
heat, "I'd rather use zfs" doesn't tell us much, because zfs is delightful :P
<geist>
honestly i'm probably the one person that doesn't below to the zfs fan club
<geist>
for no particular reason, to be honest, but it seems like it gets just a Little Too Much love
<geist>
kinda like plan9 or something, where feels a lot like folks just repeat 'omg that is awesome'
<geist>
Griwes: and in no way am i implying that's what you are doing here, just getting it off my chest
<mjg>
zfs is pretty great if you don't look to close at it
Iris_Persephone has joined #osdev
<Griwes>
Well, I've been running zfs as rootfs on every non trivially rebuildable system I have for years now, and I couldn't be happier
<mjg>
but i admit i don't know how it really compares to btrfs
<Iris_Persephone>
heya guys!
<Griwes>
So at least I put my money where my mouth is :p
<mjg>
crucial selling point for me is that there is no partitioning in a way which limits space
<geist>
yeah i just get a little suspicious of stuff that is universally loved, etc. the more love something gets the more suspicious i am that it's just groupthink
<mjg>
as in if i want to create a separate dataset at /lolo, i can just do it
<mjg>
don]t have to carve out any space for it
<Griwes>
geist: yeah that's fair
<mjg>
that turned to shittingo n linux real quick
<mjg>
:S
<Iris_Persephone>
My Linux From Scratch didn't end up working out, so I thought I would follow Bare Bones and see where that takes me
<geist>
well, it's educational if nothing else
demindiro has joined #osdev
<geist>
honestly LFS was probably a lot simpler and usefl about 10-15 years ago when a basic system was init and a shell, and maybe one or two daemons
<geist>
i *assume* it's moved on to more complex setups now
<zid>
I'd only LFS from stripped down packages built on a gentoo host
<mjg>
what's the point of lfs?
<zid>
no idea
<zid>
edutainment for getting it working I think
<geist>
educational if nothing else. i think i learned a few things the first tiem i fiddled with it
<mjg>
ye ok
<geist>
exactly
<mjg>
that used to be gentoo back in my day
<demindiro>
Also smaller images I guess
<geist>
you end up with something roughly BSD shaped, or gentoo stage 1/2
<geist>
yah i loved the days back when you actually built the first few stages of gentoo
<zid>
I should scour ebay for rams, speaking of building packages
<geist>
gave you the same dopamine hit as LFS
<zid>
I have some paypal dollars to burn but I refuse to overpay for this old ass ram I want
pretty_dumm_guy has quit [Quit: WeeChat 3.5]
<Iris_Persephone>
I spent... god, it must have been weeks at this point
<Iris_Persephone>
Eventually I decided tracking down all the errors wasn't worth my time
<geist>
yah if it ceases to be edutainment, move on
<geist>
you alas dont get an Achievement badge
<zid>
That's why I never played that game, no achievement tracking
<Iris_Persephone>
Gave me a new appreciation for Unix-likes, though!
<zid>
okay I checked all of ebay.. maybe I should figure out how alerts work
<geist>
oh no. that way leads to madness
<Iris_Persephone>
What are you looking for?
<zid>
2x8GB 1866MHz UDIMMs
<Iris_Persephone>
ah
<zid>
or alternatively, 4x8GB 1600/1866Mhz URDIMMs
<gog>
yes, well-defined interfaces are bad. let's slap everything inside ioctls and make it look like a file
darkstardevx has joined #osdev
<Iris_Persephone>
To be perfectly honest
<heat>
the worse parts are when they go half write/read and then give up and ioctl the shit out of it
<Iris_Persephone>
half of this is just pure spite at Voicemeeter and AutoHotKey
<heat>
THANKFULLY they now understand they can just return a fd from syscalls that are not named open
<heat>
which is why pidfd, etc are decent interfaces
<gog>
yeah the excuse of "it keeps a generic interface for I/O!!" like
<gog>
no
<gog>
it fucken doesnðt
<gog>
if you have to use ioctl at all you've negated the benefit
<heat>
ioctls are not bad
<gog>
i suppose not
<gog>
but you still have to use a file descriptor with them
<Iris_Persephone>
I would give my firstborn child to be able to pipe my audio/video/whathaveyou to and from a program as easily as using "tee"
<gog>
so the big abstraction is still there for the benefit of the little one
<heat>
char buffer[200]; sprintf(buffer, "/proc/%d/comm", pid); open(buffer) is horrible
<heat>
dog shit
<heat>
it's also like half the linux libc
<heat>
Iris_Persephone, not doable
<heat>
audio and video is low latency stuff
<Iris_Persephone>
Yeah, that figures
<heat>
video is bandwidth heavy and needs acceleration
<heat>
it turns out the whole UNIX philosophy thing of making small, composable programs with pipes and whatever the fuck isn't a great idea
<heat>
because it's slow and a poor fit
Celelibi has quit [Ping timeout: 260 seconds]
<heat>
so people just added options to already existing programs
<heat>
like cat -n
<Iris_Persephone>
So, the slowness is *inherent* to the concept of pipes?
<heat>
yes
<demindiro>
no
<heat>
lots of copying and IPC
<heat>
yes
<heat>
absolutely yes
<heat>
you're not getting anything performant out of pipes
<demindiro>
How performant does it need to be?
<demindiro>
Piping works plenty well for lots of stuff
<heat>
on real services? it totally does
<heat>
I've seen people suggest https as httpd | tlsd or whatever
<heat>
ridiculous shit
<CompanionCube>
does anyone really care about st_blocks tho?
<heat>
I do
<heat>
fsync() should make sure metadata is written back
<heat>
also kenton too
<heat>
he has a file hole test which needs it
<Iris_Persephone>
well shit, there goes that idea
<heat>
do things -> fdatasync -> check st_blocks -> expect st_blocks to be valid
<demindiro>
iris_persephone: if redundant copying is a concern, consider using shared buffers and passing small messages between processes instead.
<heat>
demindiro, and your unix philosophy goes out the door
<demindiro>
No
<heat>
yes
<demindiro>
Shared buffers are very generic
<heat>
it all revolves around pipes
<demindiro>
Message can just be "this offset + length go figure"
<heat>
unix revolves around simple utilities that are easily composable using |
<heat>
on the shell
<demindiro>
And you can do the same with shared buffers
<heat>
it worked great in the 70s
<demindiro>
Just involves a little more setup
<heat>
again, unix philosophy goes out the door
<kof123>
eh, if anyone cares somewhere in unix interviews there is someone who had a fancier pipe idea that was deemed too complex, would look at that, not very helpful i know :D
<kof123>
so i would lean towards heat it wasn't original, just based on that
<kof123>
*in the original
<CompanionCube>
does btrfs has less weird semantics
<heat>
idk
<heat>
hope so
<kof123>
i mean, it might be a horrible idea, just if someone was looking into pipes
<kof123>
maybe it was left out for very good reasons
<heat>
"This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."
<kof123>
misspoke, you said everything is a file was not the original, my bad
<heat>
take your data, serialize it to text (but not json because that isn't streamy), send it over a pipe, receive it on the other program, deserialize it, ...
Celelibi has joined #osdev
<heat>
according to this your C compiler would be a 200 line piped up invocation|
<heat>
!
<demindiro>
I guess I don't take the UNIX philosophy as literally, you can do e.g. lz4 blah.txt | pv | lz4 -d
<demindiro>
and lz4 outputs binary data
<Iris_Persephone>
I suppose I don't want to *strictly* follow the Unix philosophy, I just find the concepts useful
smach has joined #osdev
* kof123
prepares to throw gasoline on the fire
<kof123>
piping audio and video...wouldnt that just be remote importing /dev/xyz on remote machines?
<gog>
nothing is a file
* kof123
runs
<gog>
that's my philosophy
<heat>
kof123, very plan9 of you
<kof123>
yes :D
<kof123>
"simpsons did it!" "plan9 did it!" lol
<CompanionCube>
heat: iirc the thing with the metadata is that it doesn't know the on-disk size until the TXG commits. fsync only writes back to the on-disk ZIL.
<heat>
that sounds like fsync and sync are broken
<CompanionCube>
no it's just write-ahead logging as you would find in a database
<CompanionCube>
you can make fsync properly broken by setting sync=disabled
<heat>
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device
<heat>
(or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.
<heat>
after fsync of zfs your st_blocks metadata isn't consistent at all with how things look like or will look like
<heat>
fsync should be a proper barrier
<heat>
what's fsync for if you can't guarantee consistency on metadata, the inode, and the disk itself?
<heat>
the problem used to be the lack of guarantees on synchronization for other API functions
<bslsk05>
lists.gnu.org: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in
<heat>
now these fancy filesystems can't even guarantee fsync synchronizes?
<heat>
good, fuck it
<gog>
yes
<gog>
move beyond the need for files
<heat>
"until the data are synced"
<heat>
at least it works on fsync
<CompanionCube>
single-level storage when
<heat>
(hopefully)
<Griwes>
is on-disk ZIL lost on reboot?
<Iris_Persephone>
Stupid newbie question: What is an filesystem object, if not a file?
<demindiro>
generic data blob, I suppose
<CompanionCube>
Griwes: no, the one lost on reboot is the in-memory ZIL.
<Griwes>
if no (and I don't think it is), then this behavior fulfills the behavior of fsync you quoted vOv
<Griwes>
so everything is fine then
<heat>
it's not
<clever>
Iris_Persephone: for zfs, there are a lot of fs metadata objects, and block devices (zvol's) are also one big object
<heat>
metadata isn't synced
<clever>
but files are also objects
<heat>
data isn't even fucking sync'd
<CompanionCube>
async writes go to the in-memory ZIL but not the on-disk one
<heat>
how can you call *sync* and have it not be sync
<Griwes>
> so that all changed information can be retrieved even if the system crashes or is rebooted
<heat>
comitting something zfs's magic super duper table of awesome things is not guaranteeing synchronization
<clever>
heat: with zfs, i believe sync will write it to the on-disk ZIL (journal), so it can return faster
<Griwes>
this seems to be the core part of fsync by the quoted definition
<heat>
and that's the fucking problem clever
<heat>
it's sync, not a race
<clever>
heat: but i think the ZIL is basically just a record of VFS calls, write/delete/symlink
<Griwes>
so you are arguing the definition of fsync you quoted is wrong
<heat>
I am
<heat>
fsync and sync and fdatasync need to be *properly done*
<heat>
not POSIX's-vague-definition done
<Griwes>
okay so you're saying the unix (or at least linux) definition of fsync is bad
<heat>
yes!
<Griwes>
stop blaming zfs for that then!
<demindiro>
IIRC macOS even just ignores fsync for the sake of performance
<demindiro>
Which apparently is valid
<heat>
and you'll know that filesystem fuck around with the semantics of fsync
<clever>
demindiro: i have seen macos bugs, where a file that was written via mmap() and not sync'd, reports having holes when you use the seek-hole function
<heat>
this is a contentious subject
<heat>
please, make it right.
<clever>
demindiro: cp then just skips the first 8mb of the file, and doesnt even bother copying it
<Griwes>
I don't have a full opinion on whether the fsync definition is sufficient or not, and I can see merit in an argument that it is wrong
smach has quit [Remote host closed the connection]
<Griwes>
but that doesn't make it the FS' fault
<mjg>
macos andp erfomrance?
<heat>
it is
<clever>
demindiro: and by pure chance, i then saw nearly the identical bug on linux+zfs a week later, lol
<heat>
they implement fsync!!!
<Griwes>
would you prefer them to *not* implement fsync?
<heat>
better than pretend it does anything remotely useful
<mjg>
if lolo fsync provides no guarantees i would prefer to software to know
<Griwes>
but it does what it says on the man page lol
<kof123>
i was looking at raidframe to steal ideas...it it way too much for me, but a couple gems in the source: @echo "RAIDFRAME R00LZ D00D" >> /dev/null Set this to 1 if you are the CMU Parallel Data Lab. You probably aren't.
<heat>
it's like ext4's fsync just writing journaling data and fucking off
<Griwes>
yeah? it's implementing a unix FS protocol lol
<Griwes>
blame unix or whoever the heck first defined fsync like this
<CompanionCube>
tbf at least opposing ext4 journaling fsync is being consistent
<heat>
do you think that guarantees consistency?
<Griwes>
again: I see merit in your argument but you are misplacing the blame for the fault in the API that you perceive
<heat>
I expect sane filesystem semantics
<gog>
why
<Griwes>
I expect a driver to implement the API it is asked to implement
<heat>
if they can't provide it, fix it. I called fsync, I want fsync, I want consistency
<Griwes>
and if that API is shit, it's not the driver's fault
<heat>
there already is consistency
<heat>
(ish)
smach has joined #osdev
<heat>
ext4 delays block allocation but st_blocks is always correct
<Griwes>
you called fsync, but you wanted something else than what the manpage for fsync says will happen
<heat>
now, I understand zfs's set of constraints is totally different
<clever>
heat: zfs compresses, so the on-disk size can vary
<heat>
so, make it work on fsync
<clever>
CompanionCube: is the ZIL before or after compression?
<heat>
^^^^
<Griwes>
expecting more than what the manpage says is on you, not on the driver
<heat>
Griwes, the manpage is irrelevant here
<heat>
the manpage is written based on the code and behavior of fsync
<CompanionCube>
clever: that's actually a good question, more so how that interacts with immediate writes which are also i thing i forgot to mention
<Griwes>
heat, okay. what does POSIX say?
<heat>
posix is also irrelevant
<heat>
sorry
<Griwes>
no
<heat>
yes
<heat>
absolutely yes
<Griwes>
because POSIX is the spec that they are implementing
<pitust>
its not for xnu
<heat>
you can't write a whole OS looking only at POSIX
<heat>
The sync() function shall cause all information in memory that updates file systems to be scheduled for writing out to all file systems.
<heat>
The writing, although scheduled, is not necessarily complete upon return from sync().
<heat>
???????????????????????????
<Griwes>
this is time for us to agree to disagree because you have a fundamentally different PoV on what you are supposed to do when implementing an interface with a formal spec than I have and we will not agree on this
<heat>
the kernel doesn't need to "just implement POSIX"
<heat>
it never has, that was never the purpose
<heat>
POSIX was defined by the kernels
<heat>
still is
* Iris_Persephone
pokes her head into the chat
<Iris_Persephone>
is it over yet?
<Griwes>
POSIX is defined by a spec created by the austin group and then ratified by multiple international standards orgs
<demindiro>
Be careful not to get decapitated
<Griwes>
but let's agree to disagree
<heat>
if you don't provide a useful fsync() then you're doing a bad job
<heat>
Griwes, spec that is built by... multiple operating systems in order to form a consensus
<Griwes>
if you expect fsync to do more than is documented as its user, you're doing a bad job
<heat>
there have been huge debates in linux about it
<bslsk05>
utcc.utoronto.ca: Chris's Wiki :: blog/solaris/ZFSZILSafeDirectWrites
<CompanionCube>
clever: and the backlinks are good too
<clever>
basically, the in-memory maps are updated, so the live system is aware of it
<heat>
Griwes, to know how much disk I've used, file holes, etc
<clever>
and during recovery, you read the ZIL and rebuild that in-memory state
<Griwes>
"sync" being "the data is on the permanent storage device so it doesn't get lost" seems like a perfectly fine definition to me
xenos1984 has quit [Read error: Connection reset by peer]
<clever>
CompanionCube: and from what ive see, the ZIL is entirely optional, assuming there is nothing to recover, you could just do every write directly to the pool itself
<Griwes>
heat, why do you need to know _exactly_ how many blocks you've used?
<heat>
Griwes, the metadata doesn't even add up!
<heat>
<heat> Griwes, to know how much disk I've used, file holes, etc
<clever>
the only purpose of the ZIL is to make sync writes finish faster
<heat>
of fucking course
<Griwes>
the first part of your answer is why I'm asking the question
<heat>
it's a performance thing!
<Griwes>
lol
<Griwes>
anyway I'm out
<heat>
it's not a "Lets just do the bare minimum for posix"
<Iris_Persephone>
If you don't like it, just implement it differently in your OS...?
<Iris_Persephone>
Is that not practical?
<heat>
no
<heat>
I'm not realistically daily driving my OS
<CompanionCube>
clever: mhm
<heat>
if you don't fight for change you just get stupid behavior all the way
<Griwes>
(we're at the point of disagreeing what constitutes "stupid behavior")
freakazoid332 has quit [Read error: Connection reset by peer]
<clever>
CompanionCube: i also have theories on how a bootloader can cheat zfs writes, what if i just append an entry to the ZIL?
smach has quit [Ping timeout: 265 seconds]
<CompanionCube>
clever: do you really need to cheat zfs writes when there's enough reserved space to be used
<mrvn>
heat: The number of blocks something uses becomes basically meaningless when you have snapshots (is the space = total / snapshots using the block?) and cow (a freshly edited file can take 3 times as much blocks and go down to 1 times as things are flushed).
<clever>
CompanionCube: then i need to update the spacemaps, and all of the indirect blocks in several objects
<demindiro>
ZIL is a separate device no?
<CompanionCube>
demindiro: not per se
<clever>
demindiro: it can either be in the main pool, or its own device
<CompanionCube>
the term 'SLOG' refers to the latter, but ZIL applies to both as well as the in-memory one
<clever>
demindiro: under normal conditions, the ZIL is write-only
<clever>
to make a write() or sync() finish asap, it syncs the data to the ZIL on a disk, but also keeps a copy in ram
<clever>
and then it can update the proper pool at a later time, from the copy in ram
<mrvn>
heat: there is also fdatasync(). Sounds like ZIL ,akes fsync behave like fdatasync
<heat>
yes
<heat>
the difference between fsync and fdatasync is that metadata is supposed to be written and consistent
<mrvn>
BUT: "Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk."
<heat>
yes, that makes sense
<clever>
from memory, a normal sync in CLI did mask this bug
<mrvn>
You could argue that the block counts etc is part of the directory?
<heat>
could I?
<mrvn>
.oO(Well, you can argue anything)
<mrvn>
The manpage refers to inode(7) for what metadata is flushed by fsync() but I don't know how normative that is.
<clever>
mrvn: for zfs, the blocks are part of the dnode (similar to an inode), and a directory is just a name->dnode-number mapping
<clever>
mrvn: but the ZIL lets you ensure data is secure on disk, without updating the dnode
<mrvn>
clever: Which fits with "transfers all modified in-core data of the file referred to by the file descriptor fd to the disk device"
<clever>
mrvn: the dnode table is also an object in zfs, with its own indirect block tree, and a dnode at its root
<mrvn>
clever: except it's not "all"
<heat>
st_blocks is part of metadata per inode(7) if we want to go down that route
<heat>
exactly.
<mrvn>
clever: heats argument is that "st_blocks" is not written to disk, nor even reported correctly from ram
<heat>
yup
<mrvn>
heat: but what should st_blocks be? I have a file that takes up 4 blocks so st_blocks is 4. Now I overwrite 1 block so with COW the count goes up to 5? 2 flushes later the old block is removed and the count goes back to 4?
<mrvn>
heat: or with journal and compression I write a block the count goes up to 5 to account for the journal block. Then copression takes it down to 2 blocks total and it drops to 2?
<heat>
i don't know
<heat>
that's up for the fs to decide
<heat>
not like I've thought much about that particular question anyway
<mrvn>
heat: well, you seem to disagree with the decision zfs made.
<heat>
no
<heat>
zfs has a decision regarding st_blocks
immibis_ has quit [Ping timeout: 264 seconds]
<heat>
it's pretty clear
<heat>
you do get st_blocks values after everything is flushed per sleep 10
<heat>
the problem is that it's nowhere near consistent
<heat>
and fsync, sync, fdatasync that should serve as barriers for in-memory and in-disk consistency, don't
<mrvn>
But that's the problem with journaling, COW and compression. The block count changes over time.
Iris_Persephone has quit [Ping timeout: 268 seconds]
<heat>
sure
<mrvn>
fsync() is more about the data being retrievable.
<mrvn>
it wasn't designed with metadata that changes on it's own over time.
<heat>
I would say it's about consistency
xenos1984 has joined #osdev
<heat>
"the only purpose of the ZIL is to make sync writes finish faster" <-- having a structure to make syncs faster isn't a particularly good goal
<heat>
if it jeopardizes the consistency
<clever>
heat: all writes to an fs are also serialized within the ZIL
<clever>
because the ZIL is a singly linked list, and each write appends to it
<heat>
sure
<mrvn>
heat: does it violate "so that all changed information can be retrieved even if the system crashes or is rebooted."?
<heat>
I'm not saying your fancy ass filesystem doesn't have top tier consistency in practice
<heat>
no
<heat>
but again, manpages reflect reality, not the opposite
<mrvn>
heat: so your only problem is that the metadata doesn't become stable with fsync() but keeps changing for a while after that.
<heat>
you want to enforce sane, useful semantics for your filesystem so that things Just Work, are safe, make sense
<heat>
it's not "do the bare minimum"
<Griwes>
manpages reflect the behavior a user can depend on
<mjg>
i remember zfs is doing something nasty to keep fsync oepratinal without demolishing perf
<mjg>
and it may be it is is zil
<Griwes>
a user depending on more than the manpages say they can depend on is a user problem
<heat>
Griwes, I'm talking about changing the implementation :)
<mjg>
i'm only sightly familiar with zfs, mostly from its pessimal multicore code :[
<Griwes>
changing the implementation would not stop meaning you can only depend on what the manpages say you can depend on
<heat>
sure
<Griwes>
you are presenting all this as a bug
<heat>
make everyone do that, then change the man page
<mrvn>
Pol: Who here uses fdatasync() in their code because they only care about data integrity?
<mrvn>
+l
<Griwes>
it's not; it's a feature request
<heat>
it is a bug, there's no consistency in memory
<Griwes>
it's a feature request to widen the guarantees of fsync
<heat>
I'm saying they're not wide enough for normal usage
<Griwes>
and you are asking to widen the guarantees
<heat>
this is all happening because they wanted to make sync faster
<Griwes>
that's a feature request
<heat>
they regressed it
<heat>
it is in fact a bug, a regression. it was fine, now it's not
<heat>
this is something that needs to be fixed and not added
<Griwes>
they regressed the manpage?
<heat>
...
<heat>
i give up
<Griwes>
did the manpage promise you more earlier?
<Griwes>
if it didn't, then this is a feature request
<heat>
it's pointless
<mrvn>
Griwes: that depends on the interpretation
<Griwes>
look, I work on a formal standard
<Griwes>
and I deal with that distinction regularly
<mrvn>
Griwes: it doesn't flush (or even compute in ram) the metadata
<Griwes>
if it wasn't in the spec that you can depend on the behavior you want, it's not a bug
<mrvn>
and it's more the "compute in ram" part heat objects too I think.
<Griwes>
mrvn, is it required to?
<mrvn>
Griwes: yes
<Griwes>
by?
<heat>
an operating system is not a fucking language
<heat>
you do realize that right?
<mrvn>
Griwes: I only have the manpage open but the specs surely say something similar: "As well as flushing the file data, fsync() also flushes the metadata information associated with the file (see inode(7))."
<Griwes>
we're talking about an API that has a spec
<heat>
POSIX specifies the bare minimum
<heat>
the absolutely bare fucking minimum
<heat>
can they just comply to that? oh of fucking course
<heat>
is it useful? absolutely not
<mrvn>
heat: have you actually looked into the specs to see what it says about the inode metadata?
<heat>
it says nothing
<heat>
POSIX doesn't know what an "inode metadata" is
<heat>
POSIX will never make stronger guarantees about fsync, or anything else
<heat>
it's purposefully generic
<heat>
so every god damn UNIX out there can say they're POSIX(r) Compliant(r)(tm)
<mrvn>
heat: surely it must say something about fsync() vs. fdatasync()
Iris_Persephone has joined #osdev
<Griwes>
> This field indicates the number of blocks allocated to the file
<heat>
"The fsync() function shall request that all data "
<Griwes>
okay, are blocks allocated to the file while the write is in ZIL?
<mrvn>
Griwes: arguably.
<heat>
"The fdatasync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state."
<Griwes>
mrvn, but that's just ZIL blocks. the actual blocks haven't been allocated yet
<Griwes>
I'd argue that the only bug is that it returns 1 and not 0
<mrvn>
Griwes: it's space on the disk
<heat>
then that's the wrong fsync behavior
<heat>
why is zfs autonomously saying "fuck standard behavior" and trying to just be faster
<mrvn>
Griwes: same with journaling and COW copies. But then the block count changes over time. So it's not that useful.
<clever>
Griwes: ZIL can store data either in the log itself, or in the main pool, in both cases, there are data blocks assigned to that data, but the free space maps havent been updated
<Griwes>
mrvn, it also says "the unit may differ on a per-filesystem basis" - wonder if that means the filesystem can use two units depending on the file's state
<Griwes>
clever, but arguably those blocks are assigned to the ZIL entry and not to the file itself
<clever>
Griwes: zfs does allow the block-size for every file on an fs to differ
<Griwes>
idk, looks like a feature request to me
<mrvn>
Griwes: try computing st_block on a unionfs where each branch can have a different block size.
<heat>
st_block is in practice in 512byte units
<mrvn>
Personally the st_block has been rather meaningless for decades now.
<heat>
de facto
<heat>
anyway
<heat>
this is not worth debating
<clever>
[root@amd-nixos:~]# stat /nix/var/nix/db/db.sqlite
<clever>
mrvn: yikes, thats out of a total of 1153 L0 blocks, 90% of the blocks are fragmented
<mrvn>
heat: I agree with you that it's bad that st_block is reporting bad data. But fact is st_blocks has been total chaos for decades now. It's basically useless unless you know what FS you are checking.
<Griwes>
well be glad I'm not a zfs dev because I'd just slam your issue as closed, not a bug, rtfm
<heat>
what manual?
<heat>
do I need to be intimately familiar with my filesystem's format??
Iris_Persephone has quit [Ping timeout: 265 seconds]
<Griwes>
the manpages for fsync and inode
<mrvn>
heat: for st_blocks? yes.
<demindiro>
Is the block count even accurate? du -a random_text_file claims it 5 blocks of 512 bytes, but the file itself is 2162 bytes large.
<dh`>
isn't st_block defined to count in 512-byte blocks? or was that something we did to avoid chaos?
<demindiro>
Which honestly doesn't make much sense to me
<demindiro>
(also I use ashift=12)
<clever>
mrvn: most of the holes on my disk are 8kb in size...
<mrvn>
dh`: no
<clever>
demindiro: check zdb as well, `zdb -ddddddddbbbbbbbb <dataset> <inode#>`
<mrvn>
demindiro: 2162 rounded up to the next full block is 5*512.
<demindiro>
wait
<clever>
demindiro: L0 blocks contain the actual data, L1 blocks are a list of L0 pointers and so on, if a block says gang then its fragmented into more pieces
<demindiro>
uh
<demindiro>
4.5K dux_notes.txt
<demindiro>
hm
<dh`>
for us it is: st_blocks The actual number of blocks allocated for the file in
<dh`>
512-byte units.
<dh`>
makes me wonder if our zfs does it right
<demindiro>
I'm trying to understand how du -ha arrives at 4.5K with 512 * 5
<Griwes>
dh`, are writes in the write journal (ZIL for zfs) "allocated for the file"?
<clever>
dh`: should st_blocks count metadata, like the indirect block tree?
<dh`>
traditionally it does, yes
<mrvn>
Another thing about st_blocks: When a file is small or the tail is small (or has small blocks in the middle) it can be included in the inode (or other metadata). How do you count that?
<clever>
dh`: what about when data blocks are of variable size?
<dh`>
As short symbolic links are stored in
<dh`>
the inode, this number may be zero.
<heat>
mrvn, ext4 fakes 1
<dh`>
clever: the goal is to get du(1) to print useful values, so ultimately it's up to the filesystem to be helpful
<heat>
in practice all relevant st_blocks are defined as 512-byte by the kernel
<clever>
mrvn: zfs has a feature called embedded block pointers, if the entire file data (after compression) is under ~300 bytes, it just shoves the data in where the block#+checksum would have gone
<clever>
[root@amd-nixos:/nix/var/nix/db]# ls -lhs
<heat>
I think most implementations converged into 512
<clever>
dh`: heh, all i did was cp it, and it gained more overhead!
<dh`>
that, however, is not reasonable!
<mrvn>
clever: should you report those 300 bytes as "1 512byte block"? Or sum up all the data + metadata of the file (some 300+32+something bytes) and round up to 512 blocks?
<clever>
mrvn: now it has 32 fragmented L1 blocks, because it has twice as many L0's
<mrvn>
heat: did you see the note in stat "Note: for performance and simplicity reasons, different fields in the stat structure may contain state information from different moments during the execution of the system call."? So the stat data itself has never been deemed consistent anyway. :)
<dh`>
whose stat is that in?
<mrvn>
clever: Will that change over time as sqlite modifies blocks?
<mrvn>
dh`: man stat on Debian
<mrvn>
man 2 stat
<dh`>
ah, linux
<clever>
mrvn: once a file has 2 blocks, the block size is set in stone, but the CoW nature means the indirect blocks(L1/L2) are re-written on every change
<clever>
mrvn: so if my free space situation improves, the L1 blocks will un-fragment, as they get modified
<dh`>
one of the things I've been debating for years is whether it's acceptable for stat() to succeed and tell you the linkcount is 0
<clever>
but, each file on a filesystem, can have a different block size
<mrvn>
clever: but if you have 8k blocks and sqlite writes 4k changes then your L0 will fragment, or not?
<dh`>
apparently linux thinks so!
<clever>
mrvn: nope, zfs will do a read/modify/write cycle
<clever>
pull 8k off the disk, modify half of it, then write 8k back to disk
<mrvn>
dh`: more importantly what is the link count for a directory? The "normal" behavior lets you know the number of files and subdirs in that dir.
<clever>
same reason you want to align your partitions, so your 4k writes are not straddling half of 2 4k blocks
<dh`>
that's not more important, it's a completely different issue
<dh`>
:-p
<clever>
it still causes a read/modify/write cycle, but twice now!
<mrvn>
clever: so it stores 4k zeroes and 4k data when you have a hole?
<clever>
not sure what it does, if half a L0 is nulls, and compression is off
<clever>
and now that youve reminded me, let me flip compression on
<mrvn>
clever: with compression you get a lot more partially filled blocks.
<bslsk05>
github.com: lk-overlay/zfs.c at master · librerpi/lk-overlay · GitHub
<clever>
currently, its able to parse the vdev labels, find the most recent uberblock, load the MOS dnode, and partially load the root dataset dnode
<clever>
to get directory traversal, i need to implement ZAP parsing
<clever>
a fat-zap is basically a b-tree, hash the filename, use the hash in an index to find the block# with the name->inode entry
<clever>
but if the filenames are small and the total size of the serialized structure is small, it instead becomes a micro-zap, no index, just a single block of name->inode pairs
<clever>
demindiro: only sha256 is validated currently, but zfs uses fletcher4 for metadata, so my more critical reads arent validated
<clever>
and only lz4 decompression is supported
<clever>
mrvn: oh right, and with 4kb blocks, compression is pointless
wxwisiasdf has joined #osdev
<clever>
mrvn: block allocations, are done in units of ashift, 4kb for my current setup
<clever>
so with 4k fs blocks, it takes 4kb of data, compresses it down, then stores it in a 4kb block!
<clever>
with 8kb blocks, you need to halve the size of the data, for compression to actually be a benefit
gildasio has quit [Remote host closed the connection]
<clever>
yep, i can confirm that in zdb, despite asking for lz4, a large chunk of the blocks arent compressed
<clever>
zfs realized it wasnt a net-gain, so it just stored the uncompressed version instead
gildasio has joined #osdev
<heat>
dh`: now that you're here, mjg told me to ping wrt rename and locks
<heat>
<heat> what do you think of a per-filesystem rename lock?
<dh`>
what about it? rename is not trivial and ~everyone out there has it wrong
<heat>
why wrong?
<dh`>
as far as I know there are only two correct ways to do it: one is with strict two-phase vnode locking and the other is with a per-volume rename lock
<dh`>
so yes
<heat>
linux is seqlocking these days
<heat>
soooooo i guess it seems to work
<heat>
although I don't fully understand how that works
<heat>
how likely is it that openbsd is using a global mutex or something weird like that?
<dh`>
openbsd almost certainly has the completely broken 4.4 code
<heat>
lmao
<dh`>
it is unlikely that they've changed it at all
<heat>
what makes it broken?
<dh`>
the 4.4 code will not deadlock at runtime (though mostly by accident) but it will cheerfully detach and lose sections of the namespace, so you don't realize anything bad happened until you get a fatal fsck failure
<dh`>
and I've seen that failure be unrecoverable by fsck, too (fortunately on a crash machine)
<heat>
ugh
<Iris_Persephone>
I am now leaning towards "implement POSIX as closely as you can, except for the parts where POSIX is stupid"
<dh`>
the 4.4 code just randomly locks and unlocks with no clear pattern
<heat>
rename is not stupid tbf
<heat>
it's just hard to implement
<dh`>
it obviously meant something to someone at some point but it never made any sense to me
<dh`>
there's a POSIX_MISTAKE in rename
<heat>
are you talking about 4.4's path/directory code in general or just rename?
<dh`>
which is: ordinarily if you rename a/b over c/d, it unlinks d and replaces it with a
<dh`>
but if you first hardlink b and d, posix says this does nothing at all rather than unlinking b
<clever>
heat: is this what you where saying? i ran `sync` in CLI`, and the file still claimed to be 512 bytes, then seconds later, 80M!
<dh`>
4.4's pathname and directory code in general is a mess but rename is quite extra
<heat>
clever, ack
<clever>
mrvn: also, because i raised my block size back to 128kb, lz4 doesnt have to work as hard to get savings (i fixed my free space)
<heat>
I should be fancier with rename
<dh`>
anyway, the fundamental nature of the problem is: if you rename a/b/c/d/e to f/g/h/i/j, you need to make sure e isn't an ancestor of j in the tree
<clever>
it still has to shave off at least 4kb, but 4kb off 128kb is far easier
<heat>
I'm doing unlinks + links as the base operations which seems unsafe
<dh`>
this check is nonlocal and so requires nonlocal locking, hence the per-volume lock
<clever>
heat: in the ZIL, rename is a recordable action, so it can just append that to the ZIL and fix the metadata later
<dh`>
you also need to make sure that you don't violate the locking order rules, which can depend on the location of the subpaths in the tree
<clever>
and things remain atomic and consistent after a power failure
<dh`>
unlink and link is fine for renaming regular files; it's dirs that make trouble
<dh`>
well, fine in the sense that if you crash in the middle you might be left with both names if you don't journal it
<heat>
yrs
<heat>
yea
<clever>
demindiro: but then hardlink counts?
<clever>
dh`: oops, ^
<dh`>
clever: if you don't increment the link count while doing the link/unlink you can detect that a rename was in progress but that violates an intended invariant and can make you other problems
<clever>
now that i say that, i dont know how zfs manages link counts....
<dh`>
you know how the original softupdates you're supposed to be able to fsck in the background while the system starts up?
<clever>
ah, in the bouns space on a dnode
<dh`>
never enable that, it's not safe
<dh`>
because of rename!
<heat>
lol
<dh`>
even with softupdates, rename must be fixed up afterwards by fsck or journal replay
<heat>
love me some rename
<dh`>
there's no sequence of writes that can make an operation atomic in two separate places at once
<clever>
dh`: zfs can do that with both the zil and the txg
<clever>
for the zil way, the rename is just appended to the log, on replay you try the rename again
<dh`>
so the failure mode is: rename /tmp/foo /tmp/bar/baz, crash in the middle, go to clean tmp, rm -r goes into /tmp/bar/baz but comes back out into /tmp, so it thinks it's one layer deeper than it is, then it comes out of /tmp and starts erasing the whole system
<clever>
for the txg, every transaction in that group must be commited to disk at once, and due to the cow/immutable nature, that involves creating a new set of state, where the rename has fully happened
<clever>
and that new state doesnt take effect until you update the uberblock, the single root of truth
<dh`>
clever: any reasonable journaling or snapshot-based solution deals with the problem
<clever>
if you get interrupted, the new state basically never existed
<clever>
yeah, this is both journaling and snapshotting
<clever>
the ZIL is a journal, while the cow/immutable nature of the main pool is snapshotting