<geist>
Clockface: note that when yo usay 'any cpu made in the last 7 years' x86 and everything not x86 are fairly different stories when talking about mode switches
<geist>
x86s are a lot faster than they used to be, but mode switching overhead is somewhat higher in general, then say modern arm and riscv cores
<geist>
per clock at least
<geist>
there's just a lot more work to do on x86, and though they're fast for what they have to do, the microcode still has to do a lot
<mrvn>
geist: Do you know how many clock cycles an FIQ takes on ARM or IRQ on ARM64?
mctpyt has joined #osdev
<geist>
well depends highly on the core and i thin kthere's an implicit memory barrier so it may end up stalling a bit
<geist>
but in general it's like 10s of cycles
<geist>
i have measured in the low ns on a reasonably modern core running at a few Ghz
<geist>
on OOO cores there is an implicit pipeline stall though, but taking an IRQ out of WFI might be faster
<mrvn>
no need to finish the pipeline from WFI state
<geist>
right
<geist>
it *could* start actually implementing the front part of the interrupt (saving the state, etc) at the time the WFI starts
<mrvn>
right, that would be smart
<geist>
i dont think it can on modern armv8 because there sa lot of other things that can happen in a WFI but the idea is something like that
<geist>
i know of some 8 bit micros that did the same thing. apparently even the original 8086 does an interrupt start at HLT
<geist>
it just pauses until the actual event happens
<geist>
since all the state written out to the stack is the same if it is written now or in the future
<mrvn>
it probably runs that so the pipeline gets flushed and such.
<mrvn>
the stack should be observable by other cores.
* geist
nods
<geist>
and possible the mapping could be changed out from underneath it by another core modifying the page tables
<geist>
so possible that optimization is no longer feasible
<geist>
also could trap into SMM, etc. but original 8086 didn't have this problem
<mrvn>
Don't think that would work as the stack should be in the TLB.
<mrvn>
You would have to manipulate it so the stack is right at a page border so the next write does a new page walk etc
<mrvn>
or INVLPG the stack before HLT
* geist
nods
<geist>
or at least it's within the architectural spec because you didn't invlpg it
<geist>
ie, it conforms to the model of what you should observe
<mrvn>
it could also just prepare the write in the write back buffer or something.
mctpyt has quit [Ping timeout: 252 seconds]
wand has quit [Remote host closed the connection]
[itchyjunk] has quit [Ping timeout: 255 seconds]
[itchyjunk] has joined #osdev
wand has joined #osdev
vin has joined #osdev
<vin>
/join #fosdem2023:fosdem.org
<vin>
Oops wrong window!
slidercrank has quit [Ping timeout: 248 seconds]
gildasio1 has quit [Remote host closed the connection]
gildasio1 has joined #osdev
spikeheron has quit [Quit: WeeChat 3.8]
immibis_ has quit [Ping timeout: 246 seconds]
immibis_ has joined #osdev
vin has quit [Ping timeout: 252 seconds]
dutch has joined #osdev
joe9 has joined #osdev
joe9 has quit [Quit: leaving]
joe9 has joined #osdev
wootehfoot has quit [Ping timeout: 252 seconds]
joe9 has quit [Quit: leaving]
gog has quit [Ping timeout: 248 seconds]
joe9 has joined #osdev
heat has quit [Ping timeout: 260 seconds]
dude12312414 has quit [Remote host closed the connection]
unimplemented has quit [Read error: Connection reset by peer]
Brnocrist has joined #osdev
Brnocrist has quit [Ping timeout: 252 seconds]
Brnocrist has joined #osdev
unimplemented has joined #osdev
<gog>
yes
<gog>
i figured out how not to do that
<gog>
now when i render a glyph it's to a bitmap that is currently on the stack but in the future i'm going to cache them
<gog>
this might be a premature optimization
<gog>
but it feels better than writing directly to the framebuffer
sortie has quit [Quit: Leaving]
Turn_Left has joined #osdev
sortie has joined #osdev
Left_Turn has quit [Ping timeout: 252 seconds]
<clever>
gog: in some situations, it may help performance to memcpy an entire line of the glyph into the framebuffer at once, depending on how the caches and bus is configured
<gog>
YES
<gog>
yes
<gog>
that's what i'm doing now
<gog>
although i'm not sure how much it matters in an emulator
<Ermine>
gog: may I pet you
<gog>
yes
* Ermine
pets gog
* gog
prrr
* Ermine
awwws
leon has joined #osdev
tejr has joined #osdev
nikolar has joined #osdev
yyp has joined #osdev
MrPortmaster has joined #osdev
Brnocrist has quit [Ping timeout: 260 seconds]
<Ermine>
We're GOAT
<MrPortmaster>
FOSDEM gang
unimplemented has quit [Read error: Connection reset by peer]
<heat>
i would like to do that but I realize it's impossible or maybe non-obvious for a terminal
<heat>
since your bg/fg may change
<gog>
true
<gog>
i might render them without color and use them as masks in the case of hardware that supports mask sprites
<moon-child>
hmmm I guess you could change the colour inline
<moon-child>
though that gets dicy as soon as you add aa
<gog>
but then you're basically just re-rendering the glyph each time anyway
<moon-child>
well, it's less memory traffic
<moon-child>
(see, this is why I don't like software rendering)
<gog>
yeh
<Ermine>
ddevault: I forgot what was on that slide, but does helios provide any API to map virtual addresses to physical? It's useful for DMA implementation
<heat>
moon-child, no AA, this is bitmap mr moon
<moon-child>
yeah, hence 'as soon as you add'
<moon-child>
gog: proggy
<heat>
"never"
<heat>
there you go
<moon-child>
:<
<gog>
goggers is proggers
<heat>
doing AA inside the fucking kernel sounds depressing
<gog>
this is eventually going to move outside the kernel
<gog>
i just need something to show me text rn
<gog>
so i don't have to watch the serial console for everything
<heat>
no unix??
<moon-child>
didn't windows to text rendering inside the kernel
<gog>
yes
<moon-child>
and then had a fucktonne of vulnerabilities
<moon-child>
lol
<gog>
i think the TTF engine in kernel still exists?
<Ermine>
gog: what about vesa/gop?
<gog>
Ermine: i'm using GOP here
<gog>
or rather, passing the pointer and pixel format and pitch forward to the kernel
<Ermine>
ah wait, it doesn't have text mode
<gog>
yeah
<Ermine>
okay, okay..
<gog>
so i'm forced to implemented a halfass framebuffer console
<heat>
right-wing gog using gop
<gog>
what
<gog>
oh
<gog>
lol
<Ermine>
heat: you mean the party?
<gog>
yes he's making a pun
<gog>
he's not funny
<moon-child>
it's only not funny because you explained the joke
<heat>
i am hilarious
<ddevault>
Ermine: yes
Brnocrist has quit [Ping timeout: 265 seconds]
<Ermine>
ok, maybe will take a look at it
<Ermine>
well' it's too different to compare with minix imo
<zid>
I like my half assed framebuffer console
<zid>
I'm just not sure when I should update it
<heat>
now
<zid>
okay we're on frame 1 then
<phr3ak>
anyone have experience how gd32 is compatible with stm32f4xx? I want to patch an stm32f429 bootloader to work on gd32f450.
Brnocrist has joined #osdev
unimplemented has quit [Read error: Connection reset by peer]
<x8dcc>
my framebuffer console rocks
<zid>
does yours have ayame bg
<x8dcc>
mine has gimp peppers on top
<zid>
yea mine is way better then
<x8dcc>
because I can initialize it with position and dimension sheeeesh
<x8dcc>
zid: probably. I have had a look at your os btw
<zid>
'os' :D
<x8dcc>
it has been useful, but I have lots of questions
<zid>
I bet
<zid>
I have lots of questions and I wrote it
<clever>
x8dcc: my framebuffer console is using sprites to do scrolling with zero copying, and its actually too fast, i need to throttle it on vsync or it glitches out
<x8dcc>
not related to boros, its just that I am not sure what should I focus on
<x8dcc>
clever: what do you mean by zero copying?
<zid>
it's composited
<zid>
the gpu does the scroll by moving the geometry around
<clever>
yep
<zid>
rather than moving pixels around in a big texture
<zid>
with memcpy
<clever>
exactly
<zid>
(aka blitting)
<x8dcc>
oh, I don't mess with gpu
<zid>
he's on a pi so he gets to cheat
<x8dcc>
I mess with the peppers if you know what I mean
<nikolar>
well that doesn't count
<zid>
x8dcc: Are you going to ask any of those questions
<clever>
nikolar: i still had to init the gpu from scratch!
<clever>
id say that counts :P
<moon-child>
didn't you have to reverse engineer it
<nikolar>
let's see you write an intel/amd driver huh
Brnocrist has quit [Ping timeout: 252 seconds]
<clever>
moon-child: i had to read the linux drivers and write my own docs from that
<x8dcc>
zid: well more than a specific question is just that I don't know what should I do next
<zid>
oh boring
<clever>
moon-child: and decompile some firmware for the missing init steps
<nikolar>
clever: i'll give you half points :P
<x8dcc>
I made a simple shell, which is called by the kernel
<x8dcc>
not a module or anything, uses libk too
<x8dcc>
and I would like to make a proper userspace and what not, but I have to do some stuff before
<clever>
nikolar: i still dont have hdmi working, i think the PLL is in another power domain, and linux cant bring that online
<nikolar>
really
<nikolar>
got to love embeded platforms
<clever>
linux relies on the closed firmware to bring the PLL online
<x8dcc>
I have been thinking about doing something related to ext2 or fat, which is one thing I am not sure about
<clever>
so the instant you replace that firmware, the PLL just stays dead
<x8dcc>
but I don't have paging yet, and I don't know if I really need it for now
<nikolar>
clever: kind of leaves you with no options
<nikolar>
wasn't there a project to rewrite raspi firmware as open source
<clever>
nikolar: i still have 2 options, DPI (which can convert to vga) and composite (ntsc/pal)
<moon-child>
is the kind of thing where you need a key to bring it up?
<moon-child>
or just no one did the legwork yet
<clever>
nikolar: thats what ive been working on
<nikolar>
oh nice
<clever>
moon-child: i think its just an undocumented control register needing a bit flipped
<clever>
but the problem is which register? which bit?
<x8dcc>
so processes don't write to other processes memory?
<heat>
2) process isolation 3) with virtual addressing you stop needing contiguous physical memory
<heat>
1) is mprotect, 2) is processes don't write to other processes' memory
<x8dcc>
heat: yeah, for example with process isolation, I was not sure if I needed it before multitasking
Brnocrist has joined #osdev
<nikolar>
clever: lol
<nikolar>
love the commodore colourscheme too
<clever>
x8dcc: the mmu also helps with load-address, a lot of binaries break if you load it to the wrong addr in ram
<clever>
you can do relocation patching, but that can be tricky
<clever>
nikolar: the background border even works nearly identically to how it does on the c64, the 2d core on the rpi has a background fill option, so you can change that whole border with a single register write
<x8dcc>
clever: yeah, I had a look at stuff related to processes and I saw where they were loading them, and I felt like that could be a problem
<clever>
you can even change it in hsync, like on the c64, and profile code with that
<gog>
can qemu vga accelerate
<gog>
i'm not finding a lot in the way of docs here
<heat>
wdym
<Ermine>
gog: wsym
<Ermine>
wdym
<heat>
Ermine, omg twiiiiiiiiiiinssss
<gog>
like can i put sprites in it and tell it to move the sprites around without doing it in "software"
<gog>
i know it's all software and it's not really accelerated, just more of a shortcut for the next things i need to do
<clever>
gog: the moue cursor is one form of that
<zid>
yea vga is dumb as hell
<zid>
when qemu-voodoo2
<x8dcc>
heat: so for example now I am using a heap allocator I made for alloc and all that, so now I would have to change that to allocate with paging? this is the kind of stuff that confuses me
<gog>
yes
<heat>
gog, not in vga, but qxl and vmware and virtio-gpu support that
<gog>
x8dcc: you can build allocators atop allocators
<heat>
^^
<gog>
ok i'll look at virtio-gpu
<clever>
gog: i think most dumb desktop gpu's are limited to 2 sprite like overlays, a small one for the cursor, and a large one for accerated video playback, plus the main framebuffer
<clever>
and the rest all goes into the 3d half of things
<x8dcc>
gog: not sure I understood that :/
<gog>
x8dcc: your heap allocator is going to be atop something like SLAB which is atop a page frame allocator
<gog>
it's allocators all the way down
<moon-child>
gpu should have more options for latency oriented stuff
<moon-child>
teh pipelines are too deep!
<gog>
not that any of this is really necessary i just don't want to deal with moving huge bitmaps around in my cheap framebuffer console code
<clever>
the video surface tends to be chroma-key'd into the main framebuffer, and accepts yuv
<Ermine>
If you want to try virgl, idk if it stable right now
<bslsk05>
'Chaos, 13 sprites randomly bouncing around' by michael bishop (00:00:12)
<heat>
the rest honestly kind of sucks
<Ermine>
heat: At least I didn't manage to make it work in my virtual machines
<moon-child>
zid: what's the diff?
<Ermine>
(or is it supposed to be laggy?)
<heat>
Ermine, did you do gl=on
<Ermine>
Is it qemu option? Idk then, I use libvirt
<heat>
yes, qemu option
<heat>
libvirt is libcringe
<Ermine>
I should seat down and learn how to use qemu directly, I know
<nikolar>
clever: suspicioulsy, my cpu runs at like 60% of a single core even though it should have hardware acceleration for video decoding
<clever>
nikolar: use top to find out which pid is to blame, then `perf top -p <PID>` to see what function is to blame
<heat>
hardware video accel sucks in linux
<heat>
like most things suck in linux
<clever>
heat: a common issue i see on the rpi, is that xorg wants rgb images, so even if the gpu can accept a yuv frame, your wasting cpu converting yuv->rgb, to pass it off to xorg
<clever>
similar issues may occur on other hardware
<clever>
because xorg is equally dumb on all platforms
<zid>
if your gpu only does yuv I feel bad for it
<clever>
more, that the gpu can accept both rgb and yuv, so you can just feed the yuv frames in as a composition layer
<clever>
but xorg doesnt allow that, and demands you convert to rgb first
<Ermine>
clever: is it problem on wayland? If the client gives you rgb buffer you still need to convert it
<zid>
that seems totally fine
<zid>
the alternative is that it settled on yuv, and fuck that
<clever>
Ermine: i know less about wayland, but one of the wayland backends uses kms for composition, and that could fully hw accelerate things
<nikolar>
clever: libavcodec.so.59.37.100, don't have debug symbols though
<clever>
but kms is limited to 32 surfaces on linux i think
<clever>
nikolar: definitely sounds like its not hw accel
<nikolar>
yeah 60% is suspicious as i said
<nikolar>
didn't expect i'd go down the hardware acceleration rabbit hole today
<clever>
if i perf top a chrome renderer process for YT, but its paused, its peaking at only 4% cpu in a single function, some js stuff
<clever>
if i then play the video, some changes to the top functions, but nothing over 4%
<bslsk05>
www.cpubenchmark.net: AMD Ryzen 5 5500U vs AMD FX-8350 Eight-Core [cpubenchmark.net] by PassMark Software
<moon-child>
cus it's waking up, checking time, and at the same time sampling that in the profile
<nikolar>
moon-child: yeah guess it could be
<clever>
nikolar: your cpu is about 2.1x faster then mine, and uses 10% of the wattage, wut?
<nikolar>
why is it higher on a more demanding video
<nikolar>
clever: laptop cpu from 2021 ¯\_(ツ)_/¯
<clever>
nikolar: amd cpu from when people thought amd couldnt recover and would go under, lol
<clever>
thats just how bad mine is :P
<nikolar>
clever: lol yeah
<zid>
power usage scales with like, voltage squared, and they like to drive desktop chips at higher pwoer levels for more cycles
<nikolar>
amd made massive jumps in the past few years
<zid>
If I want 5GHz on my cpu rather than 4GHz I have to take it from 120W to 200W :P
<nikolar>
not to mention smaller process
<nikolar>
zid: moar power
<zid>
need more cooling for that though
<clever>
that reminds me, in the past, i could set off the UPS overload alarm, just by maxing all 8 cores
<zid>
pair of £5 fans not cutting it for 5GHz
<nikolar>
clever: get a better ups :P
<clever>
nikolar: it was running half my monitors, and the desktop
<clever>
and it didnt warn, if i turned a monitor off
<nikolar>
lol that would do it
Brnocrist has quit [Ping timeout: 255 seconds]
<clever>
but it has since died, i forget why, and been replaced with one that doesnt complain, but does die after just 10mins of outage
<clever>
and its only running 1 monitor!
<nikolar>
does it at least warn you to shut off your computer in time
<clever>
not really
<clever>
no low-battery alarm, just *click*, its dead
<heat>
the slander!
<heat>
the fx 8350 is the greatest heater that amd ever made
<nikolar>
doesn't sound particulary useful as ups
<zid>
fx 8350 actually really good cpu
<zid>
it had the single core world record for like 15 years
<nikolar>
heat: does it compete with pentium 4s though
<clever>
nikolar: it covers short blips, and gives me time to do a proper shutdown, but not much more
<nikolar>
but you might not even notice the power was out in time
<nikolar>
if there's no idication
<moon-child>
oh yeah
<moon-child>
zid: I got your favourite cpu
<zid>
w-2195?
<moon-child>
2295
<zid>
w-1390p?
<zid>
2295 is sick af
<moon-child>
it doesn't actually clock all the way up though
<zid>
1390p > 2295 though
<clever>
nikolar: outages are very noticable, half my monitors die, and the fridge makes a sudden clunk noise, then the whole house goes silent
<moon-child>
idk why. But it goes up to 3.7ghz, and maintains that on all cores, so I'm happy
xenos1984 has quit [Read error: Connection reset by peer]
<moon-child>
zid: :<
<zid>
why did you break your 2295
<nikolar>
clever: well if it works ¯\_(ツ)_/¯
<moon-child>
zid: that literally has two memory channels
<moon-child>
and 8 cores
<zid>
yea but it's so fast
<moon-child>
wtf is the point of a cpu with only 8 cores and 2 memory channels
<zid>
5.3Ghz out of the box
<moon-child>
can't computer anything with that
<zid>
2295 is for webserver
<zid>
well, workstation
<zid>
1390p for desktop
<zid>
and what was the webserver one like.. 3175? I don't remember
<zid>
2225 also very good
<zid>
W-3375 is the webserver
<moon-child>
2225 has 4 cores
<moon-child>
that's even less than 1390p
<zid>
4.6GHz and quad though
<zid>
and 1/3rd the price
<zid>
and actually exists, afaik, 1390p is paper launch
<moon-child>
only 1/3?
<heat>
xeon weirdos
<heat>
use normal CPUs
<moon-child>
but fast
<moon-child>
and ecc
<zid>
and 2011 is fully unlocked :D
<clever>
nikolar: but recently, a big storm has blown thru, it got down to -25c, heavy wind, and 7 interrupts to the power in the last week!
<zid>
why the 2011 xeons are unlocked nobody knows
<zid>
but they am
<nikolar>
clever: that's a lot
<zid>
They're all K or X cpus as intel would name them
<moon-child>
if I don't ecc, I might end up with the wrong bits
<nikolar>
where does it get to -25c lol
<moon-child>
WRONG BITS!
<zid>
1620 is a K, 1650 is an X
<zid>
and they're $20
<clever>
nikolar: some where short 20min outages, enough to kill the ups, some where planned 4h outages, and some where just a blip, under 1 second, enough to reboot everything
<moon-child>
zid: I think there was some stuff in bios where I could adjust cpu voltage
<nikolar>
clever: under 1s?
<zid>
yea I always turn mine down usually
<clever>
nikolar: NB canada
<nikolar>
isn't that why you have ups
<nikolar>
clever: ah makes sense
<zid>
My cpu out of the box idles at 1.1V but it's stable at 0.8V
<clever>
nikolar: yep, but the UPS only covers 1 machine, i have another 2 rooms with computers
<zid>
saves like 40W
<clever>
so i would need 2 more UPS's to cover everything
<zid>
just need to up the load-line so that it doesn't die at full turbo from not enough wolts
<zid>
It's a stupid skinned window that isn't screenshotting properly
<clever>
zid: that sounds similar to the voltage/freq ratios in arm/rpi, where you need to raise the voltage before raising the clock
<zid>
I have it scale from 0.8V to 1.3V between 800MHz and 4.4GHz
<clever>
thats also something i have not yet tried messing with on the open firmware
<clever>
and i suspect its part of my instability
<clever>
the voltage is whatever the reset-default is, and i'm probably clocking it too much for that
<zid>
I can also do all this same shit on my dram ofc
<zid>
all per channel
<clever>
yep, the rpi has 3 seperate voltage settings for the dram
<nikolar>
clever: or you could lock it to the lowest frequency :P
<clever>
phy, io, and core
<zid>
I can make different channels run on different PWM frequencies and stuff, it's silly :D
<clever>
dang!
<clever>
ive only got 1 dram channel, so not much to play with there
Brnocrist has joined #osdev
<zid>
I have four but they're paired sometimes for some options
<zid>
slot AB and slot CD
<zid>
but I used paired kits so it dosn't impact me
<zid>
imagine being able to find 4 of the same dimm
<clever>
my desktop came with a pair of ram sticks, and i later bought a second pair to upgrade it
<clever>
but that 2nd pair has since turned faulty
<zid>
I found out about a nice cheap dimm *after* I already bought different ones, like within a week, kinda sad
<zid>
so now I'd need 4 new ones again, fuck that
<zid>
16GB 1066MHz urdimms exist and aren't anywhere near as expensive as I would have guessed, considering the price of 8GB 933MHz urdimms was teetering on ancient relic sealed inside a tomb prices
<zid>
I found a cheap kit of the latter and thought I did really well
<nikolar>
zid: what do you need 16gb 1066mhz ram sticks for
<zid>
so that I can have lots of fast ram?
<zid>
surely that's better than 4GB of 800MHz ram
<nikolar>
well i wouldn't call 1066 fast
xenos1984 has joined #osdev
<zid>
1066 is very fast
<heat>
ddr3 moment
<zid>
I legit see people recommending single channel for ddr4 cus it trains so badly :D
<zid>
you have to back off all the timings to make two dimms of it work in the same postcode
<mrvn>
With ram why aren't we driving 4 DIMMs in parallel instead of just 2?
<mrvn>
or ave DIMMs with twice the number of chips on it and have the controller on the DIMM alternate them?
<moon-child>
at a guess: cost ineffective vs capacity
<moon-child>
servers have a shitton of channels _and_ a shitton of dimms
<mrvn>
But instead of driving 8 chips at 1066MHz your could drive 16 chips at 533MHz needing less power and producing less heat and requiring less tollerances on the chips.
<dinkelhacker>
clever: The pi sets all the interrupts to group 1 in the armstubs. Qemu seems to do the same. Do you know if this is part of a protocol? I checked the linux boot protocol and haven't found a hard requirement for that.
<mrvn>
dinkelhacker: isn't that so a non-smp kernel can boot?
<mrvn>
an armv6 kernel won't know how to set the group
<dinkelhacker>
Hm.. I don't see what it has to do with smp? You have to be in secure wold to do that. And according to the linux boot protocol the kernel can be booted in EL2 or non secure EL1. So it kinda makes sense to do that. I was just wondering if that is something you can rely on when targeting platforms that also support linux.
<mrvn>
don't rely on anything.
<dinkelhacker>
well some things you have to rely on^^
<clever>
dinkelhacker: i think its also, because once you drop to NS mode, you cant change the groups, and the armstub doesnt have any EL3 handlers
<clever>
so once you go to NS mode, secure mode is permanently lost
<dinkelhacker>
Is EL2 always non secure?
<clever>
EL2/hypervisor can be both secure and non-secure
<clever>
its up to EL3 to decide which one it will be when EL3 does eret
<dinkelhacker>
Ok. So it could just boot in secure el2
<clever>
and only by trapping back to EL3 (such as the svc? opcode) can you change that
<clever>
dinkelhacker: i think some things expect NS mode, and will malfunction in S mode
<dinkelhacker>
I did originally boot the kernel in el3 and did all that myself. But maybe I can get rid of that stuff and just expect it to be done right until I'll stumble on something wherer it is not done that way.
<clever>
i think EL3 doesnt support the mmu?
<clever>
and EL2 can lack a high/low split in its MMU, so supporting a userland is tricky
<clever>
linux+kvm runs in EL1, but has a stub in EL2 for accessing protected registers
<dinkelhacker>
I think it does.. but anyway I was just setting up the gic and dropping to el1
<clever>
and any time linux+kvm context switches to a guest, it has to go EL1(host)->EL2->EL1(guest)
<clever>
which is a performance cost
<clever>
there is an EL2 extension, that allows EL2 to act more like EL1
<clever>
all of the EL1_ registers become aliases, pointing to the respective EL2 version, and EL2 gains a high/low split
<nikolar>
clever: apparently i can't make firefox use hardware acceleration
<clever>
so a relatively unmodified "EL1" linux can run in EL2 instead, and then it becomes EL2(host)->EL1(guest)
<clever>
saving a step
<dinkelhacker>
nice
<clever>
the rpi lacks that EL2 extension
<gog>
mew
<nikolar>
mew
xenos1984 has quit [Ping timeout: 246 seconds]
xenos1984 has joined #osdev
<moon-child>
mew
<nikolar>
clever: apparently it's better to watch a youtube video through mpv than firefox
<nikolar>
for acceleration
<clever>
nikolar: there is also software like freetube i think, which can do similar
fedorafan has quit [Ping timeout: 252 seconds]
<nikolar>
i just can't get ff to do hardware acceleration
foudfou has quit [Ping timeout: 255 seconds]
<clever>
nikolar: i have heard that chrome does do hw accel
fedorafan has joined #osdev
nyah has joined #osdev
<nikolar>
i don't know if i need to enable something but it's at like 100% on one core
foudfou has joined #osdev
<GeDaMo>
nikolar: in Firefox, try Help > More Troubleshooting Information, there's a graphics section there
<nikolar>
i followed instructions on arch wiki
<mrvn>
oh no, a follower. :)
srjek|home has joined #osdev
srjek has quit [Ping timeout: 252 seconds]
<nikolar>
mrvn: kek
epony has joined #osdev
wootehfoot has joined #osdev
xenos1984 has quit [Ping timeout: 260 seconds]
Turn_Left has joined #osdev
srjek has joined #osdev
xenos1984 has joined #osdev
<dinkelhacker>
my dtb is 1MB but only has data until ~0x2200 rest seems to be zeroed. Does anybody know why?
<mrvn>
how do you know it's 1MB?
<dinkelhacker>
Well it's the size of the file and also if I look into the header
srjek|home has quit [Ping timeout: 260 seconds]
Left_Turn has quit [Ping timeout: 252 seconds]
<dinkelhacker>
according to the fdt spec the second uint32 is the length of the dtb which is 0x1000 in my case. After swizzeling that its 0x100000
bgs has quit [Remote host closed the connection]
<heat>
probably just padding
<zid>
or their build tool doesn't fill it in automatically so they just said 1MB should be plenty for forever
<dinkelhacker>
Ok yeah it's the one qemu genenrates. Probably that is one size fits all
wootehfoot has quit [Read error: Connection reset by peer]
<mrvn>
swizzeling?
<mrvn>
It 0x1000 maybe the length rounded up to the next page?
joe9 has quit [Quit: leaving]
<dinkelhacker>
mrvn, chaning endianess. 0x1000 le is 0x100000 be
troseman has joined #osdev
micttyl has joined #osdev
<geist>
my guess is they just padded it out
<geist>
i've seen that when doing dumpdtb with qemu
<geist>
it's the space reserved for the largest possible dtb they could ever have
<clever>
that reminds me, i padded my dtb out to 16mb a few months ago, that entirely broke linux
<clever>
had to reduce it down to a more sane value
<mrvn>
what did you put in there?
<geist>
i suppose one could stick entire binaries in it, or other binary data
<geist>
images, etc
<mrvn>
is there a entry for that in the specs?
<geist>
the specs are pretty ad hoc in places
GeDaMo has quit [Quit: That's it, you people have stood in my way long enough! I'm going to clown college!]
<clever>
mrvn: there is a size field near the start, that tells you how big of an area the dtb covers, but thats unrelated to where the end marker actually lives
<clever>
its more of a safety/helper, so you can just memcpy the whole thing without parsing it
<clever>
and if you go past the end, something has gone wrong
<mrvn>
obviously.
<clever>
and libfdt also uses that size to enforce not writing past the end of an allocated buffer
<clever>
so i just whacked it to 16mb, the size of my buffer
<mrvn>
have you checked why linux blows up?
fedorafan has joined #osdev
<clever>
mrvn: not fully, i assume there is some max size its willing to reserve and copy/parse
<clever>
libfdt also has a shrink function, that will parse the tree, find out the minimum size, update the length field, and report that size
<clever>
that entirely resolved the issue
<mrvn>
I could understand it running out of memory for page tables to map it or something.
micttyl has quit [Quit: leaving]
<clever>
yeah, there might be an area of early ram, that it assumes it just available, enough to get the mmu online, and parse the dtb later
<clever>
and it just blindly respects that size field
<geist>
yah libfdt is probably almost more useful in scenarios where you're actually building and modifying the DTB
<geist>
ie, a boot loader
<geist>
more than half its routines are modifying things
<mrvn>
geist: it's huge for just parsing the dtb.
<clever>
yep, thats exactly what i was doing as well, loading an existing dtb file, expanding it, and then modifying some bits
<geist>
yah but most of that strips out if you only call a few routines
<heat>
huge?
<heat>
it's pretty small
<geist>
most of the routines are leaf nodes, so it collapses nicely if you're using linker gc
<mrvn>
indeed.
<geist>
i've looked into it for LK and it works nicely
<clever>
libfdt also managed to trigger a compiler bug in the VPU fork of gcc
<mrvn>
ouch
<geist>
heat: oh BTW did you get what i discovered about the mem reserve stuff?
<clever>
the VPU has a `switch r0` opcode, where it will then expect an array of 8bit pc-relative offsets to directly follow the opcode
<clever>
so switch-case blocks can be easily handled
<heat>
geist, yep
<clever>
gcc decided to reuse the link register, and do `switch lr`
<clever>
and apparently, that doesnt work, and the cpu malfunctions, going down the wrong branch
<geist>
heat: yah didn't kow if yo had read that or not. interesting at least
<mrvn>
clever: that sounds like someone screwed up the specs for the switch opcode.
<clever>
yeah, they assumed lr was like any other general-purpose reg
<mrvn>
ARM has a bunch of special cases for regs too.
<mrvn>
sp and pc mostly.
<clever>
yeah, the pipeline can make pc very different from how you expect it
<geist>
which they removed in arm64 precisely because they wanted no special cases
<mrvn>
geist: you mean they remove PC from being a general register, right?
<geist>
and SP
<mrvn>
that part I hate.
<mrvn>
such a waste with stackless languages.
<geist>
do you hate it like it hurts you or you hate it in that you find it distasteful?
<clever>
how do you manage a framepointer then?, what can SP do?
<geist>
ah. well you can use regular registers as stacks
<clever>
can add/sub/mov still interact with SP?
<mrvn>
clever: You have special opcodes that work with the SP. Like push/pop.
<geist>
it's primarily because things like 'sp must be aligned on 16 bytes' and 'sp is banked' that makes sense to remove it out of the regular register file
<geist>
so there's no special case
<clever>
mrvn: but what if you want to create a 256 byte hole in the stack?
<clever>
for a char[256] local var
<geist>
you can add/sub from it
<mrvn>
clever: addsp #256
<clever>
ah, a dedicated opcode, that solves add/sub
<mrvn>
clever: it's just not the regular add reg, reg, #imm encoding.
vin has joined #osdev
<clever>
what about frame pointers, can mov still read/write sp?
<mrvn>
clever: why would you need a frame pointer?
<clever>
or context switching, having to swap stacks
<clever>
just the first example that came to mind, on copying sp
<clever>
context switching is a much more useful case
<mrvn>
clever: there surely also is a mov sp, reg and move reg, sp opcode.
<mrvn>
clever: or you have a link/unlink opcode like m68k has. unlink loads the old SP from the stack
<mrvn>
.oO(Or did it just add to the SP? something like that that remnoves one stack frame)
<clever>
i recently found that the centurion's CPU6 call/return opcodes, are a funky hybrid
<clever>
it will push X onto the stack, store the return addr into X, and set PC to the function being called
<geist>
note one use of fp thats more or less mandatory: alloca
<clever>
so X behaves like a link register, but it also auto-saved the old X to the stack
<mrvn>
geist: only for variable sizes alloca
<geist>
at the minimum you have to create some sort of register thing to save some sort of anchor to the stack to restore it
<geist>
yes. that's what i said
<clever>
but where things get really funky, is that a lot of centurion code, expects immediates after the call opcode
<clever>
as-in, you put immediates into your .text, after the call opcode, and the called function will increment the link-reg as it reads them
<clever>
and if it doesnt consume the right number of arguments, it returns to non-opcode data
<clever>
its kind of treating call like an opcode that can consume 20+ immediates
vin has quit []
<mrvn>
geist: alloca really screws you as compiler builder. It's so rarely used and yet you have to support it in all your codegen.
<nikolar>
Centurion as in the 70s minicomputer?
<geist>
yup
<geist>
yup to mrvn that is
<clever>
nikolar: yep
<nikolar>
Surprisingly complicated for a cpu built out of logic chips
<mrvn>
Lazy compiler just have a frame pointer. But what a waste of a register on those old cpus that have so few (*cough* x86 *cough*)
<clever>
nikolar: i think this was a work-around, for having so few registers, and rather then push constants onto the stack, then pop them back out, it just put the constants inline in .text
<mrvn>
clever: was .text even read-only?
<nikolar>
Yeah fair
<clever>
mrvn: variables went via the fancy double-indirect load opcode, you would push a constant, that points to a global pointer
<clever>
so instead of printf("%d", foo);, it was more like bar = &foo; printf("%d", bar);
<clever>
and then bar is staticly allocated
<clever>
nikolar: another crazy thing i discovered recently, is that there is a dedicated opcode, and a large chunk of microcode, for binary relocation
<nikolar>
So argument passing was done through static addresses basically
<nikolar>
Not through registers or stack?
<geist>
note that it' wasn't really successful as a minicomputer
<clever>
nikolar: yep, except every function and syscall does arguments differently!
<geist>
so it's also entirely possible it's generally not a good example of a cpu architecture
<geist>
which are interesting in their own right
<clever>
nikolar: much like the amiga, you need to know exactly where the function is expecting its args
<geist>
ie, AT&T 3b2, etc
<nikolar>
No common abi then
<nikolar>
Very interesting
<clever>
geist: some recently discovered docs say they sold i think 10k units
<nikolar>
There's a lot
<nikolar>
s/there/that
<geist>
not really
<nikolar>
Well for 70s and minicomputers
<geist>
i think it was early 80s though right?
<geist>
ie, a bit late to the game
<mrvn>
clever: it's easy for C code. ints in Dx and pointer in Ax. But the register allocation for functions is part of the FFI interface.
<clever>
mrvn: yep, there are clear rules, but when using gcc you have to define those in the .h file i think
<nikolar>
geist: ah you're right
<nikolar>
sort of 70s tech in early 80s
<geist>
yeah, kidna. i suspect it was mostly a fairly cheapish back office kinda thing
<bslsk05>
github.com: Timeline · Nakazoto/CenturionComputer Wiki · GitHub
<clever>
> Also, honestly, reading through it - car crashes, hotel fires, counterfeiting, in-fighting with EDS, law suits out the wazoo... I'd totally watch that Netflix series
<mrvn>
clever: they had like 20 compilers to build AmigaOS. it's a wonder how it all works together.
<clever>
looks like the first centurion was made somewhere in 1973
<nikolar>
mrvn: that's a bit ridiculous
<nikolar>
clever: very intriguing
<clever>
the way amiga handles relocations, seems to focus on a single register for all library state
<clever>
base - offset, is a function pointer table, so you can call any exported functions
<clever>
and base + offset is local vars, and then code
<bslsk05>
'Emulating the Vacuum Tube Computer on the Centurion Minicomputer' by Usagi Electric (00:23:07)
<mrvn>
clever: yes, A6 is the base register for library calls. The "this" pointer.
<clever>
ive yet to find any similar api in centurion, but the relocation is pretty fancy
<clever>
all executables, are made up of a series of records
<clever>
each record has a type code, an 8bit length, a 16bit addr, a payload, and a checksum
<clever>
type-0 records, just write up to 120 bytes to base+addr, you specify the base when loading
<clever>
type-1 records, have a main addr, and a list of addresses
<clever>
for each addr in the list, you read from base+addr (it got relocated), add base+mainaddr, and then write it back
<mrvn>
clever: It might help to think of library calls as IPC. There is only one copy of the library in memory and all programms share the same address for it. The library gets initialized once and finalized once when the last program using closes.
<clever>
so its basically just an array of all constants, and you then just offset them
<clever>
but the crazy part, is that this patching is done by microcode in the cpu
<clever>
you just call a single CVX opcode, give it the base addr, and the start of a record, and it will execute that entire record
<clever>
either copying 120 bytes, or modifying up to 60 16bit ints in ram
<nikolar>
The power of making custom hardware
<nikolar>
And microcode :)
<clever>
nikolar: but i also found an inefficiency, these records cant span across sectors on the hdd, so there is often 20-90 bytes wasted at the end of each sector
<clever>
because a "copy 120 bytes" record didnt fit
<nikolar>
The least they could do is change to 140b or something
<clever>
but... if you just split that into "copy 90 bytes" and "copy 30 bytes"
<clever>
then you can fill the sector up
<nikolar>
Makes sense
<clever>
nikolar: i think that 120 limit, comes from the FS being heavily record based, and the record length limit is ~128 bytes, and with the headers on a record, 120 gets to ~128
<clever>
everything, even text files, are just a series of records
<clever>
text files, each line is a seperate record in the file
<nikolar>
Interesting
<nikolar>
So no lines longer than 120?
<clever>
i think so
<mrvn>
but back then everything after column 65 was comment.
<mrvn>
nikolar: the punch card only has 120 columns
<nikolar>
mrvn: wasn't it 80
<mrvn>
that wouldn't match with 120 byte records
<clever>
i'm thinking it was just a char[128] buffer, plus some overheads
<clever>
so 120 bytes of payload
<nikolar>
interesting
<nikolar>
didn't it have really weird hard drive format
<clever>
all sectors are 400 bytes
<clever>
i think thats a hold-over from when it was tape based
<nikolar>
yeah that
<mrvn>
SO you get 3 records per block and some padding?
<clever>
mrvn: yep
<nikolar>
also doesn't it have some chksum bytes per hd block too
<clever>
yeah
<mrvn>
Those people. tss. Only 2 ascii chars for the year and then they waste tons per blocks because the blocksize isn't a multiple of records.
<bslsk05>
github.com: CDC Hawk Drive · Nakazoto/CenturionComputer Wiki · GitHub
<mrvn>
16 byte checksum?
<clever>
nikolar: each sector on disk has a sync, 16bit sector addr (docs say a sector checksum, but this doesnt), a gap, a second sync, 400 bytes of data, and a 16bit checksum
<clever>
mrvn: 16bit
<nikolar>
That's a lot of metadata
<mrvn>
400 bytes of data is 3 records with 16 bytes left over
<clever>
during writes, the controller will read that sync+addr, to confirm its on the right sector, then switch to write mode
<clever>
and the gap before the 2nd sync, gives it time to switch modes
<mrvn>
I wonder what happens when one of the address bits flips. The controler sees it's at the wrong position so it seeks?
<nikolar>
Very interesting
vin has joined #osdev
<clever>
mrvn: the sector-address is only written during a format, but if it gets the wrong addr during any operation, it will return an error code
<clever>
its hard-sectored, so there is a wheel on the hdd, giving an index pulse 16 times per rotation
<clever>
and it knows exactly where on the platter each sector should begin
<nikolar>
so why is address necessary on the disk
<mrvn>
So it's more a "hey, the motor is broken" error when the address is wrong.
<clever>
nikolar: to detect errors, like the seek head being on the totally wrong track
<geist>
heh this centurion history wiki page is some salacious stuff
<clever>
you have a clock pulse running at 2.5mhz, and a "1" is encoded by having a second pulse between the clocks
<clever>
while a "0" is just the clocks alone
<clever>
each sync pattern, is 87 "0"'s, and then a single "1" bit
vin has quit [Client Quit]
<clever>
because that 2.5mhz clock wont be in phase with the reader, and it has to re-align itself
<clever>
https://i.imgur.com/FCcVMib.png the hard-sectoring wheel, each ring has a different number of sectors/track, and you move the sensor (top of frame) to select a sector/track setting
<clever>
the hdd then counts that, and outputs a 4bit sector-number on the control ribbon
<clever>
and the 2 quick pulses (bottom of frame) is the other index mark, to signal 1 complete rotation
immibis_ has joined #osdev
vin has joined #osdev
<vin>
Hi, does anyone know how postgres or other row major databases store records on disk. Especially, when the records are of variable length and the entries of the record themselves can be of different sizes.
<vin>
Is there a metadata maintained, that says at what offset new records begin and at what offset different entries in the record exists
<clever>
vin: sqlite has good docs on that
Burgundy has joined #osdev
<mrvn>
vin: DBs really hate variable sized records
srjek has quit [Ping timeout: 248 seconds]
<nikolar>
Sqlite kind of suggest using sqlite databases as a replacement for tarballs lol
<nikolar>
i remember that it was excplicitly mentioned to set 4kb pages for zfs
<clever>
nikolar: zfs treats all files as having 128kb records by default
<nikolar>
because using smaller is bad on ssds and they report having smaller pages for compatibility reasons
<nikolar>
clever: i am talking about hardware pages mostly
<clever>
ah, yeah, zfs calls that ashift, fat calls it cluster size
<mrvn>
nikolar: no, ssds report a block size and erasure size. Block size is pretty much irrelevant.
<clever>
the smallest unit you can use on-disk
<nikolar>
clever: exactly, ashift=12
<clever>
ssd's might internally operate on 4kb blocks, and when you do a 512b write, it has to RMW the whole 4k block
<nikolar>
ie 4kb
<vin>
Doing small writes are bad on SSD due to endurance, so usually it's write combined. That said there are new media types beyond flash that don't have this kind of band endurance property
<clever>
so telling zfs to do 4k writes avoids the RM and just does W
<nikolar>
wonder why intel killed optane
<nikolar>
seemed to be doing pretty well
<mrvn>
clever: worse. the have to read 4k and copy them somewhere else because they can't erase 4k
<clever>
nikolar: i heard that ntfs wasnt designed for a read-cache, and bodging it in caused major cpu overhead
<vin>
nikolar: I think CXL is the answer
<nikolar>
cxl?
<clever>
nikolar: so intel then vendor-locked the optane drivers to certain intel cpu's, that where known to be able to handle it
<nikolar>
clever: kek ms doing ms thins
<clever>
zfs on the other hand was designed to allow this kind of thing
<nikolar>
clever: zfs really is the last word in filesystems lol
<nikolar>
latest?
<clever>
nikolar: with that sqlite union, the first thing it does is open both tables, and open 1 index, it then scans over validpaths (just iterating over every single row)
<nikolar>
and then searches for those ids in the other table?
<clever>
nikolar: i think addres 7-13, fetches the deriver (join column) from that table, and then does an index based search on the 2nd table
<nikolar>
yeah that's what i said, poorly
<clever>
SeekGE's p1 is 2, that is the handle from the 3rd OpenRead, the index
<nikolar>
that's actually exactly what i thought databases do
<clever>
the main trick with sqlite, is that the `ResultRow` opcode, pauses the virtual machine, and returns back to the caller
<clever>
then you can access the current result, and call `sqlite_step()` again, to run the VM until it either has another ResultRow, or a Halt
<clever>
network based engines like mysql/postgresql, tend to gather many rows into a buffer, and then spit them over the network in bulk
<nikolar>
Is limit done in the VM or in the c code
<clever>
addr 1 initializes a local var to 5, DecrJumpZero will decrement it, and maybe jump
<nikolar>
oh so it's nothing special
<nikolar>
just a loop
<clever>
kinda, its reusing the same array the result is in
<clever>
and ResultRow offsets what your looking at
<clever>
the results are now in 2-10, not 1-9
<clever>
so r[n] is a variable size array, that the bytecode can just store anything it wants into
<clever>
Integer can put constants in, DecrJumpZero can manipulate it, Column copies from a record to r[n], ResultRow passes a slice of the array back as a result
<clever>
nikolar: that loop is also there if you lack the limit clause, so the only thing DecrJumpZero is doing, is aborting when the specified count is reached
<nikolar>
interesting
<nikolar>
how is offset implemented
<clever>
compare gistfile5 and gistfile2
<nikolar>
does it have to go through all rows until it gets to the offset
<nikolar>
clever: got it
<heat>
geist, i guess the mem reserve mechanism fdt supports is kind of limited in platform description purposes
<heat>
no way to fit i.e a phandle or name
<geist>
yah since it's in the header it's presumably some old mechanism that's basically deprecated by modern usages
Burgundy has left #osdev [#osdev]
<geist>
would be easy for a piece of dump firmware to fill it out, etc
<clever>
yeah, ive thought that memreserve was meant more for a dumb loader, that cant understand fdt
<geist>
anyway, i dont even know how you fill it out using dts, since it doesn't seem to be expressed anywhere in the source
<clever>
a special tag at the top, /memreserve i think
<clever>
it should be in the dtc docs
<nikolar>
so basically it has to go throigh all rows
<clever>
but it can do that at the b-tree level, i assume
<clever>
so its skipping thru leaves in the b-tree
<nikolar>
if it has a where clause too, it couldn't skip i imagine
<heat>
nikolar, zfs is the last word in garbage software
<nikolar>
heat: why the hate lol
<heat>
i call it how I see it
<heat>
zfs is the most stupidly complex filesystem and filesystem driver
<heat>
with very glaring flaws
<nikolar>
It does it's job ¯\_ (ツ) _/¯
<mrvn>
heat: is there anything better though?
<nikolar>
At least it never corrupted my drive like btrfs
<heat>
ext4, btrfs
<heat>
ntfs is decent too
<clever>
nikolar: ive had btrfs go read-only, because a disk took too long to respond
<heat>
i think xfs was or is nice
<mrvn>
crap, more crap, bad
<heat>
bull💩
<nikolar>
honestly, the only fs i would trust for long term storage is zfs
<clever>
heat: any fs ontop of mdadm, cant deal with re-reading the other half of the mirror when corruption is found, because the fs and raid are too isolated
<heat>
ext4 is the most reliable, simple, performing fs
<nikolar>
ext4 is reliable, until the bitrot sets in
<mrvn>
heat: and doesn't have checksums everywhere
<nikolar>
i have no issues with ext4, and it's great given how simple it is
<nikolar>
but zfs is just something else
<heat>
zfs has a nasty codebase, horrific design (2 journals or whatever the fuck that was?), unbounded complexity
<heat>
bunch of edgecases
<clever>
heat: one journal per filesystem, multiple filesystems in a pool
<nikolar>
wouldn't know about the codebase
<heat>
clever, sorry, whatever the hell ZIL is
<heat>
it had two
<clever>
the ZIL is very weird, compared to say ext4
<nikolar>
and the point of having two is that you always have one valid
<nikolar>
no matter what happens
<nikolar>
so you can't corrupt the drive
<heat>
riiiight
<clever>
ext4 journal, i believe is a backup of metadata its actively modifying
<heat>
what if we had some sort of, idk, thing where you could clone data?
<heat>
or a way to log the operations you do on a filesystem
<mrvn>
clever: by default the journal is only metadata
<clever>
ive not seen anything about the ZIL being duplicated
<clever>
mrvn: and when in data mode, does it write all data twice? to the journal, then the fs?
<heat>
clever, no, it's about you committing shit to the ZIL, and then needing to recommit to the actual data store
<heat>
it's so fucking idiotic
<clever>
heat: ah, thats only done for small writes
<clever>
large writes dont do that
<heat>
yes, and it breaks normal filesystem semantics
<heat>
I took the time to submit a bug report and I had 0 engagement from anyone in the zfs team
<clever>
to do with filesize?
<heat>
block count
<clever>
ah yeah, that one
<heat>
and *fsync()* does not actually sync
<heat>
it's somehow much worse
<clever>
i believe the ext3/4 journal, is purely an overlay, so reading block 42 gives you what the journal has, rather then the real block 42
<clever>
because all metadata is edited in place, and thats the only way to not corrupt things
<mrvn>
clever: that's kind of the point of the journal
<clever>
but CoW FS's like zfs, avoid that problem entirely
<clever>
the old metadata isnt modified
<clever>
it just makes new metadata in free space
<mrvn>
except for the ZIL
<nikolar>
i'd be happy to switch when we get a filesystem that can do what zfs does, and is simpler
<clever>
the ZIL is a shortcut, to commit things to disk without having to re-write the entire metadata tree
<nikolar>
but at the moment, the zfs is probably the best we have
<mrvn>
clever: you mean a journal? *wonder*
<clever>
each ZIL block, is pointing to a future ZIL block, that is free at the time you pointed to it
<clever>
and the ZIL doesnt store blocks being written, but is more of a record of write() syscalls
<heat>
if by "do what zfs does" you mean store files and directories, may I introduce you to ext4?
<clever>
so you can replay the last few actions, and rebuild the dirty state
<nikolar>
no i meant snapshots, functional raid, clones, etc
<mrvn>
clever: for me the point of a COW filesystem was to get away from fournaling.
<heat>
easy, btrfs
<mrvn>
heat: btrfs doesn't perform on nearly full filesystems. So basically never.
<clever>
mrvn: if you set sync=disabled, then zfs basically never touches the zil, and all writes go thru the CoW tree
<clever>
but i think it just ignores sync() entirely as well
<clever>
so you can loose the last few seconds-minutes of changes
<mrvn>
clever: unsuable
<clever>
but they will at least be in-order and atomic, i believe
<nikolar>
well the first time i tried using btrfs, it ate my data, so no thanks :)
<nikolar>
not yet at least
<clever>
i'm fuzzy on the exact semantics, need to study it more
<clever>
mrvn: zfs also performs poorly on a full disk, i think most FS's do
<nikolar>
yeah most do
<mrvn>
clever: scales much better till you get real close to 100%.
<mrvn>
and I haven't run into a ENOSPAC on "rm file" yet.
<nikolar>
kek
<clever>
for zfs specifically, the disk is broken into groups (same as ext), and each group has its own free-space histogram and list
<clever>
by default, if it cant find a big enough hole in the loaded groups, it just gives up and fragments the record
<mrvn>
if you get down to the last 8 groups or so zfs slows down drastically.
<nikolar>
does it never compact the existing data
<clever>
zfs.zfs_metaslab_try_hard_before_gang=1 forces it to scan every group in the pool, but that slows it down more
<mrvn>
nikolar: no, that would possibly corrupt that data
<clever>
nikolar: nope, once a record is fragmented, its stuck that way, until you delete it
<mrvn>
nikolar: it has a defrag that will copy fragmented data to new groups.
<clever>
and fragmented blocks are a huge waste of space, ive seen a file take up double its size
<nikolar>
i know that lfs comacted the existing data when it was running low
<nikolar>
and zfs took a lot of ideas from it
<nikolar>
don't know how exactly zfs does everything though
<clever>
this is where ext2/3, ext4, and zfs differ some
<mrvn>
you can also resilve the FS if you change the stripe count or such.
<mrvn>
resilver
<clever>
ext2/3 stored a file as a big array of clusters, you pick a cluster size at format time, and then it has to store the array of cluster#'s in the fs indirection tree
<clever>
and if a file isnt fragmented, you waste space storing every number from 10 to 30 in a block
slidercrank has quit [Ping timeout: 246 seconds]
<nikolar>
yeah which makes it a bad idea to resize the fs
<mrvn>
clever: that overhead is really irrelevantr.
<clever>
ext4 switches to extent trees, where it just says block X of the file, starts at block Y on the disk, and is Z blocks long
<mrvn>
extends are more about speed than disk space.
<clever>
so the larger your fragments, the less space you waste on metadata
<clever>
yeah, extents let you handle massive files and small blocks, without worry
<mrvn>
faster to find the block for an offset, easier to do a sequential read for contiguous data
<clever>
zfs instead has records, a file is broken up into blocks of say 128kb
<clever>
that block is then compressed, and rounded up to 2^ashift (4kb for example)
<clever>
and then that chunk is put onto the disk, and a pointer to it is stored in the indirection tree
<nikolar>
*optionaly compressed and encrypted
<clever>
yep
<mrvn>
zfs can't really do extends because of compression and checksumming
<clever>
but its different from just setting the ext block size to 128kb, because a block can occupy less then 128kb
<clever>
and also the whole record system
<clever>
its basically 2 different block systems, stacked
<mrvn>
encryption is fooie in zfs though
heat has quit [Remote host closed the connection]
<clever>
the file is made of a series of 128kb blocks, but those may be compressed, and then laid down on a 4kb block disk
Burgundy has joined #osdev
<clever>
but the file block size can be changed fs-wide
<clever>
mrvn: as for speed, ext2/3 was nice, in that you could pre-compute your path down the indirection tree, read block X, look at offset X1, read the block from there, look at offset Y1
<clever>
and then just fire off a chain of IO's, but its sequential, so you had to wait for each single read
<nikolar>
mrvn: fooie?
<mrvn>
nikolar: has flaws
<clever>
ext4 is more cpu intensive, and you have to parse the whole block, before you know which block is next, but there is likely fewer IO in total
<nikolar>
i am aware
<nikolar>
i know that metadata is never encrypted
<nikolar>
natively
<mrvn>
clever: you can parse a whole lot of extends in the time it takes a spinning disk to read a block.
<clever>
yeah, and thats where extent trees win
<mrvn>
nikolar: they also messed up the HMAC or something. Can't remember the details but you can edit the disk and it won't notice.
<nikolar>
at least you can't read the data
<clever>
mrvn: i dont see how thats possible, given all the checksums going on
<mrvn>
clever: you replay another block which has correct checksums inside
<nikolar>
the encryption layer wouldn't notice i imagine
<nikolar>
not the rest
<clever>
mrvn: the block holding that checksum, is also checksummed
<clever>
its checksums all the way to the root!
<clever>
so you would need a zfs aware tool to modify the whole tree
<mrvn>
clever: obviously
<clever>
but yeah, i can see the issue
<clever>
zfs lets you send an encrypted dataset to an untrusted party that lacks the keys
<clever>
and in that state, its just a series of numbered files, each made up of an array of <=128kb objects
<mrvn>
but when I encrypt my FS I would expect that nobody can read my data but also that nobody can modify my data
<clever>
and using zfs tooling, you can replace any record or delete any file
<clever>
you just wont know what your messing with, because the directory tree is also encrypted
<clever>
i can see why you would want hmac ontop of checksum
<mrvn>
clever: I want the checksum connected to the encyrption. Not separate
<mrvn>
if someone can alter the cheksum without the encryption key then they can swap out encrypted blocks.
<clever>
yeah, that as well
<clever>
for that, you may want an hmac on the root of the dataset
<clever>
hmmm, but with how indirect blocks work, that would be invalidated upon zfs send
<clever>
it would need to be a custom hmac of just the hmac's
<clever>
rather then the metadata
<clever>
a "block pointer" in zfs, is a 128 byte object, that holds the hash of the data its pointing to, and metadata about which compression and hash algo was used, if its encrypted, if its fragmented, if its under dedup, the size before/after compression, and up to 3 pointers of where it is on-disk (certain important metadata is stored 2 or 3 times)
<clever>
but encrypted records, reuse one of those pointers for crypto state, so it can only store 2 copies
<clever>
an indirect block, is then just a big array of those, compressed, and stored as another record, with a new block-pointer pointing to it
<clever>
so its much more like ext2/3, where you know the depth of the tree, and which index to use at each layer
<mrvn>
clever: holes reduce the tree
<clever>
yeah, a sparse hole, would be all nulls in the indirect block, which then compresses down
<clever>
and depending on the size of the file, entire branches of the indirect tree may be missing
<clever>
size of the hole*
<mrvn>
"Skip intro", the best invention since binch watching episodes.