<bslsk05>
github.com: rpi-open-firmware/sdram.c at master · librerpi/rpi-open-firmware · GitHub
<clever>
for the ddr2, its just a bunch of magic register writes, and some of the numbers make some sense
<clever>
for the bcm2711 ddr4, you need to copy several 20kb blobs into the ddr4 controller first
<clever>
along with register writes
<immibis>
firmware blobs are debatably maybe not actually a problem
<immibis>
unless the licence says they are
<clever>
immibis: yeah, thats a second issue towards what heat asked, i may not be allowed to re-distribute modified blobs that include tianocore
<heat>
blobs of what?
<immibis>
was it accolade vs sega where a court said that if you require a blob for your system to work, you can't enforce the trademark on that blob? If the court is feeling particularly anti-nefarious maybe they would rule the same on copyright, but I doubt it, since that would make oligarchs less rich
<heat>
why do you need to modify them?
<immibis>
Blobs of bytes
<heat>
ty
<heat>
blobs of mrc code?
<heat>
or a blob of tianocore?
<clever>
one sec
<zid>
nintendo's attempt was to put the nintendo logo onto the cart, then hash the cart header
Vercas6 has quit [Quit: Ping timeout (120 seconds)]
<clever>
-rw-r--r-- 1 clever users 47K Feb 8 18:16 bootcode.bin
<zid>
which I always found kind of genius
<clever>
heat: this blob runs on the VPU, it deals with initializing the ddr4, and loading the other ddr4 blobs
<clever>
-rw-r--r-- 1 clever users 4.1K Feb 8 18:16 mcb.bin
<clever>
-rw-r--r-- 1 clever users 14K Feb 8 18:16 memsys00.bin
<clever>
there are 8 memsys files, and all of these are used in bringing ddr4 online
<clever>
-rw-r--r-- 1 clever users 234K Feb 8 18:20 bootmain.elf
<clever>
after the ddr4 is online, bootcode.bin runs bootmain.elf, and the sha256 of bootmain.elf is held inside bootcode.bin
<clever>
so, if i want to replace bootmain, i have to fix the hash in bootcode.bin (and resign it)
<clever>
and now i must ship a modified blob
<clever>
and what does the license say about that?
<immibis>
I don't think it was Nintendo in the court case. Btw the Nintendo one isn't hashed.
<bslsk05>
github.com: rpi-eeprom/LICENSE at master · raspberrypi/rpi-eeprom · GitHub
<clever>
> Redistribution and use in binary form, without modification, are permitted provided that the following conditions are met:
<clever>
heat: this implies that modifying the pieeprom.bin files in any way is not allowed, but there seems to be a problem here, because the repo includes a python script for modifying a .txt file inside the .bin
<clever>
and sending those modified files is a form of support you can do
Vercas6 has joined #osdev
heat has quit [Remote host closed the connection]
<immibis>
Asking for support isn't necessarily considered distribution
heat has joined #osdev
<immibis>
Here we see the annoying part of an adversarial legal system: the oligarch will assert all the rights they want, regardless of which ones they are actually entitled to. The only way to find out they asserted a right they don't have is to try violating it and risk prison time.
<clever>
immibis: yeah, there are 2 different levels, 1: using an official python script to embed a .txt into a .bin, and sending the result to a user who cant figure that out
<clever>
and 2: replacing the code inside the .bin, to make it do something entirely different from normal
<clever>
technically both are modifying the .bin file
<immibis>
You didn't modify the file, you just constructed a new file containing unmodified parts of the original :)
<clever>
and how much can i play that card? :P
<heat>
yeah erm sounds like you're screwed
<clever>
heat: i could just provide a script that modifies things for the end-user
<clever>
its not "Redistribution" if you never distribute it
<heat>
i guess you could pull that card
<clever>
but then the end-user also needs the signing keys
<heat>
it'd be great if you could find your way around that
<clever>
you can read the keys from a custom start4.elf
<heat>
make bootmain.elf jump to your code
<clever>
bootmain.elf's job, is to load start4.elf from a supported media, SD/USB/TFTP/NVME/HTTPS
<immibis>
You can play that card until they sue you, which is most likely eternity. But it's not guaranteed to be eternity.
<heat>
i'm sure you could find a way to accidentally corrupt things
<clever>
if i keep the official bootmain.elf, then i can only boot from one of the above sources
<clever>
if i use a custom bootmain.elf, then i can boot from anything i want
<immibis>
Observe how intellectual property laws oppress the common man
<heat>
you're making this into a class thing
<heat>
it's not a class thing
<clever>
heat: yeah, ive not been looking for buffer overflow exploits in depth, more just how the code is meant to function
<heat>
ddr4 controller people write MRC code, you get the blob because they're scared to share what's there
<immibis>
heat: every thing is a class thing in the end. If only because it creates its own classes
<immibis>
what does MRC stand for?
<heat>
you get the blob with the condition that you don't modify it if redistributing, end users get it too
<heat>
memory reference code
<clever>
-rw-r--r-- 1 clever users 61K May 5 2020 bootcode.bin
<heat>
intel term for ddr2/3/4/5 training code
<clever>
heat: the older bootcode.bin, did both dram init, and loading of start4.elf from sd/usb/tftp/nvme
<clever>
so its clearly being built by RPF, and is a mix of ddr4 source, and bootloader source
<clever>
the design of the code also matches the older pre-ddr4 firmware
<clever>
so it seems like RPF got ddr4 driver source, and integrated it into the existing bootloader
<heat>
you sure?
<heat>
i bet it's just blobs in blobs
<clever>
the bootcode.bin has no real blobs hidden in it
<clever>
its all VPU asm, strings, and small binary constants (like gpt uuid's)
<heat>
you said it loaded more blobs though?
<clever>
those are seperate files, clearly tagged with a size&name
<immibis>
How is VPU assembly not a blob?
<heat>
i bet that's where the secret sauce is
<clever>
and just get copied to a dedicated area and the ddr4 deals with it
<clever>
immibis: because i can decompile it, so its effectively source
<immibis>
so a blob is anything you can't decompile? Strange definition
<clever>
well, more, that there are no non-vpu blobs mixed in with the vpu asm
<clever>
so i can look at any given byte-range, and tell you what its doing
<clever>
the vpu asm is still a blob, but its a blob i can understand
<heat>
in intel platforms you get the Intel FSP and you integrate it with your platform code to do magical things, like setting up Top Secret platform stuff and train memory
<clever>
but the mcb.bin/memsys0{0-7}.bin are unknown blobs
<heat>
vendors don't get to see the FSP
<clever>
heat: yeah, the VPU kind of acts like the FSP, but the entire bootcode.bin+bootmain.elf+start4.elf takes place on the VPU, and start4.elf must live on SD/USB/TFTP/NVME
<heat>
the term Memory reference code comes from a time where they actually needed to share the code, but things were much stricter and no one could see it
<clever>
and if you want to boot from something else, you cant
<heat>
s/no one/only intel partners/
<heat>
these days, it's just a blob in your blob
<heat>
defined API, you call it, it works
<heat>
there's even a dispatch mode where it's integrated seemlessly with the rest of your UEFI PEI
<heat>
could you not see where the memory training function starts and call that?
<clever>
dont really have that option on the rpi, the 47kb of VPU asm expects to be loaded to a specific addr, and i only have 128kb of ram to work with, and my own blob is loaded to the same addr!
<clever>
heat: i could, but id have to manually figure out all of the relocation patching by hand, or move my own binary out of the way and load it to the right addr
<heat>
well yes, but such is life :)
<clever>
or just leave the bootcode.bin as the 1st stage, and have it run my code after raminit
<clever>
thats far simpler
<clever>
for that, i just have to replace bootmain.elf, and patch the sha256 inside bootcode.bin, and re-sign it
<heat>
but that's not legally viable
<clever>
the relocation and run the blob thing, has 2 different routes
<clever>
1: my initial bootcode.bin then relocates/loads to +47kb, to create a 47kb hole at the load addr, loads the original bootcode.bin, and calls into it, then continues to boot
<clever>
2: my initial bootcode.bin relocates/loads to +47kb, loads the original bootcode.bin, does the sha256 patching, and then jumps to its entry-point, it then does ramint and runs bootmain.elf
<clever>
1 would break with every update, and i have to find the right function
<clever>
2 is just patching it at runtime, and thats far simpler
<clever>
in theory, i could write a loading stub in asm (very small), put it at the top of 128kb, then that copies bootcode.bin from spi->L2, patches, and runs it
<clever>
then bootcode.bin is very small, and just memcpy's that stub out of the way, and runs it
<zid>
immibis: nintendo logo is hashed, the gb just doesn't actually *care* about the hash
<clever>
and thats more in the realm of the signing on the .bin file
<clever>
from the factory, the .bin file (on sd or spi) must be signed with an hmac-sha1 key
<clever>
and the only changes you can do, is to make it more restricted, enabling RSA with an unknown private
Matt|home has joined #osdev
<clever>
baring exploits/bugs in the boot rom, which i havent found yet on the 2711
<clever>
the only exploit ive found was a timing exploit on the 2835's sig checking, but that doesnt even check sigs on the pi0/pi1
<clever>
and a minor bug in the gpio boot mode stuff, but nothing that could be exploited
<immibis>
and yet you wrote your own boot rom. How does that not make security irrelevant?
<immibis>
well, not rom
<clever>
immibis: the entire pi0-pi3 lineup has signature checks disabled by default, so it will just run any bootcode.bin it finds on the SD card
SGautam has joined #osdev
<clever>
immibis: the bcm2711 has hmac-sha1 checks enabled by default, but the start4.elf it loads later isnt verified, and can dump enough data to reconstruct the key
<immibis>
By the way I worked at a place where we had the source code for the DRAM training in modified u-boot. It doesn't make it any better
<clever>
immibis: my goal is less about changing the dram init, and more about changing what happens after dram init
<immibis>
(We didn't need that source code, but we had it as part of the board support package. We also had a register listing of the entire SoC - still not useful as half of them only had bit names and no descriptions)
<bslsk05>
github.com: rpi-open-firmware/arm_control.h at master · librerpi/rpi-open-firmware · GitHub
<immibis>
I think they just hack the hardware and software together until it works, then ship it. The reason for no documentation is because they just ask the hardware people
<clever>
immibis: addr, bit field masks, but this file is in a different format from normal
<bslsk05>
en.wikipedia.org: List of game engines - Wikipedia
poyking16 has joined #osdev
<GeDaMo>
I have experience with forgetting stuff :|
<mjg>
unity
<mjg>
only found it because i remembered slender: 8 pages uses it
<mjg>
never used it personally, but i hear it is pretty good
<zid>
heh
<zid>
mjg, like half of all games on steam use it
<zid>
it's like saying "I think I heard of this OS.. micro shaft? winders?"
socksonme has quit [Ping timeout: 252 seconds]
<mjg>
zid: :)
<mjg>
zid: the newest game i played was released in 2008
<mjg>
zid: so there is that
<mjg>
or slightly later, but the point remains
the_lanetly_052_ has joined #osdev
the_lanetly_052 has quit [Ping timeout: 244 seconds]
dude12312414 has joined #osdev
dude12312414 has quit [Remote host closed the connection]
dude12312414 has joined #osdev
gildasio has quit [Quit: WeeChat 3.6]
gog` has joined #osdev
gog has quit [Killed (NickServ (GHOST command used by gog`))]
gog` is now known as gog
gog` has joined #osdev
jafarlihi has joined #osdev
<jafarlihi>
I'm trying to do FreeBSD kernel development but can't figure out getting IDE right for things like autocomplete. Is vim with CoC and bear best option? How do you set up things like CLion for autocomplete? What do you use?
<clever>
jafarlihi: i tend to just use vim + youcompleteme, YCM auto-completes any keyword in any currently open file, so it just magically works, if the .h is open in another :tabe
<jafarlihi>
clever: So I need to have the struct definition .h open in another tab to get autocomplete on fields? yuck
<bauen1>
I use neovim + clangd-13 lsp to get autocomplete for my C / C++ projects, but I'm not sure how well that would work with the FreeBSD kernel
<clever>
jafarlihi: or another .c file thats using it
<clever>
jafarlihi: YCM probably also has better options, i just never bothered to look into how they get setup
skipwich_ has joined #osdev
skipwich has quit [Ping timeout: 252 seconds]
puck has quit [Excess Flood]
puck has joined #osdev
Raito_Bezarius has quit [Ping timeout: 255 seconds]
bauen1 has quit [Ping timeout: 245 seconds]
Raito_Bezarius has joined #osdev
Raito_Bezarius has quit [Max SendQ exceeded]
wootehfoot has joined #osdev
puck has quit [Excess Flood]
puck has joined #osdev
Raito_Bezarius has joined #osdev
Raito_Bezarius has quit [Max SendQ exceeded]
Raito_Bezarius has joined #osdev
gildasio has joined #osdev
<jafarlihi>
/exit
<jafarlihi>
q
jafarlihi has quit [Quit: WeeChat 3.6]
wootehfoot has quit [Ping timeout: 240 seconds]
<heat>
he's on freebsd now huh
<heat>
3rd kernel in like 1 month
<mjg>
chad kernel dev
<gog>
dang i gotta step it up
<gog>
i dev on 0 kernels
<gog>
i started to have ideas last night but fell asleep instead
<gog>
probably will do the same tonight
carbonfiber has quit [Quit: Connection closed for inactivity]
<heat>
hi vincent van gog
<heat>
I take patches
<junon>
If spinlocks in the kernel are so bad because of interrupts (specifically NMIs) and reentrancy then how do you even begin to achieve a shared resource system?
<heat>
they're not
<heat>
spinlocks are great in the kernel
<heat>
use them as widely as possible for smaller locks
<heat>
larger locks (that can sleep), mutexes, rwlocks, etc
<junon>
How do you handle cases where a spinlock needs to be locked in both normal code as well as during an interrupt? Can spinlocks be reentrant?
<junon>
Otherwise you risk deadlocking, no?
<junon>
Or you have to guarantee that you don't try to lock a spinlock from an NMI and just mask interrupts in the critical section
<bslsk05>
github.com: Onyx/spinlock.cpp at master · heatd/Onyx · GitHub
<mjg>
heat: reading from the lock is pessimal to just blindly doing cmpxchg the first time around
<zid>
yea that sounds slow in the fast case for no reason
<mjg>
it slows down contended case as well
<mjg>
at least on amd64
srjek has quit [Ping timeout: 244 seconds]
<clever>
now that i think about it, i think even just a normal load/store to the same cache line will contest with mutexes on arm
<mjg>
while (__atomic_load_n(&lock->lock, __ATOMIC_RELAXED) != 0)
<mjg>
clipper: did you mean expected_val?
<mjg>
clever: that should be a problem virtually everywhere due to coherency protocols
<clever>
yep
<clever>
but on arm, its somewhat worse
<clever>
arm has no atomic operations
<mjg>
or at least i'm unaware with anyone better than cache line
<mjg>
s/with/of/
<mjg>
doing*
<clever>
on arm, you do load-exclusive, modify the data in a reg, then a conditional store-exclusive
<clever>
if you still had exclusive ownership of the cacheline, the store happens
<clever>
but if somebody stole the cacheline, the atomic fails, and you have to repeat that
<mjg>
unless you have LSE
<clever>
LSE?
gildasio has quit [Remote host closed the connection]
dude12312414 has quit [Remote host closed the connection]
<mjg>
large system extensions
gildasio has joined #osdev
dude12312414 has joined #osdev
<mjg>
most notably adds compare-and-set
<mjg>
basically you no longer have to ll/sc
<clever>
ahh
<clever>
what i was thinking though, is where x86/lse vs arm differ, if you load during a cmpxchg, then one or the other stalls, while the L1 cache line bounces about
xenos1984 has quit [Read error: Connection reset by peer]
<clever>
but on arm, if you load during a ll/sc, the atomic fails, and that whole block of opcodes has to repeat
<clever>
so it can possibly be more costly, likely the driving force behind LSE
<mjg>
lse definitely degrades less, but i don't remember numbers
xenos1984 has joined #osdev
<clever>
mjg: yeah, with true atomic ops, the cpu can atomicly do the entire operation in basically 1 clock cycle, once it claims the L1
<clever>
vs the arm style, where it needs many clocks, and can possibly loose the L1 line and have to restart
heat has quit [Remote host closed the connection]
heat has joined #osdev
kpel has joined #osdev
opal has quit [Ping timeout: 268 seconds]
opal has joined #osdev
GeDaMo has quit [Quit: A program is just a bunch of functions in a trenchcoat.]
dude12312414 has quit [Remote host closed the connection]
dude12312414 has joined #osdev
srjek has joined #osdev
kpel has left #osdev [Leaving]
gmacd has joined #osdev
socksonme has joined #osdev
_koolazer is now known as koolazer
the_lanetly_052_ has quit [Ping timeout: 252 seconds]
<heat>
mjg, what's your suggestion? cmpxchg and then spin?
<heat>
thanks for code reviewing my shit tho
<heat>
much appreciated
<heat>
I've got 120k more lines for you to review
<mjg>
the most simplistic lock is this: if (cmpxchg(...)) return YAY; do { spin(); } while (atomic_read(&lock) != 0);
<mjg>
or so
<mjg>
and loop back that is
<geist>
yeah that'll work pretty well too, i'd start with that
<geist>
it doesn't scale too well, but it's also implementable pretty much any arch, even one with simply a swap instruction
opal has quit [Write error: Connection reset by peer]
gildasio has quit [Remote host closed the connection]
<heat>
geist, wdym doesn't scale too well?
<heat>
are you talking about spinning there vs mcs locks and whatnot?
opal has joined #osdev
gildasio has joined #osdev
<geist>
yeah
<geist>
but it works pretty well
<heat>
AIUI it's not that simple for some reason
<geist>
yah what you have there should be probably okay for x86. for ARM it's a bit more complicated because of wfe/sev/atomic interactions
<geist>
but arm has the example version of that up somewhere
<heat>
linux has a configure option between normal spinning spinlocks and queued spinlocks (MCS locks)
<heat>
so it's not like MCS is always faster I guess? or at least there's a drawback
<geist>
yah. personally i'd write the spinlock in hand asm for each arch, or have a generic one (like that) and then have a per arch version
<geist>
yah there are tradeoffs, i think for small number of cores a spinlock with a pause/wfe is probably more ideal
<geist>
if nothing else because it uses less space, etc
<heat>
how much do you gain from micro-optimizing this?
<geist>
define 'micro optimize'
<heat>
hand-writting it
<heat>
s/tt/t/
<geist>
well, for example, on arm you'd actually get the ability to stop the cores from spinning, which i think is a fairly major power/emulation win
<geist>
since the cores will basically mwait/monitor while the lock is held
<mjg>
mcs is not always faster
<mjg>
in fact mcs tends to be fucking *terrible*
<geist>
indeed. that's why i haven't just jumped into rewriting all of the zircon ones until we really have time to figure out what to switch to
<geist>
it's a compromise
<mjg>
what mcs guarantees is fairness
<geist>
right
<heat>
fucking terrible?
<mjg>
but fairness can demolish performance
<heat>
why?
<heat>
SMT?
<mjg>
no
<geist>
this is where i'd encourage you to go look at how MCS works
<heat>
i have
<geist>
it's clever and interesting, but you can probably see why its far more complicated and would have if nothing else bad cache coherency issues etc
<mjg>
key to slowdown when things are contested is that there are cachelines bouncing back and forth
<geist>
OTOH i haven't looked at it in some times
<geist>
yah
<geist>
(the cacheline stuff)
<mjg>
if the lock is 100% fair, they bounce more than with a greedier lock
<mjg>
this is especially visible if you have a multisocket system
<mjg>
and all the cpus are pounding on the same lock
<mjg>
perf is atrocious, but nobody is highly favored
<mjg>
so it is a pragmatic choice when you can't have HUGE outliers
<mjg>
with a greedy lock you may find someone is starved to death, so to speak
<bslsk05>
fuchsia.googlesource.com: zircon/kernel/arch/arm64/spinlock.cc - fuchsia - Git at Google
<geist>
i honslty dont know how to do a WFE style lock with LSE
<mjg>
that said, for most systems, a perfectly OK (not perfect!) lock would just use backoff
<heat>
i'm not at a point where I can read that
<heat>
but looks cute
<mjg>
the standard approach is to increase spin teams
<mjg>
times, geez
<mjg>
1 spin, 2, 4, and so on up to a predefined limit
<geist>
heat: yeah i was just pointing out that a standard arch-neutral C version with intrinsics is not a perfectly optimal solution
<geist>
i generally prefer to write that stuff in hand asm so that the compiler doesn't one day decide to do something stupid and tank the implementation
<mjg>
:)
<mjg>
heat: as for the paste https://godbolt.org/z/GMra5ed7M, if you want "optimal", you would have a fast path which just takes the lock and falls to a func call if that fails
<mjg>
compilers like to do nasty stuff when faced with a loop
<geist>
in that link above the real key is between line 22 and 20. if it looks at the old value, sees that it's already acquired, immediately loop back to a WFE. the cpu will halt until it sense that it lost the exclusivity of the cache line
<heat>
mjg, I tried to add an explicit cmpxchg before it but clang just makes it part of the loop
<mjg>
heat: because you need a different func
<heat>
gcc explicitly creates a fast path there, as you'd like
<geist>
another cpu writing to a cache line that a cpu thinsk it has an exclusive lock on is an implicit 'event' in arm world
<mjg>
with attribute(noinline)
<mjg>
ye i mostly deal with clang and it keeps messingi t up
<mjg>
geist: do you have a bench for that wfe thing?
<geist>
it's not a bench it's the fact that the cpu stops spinning
<geist>
and thus uses dramatically less power
<mjg>
but how does it affect performance
<geist>
it may even be slower (though i'm fairly certain thats not the case)
<geist>
i have no idea, but it's not even on the table to *not* do that
<geist>
though i'm fairly certain it is pretty fast
<mjg>
i would definitely try to quantify at least
<geist>
oh sure, it's just been a few years since i looked at it
<mjg>
point being *tiny* contention is everywhere
<mjg>
and you don't to pay big time for being unlucky
<geist>
but it's probably faster even because the other cores stop spinning on it, and bouncing the cahe line around
<mjg>
don't want to*
<geist>
but like i said the power savings are basically mandatory
<geist>
it's the equivalent of leaving out the pause instruction on these things on x86
<mjg>
well you may still get them if the lock is contested for more than one or two spins, so to speak
<geist>
you can do it, and it might even be faster, but you simply dont (unless there's an errata)
<mjg>
not having pause will definitely make it *slower8
<mjg>
as you keep fucking up the line
<mjg>
even if you switch to pure reads
<geist>
yah. what i dont understand is how to do the WFE trick on a LSE implementation
<mjg>
well not the line itself but ownership
<geist>
i may well be that you always use the ll/sc variants for spinlocks, because they're more flexible
<geist>
linux is no help because they just use MCS locks on arm64 now, so there's no example to look at
<geist>
and last i checked the BSDs all still use something simple like this
<geist>
using ll/sc
<mjg>
btw, did you mention last time that zircon has reasonably optimized string primitives? memset, memcpy and so on
<mjg>
for arm64?
<geist>
we use the canonical ones that ARM publishes
<mjg>
did you just take cortex strings lib?
<geist>
linux uses them too
<geist>
cortex strings lib
<mjg>
ye that's what freebsd is using
<mjg>
i wonder if that stuff is truly the best you can do (without doing different tradeoffs)
xenos1984 has quit [Read error: Connection reset by peer]
<mjg>
anyhow, re mcs vs wfe, have yout ried asking arm? :)
<geist>
have not. tis the fun part of work: juggling 27 things
<geist>
i tend to try to focus on one or two things at a time
<mjg>
(:
<geist>
or at least it's the only way i've found i can reasonably expect to make forward progress
opal has quit [Remote host closed the connection]
<geist>
so one day i or someone ont he team will look at it, and then we'll see
opal has joined #osdev
<mjg>
ye i just made a bunch of hacks to freebsd (not committed yet) to unfuck make
<mjg>
talking about "focus"
<heat>
make deserves to be fucked
<heat>
like everyone
<geist>
the biggest problem is we have a few spinlocks that are a source of contention, so really the first strategy is to break them up a bit
<geist>
then optimize the spinlock
<mjg>
thisr eminds me, do you have anything rcu-like?
<geist>
no
<mjg>
now that is going to put a damper on these efforts
<geist>
there is/was a hard rule to avoid anything rcu like
<mjg>
any plans to get one?
<geist>
like, dont even look at it, because patents
<mjg>
why?
<mjg>
ok
<mjg>
that's fair
<geist>
i think it's all expired, but i dont know beause i haven't looked
<mjg>
some of it is, enough to make something which basically works
<heat>
rcu-like is not rcu
<mjg>
at a tolerable scale
<heat>
I think mjg meant like EBR
<geist>
sure, but IANAL
<mjg>
i'm saying sooner than later you will need it
<heat>
epoch(9) in your local freebsd manpages
<geist>
yes, but that doesn't describe the implementation
<heat>
if you're not using freebsd, you're a regular, normal human being
<mjg>
performance aside, it helps avoid deadlocks
<mjg>
as in gets rid of some of possible lock orderings by letting you not take one to begin with
<geist>
indeed. that being said the zircon kernel doesn't have as many locks, etc. i think we prefer to measure and go in eyes open what we're doing
<geist>
find bottlenecks, fix them, reorganize to avoid them, etc
<mjg>
now i'm curious how do you do path lookups
<geist>
basically at level 1 of optimization right now, haven't yet gotten into really fancy data structures yet
<geist>
path lookups?
<heat>
mjg, not in the kernel
<mjg>
oh?
<heat>
THIS IS A MICROKERNEL, BABY
<mjg>
open("/foo/bar/baz", ...);
<heat>
EVERYTHING IS IN USER SPACEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
<geist>
user space
<mjg>
huh
<geist>
and i dont know how they do it
<mjg>
you got some plan9 devs over there?
<geist>
(it's probably slow as hell)
<heat>
yes
<heat>
they do
<mjg>
i'm not talking about pike :p
<geist>
uh no. rob pike works at google, but he doesn't work on fuchsia
<heat>
I'm guessing a good chunk of fuchsia isn't very scalable because it's not a server OS, for now at least
<geist>
exactly
<geist>
that' why it's not *that* big of a deal yet that the spinlocks dont scale, etc.
<geist>
the bigger problem is fairness because of big.LITTLE
<heat>
geist, dhobsd was around 9front
<heat>
which is totally plan9
<geist>
sure
<geist>
anyway, yeah part of zircon is just avoiding a lot of the complex data structures because there's just less code in the kernel anyway
<geist>
by far the most complex part of the kernel is the VM i'd say
<mjg>
does the kernel do networking or is that also outsorced
<geist>
user space
<geist>
no fs, no networking, no drivers (except the handful needed to drive the cpu)
<geist>
it's an async ipc design, so closer to mach than L4
<heat>
>less harmful alternatives
<geist>
extremely capability based, so no concept of user or permissions or whatnot. just handles to kernel objects with rights per handle
<heat>
>sed
<heat>
fuck yeah baby
<heat>
erm
<heat>
s/sed/ed/g
<heat>
fuck yeah baby
<mrvn>
mjg: files are ina a graph and a path lookup walks a path through the graph.
<gog>
yes
<geist>
no doy
<mjg>
harmful:
<mjg>
> FreeBSD, NetBSD, Solaris.
<mjg>
less harmful:
<mjg>
OpenBSD
<mjg>
(:
<mrvn>
mjg: I'm still not sure RCU is all that usefull because they make delete almost impossible and if you do have write contention your task will just get stuck in doing the same thing over and over because their update fails.
<mjg>
?
<mjg>
deleting stuff is pretty trivial
<heat>
openbsd just discredits them entirely
<mjg>
say you keep a linked list of rcu-protected objects
<mrvn>
mjg: you can only delete stuff once all readers are done with it. That's the hard part in RCU and where all the patents are.
<heat>
"or best of all: don't use HTTP."
<mjg>
in that case i have to ask what do you mean by 'delete'
<heat>
what do they want? gopher?
<mjg>
remove from the list
<mjg>
or actually free
<mrvn>
mjg: dlete node
<mjg>
delete in c++ parlance as in free?
<mrvn>
c++ delete, free
<mjg>
ye, there is crappery around it, but nothing fundamentally hard to get it to work to begin with
<mjg>
really this starts getting difficult when you want this to work at scale
<mrvn>
And if you aren't working at scale there is no problem with locks to begin with
xenos1984 has joined #osdev
<mjg>
"scale" for me is definitely way above 100 threads
<mjg>
and even below that avoiding locks is huge
<mjg>
as certain objects are heavily hsared, see the path lookup stuff i mentioned the last time
<mjg>
plus you get a perf win single-threaded because you are doing fewer atomics
<heat>
the fd table
<heat>
you could get away with a rw lock but then accept4(), et al becomes slower
<mrvn>
mjg: I think thats more a problem how you lock that locking. Like my file graph at the moment has multiple locks. One that covers all the linkage and everything you need to walk the graph and one to modify objects without affectting the linkage. So I can happily write to files all I want, updating their size, cache references, ... without ever blocking the path walking.
<raggi>
gos spinning in the scheduler and rwlocks will show up on profiles under some conditions and isn't super healthy, but util the os provides a cooperative way to avoid that without a syscall there's not much can be done. A simple solution would be a thing in the vdso that helps you decide if you have enough quanta left that spinning makes sense or if you should just yield instead. It wouldn't look a whole lot more complex than gettimeofday optimizations, it
<raggi>
could even share a page with it
<mrvn>
raggi: isn't that what futex is about?
<mjg>
mrvn: to recall from last time, both freebsd and linux have fully scalable lookups where terminal path component is different
<mrvn>
mjg: both bsd and linux come from a design with a single global lock. Don't think they have ti split up enough yet.
<mjg>
mrvn: for example foo/bar/baz/${threadid}. doing that will *never* bounce cachelines
<mjg>
mrvn: what is your code going to do
<mjg>
2 threads, one wants foo/bar/baz/thread1, another one foo/bar/baz/thread2
<mjg>
there will be literally 0 bouncing of anything on freebsd and linux
<raggi>
mrvn: if futex was suitable, we'd be able to persuade people to stop spinning in userspace ;)
<mjg>
in your code i presume they will compete for the same lock(s)
<mrvn>
mjg: On my os: They ask the filesystem service to open "threadX" for the handle "foo/bar/baz". So there is on refcount bouncing.
<mjg>
raggi: what you really want is info if the lock owner is on cpu imo. this info is what's used for adaptive spinning everywhere that i have seen
<mrvn>
raggi: futext spins until it doesn't make sense
<mjg>
raggi: .. in kernels
<mjg>
mrvn: well let me restate, how does it scale
<raggi>
mjg: yeah, essentially we need to expose more scheduler information to userspace in order to get userspace to contend with it less
<mjg>
mrvn: i can boot a 96-way box, have 96 threads opening foo/bar/baz/${mythreadnumber}
<mjg>
mrvn: and have it scale perfectly
<mjg>
(well modulo some uarch issues)
<mjg>
but no SMP problems
<mrvn>
mjg: creating the files will go down the crapper
<mjg>
how something like this is going to look like on your kernel
<mrvn>
mjg: you end up with 96 cores spinning on foo/bar/baz/ updating it over and over and failing.
<mjg>
you mean in your kenrel or bsd/linux
<mrvn>
mjg: on bsd/linux
<mjg>
also i did not ask about file creation
<mjg>
just *opening* a file, which presumably already exists
<mjg>
then see above
<mjg>
as for perf of parallel creation in the same dir, we can talk in a minute
<mjg>
let's sort out this bit first
<mrvn>
IN my kernel it will end up being sequential with all 96 cores doing an "atomic incr".
<mjg>
which part. opening a file which already exists or creating a new one
<mrvn>
opening an existing file
<mjg>
well that's terrible from perf standpoint
<mjg>
and wont be an issue on linux nor freebsd
<mjg>
in fact i can slap a quick bench right now
<heat>
slaappppppppppppp it
<mrvn>
mjg: will it? Is that actualy you will do over and over? Sounds more like something you do once on program start and then run with it for a long time.
<mrvn>
+something
<mjg>
i refer you once more to the -j 104 bzImage test
<mjg>
granted it was not 'open' a lot, but 'stat'
<mjg>
and then
<mjg>
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
<mjg>
after: 147.36s user 313.40s system 3216% cpu 14.326 total
Vercas6 has quit [Ping timeout: 268 seconds]
<mjg>
slight correction: i still did not commit unslowing of the open files counter (as in for file descriptors), so open per se does run into a bottleneck, just not related to path lookup
<mjg>
as for pure path lookup, here is access() calls to different files within /tmp
<mrvn>
mjg: But now look at the case of creating or deleting a file.
scoobydoob has joined #osdev
<mjg>
./access1_processes -t 1
<mjg>
min:4139933 max:4139933 total:4139933
<mjg>
min:4139242 max:4139242 total:4139242
<mjg>
min:1618870 max:1690272 total:172582177
<mjg>
min:1618546 max:1693889 total:172728838
<mjg>
i would say scales pretty nicely
<mjg>
<mrvn>
mjg: so your open performes like mine. every core needs an atomic inc of the open counter.
<mjg>
so yes, parallel creation of files in the same directory runs into a lock
<mjg>
what
<mjg>
man
<mjg>
you are missing the point
<mjg>
the open at hand does not suffer any scalability problems from path lookup, which i'm talking about
poyking16 has quit [Ping timeout: 268 seconds]
<mjg>
there is a bottleneck outside of it, which i'm going to fix later
<mrvn>
mjg: If you unslow the open THEN you are better
gmacd_ has joined #osdev
<mjg>
but the question is whether delayed memory reclamation is of anty use
<mjg>
not whether i got rid of the global open file counter
<mjg>
and for that, see access benchmark above for an example
<mrvn>
mjg: there are 2 problems with create/delete. One is the memory reclamation. The other is contention on the update.
Ram-Z_ has joined #osdev
antranigv_ has joined #osdev
<mrvn>
RCUs are really really bad if you get update contention.
<mjg>
they may slow things down, yes
darkstarx has joined #osdev
zhiayang_ has joined #osdev
chartreus has joined #osdev
Marsh has joined #osdev
<mjg>
have you tried benchmarking your code against linux fwiw?
Patater has joined #osdev
<heat>
benchmarking against linux means taking a big L
<mjg>
i have some recollection they started adding support for parallel file addition
onering has joined #osdev
bleb_ has joined #osdev
<mrvn>
they try to update, fail, retry, fail, retry, fail. Every loop only one core succeeds and all others spin doing the same work over and over.
seer_ has joined #osdev
Emil_ has joined #osdev
<mrvn>
mjg: That's O(N^2)
<heat>
mjg, I think the rename lock is now a seqlock
<mjg>
what loop?
brenns107 has joined #osdev
<mrvn>
mjg: the update process
gdd1 has joined #osdev
sbalmos1 has joined #osdev
Terlisimo1 has joined #osdev
<mjg>
be more specific please
wereii_ has joined #osdev
DutchIngraham has joined #osdev
<mjg>
update of what, global rcu state?
kaichiuchi_ has joined #osdev
__xor has joined #osdev
<mjg>
so that grace periods can move forward?
<mrvn>
mjg: when you want to create a file the core goes to the directory node, copies it, modifies it and then tries to compare&exchange it atomically. If the compare fails it has to do it all over again.
sbalmos has quit [Killed (NickServ (GHOST command used by sbalmos1!sbalmos@about/java/sbalmos))]
sbalmos1 is now known as sbalmos
arminweigl_ has joined #osdev
ornitorrincos has joined #osdev
theruran_ has joined #osdev
jjuran_ has joined #osdev
brynet_ has joined #osdev
<mjg>
well it has been some time since i had a look at what ilnux is doing there, but i'm pretty sure they take a lock
JTL1 has joined #osdev
<mjg>
or at least they used to
<mjg>
on the inode
<mrvn>
mjg: but then it isn't RCU anymore.
<mjg>
*changing* stuff is not
<mjg>
they take hte lock to change stuff
<heat>
dentry code isn't fully RCU
<mrvn>
mjg: but I think you have to. RCU are good for lots of readers but writing goes bad quickly.
dormito_ has joined #osdev
<heat>
it's best case RCU, worst case you take locks
<heat>
no it doesn't
<heat>
that's why RCU is a thing
<heat>
rw locks aren't good enough
<mjg>
rcu does suck if your workload mainly modifies stuff
<mjg>
but that is NOT the case for path lookups
<mrvn>
heat: I consider O(N^2) realy bad if N is 1000
<mjg>
or several other cases where rcu is heavily used
zaquest has quit [*.net *.split]
zhiayang has quit [*.net *.split]
MiningMarsh has quit [*.net *.split]
dutch has quit [*.net *.split]
_xor has quit [*.net *.split]
gdd has quit [*.net *.split]
Ram-Z has quit [*.net *.split]
troseman has quit [*.net *.split]
wereii has quit [*.net *.split]
Terlisimo has quit [*.net *.split]
Emil has quit [*.net *.split]
bleb has quit [*.net *.split]
seer has quit [*.net *.split]
Rubikoid has quit [*.net *.split]
Beato has quit [*.net *.split]
scoobydoo has quit [*.net *.split]
Dreg has quit [*.net *.split]
valerius_ has quit [*.net *.split]
antranigv has quit [*.net *.split]
alpha2023 has quit [*.net *.split]
zhiayang_ is now known as zhiayang
bleb_ is now known as bleb
Marsh is now known as MiningMarsh
scoobydoob is now known as scoobydoo
JTL has quit [Ping timeout: 240 seconds]
xenos1984 has quit [Ping timeout: 240 seconds]
gmacd has quit [Ping timeout: 240 seconds]
chartreuse has quit [Ping timeout: 240 seconds]
Celelibi_ has joined #osdev
<mrvn>
mjg: sure. there are plenty of cases where RCU will work. So far I seem to have just avoided them.
<mjg>
well you got path lookups, so that's already one
<mjg>
:)
brynet has quit [Killed (NickServ (GHOST command used by brynet_))]
brynet_ is now known as brynet
Celelibi has quit [Ping timeout: 240 seconds]
<mrvn>
mjg: except I don't tend to walk paths but have more of an openat() design
<heat>
linux's process/pid data structures is also fully rcu
xenos1984 has joined #osdev
<mjg>
heat: interestingly the task stuff does not scale
<mjg>
heat: most notably the tasklist_lock is heavily overused
<mrvn>
fully RCU? tasklist_lock? only one of those can be true
<mjg>
task stuff is not fully rcu, but varoius components are
<heat>
maybe not fully yeah
<mjg>
for example you can look up a task with rcu
<heat>
my bad
<mjg>
and perm check against it
darkstardevx has quit [Ping timeout: 240 seconds]
DrPatater has quit [Ping timeout: 240 seconds]
aejsmith has quit [Ping timeout: 240 seconds]
theruran has quit [Ping timeout: 240 seconds]
brenns10 has quit [Ping timeout: 240 seconds]
jjuran has quit [Ping timeout: 240 seconds]
kaichiuchi has quit [Read error: Connection reset by peer]
Starfoxxes has quit [Ping timeout: 240 seconds]
arminweigl has quit [Ping timeout: 240 seconds]
ornitorrincos_ has quit [Ping timeout: 240 seconds]
dormito has quit [Ping timeout: 240 seconds]
jjuran_ is now known as jjuran
brenns107 is now known as brenns10
kaichiuchi_ is now known as kaichiuchi
theruran_ is now known as theruran
arminweigl_ is now known as arminweigl
<mjg>
what does not scale for shit on linux is task creation + destruction
<mjg>
part of the problem is that they never cleaned up the old hack with threads pretending to be processes
<mrvn>
mjg: One thing to know about my kernel is that everything uses message passing. So processes and drivers have a mailbox. So 96 cores trying to open a file will hammer the vfs mailbox trying to add messages to the mailbox. So you have your big serializer there that RCU can't fix.
zaquest has joined #osdev
<mjg>
so for example a thread being created (or dying) in a process serializes against creation of a new process (or its destruction)
<mrvn>
mjg: to scale I will have to implement some per-core mailbox design.
<mrvn>
mjg: thread and process are basically the same. Just different flags what namespaces to create. :)
<mjg>
to make it worse, even a toy bench where you fork n workers, each of which spawns and kills a thread in a loop
<mjg>
runs into dramatic contention
Rubikoid has joined #osdev
wxwisiasdf has joined #osdev
<wxwisiasdf>
Good morning
gildasio has quit [Quit: WeeChat 3.6]
<mrvn>
mjg: that's why you don't do that. :)
<mjg>
well real apps do create and kill tons of threads
<mjg>
and you write a microbenchmark like that to get an idea where it is slow
<mjg>
for example clang is notorious for it
<mrvn>
mjg: have they never heard of thread pools?
<mjg>
this is what they do, except the entire thing is short lived
Starfoxxes has joined #osdev
<mjg>
they check how many cpus they can run on, spawn this many threads blindly
<mjg>
and exit shortly after
<mrvn>
Oh you mean, clang starts, creates 96 threads, reads a header and dies?
<mjg>
does the work with however many threads can do the work
<mjg>
and then exits
<mjg>
afair it is mostly their linker
<mjg>
now imagine building many different programs at the same time
<wxwisiasdf>
solution is OpenTBB :-)
<mrvn>
Yeah. but to be fair, linking is usualy not a short job.
<mjg>
ye i'm pretty sure they don't have any win from spawning all these threads for most programs
<mjg>
not everything is chrome
gmacd_ has quit [Ping timeout: 240 seconds]
<mjg>
i created a ticket about it, but as you can imagine there was no response
<mjg>
not even a "fuck off"
<mrvn>
mjg: They could check the input and estimate the workload and then create threads when needed.
<mjg>
this is what i suggested
<mjg>
in fact i may end up implementing something like this
<mjg>
good news is that i have an ok base of software to test it against
Vercas6 has joined #osdev
<mrvn>
Could be a big win for unit tests. They are probably the smallest link jobs you get.
<mjg>
namely the entire freebsd ports collection
<mrvn>
Does clang have an option to limit thread counts?
<mjg>
progs of all shapes and sizes
<wxwisiasdf>
The best way to minimize that is to pass all the files at once to clang so it reuses threads and doesn't repeatedly spawn stuff
<mjg>
not explicitly, the best you can do is use taskset or cmopatible
<wxwisiasdf>
but then you get absurd build times and something failing means you have to do it all again so uh
<mjg>
if it finds out it can only run on n cpus, it is going to spawn n threads
<mrvn>
If I build wiht -j I don't want the compiler to use any threads.-
<mrvn>
Only at the end when I link with LTO I want to us all cores.
dude12312414 has quit [Remote host closed the connection]
<wxwisiasdf>
mold + clang perhaps?
<mrvn>
(but then that lacks integration with make)
dude12312414 has joined #osdev
DutchIngraham is now known as dutch
<mrvn>
mjg: Would be cool if clang could use the GNU make jobserver.
<mjg>
ye i'm done with ranting about make for the week
<mjg>
(bsd make mostly :>)
<mjg>
and i know it is only monday
<mrvn>
mjg: what? No complainsd about "make -j4" all mixing up the output from the processes?
<mjg>
make -sss
<mrvn>
GNU make has an option now to buffer the output from a target and then atomically print it. Only problem with that is that it always buffers the output till the end. Now have 2 interactive jobs waiting for input.
<mrvn>
it should show the output from one target and buffer the rest.
<mjg>
i don't watch any output modulo errors, so i'm mostly fine here
<mjg>
see -s
<mjg>
what i like about linux build process is that they default to simply printing CC foo.c
<mjg>
as opposed to the full blurb
<mrvn>
I'm building netboot images for cluster and one step of it is running kvm, run test cases in kvm and recording what binaries get used. That can take a while and all that time make shows nothing.
foudfou has quit [Remote host closed the connection]
<mrvn>
And I have a flag to go interactive on error. Can't use that with -j.
foudfou has joined #osdev
<mrvn>
SO you start the build with -j 8, it churns a bit and then you have 8 kvms running. 20 minutes later it stops with an error because one of the targets had an error right at the start but was blocked from printing it. :((
<mjg>
uh
kaichiuchi has quit [Ping timeout: 240 seconds]
<mrvn>
or it printed it but you didn't see it scroll by
<mjg>
well bsd make will tell you stuff failed *somewhere*
<mrvn>
sure, but it will still run those 8 other targets, or 7 since one failed.
<mjg>
but then will have tons of other make processes claim something failed, so they are exiting
<mjg>
even with -j 40 that's enough for the error to go way out
<mjg>
it got to a point where i filter the output with awk
<mrvn>
mjg: I kind of would like a make browser. Sort all the make output into a tree following how each traget triggers the next and let me collaps or expand the branches.
<mjg>
ye sounds nice
<mrvn>
Show successful targets in green, running in yellow, failed in red.
<mjg>
existing CI suites can somewhat do it fwiw
<mrvn>
future targets (if it can predict them) in black
<mjg>
... not that i can honestly recommendo ne
<mrvn>
mjg: I think my kernel makefile is rather nicely done. Per default it just shows "CC foo/bar/baz.cc", "AS boot.S" or "LD kernel.img". One line per file. If you want details you have to set VERBOSE.
<mjg>
ye that's the linux way
<mjg>
i like it
<mjg>
but normally i build the entire thing in 30s
<mjg>
(freebsd i mean)
<mjg>
so i don't want anything scrolling by
<mrvn>
it sucks when the slowest thing in your build is printing the commands
<mrvn>
Do you use ccache?
<mjg>
no
<mjg>
i don't trust it and have no use for it
<mjg>
(not ccache specific, i barely trust anything claiming to do incremetnal builds)
<mrvn>
it's not incremental. It just caches the whole output of a command
<mjg>
i got screwed over by .o files not recompiling after header changes
<wxwisiasdf>
ccache?
<mrvn>
If you have something that takes longer than 30s to build it's a huge speedup.
<mjg>
i'm sure it mostly works for people, good for them
<mrvn>
mjg: that has nothing to do with ccache.
<mjg>
but it has to be able to detect that 'the same file compiled' actually needs to be compiled
<mrvn>
mjg: ccache hashes the preprocessor output, compiler inode, ...
<mjg>
well if it is reliable, that's nice
<mrvn>
haven't seen it fail yet
<mjg>
if i had use for this kind of a tool i would look into it
<mjg>
fortunately see above
<mrvn>
might get your 30s compile time down to 10s :)
<mjg>
30s is from scratch
<mjg>
i mostly modify .c files and that i trust bmake to pick up for an incremental build
<mjg>
in which case it is literally 2-3s
<mjg>
i'm good here
<mrvn>
.oO(i barely trust anything claiming to do incremetnal builds)
<mjg>
i don't for .h files
aejsmith has joined #osdev
<mjg>
so i do a fresh dbuild each time i modify one
<mrvn>
mjg: for incremental builds ccache is totally useless. But if you "make clean" after modyfiying a header it would help.
<mjg>
i can live with the 30s
<mrvn>
In my case I hugely helps when building packages because they always make clean.
<mrvn>
I an just "git-build-dpkg" and it's like I do an incremental build.
wxwisiasdf has quit [Remote host closed the connection]
<mjg>
you mean .debs?
<mjg>
i can see how that would be of use, sure
nohit has quit [Ping timeout: 240 seconds]
bgs has quit [Ping timeout: 240 seconds]
puck has quit [Ping timeout: 240 seconds]
hl has quit [Ping timeout: 240 seconds]
Raito_Bezarius has quit [Ping timeout: 240 seconds]