klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
<rwxr-xr-x> i see
<\Test_User> ld is typically the linker you'll use
<zid> and to produce object files from your assembler
<zid> Makefile that runs nasm file.asm -o file.o -felf32 over each .asm file it sees, then runs ld file1.o file2.o file3.o -o test.elf ^
<heat> ld is like 3 linkers
<zid> along with some linkerscript voodoo to make it org the code to 0x7C00 and make sure boot.o's code ends up first
<\Test_User> > -o test.elf
<zid> yea, fite me irl
<\Test_User> definitely shouldn't be an elf file :P
<rwxr-xr-x> cool shit
<rwxr-xr-x> so much to learn
<zid> \Test_User: look at the commit
<rwxr-xr-x> almost overwhelming
<\Test_User> zid: it's a flat binary though, no?
<zid> What is the "it"
<gog> hi
<zid> hmm my linker script is wrong
<\Test_User> the output from ld, that you're putting into test.elf
<zid> no, it's an -melf_i386
<\Test_User> ahhh
<\Test_User> I see
<zid> 4288 bytes, test.bin, I fucked something up here..
<zid> oh it has oodles of padding
<\Test_User> any particular reason for elf_i386 when it's a mix of 16-bit and 32-bit code?
<zid> because there's no such thing as a 16bit elf?
<zid> and it has 32bit code in it
<zid> but who cares, it's just a carrier format so I can objdump it
<zid> objcopy*
<\Test_User> fair
<\Test_User> and it actually works right trying to use ld to match 32-bit addressing with 16-bit code?
<zid> calls/jumps are relative on x86, either way
<zid> and use the same instructions
<zid> the linker can't tell the difference
<zid> if I wanted to do 'dw some_32bit_symbol' i'd be in trouble, yea
<\Test_User> ah
<zid> pushed out the fixed version
<zid> properly works now (I forgot to remove the times db 512-($-$$) dw 0xAA55 part)
<heat> gog
<zid> which made the broken linker script look more correct than it was
<zid> heat: Are you beating my highscore yet? I'm up to 14af
<heat> no
<heat> im not having your malicious js running in the background
<\Test_User> does it error properly if you use more than 510 bytes of code or just pad itself again
<zid> I think it might? . = 0x200; is in there
vdamewood has joined #osdev
<zid> or it might just allocate around it and give you a bigger image idk :P
<zid> yep
<\Test_User> ld:link.ld:10 cannot move location counter backwards (from 0000000000007f29 to 0000000000007dfe) seems good
<zid> cannot move location coun-
<zid> linekrs are fnu
<\Test_User> cryptic error but not sure if much better could be done
<zid> nah cryptic is make
<zid> I swapped a $ and ( and it was complaining about 5 lines later where I used the symbol next about invalid whitespce or somthing dumb
<zid> gog: I just let the cat in, I think she had a message for you or something idk? She said Mrroworrrrw
<\Test_User> lol fair
<gog> zid: cool tell her i said "bbbbreeew"
<zid> gog: Any tips for not religiously typing objdump every time I want to dump a section, instead of objcopy?
<zid> *wall of useage text* *oh I used the wrong one again didn't I*
<zid> Thank god objdump -j is nonsense rather than "delete all your shit"
<gog> alias them
<zid> like sl?
<gog> alias copysection="objcopy -j"
<zid> I'd still type objdump
<gog> retrain yourself
<\Test_User> alias objdump='kill -9 1'
<gog> give yourself a little treat every tiem you do it right
<zid> sl works better
<gog> pat yourself and call yourself a good boy
<zid> every time you type sl instead of ls, you have to wait for a steam locomotive to go past
<klange> set up a build pipeline that doesn't suck and never have to remember any of this shit ever again
<klange> i should package sl
<zid> I need an inverse alias
<zid> so that objdump -j runs sl
<klange> unfortunately it's a curses app and I haven't actually packaged curses in years
<klange> i should write my own curses
<heat> write your own curses?
<heat> you are clinically insane
rwxr-xr-x has quit [Remote host closed the connection]
<zid> I write my own curses all the time
<zid> by which I mean am I am too lazy to use ncurses and just put some escapes into my printfs
<klange> You should see my editor.
<gog> i can write my own curses
<gog> fuck
<gog> shit
<gog> see
<gog> easy
<zid> those are mine
<zid> get your own curses
<zid> with áéúóí
<gog> helvitis
<Matt|home> evening.
<Matt|home> i seem to have some mental block.. or learning disability perhaps, with this topic. i've taken to drawing shitty little diagrams on a whiteboard which were promptly mocked by my more IT experienced relatives
<Matt|home> im going to switch to crayon instead i think.
<zid> I thought you had actual brain damage, re a discussion in asm
<zid> or was that a different matt
<Matt|home> a different person. i haven't talked in that channel in 2+ years.
<Matt|home> but thank you.
<zid> just mistaken identity
<gog> hi
<Matt|home> o\
<klange> I suggest paper and pencil, less easy for others to look at.
<klange> Actually I suggest paper and erasable ink pens, but that's splitting hairs.
gog has quit [Ping timeout: 268 seconds]
<heat> gog
<heat> gog
<heat> gog
<zid> it's 1am heat
<zid> keep it in your pants
<heat> bazinga
<heat> oh wait wrong person answered
<heat> fuck
<heat> TIL -mrelax-cmpxchg-loop
<zid> what's that one do
<heat> Relax cmpxchg loop by emitting an early load and compare before cmpxchg, execute pause if load value is not expected. This reduces excessive cachline bouncing when and works for all atomic logic fetch builtins that generates compare and swap loop.
<heat> it even has a typo woah
<zid> get you some cachlines
<zid> want me to ping jwakely?
<zid> I've had him fix typos before
<heat> lol
<zid> punged
<zid> doing well today, bug in qemu, bug in gcc
<heat> LITERALLY BROKEN
<zid> critical 13.0 stopping bug
<klange> this is why I only use outdated versions of gcc
<klange> definitely not because I'm too lazy to port my target support patches forward and rebuild, no sirree...
MiningMarsh has quit [Read error: Connection reset by peer]
<klange> You would think I'd have Toaru support mainlined by now in gcc+binutils...
mavhq has quit [Ping timeout: 252 seconds]
mavhq has joined #osdev
<klange> if no one else has bothered to try to upstream my patches, then I don't deserve to have them upstreamed, simple as that
<kof123> Matt|home: what is the topic that causes such a grave situation as busting out the crayons that professionals use?
MiningMarsh has joined #osdev
<geist> heat: huh interseting
<geist> would be nice if it was more opt in than that maybe
<geist> ie, a separate builtin to accesss the relaxed version
<heat> i feel the same way
<heat> the only way this works is if you were either completely careless in your cmpxchg usage, or if you do a proper test-and-test-and-set everywhere and you don't need to pass -mrelax-cmpxchg-loop
<geist> and for something like your spinlock implementation you should either reimplement it per arch or already do the right thing
<heat> if you have test-and-test-and-set everywhere except your i.e spinlocks, your spinlocks get pessimized
<geist> might be interesting to see what it does with ARM
<geist> or is it x86 specific?
<heat> x86 as far as I can see
<heat> I didn't check the aarch64 options
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<geist> i guess it having the instruction in the name pretty much means it's x86
<zid> it's on the x86 manual page
<zid> which doesn't stop something similar existing for other arches of course, but maybe then it'd be on the generic page instead
<Matt|home> kof123 : paging, reading up on it now.
<geist> yah lots of other arches tend to call the same thing 'cas' for compare and swap
<geist> armv8.1 included. 68k called it cas too
<bslsk05> ​godbolt.org: Compiler Explorer
<geist> ah thought the switch was in gcc 13+
<heat> this doesn't seem entirely like a good idea
<heat> unless that specific cmpxchg gets a lot of contention
<heat> your fast path takes a load and a cmpxchg
<Matt|home> is there a modern CPU architecture that doesn't use virtual addressing/have an MMU? apart from the very small ones like what arduino uses
<Matt|home> i mean like an actual desktop-usage computer
<Matt|home> or is it ubiquitous
<geist> Matt|home: no. not for desktop or server class stuff
<heat> i dont think so
<Mutabah> Desktop? Not that I know of
<geist> it's ubiquotous
<Matt|home> do they all function the same?
<Mutabah> There are the ARM -M variants
<heat> define same
<geist> not precisely,b ut they arrive at the same thing
rwxr-xr-x has joined #osdev
<Matt|home> i.e. the kernel sets up the page tables and that's all you have to worry about?
<geist> no they dont work the same
<Matt|home> kk
<Matt|home> so x86 is different than the others
<heat> I T A N I U M
<geist> but they have the same result: a translation of fixed size pages from virtual to physical addresses
<heat> no x86 is very similar
<geist> Matt|home: yah that's not the same thing from different fro the others
<Mutabah> x86/ARM are pretty similar in implementation
<geist> more like there are various patterns that some implementations follow. x86, arm, riscv are fairly similar
<Mutabah> (lots of little differences, but the same broad approach)
<geist> there are other strategise that other arches take.
<Mutabah> Compare to PPC where it's a software-managed TLB (iirc)
<heat> ITANIU
<heat> M
<geist> no PPC uses a hash table, but there are software managed things
<Matt|home> let me rephrase the question: if you were designing a kernel for two different architectures, is the leap between setting up paging very difficult to jump across or is it similar enough that it won't add six hours of reading material
vdamewood has joined #osdev
<klange> most of my mmu code between aarch64 and x86-64 is the same; enough that I should really consolidate it
<geist> Matt|home: very different. basically yo need to abstract the whole mmu into architecturally dependent code
<Matt|home> ok
<Matt|home> thank you
<geist> and then abstract it out
<CompanionCube> geist: i believe newer ppc has an additional radix mode
<heat> see bsd pmap
<geist> but your questions are in the right direction: ignoring the details of how each arch does the translation, what is the set of features the translations support
<geist> that's what you design your overall api for, and your virtual memory system around
<geist> and in that case it's pretty much standardized on basically the same set of features, plus or minus some
<heat> fwiw linux always assumes there are page tables, even when there's no such thing
<geist> yes and linux is the weird outlyer
<geist> that's highly page table centric and if you're not using page tables? (ie ppc) then too bad for yoy
<heat> i'm fairly sure they still maintain page tables
<heat> (on non pt architectures)
<geist> they do: you have to maintain it because it's what the upper VM does, but then at the arch level you end up translating it
<geist> into the PPC hash table, or in the case of the SW TLB you probably take a fault and then walk the upper level page table with software
<bslsk05> ​twitter: <itanium_guy> Once you think you understand all about MMUs because you could figure out how self-mapping works... move to the next level: ␤ On Itanium, the VHPT Walker memory accesses (the ones that fetch PTEs) themselves go through the MMU... meaning walking is done with Virtual addresses. 🤪
<heat> what
<geist> problem with trying to unify page tables logic between x86, arm, and riscv is they eem to be prtty close, but the subtle details matter
<geist> so you possibly end up with huge piles of conditionals
<geist> heat: yep! i thought you had dug into it to figure that part out?
<heat> i don't remember this bit specifically
<geist> VHPT on ia64 is a weird beast. far as i know, aside from maybe VAX, there are no arches that do virtual page tables like that
<heat> i assume you need to insert the tlb entry for the vhpt manually
<geist> i think you do
<geist> or... you must always be able to handle a software tlb fault
<zid> wow, tick-tock-clock is now 0 A presses
<geist> and you can use that to insert the root of the VHPT
<geist> i think the idea is the sw tlb fault doseen't happen that often, so it's an acceptable outcome
<geist> i *think* in practice linux at east endedup deciding to not use the VHPT but it uses the Other Method, which i forget what its called, but it's functionally similar to POWER/PPC
heat has quit [Remote host closed the connection]
<geist> but i also think you can configure it per each 8 regions (bit [63:31] of the virtual address) so you can choose where to use either method. i think
heat has joined #osdev
<heat> we didn't deserve itanium
<geist> note VAX actually does someting kinda like this too: virtual space is statically carved up into 0-2GB (user) and 2GB+ is kernel. the kernel has a linear page table (just one flat thingm one entry per page) that lives in physical space
<geist> and each process has its own linear page table (0 ... length of page table) that lives in kernel space
<geist> so as the cpu is fetching the TLB for user pages it actually reads through the kernels virtual address space
<geist> so you can build sparse user page tables that way
<heat> isn't this very inefficient?
<geist> probably!
<heat> even just wrt memory usage
rwxr-xr-x has quit [Read error: Connection reset by peer]
<geist> ot really. remember this was the era when procsesses acted more in the sbrk() style. started from around 0 and grew upwards
<geist> so since when you loaded the user page table you set the base address + length, you only allocated as much table as you needed, and grew it oer time
<geist> and/or used the kernel page tables to provide a large sled of zero pages off the end of user
<geist> it is a kinda interesting problem to solve
<heat> good point
<heat> although they came up with mmap during VAX
<geist> also it was i an era when the cpu iterally had 4MB *max* memory
<geist> though later vaxen got up to 128MB or so
<heat> hmmmmm
<geist> so i think the vast majority of prcesses were very small, and clustered around 0
<geist> i think the big mistake they made with VAX was page size was 512 bytes, which i think quickly became too small
<heat> through my shitty calculations I'm getting 2MiB just for the kernel's page table
<geist> but hey, 1977. was pretty sophisticated at the time
<geist> heat: depens o how big you made the kernel. you also sized the kernel based on how much size yo uthink you needed
<geist> iirc in netbsd it'ssome percentage of total system memory
<geist> but functionally POWER/PPC has that problem too in a different way: you burn N bytes of physical memory to store the one page table, which must be physically contigous at boot
<geist> so in that sense intels radix tree when they added page tables was one of the few things they didn't outright copy from VAX. OTOH i dont think they invented it. other arches were doing that style too
<geist> btween like 1977 (vax) and 1985 (i386)
<Matt|home> "in x86 the kernel will load itself somewhere near the beginning of memory but it'll map itself closer to the end of virtual space" <-- is there a reason for this? seems more convenient to have a 1 to 1 mapping imo
<heat> i guess doing something like this for the kernel would be pretty smart?
<heat> sparse page table covering the whole range
<heat> Matt|home, yes, abi stability
<Matt|home> what's that mean, just cuz everyone else does it you should too?
<heat> if you map yourself at 1MiB and a user program wants to load itself at 3MiB, it means that you can't grow over 2MiB or you broke the ABI
<geist> Matt|home: well more ike they usually do it for a reason. so you figure out what the reason is
<geist> many times it's convenience and speed, or soetimes it's less obvious until later
<geist> but yes putting the kernel 'up high' on a 32bit system is basically standardized, for the reason heat described
[itchyjunk] has quit [Ping timeout: 260 seconds]
[_] has joined #osdev
* Matt|home slams his head on the desk
* Matt|home slams his head on the desk
* Matt|home slams his head on the desk
* heat slams his desk on the hand
Phytolizer has joined #osdev
Phytolizer has quit [Client Quit]
Phytolizer has joined #osdev
* klange smacks Matt|home around a bit with a large trout.
<Matt|home> thank you..
* Matt|home sticks a sharp pencil in his eye and pushes.. okay.
<Matt|home> "each process is given it's own individual page table" <--true or false..
Phytolizer has quit [Client Quit]
<heat> semi-true
<heat> some processes share page tables sometimes
<heat> but yes, 99% of processes have separate page tables
<Matt|home> e
<Matt|home> page directory
<Matt|home> directory not table
<Matt|home> see im already fucking stupid
<heat> (linux, and most other operating systems, have the possibility to create processes which share more or less stuff, one of those being the address space. but in practice most processes don't share anything)
<heat> hm?
<Matt|home> ... if i paypal you fifty dollars will you walk me through this until i understand it..
<heat> i dont want $50
<Mutabah> Just keep asking here, people will help
<heat> mostly because we don't use USD here
<Matt|home> i assure you that wherever on earth you are 50 USD is worth more than 50 of whatever currency you use..
<Matt|home> thanks inflation..
<Matt|home> okay look.. so here's what im sussing out.
<Matt|home> for x86.. each process is given it's own unique page directory, according to my notes (presumably im talking about linux). so if linux has a hard limit on the number of processes, which iirc is 512, that means there are 512 unique page directory entries correct?
<heat> sure thing but we still don't use usd here
<heat> linux doesn't have a hard limit on the number of processes
<heat> no
<Matt|home> i can link you to a paper i am currently reading that says otherwise
<heat> if you have 512 processes, you'll have roughly 512 page directories
<heat> sure
<zid> unique bottom halves, like weird minotaurs
<Matt|home> https://tldp.org/LDP/tlk/kernel/processes.html <-- let me find the exact sentence
<bslsk05> ​tldp.org <no title>
<klange> That document is from 1999.
<heat> btw euro are more valuable than usd
<Matt|home> This means that the maximum number of processes in the system is limited by the size of the task vector; by default it has 512 entries. As processes are created, a new task_struct is allocated from system memory and added into the task vector. To make it easy to find, the current, running, process is pointed to by the current pointer.
<heat> this is not true
<Matt|home> great..
<Matt|home> more confusion..
<klange> That is very ancient information.
<heat> 1996-1999 David A Rusling
<kazinsal> holy moly that's an old piece of paper
<Matt|home> yeah it's a little frustrating when every document im trying to read up on a subject is apparently subject to being out of date with incorrect information :\
xenos1984 has quit [Read error: Connection reset by peer]
<kazinsal> "Linux is a moving target; this book is based upon the current, stable, 2.0.33 sources as those are what most individuals and companies are now using."
<heat> please ask questions
<kazinsal> mah goodness
<heat> LMAO
<Mutabah> Try the osdev wiki
<heat> i don't recommend the osdev wiki
<Mutabah> Also, linux does some questionable things in the name of speed
<heat> but it's still better than this
<Matt|home> yeah i give up. my brain just doesn't work. im going for a walk
<Mutabah> And if you want a tutorial - https://os.phil-opp.com/
<bslsk05> ​os.phil-opp.com: Writing an OS in Rust
<heat> okay
<Mutabah> Sure it's rust - but it's the newest/best OSDev tutorial out there
<kazinsal> the chapter on paging in that one is one of the better explanations I've seen of x86-64 paging
<heat> i should write a tutorial some day
<klange> I should write a book in the style of Tanenbaum's Minix book.
<heat> commentary on UNIX v6 but it's about Onyx
<heat> ... that would actually be an interesting book lol
xenos1984 has joined #osdev
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
[_] has quit [Read error: Connection reset by peer]
<zid> What do you think of my processes as centaurs theory though
<kof123> (hold on zid, i was already typing) i tend to practice that the way to eat elephants is one elephant at a time. if i dont understand something, working on another component is acceptable, just so long as some progress is being made on one of the elephants
<kof123> short answer yes
<kof123> long answer: take that "kabbalah of os" graphic or whatever it was, and make a picture puzzle book lol
<heat> zid, not horrible
genpaku has quit [Remote host closed the connection]
genpaku has joined #osdev
MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]
srjek|home has quit [Ping timeout: 256 seconds]
MiningMarsh has joined #osdev
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
bradd has joined #osdev
chartreuse has quit [Ping timeout: 260 seconds]
heat has quit [Ping timeout: 260 seconds]
bgs has joined #osdev
MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
bgs has quit [Remote host closed the connection]
eroux has quit [Ping timeout: 248 seconds]
MiningMarsh has joined #osdev
eroux has joined #osdev
Burgundy has joined #osdev
jjuran has quit [Ping timeout: 260 seconds]
jjuran has joined #osdev
jjuran has quit [Quit: Killing Colloquy first, before it kills me…]
bauen1 has quit [Ping timeout: 268 seconds]
jjuran has joined #osdev
Burgundy has quit [Ping timeout: 268 seconds]
<mrvn> Matt|home: Beware that linux didn't have threads. So when they added threads they did that by allowing processes to share an address space (and other namespaces). So it's all a bit confuising.
<mrvn> There are also schemes (not linux but generally) to (re)create page tables as needed. You have a fixed number of pages for page tables and you create them from the address space objects when processes page fault in an LRU fashion. Similar to swapping memory in/out you swap page tables.
bauen1 has joined #osdev
<Matt|home> thanks. i guess i just have trouble with abstract stuff
<Mutabah> Matt|home: https://os.phil-opp.com/paging-introduction/ - This is an excellent introduction to x86_64 paging
<bslsk05> ​os.phil-opp.com: Introduction to Paging | Writing an OS in Rust
<Matt|home> bookmarked thanks
<Mutabah> and to the concept in general
bauen1 has quit [Ping timeout: 260 seconds]
bauen1 has joined #osdev
joe9 has quit [Ping timeout: 260 seconds]
joe9 has joined #osdev
Burgundy has joined #osdev
bauen1 has quit [Ping timeout: 240 seconds]
bauen1 has joined #osdev
Burgundy has quit [Ping timeout: 240 seconds]
vdamewood has joined #osdev
vinleod has joined #osdev
SGautam has joined #osdev
vinleod is now known as vdamewood
vdamewood has quit [Killed (tantalum.libera.chat (Nickname regained by services))]
MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]
bauen1 has quit [Ping timeout: 268 seconds]
bauen1 has joined #osdev
bauen1 has quit [Ping timeout: 256 seconds]
bauen1 has joined #osdev
MiningMarsh has joined #osdev
bauen1 has quit [Ping timeout: 268 seconds]
bauen1 has joined #osdev
GeDaMo has joined #osdev
bauen1 has quit [Ping timeout: 256 seconds]
bauen1 has joined #osdev
SGautam has quit [Quit: Connection closed for inactivity]
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
gildasio has quit [Ping timeout: 255 seconds]
gxt has quit [Ping timeout: 255 seconds]
gxt has joined #osdev
bauen1 has quit [Ping timeout: 268 seconds]
bauen1 has joined #osdev
Burgundy has joined #osdev
jafarlihi has joined #osdev
<jafarlihi> How are you supposed to use gprof with app that gets OOMd? It never saves the gmon.out since it gets killed before finishing
<Mutabah> Solve the OOM first?
<jafarlihi> I need the profile info to know what to solve
<Mutabah> "cleanly" quit early?
<jafarlihi> Oh yeah, that might work, thanks
<kazinsal> if you know you get 5 minutes in before dying of OOM, quit cleanly after 4 minutes and check the memory info
bradd has quit [Ping timeout: 268 seconds]
bauen1 has quit [Ping timeout: 260 seconds]
bauen1 has joined #osdev
epony has quit [Quit: QUIT]
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
jafarlihi has quit [Quit: WeeChat 3.7.1]
epony has joined #osdev
bauen1 has quit [Ping timeout: 256 seconds]
bauen1 has joined #osdev
gxt has quit [Ping timeout: 255 seconds]
<mrvn> does valgrind survive a OOM kill?
bauen1 has quit [Ping timeout: 260 seconds]
gxt has joined #osdev
MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]
<zid> oh heat isn't here
<bslsk05> ​gcc.gnu.org: 107676 – Nonsensical docs for -mrelax-cmpxchg-loop
<mjg> lol title
<mjg> also this is bullshit
<mjg> pre-load *reduces* performance in face of multiple threads doing the owrk
<mjg> aand the intel doc linked in the commit does not corobarate what this person is saying
<mjg> while (__atomic_sub_fetch(&lock, 1, __ATOMIC_ACQUIRE) < 0) {
<mjg> do _mm_pause(); while (__atomic_load_n(&lock, __ATOMIC_ACQUIRE) != 1);
<mjg> }
<\Test_User> if you disable memory overcommit it can't be oom-killed
<mjg> it literally performs the atomic fucking op upfront
<zid> what is mm_pause btw
<mjg> presumably the pause instruction
<mjg> ye, it is
<bslsk05> ​www.felixcloutier.com: PAUSE — Spin Loop Hint
<mjg> oh?
<mjg> welp man
<mjg> not a smp person, are you
<zid> I'm not dumb enough to think I know enough to write my own high performance intrinsics
<mjg> the description above misses big part of the point of pause
<mjg> and that is to hopefully chill for long enough(tm) for whoever owns the lock to release it
<mjg> as in you try to mess with that cpu as little as possible
<zid> the one on the outside is to unpessimise?
<zid> try it, if it fails, do the pause+loop
<mjg> what
<mjg> in the linked bugzilla? that example is *wrong*
<zid> the one you just said
<zid> there is no one linked int he bugzilla
<mjg> the one i pasted does not have pause "outside" of anything
<zid> oh that's a clickable git SHA1, til
<zid> I didn't say pause outside
<mjg> > 14:42 < zid> the one on the outside is to unpessimise?
<mjg> wha'ts outside
<mjg> the atomic then?
<zid> yours is cmpxchg(); while(1){ pause(); cmpxcgh(); }
<zid> roughly
<zid> which seems to my naive face like it might be better than while(1){ pause(); cmpxchg(); } in the case where the lock isn't contested
<zid> aka unpessimise
<mjg> for single-threaded, it is WAY faster to start with cmpxchg
<zid> yes, I assumed so
<mjg> but it also happens to be faster for the multithreaded case
<zid> yes, I assumed so
<mjg> ultimately perf is mostly affected by what you do should the initial attempt fail
<zid> This seems like a lot of words to say "yes"
<mjg> bare minimum is to pause + re-read and check if it appears free
<mjg> a not higher than that is to speculatively backoff
<mjg> the mots common idea is to do exponential with an upper limit
<mjg> so 1 pause, then 2, 4 and so on
<mjg> this runs into problems of its own but it tends to beg ood enough(tm) at small (say < 50-ish threads) scale
<mjg> problems being potential starvation
MiningMarsh has joined #osdev
<mjg> where is that fucking paper
bauen1 has joined #osdev
<mjg> zid: fundamental concepts remain applicable today
<zid> no
<zid> wtf
<zid> I just asked if it unpessmised because this isn't my field, and I trusted you
<zid> but you gave me 4 pages of bullshit waffle instead, I am not your bro heat
<mjg> :[
Burgundy has left #osdev [#osdev]
<mjg> ok, how about this then
<mjg> don't trust the gcc doc
<mjg> kthxitsall
<zid> I am not trying to trust it
<zid> I am trying to write a fucking description of it
<zid> "Unpessimise cmpxcgh loops by hoisting one outside of the loop, and insert a pause inside the loop to relax cpu power use" or something idk
<mrvn> That loop could also be written as while(1){ if (cmpxchg()) break; pause(); }
<mjg> that remains highly pessimal
<mrvn> it's all the same code just written differently.
<zid> as in, it's worse than not doing it
<zid> or as in, you're a pro and you can do better
<bslsk05> ​gcc.gnu.org: gcc.gnu.org Git - gcc.git/commit
<mjg> performs a pre-read
<mrvn> What some people do is to do a non-atomic check first
<mjg> and another ead after the pause
<mrvn> mjg: the pasted code doesn't do a read, it does a write
<mjg> movl (%rdi), %ecx
<mjg> ..
<mjg> lock cmpxchgl %edx, (%rdi)
<mjg> it totally pre-reads the target
bauen1 has quit [Ping timeout: 268 seconds]
<mrvn> I'm talking about 14:38 < mjg> while (__atomic_sub_fetch(&lock, 1, __ATOMIC_ACQUIRE) < 0) {
<mjg> ok
<mjg> i was talking about the commit
<mjg> i did note the code from the paper which the comment references *does not do* what the commit claims
<mrvn> ok, so what are you trying to say? The commit is good? The commit references a bad pdf?
<mjg> the pdf is fine, at least the part referenced in the commit
<mjg> the commit *contradicts* the pdf and is wrong
<mrvn> mjg: Not really, if the pasted code was from the commit.
<zid> as in the broken chinese contradicts it, or the actual change to gcc contradicts it and the option is worse and useless?
<mrvn> from the pdf I mean
<mrvn> gcc does:
<mrvn> while (__atomic_sub_fetch(&lock, 1, __ATOMIC_ACQUIRE) < 0) { }
<mrvn> the pdf adds the pause loop
<zid> 'no' what?
<mjg> gcc rolls wiht a load
<mjg> movl (%rdi), %ecx
<zid> oh that was probably to mrvn
<mjg> > To relax above loop, GCC should first emit a normal load, check and jump to
<mjg> .L2 if cmpxchgl may fail.
<mrvn> mjg: that isn't the problem. It always does a lock cmpxchgl
<mjg> it only does after the initial load
<zid> [14:00] <mjg> the pdf is fine, at least the part referenced in the commit
<zid> [14:00] <mjg> the commit *contradicts* the pdf and is wrong
<zid> [14:01] <zid> as in the broken chinese contradicts it, or the actual change to gcc contradicts it and the option is worse and useless?
<mrvn> mjg: yes, but it always does it
<mjg> which *hurts* performance
bauen1 has joined #osdev
<mrvn> mjg: the original gcc code always does a "lock cmpxchgl", the pdf adds a "pause" loop when cmpxchgl fails, the patch in the commit adds the pause loop but also only does a "lock cmpxchgl" when the read says it will succeed
<mrvn> So to me it seems it is one step better than the pdf.
<mjg> i keep saying this load from the get go is pessimal
<mjg> before anything happens
<mjg> and is not what the pdf recommends either
* zid refuses to get drawn in
<mjg> it is pessimal *both* single and multithreaded
<zid> as in the broken chinese contradicts it, or the actual change to gcc contradicts it and the option is worse and useless?
<mjg> the option as implemented is bad, but should it move the the standard model a'la what's seen in the pdf
<mjg> it would be fine
<mrvn> mjg: haeh? How would you avoid the read? The code does "x |= 1;". That's a read-modify-write. No way to not read.
<mjg> as is it may happen to help or slow things down
<mjg> movl %eax, %edx
<mjg> orl %esi, %edx
<mjg> the found value is *not* used when computing the vlaue to be set
<mjg> lock cmpxchgl %edx, (%rdi)
<mjg> uh, did not paste the initial read: movl (%rdi), %ecx
<mrvn> mjg: oh, in the patched code. That seems wrong.
<mrvn> cmpl %eax, %ecx <---- there it is used
<mrvn> but %eax doesn't seem to be initialized by that code.
<mjg> yes, it is used to skip cmpxchg
<mjg> i assumed %eax was initialized elsewhere
<mrvn> In the original (bad) gcc code %eax was initialized
<mjg> will ahve to write a toy sample later
gxt has quit [Ping timeout: 255 seconds]
<mrvn> "movl v(%rip), %eax" seem to have been lost
<mjg> oh, it is atomic_fetch_or et al, so they can't bts
<mjg> now i'm curious what clang is doing
<mrvn> yes, or, xore, and, nand as atomic read-modify-write. Can't avoid the read.
<mjg> will have to get back to it in 1h or so
gxt has joined #osdev
<zid> Thanks for the.. help? I guess? *cries*
bauen1 has quit [Ping timeout: 260 seconds]
bauen1 has joined #osdev
<mrvn> mjg: what I find more worrysome is that the patched code as shown is an infinite loop. If cmpxchgl returns "ne" then it tries again. Otherwise it does "rep nop" and tries again. Nowhere does it exit the loop.
<mjg> ye the code as pasted is definitley not what is normally generated
<mjg> i would hope anyway
<mrvn> I think the shown code is just a badly stitched together fragment of the generated code.
<mrvn> Should never pass the test cases as shown.
<mjg> agreed
<mrvn> check what gcc and clang actually generate as code now.
<mjg> 15:11 < mjg> now i'm curious what clang is doing
<mjg> 15:11 < mjg> will have to get back to it in 1h or so
<mrvn> "We are undercover tactical nuns."
<sbalmos> in the right hands, those wooden rulers can be deadly
bgs has joined #osdev
rwxr-xr-x has joined #osdev
<mjg> so, despite the name of the gcc opt it is apparently only of significance for the loops explicitly implemented by the compiler for atomic_fetch_* primitives
<mjg> for that particular usecase the spinlock-related reasoning does not apply
<mjg> failing cmpxchg grants you exclusive access to the cache line
<mjg> if the only thing you want to do is to slap some value into it and gtfo, pause *reduces* performance
<mjg> so tl;dr it is wrong, but for a different reason than i inititally thought
dennis95 has quit [Ping timeout: 255 seconds]
<mjg> as in the doc talks about spinlocks and i blindly assumed the patched code is spinlock-y in nature
<mjg> which it is not
<mrvn> huh? pause makes it so other threads can make progress while you are stuck in a loop
<mrvn> If you have contention the cache line will bounce around cores all the time using up 99.99% of the bandwidth
<mjg> dude
epony has quit [Ping timeout: 268 seconds]
<mjg> for a case like the above, where you want to slap a bit into it and leave, it is pause which *adds* bouncing
<mjg> because
<mjg> > 16:15 < mjg> failing cmpxchg grants you exclusive access to the cache line
<mjg> which you can immediately take advantage of
<mjg> if you pause and there is other traffic, you just lost the E status
<mrvn> you mean when you just want to set a bit you don't want to loose the cache line before you retry?
<mrvn> ok, that makes sense.
<mrvn> But consider the case of 64 cores all wanting to set a bit. They will all be failing and retrying N^2 times
<mrvn> what you would want is a random amount of delay between retries
<mjg> this is true on arch
<mjg> erm arm64
<mjg> it is not true on amd64
<mrvn> mjg: if 2 cores do cmpxchg don't you loose the E status to the second core imediately?
<mrvn> or does speculative execution happen and the pipeline sees you do the cmpxchg on the same address again and keeps the cache line locked?
<mrvn> It sounds like you are banking on the later and then each core would succeed on the second try.
<mjg> you keep losing it as others cmpxchg to some degree, but there is apparent optimizaiton concerning this in uarchs
<mjg> you get yourself a real-world loop which has to - say - inc/dec by 1 as long as before/after is not 0
<mjg> you slap pause() into it and performance goes down
MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]
<mrvn> With pause() the cache line definetly bounces.
dude12312414 has joined #osdev
<mrvn> but in a bandwidth friendly way. :)
<mrvn> I could definetly agree that decision wether to pause or not pause is something that depends on more context. The lowlevel op can be used many ways and only some benefit from pause while others get hurt by it.
<mrvn> worse if you contradict the uarch optimization
<mjg> last time i benchmarked this specifically was with 80 threads
<mjg> but i don't remember the numbers, apart from a win from NOT pausing
<mrvn> the good example for pause is a lock where the pause lets other threads do their work and release the lock.
<mjg> that's right
<mjg> i confused myself with the gcc committer referencing a spinlock implementation when doing something unrelated
<mrvn> Or in general pause helps if other threads aren't doing cmpxchg at the same time.
<mrvn> slows down the cmpxchg but speeds up everything else.
<mrvn> But as with so many micro benchmarks the effect is probably less than the noise. You can certainly find enough examples for it slowing things and speeding up things.
<mrvn> At least the gcc thing is an option. You can use it where it helps and otherwise just not give the option.
<mjg> the effect is very real when the target object is refcounted in this manner for example and slapped a lot
<mjg> as in you will see it in real workloads
<mrvn> refcounted? If the uarch holds the E status then pause would slow things down.
<mjg> reference counted
<mrvn> the way I can see pause helping is when the refcount is increased and then decreased before the pause in other thread completes and therefore prevents the cache line from bouncing. You would have to hammer the refcount for that to happen. On the other hand if you have many threads accessing the refcount at the same time (e.g. woken up by a condition) but then leaving it alone then pause should hurt.
<mjg> it always hurts for this case on amd64
dennis95 has joined #osdev
<mrvn> I'm happy that I don't have threads or shared memory in my kernel and this scenario basically can't come up. At least not outside the microkernel IPC mechanism.
srjek|home has joined #osdev
nur has joined #osdev
MiningMarsh has joined #osdev
Terlisimo has quit [Quit: Connection reset by beer]
Terlisimo has joined #osdev
bgs has quit [Remote host closed the connection]
LostFrog is now known as PapaFrog
rwxr-xr-x has quit [Remote host closed the connection]
epony has joined #osdev
Dyskos has joined #osdev
heat has joined #osdev
<heat> linux just got rid of the red black tree in mm_struct
<heat> it's now a maple tree
<mrvn> What's a maple tree? Google only finds nature links.
<mrvn> (and a canadian flag)
<bslsk05> ​lwn.net: Introducing maple trees [LWN.net]
<bslsk05> ​'The Linux Maple Tree - Matthew Wilcox, Oracle' by The Linux Foundation (00:39:02)
<mrvn> reads like it's a B-tree with locks for inner nodes and RCU for leaves.
<mrvn> So theoretically nothing new but lots of fiddly bits to implement the API needed for VMAs.
<mrvn> interesting mix of locked and lockless if I'm reading it right.
netbsduser has joined #osdev
xenos1984 has quit [Ping timeout: 256 seconds]
xenos1984 has joined #osdev
dude12312414 has quit [Remote host closed the connection]
dude12312414 has joined #osdev
<vdamewood> Maple trees are for saps.
<zid> Still waiting on Elm Tree
<vdamewood> What a nightmare.
<sbalmos> at least it's not a sweet gum tree with those spiky ball seeds
Dyyskos has joined #osdev
Dyskos has quit [Ping timeout: 260 seconds]
gog has joined #osdev
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
Dyyskos has quit [Quit: Leaving]
poyking16 has joined #osdev
gog has quit [Quit: byee]
xenos1984 has quit [Ping timeout: 260 seconds]
gog has joined #osdev
eroux has quit [Ping timeout: 260 seconds]
eroux has joined #osdev
poyking16 has quit [Quit: WeeChat 3.6]
xenos1984 has joined #osdev
<geist> always thought that b-tree like structures should be more used in data structures like that
<mrvn> having nodes be a cache line or two makes a lot of sense.
<mrvn> 4 sounds like a lot. But you have to balance locking with cache lines I guess.
<gog> this reminds me, i was going to improve my virtual memory allocator
<gog> not with a b-tree, but it's a tree
<gog> just like everything else in my "kernel" it has an inconsistent interface
<GeDaMo> Consistently inconsistent :P
<gog> yes
<gog> that's me in a nutshell
<gog> but also i want to hide the implementation a little better
<mrvn> The address space allocator or the mapping code?
<gog> allocator
<gog> the mapping code needs help too
<gog> it's all a big mess
<gog> my life is too so
<mrvn> gog: I'm still just searching the page table. B-tree enough?
<gog> i use an rbtree with a key of (base, length)
<gog> so it's fairly fast to see if an address is allocated
<gog> the comparator just checks if there's any overlap
<gog> if there is, then it checks the sucessor node
<gog> there's some low-hanging optimizations in there too
<gog> probably another tree with the same keys of free ranges
<gog> i think geist said something about that months ago when i was initially implementing it
<mrvn> I have the address past the last allocation and then I just check the page table for the next gap of N+1 pages.
<gog> that works too, but when the address space gets fragmented it'll slow down over time
<gog> but you can then just keep lists of free regions of 2^n pages
<mrvn> Not really. In 64bit the size of the address space is so much bigger than the ram that barely anything will be used.
<gog> oh yeah
<geist> well, not so fast. you can pretty easily fragment the crap out of a 64bit aspace
<mrvn> First time you would even notice fragmentation is when you used up 512Exabyte of ram. That's a lot of alloc and free.
<geist> especially when at leat given current hrdware it's really more like 47 or 48
<geist> also consider: ASLR
<gog> yes
<gog> i was about to point out most 64 bit impls are 48 bits or fewer
<mrvn> geist: You can have a ton of 4k pages allocated but the remaining free space will be magnitudes larger.
<geist> no doy. but that's not fragmentation
<mrvn> If your pages are close together an alloc might have to skip over all of them but then it will reach a huge hole for the next few million allocs. If the pages are further apart it will often find a hole while skipping. So I think it balances.
<mrvn> Note: I don't have mmap so you can't spam the address space with lots of mapped pages.
<mrvn> geist: Would you make allocations return randomized addresses instead of going round-robin?
<geist> depends. personally i'd like the latter, eve if doing ASLR (i think that's functionally what linux, etc does)
<geist> but in fuhsia we basically go full on random
<geist> so the aspace over time gets completely shotgunned with random allocations
<geist> for better or worse
<mrvn> The biggest drawback I've noticed is that my page tables can become big. You can leave 1 page allocated in a leaf and it will keep all 4 levels of the page tables locked in memory.
<gog> and that's when a fast way to look up (base, length) pairs comes in handy
<gog> if you've got holes everywhere in the aspace and you're just picking one at random
<mrvn> So 8-12k overhead per allocated page.
<geist> yup
<geist> though it's ot exactly 8-12k per page
<geist> but more like 8-12k per allocation, since neighboring pages mostly take advantage of thes ame page tables, etc
<mrvn> gog: you can have as many holes as allocations. With 8 GB ram that's max 2 million holes in a 47bit address space.
<geist> but yeah 64bit is so much nicer here. lots less stuff to worry about re running out of space
<mrvn> geist: the IPC moves pages between address spaces so I get a ton of single pages moving around. If you have some allocations that you keep inbetween then you could easily end up with one page per leaf. But that's the worst case.
* geist nods
<geist> you could i guess reserve a chunk of the aspace for incoming IPC buffers and then maybe mitigate it a bit
<geist> ie, this 512GB or 1GB rgion is where they come in
<mrvn> But my intention is that the libc would allocate 2MB for the heap and not single pages. And 2MB allocs can skip ahead in the page table and use a level 3 entry for a big page.
<mrvn> I also have a used_for flag in alloc, like GFX, IPC, ... so I can indeed reserve chunks for different use cases.
* geist nods
<mrvn> I've added that for the RPi so gfx memory uses pages below 1GB that the VC can access.
<mrvn> reusing addresses for IPC could also avoid having to allocate page tables and to invalidate page table walks.
<mrvn> On the other hand I like having use-after-ipc to just fail because the page will be unmapped.
cyao has joined #osdev
<heat> geist, it's not only the fact that it's a btree. being able to use RCU is a big win
<heat> i've seen the mmap rwlock get hugged to death in a big server with a >1000 threads
<heat> doing cat /proc/<pid>/maps would literally hang for seconds
<cyao> Hello, how do you implement the FILE type? is just a char* enough?
<heat> no
<heat> char * to what?
<cyao> To the file
<heat> what file?
<heat> what is a file and how is it a string
<cyao> like im just wantting to access a few small files
<cyao> read them from disk
<cyao> and access them
<gog> FILE is a complex structure
<heat> define file and how would it be a string
srjek|home has quit [Ping timeout: 268 seconds]
<cyao> so is just reading them and putting them in memory, then giving the function that needs the file the pointer to the memory good?
<gog> that's not really how it works
<heat> well, i mean
<heat> technically?
<gog> ok
<cyao> umm a file like a text file, plain text
<heat> are you doing this in the kernel?
<gog> so small files that are not sparse
<cyao> yes :P
<gog> you can just mmap those in one chunk into the address space
<heat> usually you create a VFS and read through that
<cyao> just tring to achive file reading
<gog> and then FILE just contains some pointers and implementation details
<heat> kernels don't have FILE in the C standard library sense
<gog> yes
<heat> see linux's struct file for an equivalent-ish
<cyao> you have a link to linux's file?
<cyao> couldn't quite find it
<heat> elixir.bootlin.com
<heat> search there
<cyao> i only found the one for aarch
<cyao> okk thx
<heat> the correct UNIX-like VFS approach is to have struct file, which represents a file descriptor, struct inode, which represents a filesystem inode (where you do reads), and struct dentry, which represents a directory entry (or a directory itself in case the inode is a directory)
<heat> or <insert BSD struct names>
<cyao> I just searched in the site, and it told me that the file def is just like this: typedef struct FILE { char dummy[1]; } FILE;
<cyao> Am I looking at the right one?
<heat> i told you, it's struct file, not FILE
<bslsk05> ​elixir.bootlin.com: fs.h - include/linux/fs.h - Linux source code (v6.0.8) - Bootlin
<heat> thank you goggers
<gog> all caps is the stdio library, part of the c standard library
<gog> yw heaty
<heat> gog
<heat> gog
<heat> gog
<gog> yes my son
<cyao> Ahh thanks a lot!
<heat> BAZINGA
<heat> *laugh track*
<gog> bazooper
GeDaMo has quit [Quit: I'm going to show these people something you don't want them to see. I'm going to show them a world without you.]
<geist> also that whole FILE is just a char[1] is not correct at all
<geist> i have no idea where you found that, but that's not what a FILE is. but FILE in particular is a structure used in user space libcs to abstract a handle to a kernel file
<geist> so it really doesn't mean much in kernel space anyway
<gog> i think that's just the opaque pointer impl
<gog> the actual structure depends on the system
<gog> if they were looking at libc or smth
<gog> i'd have to open it up but it's almost dinner time
<geist> ah yeah probably true
<bslsk05> ​elixir.bootlin.com: stdio.h - tools/include/nolibc/stdio.h - Linux source code (v6.0.8) - Bootlin
<cyao> dunno why it's there
<geist> yeah that's just some user space wrapper thing
<geist> i do, but frankly it's not worth talking about
<geist> since it's just some minutae around opaque usr space handles, etc
<gog> yeah don't worry about stdio.h
<geist> anyway it's not what you're looking for
<cyao> okk
* gog waves her hand "this is not the struct you're looking for"
<cyao> woah how do you do the actions like this?
<cyao> im new to this irc thiggy
<cyao> *thingy
<\Test_User> /me <msg>; probably, client-dependent but practically always that
* cyao hello
<cyao> ahh thx
cyao has quit [Changing host]
cyao has joined #osdev
cyao has quit []
<heat> geist, you do?
<heat> oh wait, I see what it is, in context
<heat> just a dummy struct
<heat> and cuz empty structs are undefined in C, they add a dummy member
<mrvn> heat: and char dummy[] or char dummy[0] has compiler problems too on odd unix systems
<mrvn> Although I'm not sure if linux was ever able to compile with not gcc anyway.
<bslsk05> ​elixir.bootlin.com: compiler-intel.h - include/linux/compiler-intel.h - Linux source code (v6.0.8) - Bootlin
<heat> no idea if this works though
<heat> also clang if you count it as not gcc :)
<mrvn> heat: I don't. it implements (all) the gcc extensions.
<mrvn> (not for this case)
<heat> not quite all the gcc extensions
<heat> there has been some effort to get clangbuiltlinux (see clangbuiltlinux)
<mrvn> not really relevant for that FILE dummy struct
<mrvn> that's pre c99 stuff
<heat> most of linux still compiles in gnu89 mode
<heat> (although they pass -std=gnu11 iirc now)
<heat> the linter will hurt you if you mix declarations with code
<gog> clangux
<heat> bazingux
chartreuse has joined #osdev
<mrvn> Poll: "Jason X": Horror or Commedy?
<gog> heat: do you program visual entertainments
<gog> mrvn: both
<gog> i haven't actually seen it
<gog> if it's anything like jason goes to hell then it's both
<gog> over-the-top horror is also hilarious
<heat> what's a visual entertainment and why does that sound dirty
<gog> heat: bazinga
<heat> 😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂
<heat> 😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂
<zid> nice black boxes
nickster has joined #osdev
<zid> gog when are you teaching me how to write an OS
<zid> mj only cares about smp locking prims, heat only cares about football and emoji
<heat> i also care about smp locking primitives
<zid> only enough to stop mj raging
<heat> it's not just football and emoji
puck has quit [Excess Flood]
<gog> zid: i don't know how to write an os
<gog> i can write pieces of a kernel
puck has joined #osdev
<zid> that's fi ne that's what I mean
<gog> oh
<zid> OS ends at kernel
<zid> I either need to write, a vfs, or I need to write an allocator
<gog> well step right up young man and welcome to gog's academy of partially-completed ideas
<zid> pick one
<zid> if you pick the former you have to write it without the latter ofc
<gog> well
<gog> i was just gonna say
<gog> how can you have a vfs if you don't have any ~~beans~~ allocator
<gog> i guess you could just statically allocate structures and only do operations one at a time and have to repopulate it every time you need to make a transaction from memory to fs
<gog> that sounds error-prone and also awesome
<gog> and vice versa
<gog> awesome and error-prone
<heat> write a slab allocator
<zid> eh just have struct blah n[MAX_FDS];
<gog> write a slub allocator
<zid> okay tell me
<heat> do you want the theory behind slab
<mjg> bonwick moment!
<gog> write a slob allocator
<heat> slob is shit and getting removed
<zid> I thought we already settled on slub
<gog> dang how is it going to allocate me then
<heat> slub is like slab but each partial slab is percpu
<mjg> write slyb
<gog> what if gog was one of us, just a slob like one of us
<zid> if I were some idiot you'd all be fawning over me trying to teach me shit
<zid> clearly you respect me too much to actually tell me things
<heat> hi idiot, im heat
<zid> you just wanna meme about sl?b
<gog> what am i supposed to teach you that you don't already know
<zid> I don't know anything about allocator design
<gog> don't you have a more advanced experiment than i do
<mjg> excrement
<zid> my kernel's 'allocator' so far is just a free list of pages
<gog> so is mine basically
<zid> so you can alloc I guess, you just can't free or do allocations that aren't 4k aligned
<gog> yeah
<gog> sounds familiar
<heat> ok so basically slab works in this way: you have caches, a cache being a collection of slabs (we'll get there) plus a ctor and dtor (optional, linux has stripped those out)
<mjg> don't let anyone gatekeep the term allocator!
<gog> if it allocates it's an allocator
<gog> doesn't have to free
<heat> a cache also has a name and an object size (so you create a cache for inodes, a cache for dentries, a cache for each kmalloc size class, etc)
<heat> a slab is a collection of objects, usually PAGE_SIZE'd, sometimes not (if the objects are individually too big, you allocate a higher order slab)
<zid> pool allocator with multiple pools okay
<gog> billiards allocator
<heat> slabs can be free (no objects being used), partial (some objs being used), free (no objs being used)
<heat> erm
<heat> *full (all objs being used)
<heat> basically on alloc you try a partial slab (if there is one), allocate
<heat> if you don't have partial slabs, you allocate a new slab and just grab an object from there
<zid> so your description of how an allocator works is step 1. Draw some rough lines. step 2. ??? step 3. Allocate.
<heat> each slab has a free list
<heat> so allocating is trivial
<heat> the advantages of this slab thing is that you have a lock per object and you *never* return slabs back to the page allocator unless you really need to (or if you got some heuristic to minimize slab memory usage)
<heat> s/object/object cache/
<heat> if you go down the shitty ctors and dtors you also get some theoretical benefits when constructing objects but I personally believe it's a stupid weird myth and it doesn't gain you much
<zid> gog: This is why I asked you
<heat> such that linux has done away with those
<dminuoso> I think the thing that is often missed with slab is the rationale. There's usually way too much talk about finer implementation details, and lack of the rough idea..
<zid> I don't give a fuck about any of this
<mrvn> your slab can also have a move callback if objects are movable. Then you can defragment slabs to free more
<heat> ...
<zid> dminuoso: exactly
<heat> great waste of my time, thanks
<heat> dminuoso, I could go on about "caches" and constructing/destroying but I don't believe in any of that
<mrvn> dminuoso: slabs have 3 benefits: 1) fast and simple, 2) efficient because it's equal sizes objects, 3) cache locality.
<clever> mrvn: palmos and the rpi firmware have a relocatable heap that isnt slab based, and doesnt use a callback either
<heat> I just believe in less lock contention, which is what you essentially get
<zid> "explain calculus" heat: "Well the derivative of speed is acceleration" dmi: "Calculating the area of a shape by cutting it into tiny slices and summing them"
<zid> guess which one is implementable
<mrvn> clever: most objects can't just be moved without fixing some pointers to them
<clever> mrvn: yeah, thats why the relocatable heap doesnt use pointers but handles, and you must lock the object to get its current addr
<zid> "ctors gives a benefit supposedly but linux got rid of them" is not
<heat> zid, I literally explained to you how a slab allocator works from top to bottom
<heat> what else do you want?
<mrvn> clever: that's just an object store, not a heap
<zid> ah so you're not even aware of the problem, no malice then
<clever> mrvn: once an object is locked, you get a physical address and it is contiguous in the physical space, that sounds heap-y to me?
<heat> what do you want to know about ctors?
<heat> time has proven them to be a shitty idea
<zid> Nothing, the question is why you even mentioned them
<zid> if you weren't going to explain them
<zid> you don't think you should so you didn't
<mrvn> clever: and once you put it back it can move around.
<zid> but what i want is to implement this, not to read a treatise on which parts are 'good'
<dminuoso> The gist of slab allocator is simple: rather than allocating pages on demand, you preallocate, and instead of unmapping you mark as free. This speeds up allocation in that you already have page table mappings. Furthermore, depending on what object type you want to store, you might group allocations together for a particular object type (say you need a THING_T often, so maybe you have a region in
<dminuoso> which you have a bunch of THING_T preallocated and perhaps even initialized, such that an allocation is just handing you a pointer - without finding a free page, mapping it, and initializing it.
<clever> mrvn: yep, once you unlock it (refcnt based), the kernel can move it to defrag the heap
<clever> mrvn: but i also recently learned, the rpi has a special flag when you unlock, to say the contents are not actually of importance
<clever> so it can skip the actual memcpy when moving
<zid> dminuoso: Nice. What's the strategy for actually allocating?
<dminuoso> zid: whatever you please, really
<mrvn> clever: so more a free than put back
<dminuoso> It's largely irrelevant
<zid> I still need an implementation of that though
<clever> mrvn: but the space is still reserved, and you will instantly get a range again next time you lock
<gog> zid: i respect you enough to know that you're not actually asking me for help
<zid> It's presumably fairly important that the actual you know, allocation bit of it works nicely, before you add a vaneer to it
<heat> dminuoso, except that has proven to be largely irrelevant
<mrvn> zid: usualy free THINGs are in a linked list and you just allocate by taking the head.
<heat> the caching part of the "object-caching kernel memory allocator" has been almost completely removed
<heat> what truly matters is the allocation algorithm - which I described in detail per bonwick 94
<mrvn> heat: how is ctor/dtor a bad idea?
<heat> you be sure in which context ctor() or dtor() is being called, and with which locks
<heat> can't be*
<dminuoso> heat: Oh okay. I think there was a misunderstanding on my part. I read "actually allocating" as "page allocation"
<dminuoso> And I meant how that is done is largely irrelevant, and if it is relevant you will know and know how to address it
<dminuoso> (e.g. do you need physically contiguous pages or not)
<heat> it also stops you from optimizing the slab layout by sticking the *next inside the actual object
<heat> dminuoso, sure. although that's also not quite irrelevant
<dminuoso> heat: well its easily replaceable and improvable
<heat> getting ptr_to_slab in an efficient way is a good idea
<mrvn> heat: depend on when you call ctor/dtor. If you do it on alloc/free, i.e. new/delete calls, then it's perfectly fine
<heat> mrvn, no, these ctor/dtor are called at slab allocation/destruction time
<mrvn> heat: when you pre-initialize then you are right about the next pointer
<dminuoso> I think there's something to be said about which parts are easy to refactor down the road, just to get a working implementation faster.
<heat> sure
<heat> but this aspect is relatively important for allocator performance
<heat> although after the vmem paper [bonwick 2001] it stops being important because the hotpath shifts significantly
<heat> well, stops being /super/ important
<mrvn> heat: calling ctor/dtor at slab construction time is a security problem too, you leak information across alloc/free
<heat> sure
<heat> (although that's a problem for every allocator)
<dminuoso> well, sometimes you explicitly want to cache it
<heat> I find this ctor pattern to be highly pessimal anyway
<dminuoso> so its more subtle to understand which cacheable construction artifacts are security sensitive and which ones are not
<heat> if you want to add all these cached objects to a list, you can't just lock once, add everything, unlock
<heat> you need to relock for every ctor() call
<dminuoso> other allocators could get away by just zero'ing out everything, so your only worry is whether that happens at all
<heat> the original big linux issue with dtor is that you can't reliably know in which context you're calling it
<heat> you could call it from IRQ context, you could call it as part of normal irqs-enabled preemption-enabled operation
<heat> my big issue with it is that not only do I not have objects that require such expensive initialization/destruction, it stops me from being able to reuse freed object space, and doesn't play well with C++ object lifetimes at all, and the dtor issue too
<heat> you'll also realistically only have 2 or 3 caches in the kernel, and those should get hand optimized
<mrvn> If you have an object that needs an expensive ctor call to initialize then why don't you need that after free and allocating it again?
<mrvn> as for the context in which you call it. How is that a problem. Don't alloc/free objects in an IRQ that are not IRQ safe and vice versa.
<heat> because per bonwick objects that get freed are still in a valid state
<heat> how is that a problem? because dtor isn't called at free() time, but at page_free() time
<mrvn> the problem is that the state they are in is pretty random. So any user needs to re-init the object to get a consistent state.
<heat> if you're allocating in a particularly complicated context (irqs off, preemption off, whatever), dtor() can be called whenever
<heat> if you're running out of memory for instance
<heat> sure, it's pretty random, which is why you need to be careful not to
<bslsk05> ​grok.dragonflybsd.org: inode.c (revision 2e488f13) - OpenGrok cross reference for /linux/fs/inode.c
<heat> it's this kind of dubious usage that tells me ctor() isn't a good idea
<mrvn> I find state that survives across free/alloc rather odd
<heat> well, that's the gist of the "caching allocator" part
<mrvn> don't you call page_free in free when you have an fully unused slab and enough free objects?
<heat> no, you call page_free when memory gets tight
<mrvn> that would imply you have a list of all slabs sorted by some metric to find the best one fro free from and such.
<heat> you essentially gather empty slabs in your cache for $indefinite amount of time, per bonwick
<heat> ofc in the real world things are a bit different
<heat> meh, that's optional
<mrvn> I don't cache free objects. I cache stuff with information in it. :)
<dminuoso> If empty slabs are going to be the reason you are running out of memory, I think you have a very different problem
<mrvn> dminuoso: if your slab never shrinks on free then it will be the cause of oom
<heat> yeah it's a valid issue
<dminuoso> I guess it is a situational problem, but a very sporadic housecleaning will take care of it.
<mrvn> dminuoso: think inode/dentry cache. That gets huge till you run out of memory.
<heat> yeah, which is why you shrink on OOM
<dminuoso> It might not run oom on the basis of never shrinking, though.
<dminuoso> Getting OOM is if you continously allocate new slabs
<dminuoso> But merely not freeing wont continuously increase memory pressure
<mrvn> But on OOM you then need to free inodes/dentries so the slab can shrink at all
<heat> one can imagine particularly big slabs, like a kmalloc-32MB slab
<dminuoso> mrvn: right
<dminuoso> at the OOM threshold the small latency is fine anyway, chances are even slab freeing is not going to be enough anyway
<mrvn> heat: is there a point of having slabs for obejcts > page size?
<dminuoso> it will at best only temporarily give you some breathing room
<heat> mrvn, yeah
<clever> mrvn: it reduces fragmentation, so you dont wind up with a pair of 16mb holes, and no 32mb hole
<heat> kmalloc is a big usage of that
<clever> the same thing the rpi's relocatable heap solves, but the rpi just moves things around after the fact
<mrvn> clever: irelevant for 64bit for consumer systems
<heat> slab is pretty much based on the fact that if you need it once, you'll probably need it again
<clever> mrvn: i think it depends on how you manage memory, if you map the entire physical ram to a contiguous range of virtual memory, the your hole fragmentation carries over
<clever> but if you dynamically change the kernel paging tables, you can assemble all of the holes on demand with the mmu
<mrvn> clever: that's then not using virtual memory in the kernel. Linux used to do that. Big problem.
<mrvn> You have an MMU. Use it.
<clever> ah, that explains why i thought linux did that
<clever> and 32bit linux with LPAE kinda needed to switch over
<clever> since you couldnt fit all ram
<heat> linux still does that
<heat> using virtual memory is slow
<heat> it allocates on top of the direct mapping
<clever> i can also see it being beneficial to map things twice
<clever> if you want a physically contiguous chunk of ram, allocate the pages with the PMM, and use the physical window in the virtual space
<clever> if you dont care about the physical view, allocate random pages with the PMM, map them somewhere, and its assembled
<clever> and adjust the protection bits as needed, so the mmu still does its job
<heat> vmalloc allocates actually virtual memory, but you should only use it for big allocations, it's pretty expensive and the allocation sizes get page aligned
<clever> yeah, thats what i was thinking of
<heat> (and not quite virtual, everything is mapped and pinned)
<heat> you also have kvmalloc which tries kmalloc and falls back on vmalloc
<clever> ah, nice
<mrvn> heat: if it's small you use a slab.
<heat> you may not be able to know if it's small though
<heat> imagine std::vector
<mrvn> heat: a vector is big and you probably want to reserve lots of space for it to grow.
<heat> a vector isn't necessarily big, and you don't want to reserve lots of space but rather 2^order (where order gets incremented when you run out of space)
<mrvn> heat: reserve, not allocate.
<heat> and you can't reserve $lots because, again, there's no virtual memory or on demand faulting
<mrvn> heat: a linux problem.
<mrvn> I have no problem reserving a big chunk of virtual address space and then later maping physical pages for it as needed.
<clever> the last time i actively wrote linux drivers, it was for a 3d core
<clever> most of the allocations where small and fixed size
<clever> but the framebuffer and textures where large
<clever> and all of them had to be physically contiguous
<heat> this is not a linux problem
<heat> it's a $every_os problem
<mrvn> clever: so you should probably have a slab for the small stuff and kvmalloc the big stuff. But with physically contiguous it's a bit different.
<heat> on demand faulting isn't something you can just do
<bslsk05> ​github.com: v3d2/v3d2.c at master · cleverca22/v3d2 · GitHub
<mrvn> heat: I didn't say "on demand faulting". A std::vector knows when it resizes and can explicitly map pages as needed.
<mrvn> heat: you just don't want to move objects in the kernel so you need enough virtual address space for the vector to grow into.
<clever> mrvn: looks like i was using dma_alloc_coherent() and remap_pfn_range() to map it into userland
<heat> and how much would you reserve?
<heat> you can totally move objects, that's a non-issue
<clever> mrvn: if i was to rewrite it nowadays, i would use the new dmabuf framework to handle that half of the job
<heat> it will always be an amortized O(1) push_back
<mrvn> heat: depends on the context. $memory can't be too wrong. As for moving that invalidates all iterators (and pointers) and you either have to copy (which means using twice the old size in memory) or remap pages. Both are rather costly.
<heat> remapping doesn't work in standard C++
<mrvn> Again something you can't do in 32bit. In 64bit reserving 8GB address space on an 8GB ram RPi is no problem.
<mrvn> heat: depends on wether your object is trivially copyable.
<heat> in fact, userspace std::vector can't use realloc for this reason
<mrvn> heat: it can for trivially copyable.
<heat> I don't know if there's an optimization for that
<mrvn> not that I know of.
<heat> IIRC there wasn't last i checked
<heat> a solution for this would be to add a relocate() method
<mrvn> I missing a realloc_but_fail_if_you_have_to_move()
<mrvn> I guess in userspace the chance that realloc doesn't have to move is near zero unless you shrink.
<mrvn> heat: does std::vector realloc on shrink?
<heat> hmmmmmmmmmmmmmmmmmmmmmm, idk
<mrvn> I don't think C guarantees realloc + shrink won't copy.
<heat> i highly doubt C gives any guarantees on realloc at all
<mrvn> heat: it guarantes that the memory block from 0 to min(old_size, new_size) remains the same.
<heat> sure, except that
<mrvn> I would assume though that most OSes implement shrinking in realloc to not move the data.
<mrvn> But nothing about that in the manpage so I assume POSIX guarantees nothing there.
<klange> That cyao person showed up in my channel at 1am asking similar questions, and then got mad when I wasn't there.
<heat> timezone moment
<gog> hi
scoobydoo_ has joined #osdev
scoobydoo has quit [Ping timeout: 260 seconds]
scoobydoo_ is now known as scoobydoo
hmmmm has quit [Ping timeout: 252 seconds]
srjek|home has joined #osdev
Celelibi has quit [Ping timeout: 246 seconds]
<heat> ______________
<heat> \______ _____ /
<heat> | fuck you |
<heat> \/
<heat> .--.
<heat> |o_o |
<heat> |:_/ |
<heat> // \ \
<heat> (| | )
<heat> /`\_ _/`\
<heat> \___)=(___/
Burgundy has joined #osdev
<gog> :(
LittleFox has quit [Quit: ZNC 1.8.2+deb2+b1 - https://znc.in]
LittleFox has joined #osdev
<heat> tux has grown to be a bit of an asshole
<geist> awww that's mean
<heat> but he's smiling and everything
<geist> reminds me the guy that drew the tux logo was at my university at the tmie
<geist> i remember it being some what of a deal
<Mutabah> heat: ... why'd you post that?
<bslsk05> ​www.reddit.com: [OC] jfchmotfsdynfetch - The MOST minimal fetch tool that fetches precisely NO information about your PC : linux
<heat> i find it very cute
<zid> wtf is up with my interwebs
<zid> did you do this heat
<heat> yes