<nikolapdp>
heat: how do you do physical memory allocation
Neo has quit [Ping timeout: 260 seconds]
<heat_>
imagine not having tab completion
heat_ is now known as heat
<nikolapdp>
kek
<heat>
nikolapdp, very generic question, please explain
<nikolapdp>
like how do you keep track of what phyical pages have you allocated or not
<nikolapdp>
do you use a slab, or bitmap or whatever
<zid>
bitmap is life
<zid>
bitmap is love
<nikolapdp>
sure is zid
<zid>
bitmap of bitmaps
<heat>
i have two physical memory allocators
<heat>
my bootmem allocator works pre-buddy, it's basically a list of available ranges and reserved ranges, and you carve out memory from the available ranges
<heat>
it's a very simple thing
<zid>
boros does a linked list cus it's boring and trivial
<heat>
my actual page allocator (when memory is "properly up" and I have struct page available) is a buddy allocator
<nikolapdp>
makes sense
<heat>
pages in the buddy allocator get marked PAGE_FLAG_BUDDY, the order is also stashed in the struct page; these two things help me coalesce pages
<heat>
then as a kind of "separate layer but not really" i have a percpu cache of order-0 pages
<heat>
oh, note that my buddy allocator is zone-based
<heat>
and technically-but-not-actually NUMA-node-based
<heat>
say, a node has memory for DMA32 and NORMAL (> 4GB)
zetef has quit [Remote host closed the connection]
<zid>
nikolapdp: When are you adding me a proper allocator to boros?
<heat>
what zone you prefer/use entirely depends on flags you pass the allocator
<nikolapdp>
zid: when i am done writing my own os
<heat>
it also has some initial support for kicking off page reclamation
<zid>
you mean, after you're done reading honzuki
<nikolapdp>
people can do two things
<nikolapdp>
heat why is it numa but not really
<heat>
in practice besides the basic LRU shit (which i *still* don't have) i need memory compaction in order to reliably be able to get higher order pages
joe9 has quit [Quit: leaving]
<heat>
it's numa but not really because although I do have the beginnings of a struct page_node for each NUMA node, i don't instantiate any and the alloc_page() interface does not support specifying numa nodes
<heat>
nor is slab numa-aware, nor is anything else
<heat>
and i cant be arsed because i don't have numa hardware, so it'd be pretty hard to test nonetheless
<heat>
even if i tried to add numa
<nikolapdp>
lol fair enough
Neo has joined #osdev
jack_rabbit has quit [Read error: Connection reset by peer]
jack_rabbit has joined #osdev
Shaddox404 is now known as Shaddox_AFK
<heat>
geist, have you seen Svvptc?
<heat>
it works around the need for the "redundant" sfence.vma when mapping in a page fault
<heat>
it makes stores to PTEs that set V happen-before an sret or mret
<geist>
i'm not sure i've ever built it either, need to do it
<geist>
also hmm, which one is svvptc....
<geist>
ah yeah, dunno. no haven't fiddled with it
<heat>
yeah its very new
<heat>
it wasn't ratified yet tho
gareppa has joined #osdev
<geist>
does remind me i should look at the oh what is it extensio (i have a whole spreadsheet at work with a list of extensions, but i'm on my personal computer right now)
<geist>
it's the one that lets yo split MMU flushes into separate flush and sync instructions
<geist>
ie, like arm
<geist>
that extensiom is starting to show up on things
<heat>
the Owhatisit extension?
<geist>
Svinval
<geist>
qemu will emulate it but i'm sure it makes no difference at all, probably treats the sync as a nop
<heat>
i have no idea how one is supposed to support all these extensions and differing code paths
gareppa has quit [Client Quit]
<heat>
this looks like opengl extension hell, but architecture
<geist>
well, in general you start adding global bools and either test at the place, start code patching, or have different virtual functions
<geist>
at this point its nothing like suppirting a bunch of stuff on arm64 post v8.0
<heat>
is arm64 worse?
<geist>
well, now that it's up through 8.7 and whatnot there are a *ton* of details that you may want to conditionalize on in the kernel
<geist>
behavioral stuff
<geist>
feature bits that change this or that
<geist>
it's the beahviorla ones i find to be more annoying, where based on feature X if you set bit Y now you need to do sequence Z instead of W
<geist>
though as is usual most are optional, so you can pick and choose
<geist>
likle say dont need to use x2apic vs apic kinda stuff
<heat>
i guess with riscv it's *usually* like that too
<heat>
except with Zicbom, that's really annoying
<geist>
part of the problem is so far a few of the extensions are not opt out. like if this is present you must deal with it
<geist>
there's a new extension for precisely that problem actually, but i haven't seen it in place yet
<geist>
lets you turn features off
<heat>
the riscv platform spec(i think?) says something like "cache coherency is not a problem and is expected for UNIX kernels. if cache maintenance is required, there will be an extension present for it"
<geist>
whereas x86 and arm are worried enough about forward compatibility that they almost always hide new things behind some sort of opt in bit
<heat>
which sounds /ok/, but you don't know if there's a cache extension present, except if you support it
<heat>
and if you don't... silent breakage all around
<geist>
yah
<heat>
e.g there's Zicbom, but there's also a Theadcmo or something liek that
<geist>
stuff like wiether or not the cpu writes back to the A/D bit: you cannot opt out of that
<geist>
it either does it or not, and you must deal with both paths
<geist>
that ones the most annoying to me personally so far. i'd just as soon have it fall back to exceptions and then if needed write code to scan later, but in this case you dont have a choice
<heat>
i had to deal with the zicbom problem personally, and it was the most annoying shit ever
<heat>
because the EDK2 people want to half ass it and deal with the real problems later
<heat>
and i don't quite understand the device <-> cache coherency problem well enough to really be an authority on it
<geist>
yah
Shaddox404 is now known as Shaddox_AFK
<geist>
yah added zicbom to zircon recently
Shaddox_AFK is now known as Shaddox404
<geist>
what's making the extension explosion not get out of hand is the RVA stuff which defines these baselines and mandatory extensions
<geist>
so for the most part if you follow along there and pick up the mandatory bits as the RVAs roll forward. RV... uh, what is the A
<geist>
oh profiles. A is for application stuff i think
<bslsk05>
github.com: riscv-profiles/rvm23-profile.adoc at main · riscv/riscv-profiles · GitHub
<geist>
not sure it's ratified yet, but there is a microcontroller version
antranigv has joined #osdev
<heat>
actually, now that you're here geist: when do you need to maintain cache coherency explicitly?
<heat>
i know there's a device tree property for it
Left_Turn has joined #osdev
<geist>
between cpus or between cpus and devices?
<heat>
does it depend on the device? the platform? both? the architecture? all of em?
<heat>
cpus and devices
<geist>
yes
<geist>
i dont actually know if ther'es a device tree thing that says if it's coherent or not
<geist>
so for example the sifive hifive and visionfive class socs *are* coherent, which is why there really isn't any cache flushing you have to do
<heat>
there's dma-coherent and dma-noncoherent
<geist>
theres some sort of front port AXI bus that if you run your bus mastering dma device through it, the cpu gets to snoop the transfers
<geist>
yah and it makes it dma coherent, and thus you dont really need to manually flush anything. basically like x86
<geist>
note this is independent of i&d cache coherency. riscv and arm (and most other arches) you have to manaylly sync data there, but that's known
<heat>
where's it stated "this architecture is coherent by default"
<geist>
it does not
<geist>
it quite explicitly does not state it at all
<heat>
because the device tree spec states:
<heat>
"For architectures which are by default non-coherent for I/O, the dma-coherent property is used ..."
<heat>
and vice-versa for the dma-noncoherent
<heat>
so... how tf do you guess?
<heat>
i'm assuming the device tree spec reflects reality
<geist>
right. it quite possibly is Just Known, or it may be stated that you must assume it's non coherent unless specified elsewhere
<geist>
depends. which spec are yo ureading? if it's the original spec it probably hasn't been updated in 20 years
netbsduser has quit [Remote host closed the connection]
<geist>
but if you read the arm and riscv spec it may be stated somewhere that it's non coherent by default
<geist>
i just cant tell you if/where that is
netbsduser has joined #osdev
<geist>
however since i know it is that way because that's how it is, i dont particularly need to find it
<geist>
i think what makes it more confusing is except for very high end server chips, any given ARM device is almost certainly non-dma-coherent, so it's sort of the default state: non coherent unless proven otherwise
<geist>
and if you over flush stuff you're just wasting time, but it's otherwise harmless
<geist>
on riscv it seems a lot of the initial cpu clusters (by sifive in general) *are* fully coherent, so it means a lot of initial code can forget about it, and then as more cores come out that are not, it gets much more messy
<geist>
i've never heard of one that doesnt, so i can't say why
<geist>
possible there were some existing cores that dont, so this is a recommendation to try to claim it back
<geist>
this profile stuff seems to be a real attempt across the riscv world to make some order out of chaos
<geist>
to make things a little more confusing, it's possible for machine mode (SBI in particular) to trap and emulate instructions transparently for you, so they may, for eample, just nop something it doesn't understand
<geist>
in that case it's not the cpus fault, but a firmware issue. from the app developer point of view it may appear as if nothing was raised
<geist>
not saying thats the thing, but possible something like that exists somewhere and this is an attempt to 'please dont do that again'
gbowne1 has joined #osdev
<geist>
i'd like to tell you some of the real world riscv mess i've had to deal with over the last 6 months but i can't, but precisely this sort of nonsense does exist right now
<geist>
but usual 'bring up on <thing> which is weird and nonstandard'
<kof123>
or all the feature bits sounds like "The Thirty-Million Line Problem" .....which he argued e.g. for x86 (expansion hardware-wise, not just cpu), that was what brought innovation... this is just to say, because it is new stuff, the dust hasn't settled yet?
<geist>
yeah but it also treats the page table as a first class citizen
<geist>
as an upper data structure, much to the chagrin to any arch that doesn't match that model
<geist>
it's a fundamental design decision that has immense ramifications
gog has quit [Quit: byee]
* heat
nods
<heat>
i.. i don't know, this is hard to think about
<heat>
i feel like the linux page table model came up as an accident, but a happy accident because by exposing them as a first-class citizen it ends up allowing for really fast "hacks" so to speak
<heat>
it was just a "look haha my hobby system can map pages" that evolved into pgd/p4d/pud/pmd/pte go brrrrrrr
antranigv has joined #osdev
<geist>
yah
<geist>
but then it also is only a win on architectures where it lines up. ie, x86
<heat>
linux generally throws away every other kernel's very pretty abstractions that map out very nicely on a whiteboard
<heat>
and it ends up winning out because of that
<nikolapdp>
geist where does it not line up
<heat>
sun engineering ethos vs LINUX HACKER GPL!!!!
<nikolapdp>
kek
<geist>
nikolapdp: POWER/PPC comes to mind. or itanium
<heat>
ppc, itanium, sparc
<geist>
or arches that take explicit TLB misses, or even arm32
<nortti>
0
<heat>
zero
<geist>
iirc arm32 has some funny dual page table thing, where for every high level page table there's a second one
<nikolapdp>
so a bunch
<heat>
yes
<geist>
yah but, if you notice all the modern ones basically copy x86
<geist>
because they know where the bread is buttered
<geist>
(not that x86 invented that strategy of page table)
<heat>
fwiw windows also follows this idea somehow
<geist>
yeah
<geist>
prototype page tables, etc
<CompanionCube>
the new POWER versions have more conventional page tables, don't they?
<heat>
gosh linux was a fucking accident wasn't it
<nikolapdp>
absolutely
<nikolapdp>
just at the right place at the right time
<heat>
the UNIX people generally have some sort of disdain for linux's abstractions
<heat>
i guess this is what dave cutler talked about all along
<heat>
the UNIX phds
<heat>
vs the Linux... unemployed BSc's?
bauen1 has joined #osdev
<heat>
vs the OpenVMS demigods of course
<nikolapdp>
and who won out :)
<heat>
IBM AIX
Matt|home has quit [Quit: Leaving]
<nikolapdp>
SOLARIS
<heat>
it's remarkably funny to read the svr4 internals book and see them justify the vnode as the end-all be-all of VFS's everywhere, but then when it comes to block devices and other special files, the vnode shits itself and needs a separate special filesystem to proxy
<nikolapdp>
kek
<heat>
whereas the linux jank has 3 separate structs with 3 separate method table structs
<heat>
but everything Just Works(tm)
<nikolapdp>
good enough(tm) always wins
<heat>
yeah, the jank is there for a reason
<nikolapdp>
if it works it ain't stupid i guess
<heat>
there's a lot of stupid stuff *and* stuff that seems stupid but isn't
<heat>
like struct page is really stupid and amazingly overloaded, but it's also the smallest of all the struct pages in UNIX
<nikolapdp>
lol
<heat>
there's a really great hairy trick in struct page: the mapcount field is biased to -1 (so 0 maps = -1 in mapcount)
<heat>
this means that it's trivial and OPTIMAL to detect state transitions between mapped and unmapped
<heat>
unmapped -> mapped = overflow to 0, mapped -> unmapped = underflow to 0xffffffff
<nikolar>
Interesting
<nikolar>
And very hacky
<heat>
this is not a story the sun engineering department would tell you