<nick64>
A few of examples of when CPL changes: 1. When you make a syscall, 2. When a syscall returns. Similarly when does IOPL change?
<Mutabah>
It _can_ change on any of those - anything that can load EFLAGS could change it
<Mutabah>
(although, I suspect that if the modifying code is not CPL=0, it'll GPF if the bits change)
<nick64>
I don't think I was able to convey what I am trying to ask here
scaleww has joined #osdev
<nick64>
What is a functional use case where it changes
<Mutabah>
The simple answer is "whenever you [the kernel developer] want to change it"
<froggey>
are you asking when you might want to change it?
<nick64>
froggey: exactly
<nick64>
What is an example scenario where it changes in the real world
<Mutabah>
"basically never" is my impression
<froggey>
never
<froggey>
right
<nick64>
Why is it a thing then to be able to change it?
krychu has quit [Quit: ZNC 1.8.2+deb1+bionic2 - https://znc.in]
<Mutabah>
There's a lot of legacy stuff in x86
<froggey>
linux had an iopl function for changing it per-process. xorg uses it to get access to hardware io ports
<Mutabah>
that said - I think there's a syscall on linux that allows you to modify the IOPL such that userland (for a single process, or even thread) can use the IO instructions
<Mutabah>
I don't know off the top of my head how it inteacts with the IO permissions bitmap
<froggey>
but because iopl=3 also enables use of the cli/sti/hlt instructions, the implementation has changed to use the tss io bitmap instead
<Mutabah>
^ that makes sense (that the IOPB passing is ORed with the IOPL passing)
<Mutabah>
Much finer grained
<froggey>
yeah
* nick64
is processing
<froggey>
it's a weird legacy feature basically, like rings 1 & 2
<nick64>
I see
<Mutabah>
For reference - Intel's manuals have a decent description in the EFLAGS register section
<nick64>
Yeah I was reading that earlier and couldn't wrap my head around why that is required
<nick64>
Like, if there is a way for userland to request it's IOPL to be downgraded to match the CPL, why have IOPL checks in the first place. If Kernel does some sort of privilege check (maybe UID of the user thread?) during the IOPL downgrade, then it can perhaps do the same check during the IO port access time itself, and not even have an IOPL maintained
<Mutabah>
There's a lot in x86 that is left over from experiementation in the transition to 32-bit
<nick64>
Yeah, when you mentioned it is only for legacy, I abandoned that train of thought
k8yun has quit [Quit: Leaving]
elastic_dog has quit [Ping timeout: 244 seconds]
elastic_dog has joined #osdev
roan has joined #osdev
elastic_dog has quit [Ping timeout: 260 seconds]
<nick64>
Correcting a flaw in my thought there. It would be the CPU that does the access checks and not the kernel, and CPU is not UID aware
<Mutabah>
Yep
<Mutabah>
The CPU only knows about the special CPU registers. The kernel checks the UID/permissions before doing something in a syscall
<nick64>
Upon further reading, looks like it is CAP_SYS_RAWIO what is the deciding factor by the kernel
<Mutabah>
so the `iopl` will check permissions before actually setting IOPL
epony has quit [Remote host closed the connection]
epony has joined #osdev
elastic_dog has joined #osdev
elderK has joined #osdev
zaquest has quit [Remote host closed the connection]
zaquest has joined #osdev
scaleww has quit [Quit: Leaving]
<ddevault>
✓ multithreading
<ddevault>
complete userspace implementation clocks in at about 250 lines of code
gxt has quit [Ping timeout: 258 seconds]
freakazoid333 has quit [Ping timeout: 244 seconds]
gxt has joined #osdev
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
frkzoid has joined #osdev
lkurusa has joined #osdev
elastic_dog has quit [Ping timeout: 246 seconds]
bauen1 has quit [Ping timeout: 244 seconds]
bauen1 has joined #osdev
frkzoid has quit [Ping timeout: 250 seconds]
elastic_dog has joined #osdev
elderK has quit [Quit: Connection closed for inactivity]
<ddevault>
anyone have any experience reading power details (e.g. battery capacity) from ACPI?
<ddevault>
ah, found the relevant parts of the spec
Starfoxxes has quit [Ping timeout: 265 seconds]
Starfoxxes has joined #osdev
gildasio has quit [Remote host closed the connection]
xenos1984 has quit [Read error: Connection reset by peer]
gildasio has joined #osdev
freakazoid332 has joined #osdev
terminalpusher has joined #osdev
puck has quit [Excess Flood]
puck has joined #osdev
xenos1984 has joined #osdev
puck has quit [Excess Flood]
puck has joined #osdev
heat has joined #osdev
<heat>
all your operating systems are shit
<heat>
mine is the best
<heat>
and this is straight fax
<Ermine>
no u
<heat>
shut up poopy head
<zid>
fax is dead, long live chain emails
<zid>
if you don't forward this irc message to 20 people, you owe me $10
<heat>
zid, send me your credit card info pls
<heat>
it's for a survey
<heat>
i give u 10 doller after yes?
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
<zid>
z*dsoft+p_ypal@gmail.com
<zid>
cvv is 386
<zid>
ccv?
<heat>
I know you're lying to me
<heat>
who's this z*d fella and why are you giving me his/her cc info
<zid>
if the $10 ends up with the wrong person I will forgive you
<zid>
I am very benevolent
<mats1>
order some ribs
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
xenos1984 has quit [Ping timeout: 250 seconds]
xenos1984 has joined #osdev
vin has left #osdev [WeeChat 2.8]
vin has joined #osdev
<heat>
clang is flattening a 400 line slab allocator into like 5 functions
<heat>
it's pretty funny
maxdev has joined #osdev
<heat>
mjg, btw is vmem freebsd's page allocator? I couldn't tell from the book
<heat>
it seemed like it allocated virtual memory but I got the impression page allocation was kind of attached to it
<heat>
a really nice optimization I made on my slab allocator is that I scratched having a ctor because it's semi-useless
<heat>
and most importantly, I can now fit bufctls in the unused buffers
<heat>
and having read their solution to the "object to slab" problem, I much prefer going down 4 or 5 page tables
<heat>
I'm willing to bet their stupid hash table is way slower (and probably requires locking)
<heat>
i thought about it for a bit and realized I don't really even need locking since the mapping is live
<heat>
(if it isn't and we got a bad pointer, it's a whole other question that I'm not ready to solve. maybe some sanity checking along the way would be a decent idea)
terminalpusher has quit [Remote host closed the connection]
xenos1984 has quit [Ping timeout: 250 seconds]
opal has quit [Remote host closed the connection]
opal has joined #osdev
kof123 has quit [Remote host closed the connection]
xenos1984 has joined #osdev
gxt has quit [Remote host closed the connection]
<mjg>
heat: i don't see how not having a constructor optimizes anything
<mjg>
heat: codepath at hand should be very rarely used
<mjg>
heat: anyway vmem is sometimes used, but it is slow as fuck
<heat>
you can't store your bufctl inside the object itself if you have a ctor initializing everything when allocating slabs
<mjg>
heat: i don't know if that's an artifact of how it was implemented in freebsd
<mrvn>
you always have to initialize objects. if the slab doesn't have a ctor then that just means you have to call the ctor by hand.
<moon-child>
heat: every time you protect a hash table with a lock, cliff click sheds a tear
<heat>
s/cliff click sheds a tear/mjg rants/
<heat>
:P
<mjg>
... or if vmem is indeed that slow
<mjg>
i tried to use it once and it was a disaster
gxt has joined #osdev
<heat>
but *what is* vmem?
<heat>
is it a vmalloc() kind of thing?
<mjg>
no
<mjg>
vmem promises fast range allocation
lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]
<mjg>
range of numbers from x to y, typically a virtual address
<mjg>
but you can use it for other stuff, bonwick mentions just pid allocatin
lkurusa has joined #osdev
<mjg>
[which solaris does not use it for btw :>]
<mjg>
core idea is the same as with per-cpu slabs
<maxdev>
i hate signals and i refuse to further deal with their implementation
<heat>
ugh
<mrvn>
maxdev: I've avoided them so far too
<heat>
the fucking book is so vague when it comes to this
<mjg>
maxdev: signals hate you
<mjg>
heat: :))
<heat>
fuck you mckusick
<mjg>
i'll relay the messager
<maxdev>
i have them implemented, they cause more trouble than good, i'm removing them
<mjg>
anyhow read the vmem paper by bonwick
<heat>
i have not
<mjg>
i noticed
<mjg>
i'm sayin you should do it
<heat>
probably
<mrvn>
mjg: I have a pointer to the end of the last allocation and any allocation just checks the next N+1 pages are free.
<mjg>
just remember to not buy into the hype :>
<heat>
but I don't think I'll go for vmem in any case
<heat>
or percpu vmem areas
<mrvn>
(using the processes page table)
<heat>
that sounds... not too useful
<mjg>
heat: not saying you should, but it is one of the papers i consider i must read if you do osdev
<heat>
interestin
<heat>
thanks
<mjg>
and it's not an endorsement of bonwick, just he happened to write about it :)
<heat>
I also want to read uvm and ubc one of these days
<mjg>
whatever stuff you read make sure to not buy into the success story
<heat>
lmao
<heat>
why, because it's netbsd?
<mjg>
most papers claim whatever they describe is best shit ever
<heat>
:P
<mjg>
not talking about uvm specifically
<mjg>
i read some vfs papers and rest assured, LOL does not begin with describe it
<mrvn>
mjg: most publications don't accept failures.
<mjg>
all while beinig presented like a turing award-worthy endeavor
<mrvn>
"We tried this and see how bad this was." **rejected**
<heat>
where are all the papers about freebsd and linux?
<heat>
i have not seen one
<heat>
at least for linux
<heat>
these big classics are solaris and netbsd
wootehfoot has joined #osdev
lkurusa has quit [Ping timeout: 250 seconds]
<mjg>
heat: just google around, i don'th ave anything handy
<mjg>
there is defo a paper about vfs for freebsd speed up
<mjg>
which basically replaces one WTF with another
<mjg>
an excellent lollery material is the solaris internals book
<bslsk05>
github.com: x86/mm: Set TLB flush tunable to sane value (33) · torvalds/linux@a510247 · GitHub
<heat>
mjg, is that wrt BSD people or does it also include linux
<mjg>
literally everyone
<mjg>
you even used to have devs from one project show up on lists of anotherp roject and straight up lie
<mjg>
pretty funny
<mjg>
my favourite saga concerns syscall performance (way pre cpu mitigations), where the bsd land claimed syscall perf over there is f4st3rz and overall better implemented
<heat>
everyone grew soft and boring
<heat>
except theo
<mjg>
right on man
<heat>
where's the drama, the lies, the flamewars?
<heat>
i think theo would lie and say his performance is even worse because m'safeteh
<heat>
(disregard that detail where I can't beat openbsd)
<mjg>
want some juicy drama read up on grsecurity vs openbsd
<mrvn>
And I would expect that "33 pages" limit to be CPU specific.
<mrvn>
s/limit/break even point/
opal has quit [Remote host closed the connection]
opal has joined #osdev
gxt has joined #osdev
freakazoid332 has quit [Remote host closed the connection]
freakazoid332 has joined #osdev
<geist>
oh huh the 33 thing is pretty reasoned (at least for the time)
<geist>
mrvn: see the commit message heat linked before
<geist>
Though of course that’s for a specific workload (compiling the kernel) but interesting nonetheless. Actually not a bad idea running some tests to figure out what the average flush batch is
<mjg>
it is kind of funny though that the canonical kernel bencmark is kernel build
<mjg>
... which is not at all what people are using the kernel for
<geist>
Indeed. I do worry that that would tend to optimize the whole kernel for very specific workloads. Ie, lots of processes lots of forking
<geist>
Vs say a lot of large long lived heavily multithreaded processes
<geist>
That seems to be somewhat of the opposite pattern
<geist>
Or even medium or small sized processes with lots of threads. That’s basically where fuchsia is at right now
<mjg>
i'm happy it's not a microbenchmark at least
<mjg>
:)
<geist>
Does remind me, now that zen3 has been out a while, has linux finally gotten support for it’s fancy TLB shoot down mechanism?
<mjg>
running real workloads is hard(tm)
<geist>
I was surprised that AMD didn’t dump it in instantly
<mjg>
well, i;m afraid amd is notorious for not optimzing linux
<mjg>
(not saying they never do it)
<geist>
I worry that it’s actually not a good win, or it is a win on desktopy stuff, but since servers are where a lot of the devs are they aren’t interested in it
<geist>
There are definitely things like the clzero thing being bad on multi socket machines ( i think that was you reporting that) but it’s probably fine on desktop stuff, etc
<mjg>
i have not seen any numbers from it
<geist>
But since the monies are all in server land…
<mjg>
clzero is doing non-termporal stores
<mjg>
which in *certain* cases, multisocket or not, is terrible
<geist>
Sure, the issue was cross socket bandwidth, etc
<mjg>
and is extra bad on multisocket
<geist>
Ah. What’s interesting is that’s the default way to do it on ARM
<mjg>
is it?
<geist>
Absolutely. Clzero is clearly one of the AMD features that was picked up from K12
<geist>
Seems that at least half of these new AMD fancy things are 1:1 an ARM feature
<geist>
(The tlb shoot down is the same thing too)
<mjg>
are you in position to count cache misses with this vs regular zeroing?
<mjg>
when doing the famed kernel build :p
<geist>
On what hardware?
<mjg>
arm
<geist>
Sure, but my point is the clzero arm equivalent (`dc zva`) is used *everywhere*
<geist>
Like the default memset implementation uses it
<mjg>
at least 2 years ago nt tsores for clear_page still resulted in *more* cache misses (aka slower)
<mjg>
maybe i misunderstood what you wrote here
<mjg>
does arm64 clear_page use nt stores?
<geist>
Basically if memset is asked to write zeros there’s a ‘fast path’ that just blats out 64 bytes at a time with the instruction
<mjg>
nt stores for big enough(tm) memset and memcpy is preferable, since you are busting the cache anyway
<geist>
`dc zva` is an instruction that blats out a zeroed cache line. It is defined as being non temporal (or at least cache skipping)
<mjg>
but that's way past 4KB
<geist>
Though the arch manual says it doesn’t always have to be, etc etc. ie, it doesn’t cache allocate
<geist>
But yeah i do wonder if perf is dropped on the floor because of overuse of that in general purpose stuff
<geist>
Thoughit’s possible arm is more clever about it in general, since they advocate using it more or less at every possible place
wootehfoot has quit [Read error: Connection reset by peer]
<geist>
Yep. Dc zva
<mjg>
so
<mjg>
on amd64 nt stores *in this routine* result in more cache misses (== slowdown)
<mjg>
and i would be seriously surprised if it was different for arm64
<geist>
Also the other path (the one that doesn’t use dc zva) is using the `stnp` instruction which is also non temporal
<mjg>
heh
<mjg>
well, in principle, regular zeroing may be so slow that you come out ahead even with cache misses
<mjg>
the point t hough is that you are inducing more traffic to ram, which can;t be good
<gog>
hi
<geist>
Yah that’s my general thought. And again I think in the case of zeroing pages that’s different case from just general usage in user space in memset
<geist>
Which ARM also generally does
<mjg>
but perhaps arm is special casing this somehow and if the page is the line is already in cache it does not get evicted?
<geist>
Oh I’m fairly certain that’s precisely the case.
<mjg>
geist: i have no issues with nt stores in memset past a certain size
<geist>
That’s what i mean about it being more clever
<geist>
As in the non temporal part is a strong hint
<mjg>
ye, that would make 100% sense
<mjg>
... and it's not what happens on amd64
<mjg>
:>
<geist>
All non temporals or just clzero?
<mjg>
all nt
* geist
pets gog
<mjg>
unless i missed some special case
* gog
prrr
<geist>
So all nts in x86 are defined as explicitly writing back and evicting any cache lines that intersect?
<mjg>
they invalidate thel ines
<mjg>
and tnen you go to ram
<mjg>
i did bench a clzero-based clear_page equivalent for freebsd. system time went down (faster zeroing), user time went up (cache misses)
<geist>
Got it
<mjg>
total real time basically the same
<mjg>
so a net loss overall from more ram traffic
<geist>
Yeah, same amount of work is done, just accounted differently
<geist>
Also i guess a lot of this depends on if zeroing is done on free or alloc. On free it seems that dumping the cache as a result would be more okay than on the alloc path
<mjg>
... which will cause slowdowns when more cpus want to access it
<geist>
since on the alloc path you might want to touch the page soon anyway
<mjg>
i don't understand why you would zero on free
<geist>
Security, etc
<mjg>
user pages?
<mjg>
i don't mean explicit_memset here
<mjg>
:)
<geist>
User pages what?
<mjg>
in this case we are talking about pages being reused?
<mjg>
clear_page when you get it or unmap
<geist>
Yes
<mjg>
right, so i don't see the point
<geist>
Of what? I’m missing context here
<mjg>
well in principle there may be a kernel memory disclosure bug
<mjg>
... of zeroing pages when freeing them instead of when they get faulted
<geist>
Got it. Yeah i dont particularly see the point *except* from a security point of view some folks get antsy with user data sitting around in it
<mjg>
i found a bug in linux once which would indeed be ablet o dump it
<geist>
But yeah it has the whole ‘have to keep a queue of freed and freed and zero pages’ etc floating around
<mjg>
security aside, zeroing on alloc is a clear cut win from perf standpoint
<geist>
Or at least it’d be a lot more complicatated than to just treat freed pages as yolo
<geist>
Agreed.
<geist>
One thing i would like is when running a VM host with a lot of tenants, it’d be lovely if they zeroed their pages on free
<geist>
Since the VM host can dedupe them, however that is really what balloon memory reclaimation is for anyway
<geist>
Since that generally pressures the clients to trim their file cache or whatnot
<mjg>
i don't know if zeroing is needed
<mjg>
i would expect you could tell the hvm that this is unused now
<geist>
I’ve forced it on my VM by doing the whole ‘fill a tmpfs file with zeros, delete it’ method
<mjg>
then it can optimize bulk zeroing as needed
<geist>
Yeah having some sort of memory based TRIM like call would be nice
<mjg>
and have the guests vmexit if they want said pages
<geist>
Agreed, but there doesn’t seem to be anything like that in any of the VMs i know about
<mjg>
not much of a vm guy myself
saltd has joined #osdev
<geist>
I think it’s generally the status quo to just let the guests use their allotted mem and be done with it
<mjg>
fucking guests man
<geist>
And if you really want some sort of overcommit and the guests are okay with it (ie, it’s your box) youcan use balloon memory scheme for that
<mjg>
there are systems i'm not gonna name which just eat cpu while idle
<mjg>
and not some 1%
<geist>
But i suspect that’s not en vogue because most VM guests in most situations are on something like AWS or Azure or GCE where there’s no real reason for a guest to ‘play nice’with the rest of the machine
<geist>
They’re paying for N GB of ram, they can use it
<geist>
But i have no idea, I’m not a vm person either, so i dont know where it’s at
rwb has quit [Ping timeout: 260 seconds]
<mjg>
that's precisely the environment where i expect vendors to try to squize more free ram
<geist>
But i have a personal box running 10 or so qemu instances that i somewhat overcommit, and generally rely on page deduping and swap to work
<geist>
Anyway gotta go. Ferry is landing
gildasio has quit [Write error: Connection reset by peer]
gildasio has joined #osdev
<mrvn>
I think you have to look at why you get cache misses or lack thereof.
freakazoid332 has quit [Ping timeout: 244 seconds]
<mrvn>
One pattern I've seen is: struct Foo foo; memset(foo, 0, sizeof(foo)); foo.x = 1;
<mjg>
the extra misses are from accessing now evicted lines
<mrvn>
If memset bypasses the cache then you write 0 to memory, get a cache miss and then write 1.
<mjg>
alloc page, do work, free page, alloc page -- you are back to the same page, fully cached
<mjg>
if you now zero it with nt stores you have to read it back from memory
<mjg>
and for workloads like building the kernel this happens a lot
<mrvn>
mjg: I would expect a nt store of cached data to update the cache.
rwb has joined #osdev
<mrvn>
mjg: have you actually tested this? Does building a kernel frequently free a page and then alloc it again? I would think the libc just increases the heap till the compile step is done and never free anything.
<mrvn>
mjg: or did you mean libc malloc() reusing freed memory over and over?
<mjg>
i did test it, just like other people did
<mjg>
it's not necessarily literally the same page, but it's still something you have in llc
<mrvn>
mjg: and you get tons of sbrk() calls or munmap/mmap?
<bslsk05>
cgit.freebsd.org: src - FreeBSD source tree
<mjg>
page copy instead of page zero, but same concept applies
<mjg>
google around, you will find linux people doing more extensive tests and reaching the same conclusion
Ali_A has joined #osdev
<mrvn>
mjg: pagecopy is far from freeing and allocating the same page though.
<mjg>
i noted it's not necessarily literally the same page, just something you still have in llc
<mrvn>
mjg: even that isn't hapening there.
<mjg>
i used 'the same page' example for easier illustration of what's going on
<mjg>
it is
<mjg>
buildkernel is full of short lived processes
<mjg>
so pages keep getting reused
<mrvn>
what I see there is that copying the page will prime the cache so accessing the page after the copy generates cahe hits.
<mjg>
found it
<mjg>
there was a time where freebsd did not do numa
<mjg>
and i had a 2 socket box
<mjg>
make -j 40 buildkernel:
<mjg>
nt stores: 1726.98s user 554.49s system 1841% cpu 2:03.87 total
<mjg>
rep movsq: 1683.30s user 550.70s system 1876% cpu 1:59.08 total
<mjg>
for pagezero
<mrvn>
The interesting part in your url I find is that there is 0 change in the runtime. ~25% less cache misses and 0 change in speed.
<mjg>
that's because at the time the kernel was incredibly slow in general
<mjg>
lemme try to find an example
<mjg>
some syscalls are now 3x the speed
<heat>
how many openbsds was freebsd at that time
<mrvn>
I also don't get the original code. WTF is it doing there? It copies a page in blocks of 64 byte and then loops some more blocks of 32byte?
<saltd>
fuck your friends, we are going home
<saltd>
but
<mjg>
sigh i don't have numbers that old
<mrvn>
Because a page isn't a multiple of 64 and surely there must be multiple blocks of 32 byte at the end of the page that aren't a 64byte block?
<saltd>
wrnh chann
<saltd>
o ops
<mrvn>
oh wait, the first loop just prefetches a page and then it copies in a second loop, right?
<heat>
yes
<mrvn>
Wasn't the point of non temporal not to trash the cache? How does prefecthing work there?
<heat>
you're prefetching the source
<heat>
prefetch source -> normal load from cache -> nt store to dest page
epony has quit [Ping timeout: 252 seconds]
<mrvn>
urgs, NT stores are weakly ordered but prefetchNTA is full coherent. So maybe that first loop just makes sure the memory is in sync.
frkzoid has joined #osdev
gorgonical has joined #osdev
<gorgonical>
I am losing my mind with this assembler
<mrvn>
heat: the prefetchnta seems to do something different on every cpu.
<heat>
gorgonical, drop nasm
<heat>
use gas
<gorgonical>
I am using gas unfortunately
<heat>
gas is good
<heat>
what's your problem and why are you wrong
<mrvn>
gas or gcc -S?
<gorgonical>
I am writing a forth implementation in risc-v and can't abuse resetting variables to do word linking for me
epony has joined #osdev
<gorgonical>
"redefined symbols can't be used in reloc"
<heat>
dude
<mrvn>
.oO(Tell us your problem and we will tell you why you are wrong)
<heat>
you're writing riscv assembly?
<gorgonical>
yes
<gorgonical>
I acquired a taste for it this summer while porting a kernel
<heat>
I would've blown my brains out at line 10
<mrvn>
gorgonical: have you tried writing it in C and looking at the assembly output?
<heat>
wait
<heat>
variable?
<heat>
what's a variable'
<gorgonical>
I know what I want it to do, actually. I just can't tell if the assembler directive/macro system is good enough
<mrvn>
gorgonical: if you use numbers for symbols you can reuse them
<gorgonical>
basically each word needs a pointer to the previous one. .int link holds that. The regular forth impl resets link each time you define a word
<gorgonical>
.int link
<gorgonical>
.set link,\current_word or so
gxt has quit [Ping timeout: 258 seconds]
<mrvn>
gorgonical: so a compile time register tracking the last allocation?
<gorgonical>
yes done all by the macro system of the assembler
<mrvn>
gorgonical: can you push/pop variables?
<gorgonical>
I'm like 80% sure I can do this the manual way by tracking what the last word was and just inserting it manually but I don't really want to do that lol
<gorgonical>
mrvn: Not that I'm aware. But I'm not that versed in gas directives/macros
rwb is now known as rb
gxt has joined #osdev
<gorgonical>
by tracking I mean "remembering myself"
<mrvn>
gorgonical: I tried doing macros and structures in gas but support for that is horrible. Under AmigaOS I had a Devpac assembler where you could basically define structs like C code and it would define offsets for all the members that you could use.
<mrvn>
It had an actually usable macro language.
epony has quit [Ping timeout: 252 seconds]
<gorgonical>
Yeah the gas macro language is ugly. I'm almost thinking it would be preferable to use the C preprocessor
<heat>
why not both
<heat>
I use both
<mrvn>
gorgonical: that's what I always do. Just to be able to "#include" alone.
<heat>
there's certain stuff you can't do in Cpp macros though
<heat>
like instruction sequences, since there's no newlines in CPP
<mrvn>
heat: op; op2; op3;?
<heat>
does that work in gas?
<heat>
I thought that was an inline assembly thing
<mrvn>
not sure
<gorgonical>
I can almost get away with what I want by using the \@ symbol that counts the number of times any macro has execd. But I need to insert link\@ and set link(\@+1) and that doesn't work
epony has joined #osdev
<gorgonical>
And because of the way gas macros work you are limited in what you can concatenate to make a symbol name, apparently
<mrvn>
<source>: Warning: end of file not at end of a line; newline inserted :)
<gorgonical>
I think the right (read: only) thing I can do without introducing the cpp is to manually specify what word came previously. When defining words interactively some word in memory is used to remember the latest thing we defined. That's what link does at compile-time
<gorgonical>
I think
saltd has quit [Remote host closed the connection]
<gog>
you can use the c preprocessor for assembler files
<gog>
just use .S
<gog>
gcc will do the right thing with them
<gog>
clang too
<gorgonical>
Yeah I just don't know exactly how to get the result I want with cpp at the moment
<gog>
ah hm
<gog>
that is tricky
<gorgonical>
Maybe like a macro that spits out the right link variable name would work
<gorgonical>
At the moment gas is complaining that gp and t0 aren't valid risc-v registers, which I have never seen before
<gorgonical>
oh possibly I am a dumbdumb
<gorgonical>
yes I think I understand lol
<gorgonical>
Yeah I forgot the memory syntax and used j where I should have used jr
saltd has joined #osdev
<gog>
oops
<heat>
apparently there's a bug on every gcc since 4.8
<heat>
it's still hella simple, no percpu magic yet
<heat>
the cache's nr_objects and active_objects aren't touched because I quickly realized I'll need to add them to the percpu context
<heat>
as to avoid the atomic add
<heat>
yay for scalable counters I guess
<heat>
...actually it's under a lock right now
<heat>
meh, don't care
<geist>
Word. I’ll take a look tonight, am curious
<mjg>
c->alignment = ALIGN_TO(c->alignment, 16);
<mjg>
you want 8 bytes bro
<mjg>
no simd
<geist>
Depends on the ABI. If the ABI demands it you shouldn’t futz with that
<geist>
And you can try to override it but that’s a good way to get to buggy compiler land
<heat>
mjg: I like having extra bits in the address to stuff things in
<heat>
but the alignment stuff isn't finished. as in, totally not finished
<heat>
barely started
<mjg>
well then don't get an 8 byte slab at least :p
<geist>
Cmpxgh16b (and the arm equivalent) i think is a case where the 16b comes back
<mjg>
i would whack null checking in kfree and demand non-null pointers
<geist>
I believe we bumped into that in fuchsia. May not be x86, but arm may have the requirement for doubleword atomics
<mjg>
for the special cases you can guarantee aligment
<mjg>
i'm saying 8-byte sized (and smaller!) allocs do happen
<geist>
Sure but problem is the compiler will assume that allocations are aligned that way, so will align their structures to assume as much
<heat>
I originally tried to provide the same alignment guarantees as the original paper (as in, everything is $objsize aligned)
<heat>
but aligning non-power-of-2 sizes is not trivial
<heat>
the current code will also fail for > 4KB alignments
<heat>
again, that area is super WIP
<mjg>
i think you are giving yourself more work to do by not providing per-cpu support from the get go
<heat>
yea, I'm on it
<mjg>
key point being that with it you will only call down to slab for a full batch of objects
<heat>
but why more work?
<mjg>
or to return one
<mjg>
... batch
<mjg>
no 1 obj at a time fuckery
<heat>
I don't like that
<heat>
i'm scared i'll just get objects stuck in the percpu queues forever
<mjg>
it does not have to happen
<heat>
and yes, I understand that's how fbsd rolls
<heat>
i don't think linux's design does that? but I may be super wrong
<mjg>
but if you cahe osmething per-cpu, you already are susceptible to it
<heat>
sure, but the problem is probably less serious there
<mjg>
?
<heat>
it also avoids fucking with other cpus' queues
<mjg>
let's try an example
<mjg>
in caes described in the original paper you have a "magazine" with n object cached
<mjg>
what exactly are you planning to do in your caes?
<mjg>
case
<heat>
frees go to the percpu cache, first allocations hit the slabs
<mjg>
how many objects are you wiling to store in the percpu cache
<heat>
unknown
<mjg>
... and how are you going to avoid "stuck forever" problem
<heat>
the idea for now will probably be to flush everything at purge time
<heat>
as in, use the pcpu cache as a literal cache
<mjg>
that's pretty weird imo and will definitely negatively affect your perf
<heat>
it is?
<mjg>
i would say, if you insist on returning per-cpu cache, and are handling the fast path with preemption disabled
<mjg>
you can just ipi into it and tell the cpu to give everything back
<mjg>
should you run into memory shortage
<mjg>
otherwise bugger off
<heat>
how do you even put it there without a lock?
<mjg>
where
<mjg>
oh back to slab?
<heat>
no
<heat>
in the magazines
<mjg>
i think there is a miscommunication here, so let me pseudo code
<mjg>
you get to the fast path, find you got nothing, you go to the slab layer, take whatever locks you need and grab 1 magazine of objects
<heat>
oh, for you only?
<mjg>
yea
<mjg>
on free you get to the fast path, if there is free space in the magazine for the obj, you put it there
<mjg>
now i remember why bonwick wnated 2 magazines
<mjg>
if the magazine is full, you return it again with taking locks
<mjg>
for the caes where you want to reutrn shit on demand, even if the cpu is chillin
<mjg>
you can ipi, check it is not doing an alloc, and tell it to set the magazines aside
<mjg>
e.g., you pass it a pointer to set to
<mjg>
then on return you got the magazine and the cpu no longer does
<mjg>
should you find it is mid-alloc, you cpu_relax and try again
<mjg>
as preemption is disabled it is an invariant it will eventually finish
<mjg>
in principle by the time you get there it may be doing another alloc and if that's something you are worried about i have a hack which sorts it out
<heat>
I see
<mjg>
remember to get 2 magazines tho
<mjg>
you don't want a spot like this:
<mjg>
you have a full magazine, free() comes your way
<heat>
i was readin the codez and I think I was being conservative
<mjg>
there is no place to put the obj, so return the magazine
<mjg>
now you got 1 obj cached
gog has quit [Quit: byee]
<heat>
I dont understand the 2 magazine shizzle btw
<heat>
but that's not too important
<heat>
surely
<mjg>
well i gave you the corner case
<mjg>
you may end up with 1 cached object and having to take another trip to slab
<heat>
why would I end up with 1 object in the mag instead of $MAG_SIZE
<mjg>
you have 1 magazine, it is full
<mjg>
free comes your way
<heat>
yes
<mjg>
what are you planning to do
<heat>
so dump 1 object, not the whole mag
<mjg>
another free comes your way
<mjg>
what are you planning to do
<heat>
dump another object
<mjg>
so you gonna keep taking 1 obj trip to slab now?
<mjg>
as long as the above degenerate case happens?
<heat>
in fact, here's what the linuz does
<heat>
half the entries go to the global cache
<heat>
i like this solution over your 2 magazine galaxy brain solution
<mjg>
it's not mine but bonwicks
<heat>
i bet bonwick designed doors
<mjg>
also half is quite a difference from 1 by 1
<heat>
yes
<mjg>
here is a trivial caes where 1 by 1 fucking demolishes perf
<mjg>
you rm -r a dir tree
<heat>
i corrected my design by N / 2 - 1
<mjg>
ok
<geist>
Yeah get it working first, recognize that optimizations will come, and then move on
<geist>
The much more important thing is to get the rest of the system thinking in terms of slab allocators
<heat>
yeah
<heat>
it works-ish right now
<mjg>
makes you wonder, is slab really this good of an idea
<mjg>
or do we all suck
<heat>
last
<geist>
Well, i have thought that thought too
<geist>
Ie are slabs the end all of everything for kernels
<geist>
It *seems*to generally fit the model in that you probably have a fairly limited set of objects in the overall system and thus it benefits well
<geist>
Plus it can sit more or less directly on top of the PMM, etc
<heat>
slabs are just object pools with a particular internal design
<mjg>
one immediate problem with slab, which is not inherent to it, but is present
<mjg>
is that you are going to have allocs with drastically different life spans
<heat>
nt heavily does object pools I think
<mjg>
which slab does not handle
<mjg>
and you can keep entire pages hostage because of it
<geist>
Yah there is a fair amount of loss due to external fragmentation?
<geist>
Or would that be internal fragmentation. Either way, exactly
<mjg>
but then is this a slab problem or cnsumer roproblem
<heat>
bonwich paper says 7% fragmentation over 1 week
<heat>
wick*
<mjg>
on his sparc desktop?
<heat>
yes
<geist>
Well a mixed heap has at least more of a shot of mixing up allocations within a page
<heat>
on his sparc desktop
<mjg>
that's representative... of his sparc desktop
<geist>
So *probably* on the average would have less unallocated space in pages
<mjg>
i don't have any oither numbers, to be fair :-P
<heat>
when does google release the server fleet's allocation profiles
<geist>
Almost by definition you have a one page overhead per object type, but then lots of binning heaps have similar stuff
<heat>
it's what they test tcmalloc on
<geist>
Sure, but even that is the fleet. Big server stuff is really a different optimization than desktops/embedded/etc
<heat>
yeah
gxt has quit [Ping timeout: 258 seconds]
<mjg>
you i think a sparc desktop from late 90s is the middleground here!
SpikeHeron has quit [Quit: WeeChat 3.6]
<geist>
This is a somewhat recurring problem in fuchsia, where we have server class algorithms/data structures/common knowledge applied to something that isn’t
<heat>
but like, are you noticing 160k opens per second vs 200k opens per second in desktop/embedded?
<heat>
if you're mjg and want to make fun of me, yes
<mjg>
i would notice if you could make this many!
<geist>
As is generally the case, having something working is infinitely more valuable than nothing working but an optimal design in the works
<geist>
But it also depends on what makes you happy
<geist>
Do you like making things go or do you like optimizing things. Both answers are totally valid
<geist>
I’m much more in the former camp
<heat>
yes
<mjg>
later camp unite
* mjg
is gonna sell merch
<heat>
no
<heat>
all camp
<mjg>
OH
<geist>
OPTIMIZERS RULE
<mjg>
did i mention that certain perf baseline is part of correctness?
<geist>
I think true engineering is designing with both in mind. Ie, getting something functional soon but make sure future designs are available
<mjg>
i'm happy for you that your bubble sort works a-ok
<mjg>
but i can't use it bro
gxt has joined #osdev
<geist>
Trying to paint the room such that when you do full it in,you’re not in a corner
<mjg>
heat: i think it would be in good taste to add a note that slab was inspired by the paper
<mjg>
heat: at the top of the file
<mjg>
also you may want to add a sun-related pun to onyx now
<heat>
i made up slabs
<heat>
my name is jeff bonwick
<heat>
pleasure to meet you
<mjg>
joel spolsky
<mjg>
sup man
<mjg>
you should post on stackoverflow
<heat>
totally should matthew
<heat>
loved your firmware talk!
<mjg>
"interestingly" the english version of my name *is* matthew
<heat>
guzik is polandian for garrett
<mjg>
central european!
<heat>
check
<heat>
from the check republic
<mjg>
anyhow openbsd has per-cpu slabs man, you should catch up
<mjg>
basically i think you could learn a lot by reading their code
<heat>
must. not. lose. against. theo.
<mjg>
few years back openbsd add a new syscall which was printing some crap
<mjg>
they directly derefed user memory in it
<mjg>
s3cur1ty
<mjg>
added*
<heat>
are ctors for slabs that beneficial that you're willing to throw away memory for them?
<mjg>
what do you mean throw away
<heat>
you can't put your bufctl inside the free object
<heat>
you effectively can put less objects in a slab
<mjg>
just don't add them for now
<mjg>
if you have something really heavyweight to do when creating the object, that's a candidate
<mjg>
few stores don't count
<heat>
right
<mjg>
i expect you wont have any use for constructors for quite some time
<heat>
but circling back to your vnode example with the lru
<heat>
you're always paying the cost
<mjg>
what do you mean by always here
<mjg>
and cost in terms of memory or cpu time
<heat>
cpu and memory
<mjg>
memory would be used anyway
<heat>
although maybe you guys found a better strategy for bufctls
<mjg>
and it *saves* cpu time
<mjg>
:>
<heat>
why?
<mjg>
consider n threads all allocating a vnode at the same time
<heat>
I can call 16 ctors() at once, or 1 ctor at a time
<mjg>
if there is no ctor, adding to global lru has to happen at said alloc time
<mjg>
so you got n threads contending to do it
<mjg>
and then you have to whack it from the list if the file goes away
<heat>
sure but that's the best case scenario right?
<mjg>
sounds like a bad case? :)
<heat>
if you have contention, you dun goofed
<mjg>
thanks to the ctor i don't
<heat>
yeah but what if someone outside the allocation function locks it for any reason
<heat>
traversal or wtv
<mjg>
that's part of the point. ctor() stuff is incredibly rarely called compared to alloc/free
<heat>
hrm
<heat>
riiiight
<mjg>
which also means i normallly avoid locking
<mjg>
if ctor happens a lot, you are caching it wrong or are suffery turbo memory pressure
<mjg>
suffering
<heat>
but I'm not entirely convinced given that dtors are kind of a bad idea