klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
SpikeHeron has quit [Quit: WeeChat 3.6]
<heat> is the tlb in qemu tcg a thing or what?
<heat> can I theoretically skip flushing the TLB?
<heat> is that a no-op?
<geist> i think it has some sort of cache, yes
<mrvn> heat: m[10, 20] isn't the "," operator though but some new evil hack to make that a n-ary []
<heat> I know
<mrvn> Makes me wonder how they make that work without breaking existing , operators inside []
<mrvn> vec[size = max(size, i), i] = val;
* mrvn ducks behind this horrible construct but can't of something else that could use , inside []
<mrvn> +think
\Test_User has quit [Read error: Connection reset by peer]
SpikeHeron has joined #osdev
gildasio has quit [Remote host closed the connection]
<heat> I've been unfucking my riscv port
gildasio has joined #osdev
<heat> it's pretty unfucked by now, but gcc is still acting up
<heat> I need some real CI for this...
nyah has quit [Ping timeout: 252 seconds]
<heat> I need interrupts
<heat> I do not have interrupts because I didn't bother writing a PLIC driver
Oshawott has quit [Ping timeout: 248 seconds]
<moon-child> mrvn: if it's in a statement context, there's no reason not to put the computation by itself
<moon-child> if it's in an expression context, you can pull it out: (size = max(size, i)), (vec[i] = val)
Gooberpatrol_66 has joined #osdev
<heat> ok so
<heat> what's the difference between a M-mode enable and an S-mode enable in the PLIC?
\Test_User has joined #osdev
<heat> hrm, it seems that it may be a machine detail
<heat> the generic PLIC spec seems to be spec to 1024 interrupts
<heat> the actual sifive cores seem to only have 512 and have that M-mode vs S-mode thing
<heat> the logic to just figure out what goes where is horrific
<heat> cheers, they fucked up the irq controller
heat_ has joined #osdev
heat has quit [Read error: Connection reset by peer]
heat_ is now known as heat
<mrvn> moon-child: that isn't the point. Currently that calls operator[](size_t), in the future that might call operator[](size_t, size_t).
MelMalik is now known as AmyMalik
<heat> mrvn, I can imagine the behavior switching through -std=c++23
<mrvn> It should be that if operator[](size_t, size_t) is defined that should be prefered. Otherwise the old , operator should take hold.
<klange> Python has some similar challenges with commas - eg., where is a comma an operator that makes a tuple? Had a lot of "fun" with that in Kuroko's compiler...
<mrvn> (a,b) vs. a,b
<mrvn> f(a, b) vs. f((a, b))
archenoth has joined #osdev
gildasio has quit [Remote host closed the connection]
<moon-child> isn't (a,b) the same as a,b in python?
gildasio has joined #osdev
<\Test_User> a(a,b) and a((a,b)) is different
<\Test_User> er that was already stated... but when used in that kind of situation
<klange> as an expression on its own, yes, (a,b) and a,b are equivalent.
terrorjack has quit [Quit: The Lounge - https://thelounge.chat]
terrorjack has joined #osdev
[itchyjunk] has quit [Read error: Connection reset by peer]
<mrvn> I'm not sure f(a,b) isn't a tuple too. You can do "def foo(*args)" after all.
elastic_dog has quit [Ping timeout: 264 seconds]
elastic_dog has joined #osdev
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
<klange> mrvn: it is not, but it was historically, keyword arguments necessitated entirely different parsing and that happened a long time ago
heat has quit [Read error: Connection reset by peer]
heat has joined #osdev
<mrvn> right, with kwargs you have to split it into a tuple and a dict.
<mrvn> but python is somewhat screwy there. def foo(x=0): print(x)
<mrvn> >>> foo(1)
<mrvn> 1
<mrvn> 2
<mrvn> >>> foo(x=2)
<mrvn> Is that a positional or key word arg?
<klange> positional vs. keyword is determined by the call, so it's positional in the first and keyword in the second
<mrvn> except when you do "def foo(*, x=0)"
<mrvn> eat all positional, so x must be keyword then
<klange> That's not what bare * means.
<klange> It means "all follow arguments can not be provided as positional arguments"
<mrvn> i.e. x must be given as keyword argument, the callee deterimend that
<klange> it mandated that, it didn't determine it
<mrvn> anyway, my point was that it must be a real pain to parse
<klange> There are ordering requirements that make it not too bad in that context.
<mrvn> can't just chuck any value without name into a tuple and any "name=val" into a dict. The position of the named args matters.
<klange> The ordering of the named args does not matter.
<mrvn> they have to be after unnamed args
<klange> Yes, and that's what makes it easier to parse...
<klange> Because, again, it absolutely is not a tuple...
<mrvn> I do love the "*args" and "**kwargs" syntax though. Does anything but python have that feature?
zaquest has quit [Remote host closed the connection]
zaquest has joined #osdev
<klange> I have it in Kuroko, but then whether that is "anything but Python" or not is up to you.
<moon-child> mrvn: iirc, raku has something like it
gxt has quit [Ping timeout: 258 seconds]
<energizer> mrvn: julia has f(a...; kw...)
gxt has joined #osdev
heat has quit [Ping timeout: 250 seconds]
Ram-Z has quit [Ping timeout: 265 seconds]
dormito has quit [Ping timeout: 268 seconds]
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
dormito has joined #osdev
epony has quit [Read error: Connection reset by peer]
epony has joined #osdev
<geist> aww, sadness when your favorite chromebook starts bulging. the inevitable bulging battery syndrome
<geist> well had a good run, 7 years old
<\Test_User> can always remove the battery and just use it with an outlet nearby
<\Test_User> kinda defeats the point but still more mobile than a normal PC
<geist> yeah i will probably do that
tarel2 has joined #osdev
<pitust> geist: you could replace the battery, no
<pitust> ?
m5zs7k has quit [Ping timeout: 268 seconds]
m5zs7k has joined #osdev
<zid> I'm still not sure what you're supposed to do with bulged lithiums
<zid> throwing them into the trash seems like a recipe for the garbage truck to set on fire
<zid> and I don't wanna keep in my house
<mrvn> it's toxic waste, definetly not for normal garbage
<bradd> maybe a car battery shop could dispose of them?
<mrvn> return to producer?
<zid> put them on a convenient man-hole and fire BBs at it is my guess
<mrvn> car batteries are just acid and lead. Nothing explosive/flamable there. But maybe.
<geist> i actually disposed of some recently at a local recycling place that had basically a metal container to toss used batteries
<geist> outside
<zid> a burn barrel? :P
<geist> basically
<geist> well, in case it became a spontaneous burn barrel, that is
<geist> must be fun dealing with it downstream
<zid> maybe if lithium's price goes up from all the lithium being mined out slowly, they'll end up wanting them back as recycling
Ali_A has joined #osdev
archenoth has quit [Ping timeout: 248 seconds]
GeDaMo has joined #osdev
Ram-Z has joined #osdev
maxdev has joined #osdev
<maxdev> I think I have a misconception on interrupt handling. If a timer IRQ occurs, and my ISR is executed.. will, for example a keyboard IRQ, interrupt my ISR again? Or when is the next interrupt triggered, after existing my ISR, or after writing the EOI?
<maxdev> *exiting
<Mutabah> Depends on the interrupt controller
<maxdev> x86 APIC
<Mutabah> The legacy PIC allows lower IRQs to interrupt higher ones (.. there's a quirk where the secondary one is in the middle of the primary)
<zid> Bear in mind this is just which IRQ number you end up taking
<Mutabah> I think the APIC has finer-grained priorities... or just doesn't allow nesting
<zid> whether an IRQ happens at all is still up to whether you have interrupts enabled, the PIC has been acked since last time, etc
<Mutabah> ^
<Mutabah> Assuming you've just started handling an IRQ - IF will be clear (because hopefully you used an interrupt gate)
<zid> so in practice this is just "if two things happen simultaneously, which ends up happening first?"
<zid> it doesn't change how any of it works, your IRQs won't magically start nesting
<zid> unless they were already nestable to begin with
<maxdev> is IF also cleared if it's a software interrupt going through an interrupt gate?
<zid> I don't believe so? manual will know
<Mutabah> maxdev: It's just the gate type iirc
<maxdev> interrupts confuse me a lot lol
<maxdev> I tried making my kernel reentrant at some point and it's not very easy
<zid> interrupt gates seem to clear IF
<maxdev> but what is the EOI then good for?
<zid> eoi is for talking to the *pic* to acknowledge you've finished the IRQ it sent you
<zid> and it should send you another
<zid> it's nothing to do with the cpu's interrupts
<maxdev> ah i think i get it
archenoth has joined #osdev
kof123 has quit [Ping timeout: 268 seconds]
<ddevault> I am confuse
<ddevault> why would rdmsr on fsbase cause a GP fault
<ddevault> only on real hardware
<zid> does your real hw support it?
<zid> I don't have fsgsbase on my cpu
<ddevault> is't it always available for x86_64?
<zid> some is, but there's an extension
<zid> that adds some stuff
<ddevault> the rdfsbase/wrfsbase instructions
<ddevault> not using those, using the MSRs
<zid> k
<ddevault> wrmsr works fine
<ddevault> but not rdmsr
<ddevault> looks like linux reads the MSR just fine
<ddevault> I am confuse
kof123 has joined #osdev
<ddevault> rdfsbase works fine, though
<ddevault> I guess I'll just bite the bullet and do the cpuid check for it
Ali_A has quit [Quit: Client closed]
nyah has joined #osdev
<ddevault> is there any good database of cpuids which lets me identify CPUs which have a given feature?
eroux has quit [Ping timeout: 268 seconds]
<ddevault> if you have access to an older x86_64 computer running Linux, please run this command and tell me if it prints "no fsgsbase"
<ddevault> lscpu | grep fsgsbase >/dev/null && echo fsgsbase || echo "no fsgsbase"
demindiro has joined #osdev
<zid> I mean, I know I don't have it
<demindiro> ddevault: are you also setting bit 16 in CR4? I did a quick grep through your code but only see bit 9, 10 and 17 being set
<demindiro> Or ah, that's only for wrgsbae etc, nvm
<ddevault> yeah I added that
<ddevault> rdfsbase works fine for me
<ddevault> but reading the MSR does not work on real hardware, GP faults
eroux has joined #osdev
<demindiro> You're trying it from kernel space right?
<ddevault> aye
<bslsk05> ​gist.github.com: lscpu.txt · GitHub
<zid> I ususually just cat /proc/cpuinfo
<ddevault> erp
<zid> Legit never used it before
janemba has quit [Ping timeout: 252 seconds]
<ddevault> zid: can you test this ISO out on your hardware? https://mirror.drewdevault.com/boot.iso
<demindiro> My guess is %ecx somehow having garbage as value, no idea what else it could be.
<ddevault> demindiro: ecx looks fine in the panic dump of registers
<ddevault> working would be the classic "thread 1" "thread 2" logging
<ddevault> not working would be GP fault
<demindiro> Does RDMSR work in e.g. the bootloader?
<ddevault> depends on what mode it's in, I bet
<zid> I mean, under virt it seems to work?
<zid> it goes 'hello from helios' for like 10 seconds, then spams thread1 / thread2
<ddevault> right, I know it works fine on virt
<zid> I'm not rebooting my desktop and buying a cd-rom drive
<ddevault> but it doesn't work on my hardware
<ddevault> can boot from USB
<zid> or a usb drive
* ddevault shrugs
<demindiro> I can try
<demindiro> 1 min
<ddevault> demindiro: run lscpu first and look for fsgsbase
<zid> Clearly what I need is.. another desktop
<demindiro> fsgsbase works on my laptop
<ddevault> cool, curious to know if the iso works as well
<demindiro> kernel page fault at non-present address 0x9 @ 0xffffffffc003a839
<ddevault> uh, that's interesting
<ddevault> I think there might be a race condition somewhere, does it happen again if you reboot?
<ddevault> err wtf is that %rip
<demindiro> Yep
<ddevault> can you send me a picture of the panic screen?
<demindiro> I'll try another laptop too
d5k has joined #osdev
<demindiro> Yes, if I can get K9 mail to work and figure out where Android stores camera pictures
epony has quit [Remote host closed the connection]
<demindiro> Ditto on other laptop btw
<demindiro> Anyways, I hate android 10, phone froze again
<ddevault> here's another one to test out when you have a sec: https://mirror.drewdevault.com/rdfsbase.iso
janemba has joined #osdev
<demindiro> That ISO works
<demindiro> Loading etc, seems to halt at /hello
<ddevault> halt at /hello is not a working result, I'm afraid
<ddevault> should spam thread 1/thread 2 over and over
<demindiro> So on one laptop it appears to halt, on the other there's thread 1/2 spam
<ddevault> which laptop does it halt on? what model?
<demindiro> Samsung 520U something IIRC
<ddevault> curious to see lscpu output on that
<demindiro> Samsung NP530U3C
<zid> sounds like a centrino N-348934893 cpu
bauen1 has quit [Ping timeout: 265 seconds]
bauen1 has joined #osdev
<demindiro> So uhm
<demindiro> Apparently that samsung laptop does not have fsgsbase
<demindiro> And I forgot that I emulate it if not present
<ddevault> that's what I wanted to know, thanks
<demindiro> rdmsr stuff works though
<ddevault> nice
<ddevault> I'm surprised that the MSRs don't work consistently
lkurusa has joined #osdev
Ali_A has joined #osdev
Ali_A has quit [Client Quit]
<maxdev> forgot to say thank you for your help earlier, Mutabah and zid
<zid> I accept hard cash and sexual favours
eroux has quit [Remote host closed the connection]
eroux has joined #osdev
eroux has quit [Remote host closed the connection]
<maxdev> xD
<sham1> What about cash favours
<maxdev> what about hard sexual?
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
tarel2 has quit [Ping timeout: 252 seconds]
heat has joined #osdev
<heat> hello
<heat> help i dont get the irq!!!!!!!!!
<zid> I can email you one?
<heat> yes
<heat> irq 10 pls
<zid> okay, IRQ10 is send $20 paypal to zid
<zid> I'll deassert it once it shows up
<heat> what
<heat> but iz free no?
<zid> I'm reverse engineering rollercoaster tycoon cus.. someone asked me to
<zid> would you like to know what the incredible save encryption is
<heat> yes
<zid> add eax, 0x39393939; rol eax, 5;
<zid> You're welcome
<heat> sgtm
<zid> suck goats, tiny munchkin
<mjg> oh so you gonna cheat in game?
<mjg> add moneyz to save
<zid> no he wants to multibox it but there's two issues
<zid> one is that there's a single save blob that each copy overwrites in full
<zid> and the second is that it pops up a "this is already running" box
<zid> I fixed the latter already
<zid> the former I was just going to write a thing that merged save files
<mjg> whyn ot get a vm
<mjg> and bounce it
<zid> 'bounce it'?
<mjg> between machine
<mjg> s
<zid> pardon?
<heat> shut up
<mjg> there is technology(tm) to migrate vms along with the storage between different physical machines
<heat> go back to path walking flamegraph man
<zid> okay great
<zid> but how does it solve the problem
<zid> you end up with n saves, save 0 has beaten level 0, save 1 has beaten level 1, save 2 has beaten level 2
<zid> and you need a save with level 0 + 1 + 2 beaten so that level 4 unlocks
<mjg> maybe i misunderstood what the goal is
<zid> and the game doesn't partially increment the save file
<zid> it fully overwrites it all
<mjg> oooh wait, you mean the person wants to play *once* but generate several different instances of save files?
<zid> no
<mjg> as if they plaed the game n times?
<zid> they're playing 3 copies of the game at the same time
<zid> so they can speedrun unlocking level 4
<zid> but it won't unlock unless *one* copy beat level 1 2 and 3
<mjg> that's what i said
<zid> what did you say?
<heat> i think you should all shut the fuck up
<zid> I think heat is angry his IRQ is wedged
<heat> i am
<heat> wanna review it? yes? wonderful
<froggey> how do you multibox this if each level only gets unlocked if you've got the previous level beaten?
<zid> it's in sets, frog
<froggey> ahh
<zid> xyz start unlocked, beating each unlocks tuv
<zid> so you can triple-box xyz
<zid> then triple-box tuv
<zid> but if you *play* xyz, you just get a save game with ..z or .y. or .x. completed depending on which one saved last
<zid> err x.. :P
<zid> so I was considering writing a tool that could merge save game progress. But I think I've talked him into just starting on a save-file with every level unlocked, and just not playing tuv until xyz are beaten
<zid> scout's honor rather than "massive amounts of technical adulteration to the game files"
<zid> oh is that what the blips I couldn't follow were, the threadlocal x prints
<zid> I just noticed the linespacing changed sometimes
<bslsk05> ​git.sr.ht: ~sircmpwn/helios: vulcan/cmd/init/main.ha - sourcehut git
<bslsk05> ​github.com: Riscv work by heatd · Pull Request #45 · heatd/Onyx · GitHub
<heat> if you find a bug I'll suck your dick be very happy
<ddevault> bug: it's written in C++
<heat> WONTFIX
<heat> nexet
<heat> next
gildasio has quit [Quit: WeeChat 3.6]
puck has quit [Excess Flood]
puck has joined #osdev
gildasio has joined #osdev
gildasio has quit [Read error: Connection reset by peer]
<heat> i am confusion
<bslsk05> ​grok.dragonflybsd.org: plic.c (revision 2663ef1b) - OpenGrok cross reference for /freebsd/sys/riscv/riscv/plic.c
<maxdev> o
<maxdev> i'm calling a userspace function from my kernel, so that my driver can handle the IRQ.. is there anything special happening when using a CALL instruction on x86 and calling code in a page that's mapped as a user page? shouldn't matter I think?
<mjg> what
<mjg> it is definitely nt going to work if you enable smep
<mjg> but that aside, you do realize you are still in the kernel for all practical purposes
<maxdev> yes
<mjg> so why is the ufnc mapped there?
<maxdev> it isn't, i switch to the driver task and call it
<maxdev> so well it is at that moment
gildasio has joined #osdev
<heat> you should find a way that doesn't involve calling a random userspace function in ring0
<heat> else your pretty microkernel design will come crashing down
<maxdev> :D i had a different solution, but i found it quite messy
<heat> *this* is the less messy one? oh my!
<maxdev> well, what's a good solution? my "messy" solution is having an "interrupted state" on the task, then preparing the task to run in the handler, then switch to the task and once it's done switch back
<heat> that's the least messy solution you'll be able to find
<heat> otoh, it will be a bit slow
<maxdev> hmm it just doesn't satisfy me that i have to mess with the stack and stuff
xenos1984 has quit [Read error: Connection reset by peer]
d5k has quit [Ping timeout: 264 seconds]
<heat> you know threads need to block right?
LostFrog has quit [Read error: Connection reset by peer]
<maxdev> block in what sense?
<heat> in the sense of blocking, suspending
PapaFrog has joined #osdev
<heat> the cleanest solution from a typical perspective would be something like: "main() { int fd = open("/dev/irq/10", O_RDONLY); struct irq_data data; while (ioctl(fd, IRQFD_GET_IRQ, &data) >= 0) { do_stuff(&data); ioctl(fd, IRQFD_ACK);}
<heat> now, is it fast? no, you need to take two trips to the kernel
<maxdev> i have blocking/suspending logic etc
<maxdev> i wanted to make it faster with this interruption method, also i need it for signals
<maxdev> that clean solution would basically be polling
<heat> >signals
<heat> stop
<heat> mjg: !signalrant
<maxdev> :D
<heat> something something everyone was on LSD back then
heat has quit [Remote host closed the connection]
heat has joined #osdev
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
xenos1984 has joined #osdev
<maxdev> i really lHoAvTeE osdeving
heat_ has joined #osdev
heat has quit [Read error: Connection reset by peer]
<ddevault> maxdev: the way my microkernel handles IRQs is
<ddevault> we have an IPC primitive called signals (based on seL4), which a task can block on, and which can be written to ("signalled") asyncronously, unblocking any waiting tasks
<ddevault> then there's an IRQ object which userspace can invoke to EOI after waking up on the signal
<ddevault> note: signal in this context is not the same thing as Unix signals
dude12312414 has joined #osdev
nvmd has quit [Quit: WeeChat 3.6]
<mrvn> ddevault: my kernel has mailboxes for IPC. If you want to deal with an IRQ you get a mailbox from the kernel and you send a message to it to EOI. When the IRQ happens the message comes back.
<ddevault> yeah that also works fine
<mrvn> And you can decide if you want to send the message async or wait for a reply.
<mrvn> e.g. the NIC might fetch the infoming frame, fire of the EOI message and then process the data before going back to sleep.
<mrvn> s/infoing/incoming/
<mrvn> How do you prevent the signal being send while you are still processing the last signal?
<ddevault> as soon as the thread is unblocked the signal can be delivered again, if that happens while it's still processing then the next call to wait() does not block
<ddevault> also depends on when the interrupt is EOI'd
<mrvn> so if it doesn't wait the signal does nothing till you wait?
<ddevault> aye
<ddevault> you could also have several threads/processes waiting on the same signal, if you want to pool CPU time
<mrvn> can you pin an IRQ to a core?
<mrvn> or cpu set
<ddevault> tbd, I don't have SMP yet so
<mrvn> ahh, the old days, when things where simple
demindiro has quit [Ping timeout: 252 seconds]
puck has quit [Excess Flood]
puck has joined #osdev
epony has joined #osdev
dude12312414 has quit [Remote host closed the connection]
demindiro has joined #osdev
dude12312414 has joined #osdev
bauen1 has quit [Ping timeout: 268 seconds]
bauen1 has joined #osdev
xenos1984 has quit [Ping timeout: 260 seconds]
maxdev has quit [Remote host closed the connection]
xenos1984 has joined #osdev
antranigv has quit [Quit: ZNC 1.8.2 - https://znc.in]
antranigv has joined #osdev
dude12312414 has quit [Ping timeout: 258 seconds]
<mjg> heat_: i just got aprog at $dayjob which takes a pthread mutex from a signal handler
<mjg> heat_: you know what's funny about it?
dude12312414 has joined #osdev
sympt has joined #osdev
demindiro has quit [Ping timeout: 252 seconds]
h4zel has joined #osdev
xenos1984 has quit [Ping timeout: 250 seconds]
poyking16 has joined #osdev
xenos1984 has joined #osdev
<geist> ddevault: that rdmsr of fsbase doesn't make sense. that should *always* work
<ddevault> I know
<geist> the fsgsbase stuff is just some new feature to do it via an instruction (and faster)
<ddevault> I have no clue why it isn't
<geist> so i'd super double check if you have the right number loaded in the right register
<geist> that has to be it, there's no other explanation
d5k has joined #osdev
<ddevault> I checked it several times, cross-referenced from two sources, and checked the constant against the intel and AMD manuals and the osdev wiki
<geist> are you using inline assembly? what does it look like?
<geist> did you look at the disassembly to make sure everythings in the right register?
<ddevault> not inline, but yes assembly
<geist> can you pastebin it?
gog has joined #osdev
<ddevault> I checked the panic dump of the CPU state, has the appropriate values
<gog> meow
<bslsk05> ​git.sr.ht: ~sircmpwn/helios: arch/+x86_64/asm.s - sourcehut git
<ddevault> this is called with the system-v ABI
<geist> what is passed in what arg?
<ddevault> 0xC0000100 in the first parameter
<ddevault> no other parameters
<geist> so do you see the bug then?
<ddevault> ...err
<ddevault> yeah I do
<geist> hehe
<ddevault> ty
<geist> good ol duck debugging saves the day!
<j`ey> is
<geist> anyway yeah fsgsbase is just an optimization if you're using it in the kernel, and it's kinda annoying since it can't get to the alternate gs one.
<j`ey> delete 'mov %eax, %ecx'? i dunno x86 asm
pretty_dumm_guy has joined #osdev
<ddevault> yeah I grok fsgsbase
<ddevault> only started looking at it because the msr had weird issues
<geist> *howeveR* i do remember that it is a lot faster than the MSRs, so it's annoyingly fast enough to probably need to use in the context switch routines
<ddevault> will move the code back to the msr and add fsgsbase gated behind cpuid later
<geist> (obviously optimization, etc)
<geist> yah
<j`ey> what's the fix?
<ddevault> %edi, not %eax
<geist> j`ey: edi -> ecx
<geist> the first arg is passed in edi
<j`ey> ah
<ddevault> dealing with fsbase/gsbase in a context switch is really fucking annoying
<ddevault> 0/10
<ddevault> should have had the instruction from the start
<geist> ddevault: yeah tell me about it, you need at least a test and if statement
<geist> yah but then you also have to do a swapgs, access, swapgs back too
<geist> (to get to the gs one)
<geist> and if you get an NMI in that spot. ugh.
<geist> if it weren't so much faster than the MSRs i'd just forget the whole thing
<j`ey> geist: do you do patchng in fucshia?
<geist> yup
<geist> if you mean some routines are patched in the kernel, yes
<j`ey> yeah
<geist> in this particular case (using fsgs or not) it's not really worth patching per se, it's more that the ergonomics for using it for context switch are annoying
<geist> mostly because of the swapgs thing
<geist> these instructions are primarily for user space code
<geist> also and i already forgot the specific details, there's a little bit of weirdness with loading 0 or not 0 info the fs and gs registers on AMD and intel
<geist> specifically, iirc one of them will reset the fs/gs base to 0 when you do that in some conditions, and the other wont
<geist> so there's a little bit of an order of operation if you're also resetting the ds/es/fs/gs registers back to 0 in a context switch
<geist> it intersects with when you actually save/restore the fsgsbase MSRs
* geist pets gog
* gog prr
<gog> i finished my work thing and my boss said it was ok for me to grab some smaller jobs for the rest of the week :D
<gog> closed two today
<geist> noice!
<gog> started on a third but i put in a lot of extra time over the last week and a half so i went home on time for once :P
<gog> i really love my job
<mjg> bro
<mjg> oh right, you are new at the job
<mjg> here is a general pro tip: hard work results in more hard work with on payoff
<mjg> *no* payoff
<mjg> don't be the guy always taking the shift cause why not
<mjg> people suck and if you let them treat you a certain way, they will think less of you for it (and consequently look for more ways to do it, rinse & repeat)
<mjg> soft skillz for tech people 101
<zid> My strategy is to do 0 work, I've taken it to the extreme
<mjg> you undo work?
<zid> I technically undid some work today I guess?
<zid> I stopped a program doing a thing
<heat_> mjg, i dunno about that pthread_mutex thing
<heat_> if I were to guess, I would say pthread_mutex_lock and co. aren't required to be AS-safe in POSIX
<heat_> but I can also see why this would be a bad guess
<mjg> what if you get the signal while already hoding the lock
<mjg> which hte code at hand makes no provision to prevent
<heat_> trylock
<mjg> best part is that there should be no threading in this prog to begin with
<mjg> someone just did not know how to write an event loop
<heat_> that's great
<mjg> and then waht, you pretend therew as no signal? :>
<heat_> btw I thought you worked at the freebsd foundation?
heat_ is now known as heat
<mjg> used to, can't be fucked to update my linkedin profile
<mjg> why are you stalking me
<heat> i didn't stalk you
<zid> I'll stalk you if you want
<mjg> zid: appreciated
<heat> it's just that you work on freebsd and are a committer so it's highly likely you're at the fbsd foundation
* zid googles mateusz guzik
<mjg> heat: well i work on freebsd at the current workplace as well
<zid> no no heat, works on freebsd, has been commited*
<heat> "Kernel development with focus on scalability issues.Kernel development with focus on scalability issues."
<heat> replace that with "haha path go open(2)"
<zid> Kernel development with a focus on memoyr odreign
<zid> that's the alpha branch
<heat> what
<heat> did you have a stroke
<zid> yes, because I ran my code on an alpha
<mjg> chill person in person <-- that's me
<zid> That's because eastern europeans all know the only possible trajectory for their life is upwards, the opposite of the UK and US
<heat> is it?
<heat> you can get invaded
<heat> or shot
<heat> or both
<mjg> you can't get cancelled
<heat> or accidentally mass genocided in ex-yugoslavia
<zid> heat: yea but now tlk about places that aren't the US
<mjg> besides i take offense. i'm from central europe!
<heat> i'm not getting invaded nor shot nor mass genocided
<zid> poland is east sorry, you have an accent that sounds like russian to the untrained
<zid> that makes you eastern european
<mjg> that's just your lack of education!!
<zid> I don't make the rules
<heat> poland = weird letters = eastern europe
<zid> I just enforce them mercilessly
<mats1> is this slav on slav violence?
<mjg> seriously though, there is a common perception that anything east of germany is progressively more backwards
<zid> he says he isn't a slav
<heat> "how many consonants can you fit in a weird"
<heat> s/weird/word/
<heat> although weird is also a fitting description of polish words
<mjg> as in even poles will think less of people east of them
<zid> z with dirt on it is a vowel heat, deal with it
<mjg> zid: bro there are 2 variants!
<mjg> ź and ż
<zid> Pretend I just linked that sketch of the polish farmer giving his name to the german with the typewriter
<mjg> however, if you knew the fucked czech/solvak alphabet, you would consider poland to be very western
<zid> I'm too lazy to find it
<mjg> ye i know
<heat> you mean check
<mjg> brzęczyszcykiewicz
<heat> from the check republic
<zid> yea him
<mjg> classic
<heat> or if you're european, cheque
<zid> if your name ends in -vich, you're eastern european
<mjg> tbf his name sounds like shit even to poles though
<mjg> typical polish names are nowhere near as f-worded
<geist> does the word have not enough vowels? check!
<zid> we can give him some of germany's, as reparations
<zid> they have a lot of trailing e they can give up
<heat> poland bad germany bad france bad god save the queen pound
<zid> heat talking sense for once
<mjg> afghanistan under taliban rule is the place to be
d5k has quit [Quit: leaving]
<heat> "<mjg> afghanistan under taliban rule is the place to be"
<heat> this is a fucking quote
<heat> someone make this the topic
<geist> anyway, speaking of topics
<zid> right, back on topic guys, freebsd is shit
<mjg> i hear onyx does not scale
<heat> freebsd shit freebsd foundation shit praise mjg's new employer
<heat> WHAT
<heat> can't hear you
<mjg> heat: it's been over a year now
<heat> anyway yeah I actually had a real question
<mjg> what
<mjg> can't hear
h4zel has quit [Ping timeout: 265 seconds]
<heat> does freebsd still have slab's caching for the initialization/destruction or is that bs gone like in linux
<mjg> yes :(
<mjg> it has
<mjg> i have work inp rogress to whack it
<geist> idea is tht sort of shenanigans is no longer useful on modern hardware?
<heat> why is it bad
<mjg> there is one nasty consumer which makes it problematic
<mjg> well let's make sure we are on the same page here
<geist> on paper it seems like a good idea
<mjg> there is inititalistion of *allocation time* as in just before returning to the caller
<mjg> and there is some code to run when you import pages backing teh slab, creating objects to begin with
<geist> right, such that it has magazines of partially initialized things ready to go
<mjg> freebsd supports both and the latter is fine, while the former is a bad idea
<mjg> and i intend to remove it
<geist> which to remove?
<mjg> init just prior to return
GeDaMo has quit [Quit: Physics -> Chemistry -> Biology -> Intelligence -> ???]
<mjg> i reviewed several consumers and i'm did not find any which benefits from it
<geist> oh sure. okay. that maekse sense. i thought initially you were saying the 'partially initialized thing in the cache is bad' and i was wondering if that's one of those optimizations that end up not being worth it in the end
<mjg> and the loss is from an indirect function call
<geist> yah and i guess the consumers are still free to initialize/construct the object once youv'e handed it to them
<mjg> geist: it may or may not be worth it, the problem is that historical papers about it were doing stupid stuff to show the win
<geist> ie, here's 48 bytes of something
<mjg> right
<mjg> it gets largely overwitten most of the time anyway
elderK has joined #osdev
<mjg> and you still pay for branching on the existence of the init func, you waste space in the cache line to store it etc.
<geist> yah okay, makes sense. and yeah i was thinking the everything just overwrites it anyway part with modern software practices may be the reason the partially initialized stuff isn't as useful anymore
<mjg> it's all around a loss
<geist> i'd tend tot hink with modern machines the idea that you blat out something on the object and then immediately use it means the cache is hot, etc and may be a generally okay pattern
<geist> up to some point of course
<mjg> well let me tell you what prompted the existence of constructors on page import
<mjg> in solaris
<mjg> and it is pretty bad
<mjg> (sorry :))
<geist> yah i remember reading the same slab paper
<geist> that's kida what i'm going on in my mind
<heat> i was reading the slab paper on my uber today
<mjg> in solaris they have funcs like mutex_init, cv_init which are weirdly expensive
<geist> seems that the upside there is you can do all the initialization while not holding any locks
<geist> whereas alloc_slab -> construct is probably holding some consumers lock
<mjg> in linux you would literlaly store few bytes, done
<mjg> in solaris the yhave func calls
<zid> "which is weirdly expensive" sums up solaris as a whole doesn't it
<mjg> so ye, if you shave numerous func calls beause an obj has 5 mutexes and 3 condvars
<mjg> you can measure a win
<geist> different era. you generally didn't inline the fuck out of things like yo udo now
<geist> also because yuo probably had like 4KB cache, *total*
<heat> also a function call in linux
<bslsk05> ​elixir.bootlin.com: mutex.c - kernel/locking/mutex.c - Linux source code (v5.19.12) - Bootlin
<geist> anyway, got it
<mjg> geist: if this was just few stores, it would *likely* still be faster to just have them in one central func allocating stuff
<mjg> note the store is probably shorter than generating a call +_recovery from it
<geist> probably. also unclear of compilers at the time even had inlining as a feature. you'd have to macro it probably
<mjg> i'm not saying everyone allocating has to inline this at their own callsite
<mjg> i'm saying: foo_alloc would conatain a call to slab + do the init
<geist> anyway, i just remember from say BeOS kernel, which did none of this inlining either
<mjg> that said, there may be legitimiately expensive things to init
<mjg> in which case the idea is great
<geist> yah. i think the idea though was if you freed an object it could keep a partially initialized version though
<geist> and thus as it got recycled it wouldn't need to be re-allocated
<geist> er re-initialized
<geist> so it's less of any sort of win on a 'clean' page added to the slab that then gets used, and more of an optimization for the recycle path, right?
<mjg> i'm not disupting it's ok in the worst case evne for this mutex stuff
<heat> oh yeah also, is the cache coloring still similar to the original paper? or did you get more fancy?
<mjg> i'm saying the ordeal was prompted from solaris inefficiency in the area
<geist> sure.
<geist> yeah i'm actually looking at this from a 'one of these days we'll do a slab in zircon too and have been vaguely thinking about these sort of ideas'
<heat> I don't fully understand how the coloring and I don't know how their bus example maps to modern x86/arm64 hardware
<geist> heat: i'm not entirely sure the cache coloring matters much anymore
<geist> i think it matters a lot more on direct mapped and or very low associtivity caches
<geist> which would have been the norm back then
<mjg> geist: for example in freebsd vnodes get added to a global lru list on alloc
<mjg> geist: as in when they are created. it would be prohibitively expensive on actual alloc/free
<geist> mjg: yah i think vnodes are the canonical example of these things
<mjg> i don';t know if they are canonical, freebs happend to not use this stuff prior to 5-ish years ago fwiw :P
<mjg> ... for vnodes
* geist nods
<geist> more of the canonical ideal object
<geist> ie, the one you would use if you wanted to describe it to a classroom of osdevers
<heat> maybe they're not canonical but red hat
<heat> ba dum tss
h4zel has joined #osdev
<mjg> anyway ye, constructor on page import == defo. an optional init function on alloc == please no kthx
<mjg> :)
<geist> sinc eyou have a fairly large and complicaed structure that gets recycled a lot
<mjg> ye agreed
<geist> but i think that also ignores the recycle path
<heat> do you still have dtors?
<mjg> heat: again what's a dtor in this context
<geist> ie, free_but_keep_alive(vnode);
<geist> now it goes in the lru with a ref count of 0
<heat> mjg, as in the dtor you pass to kmem_cache_create
<geist> if asked for again it pops it out, otherwise the slab grabs it, runs dealloc() and then returns it as a new one
<heat> that gets called when a slab goes bye-bye
<mjg> you mean when you "whatever_free" or when you destroy the page
<mjg> so ye, dtors exist and make perfect sense if you have a ctor
<heat> hrm
<mjg> i unlink vnodes from lru in a dtor
<heat> linux... does not have dtors
<mjg> no? that's weird
<heat> they have ctors, but not dtors
<geist> yah i thought a big part of the win is you can do the delayed dtor thing, and since the slab has a global view of what memory is in the system
<geist> you can allow the slab's global view to decide how to allocate memory between them
<geist> and it can globally trim slabs based no memory pressure by removing unused-but-not-dtored objects
<bslsk05> ​lwn.net: Slab allocators: Drop support for destructors [LWN.net]
<mjg> ye i like the general idea, but it tends to not be really utilized like it should
<geist> possible that that sounds good on paper, but something more bespoke is ultimately more flexible
<geist> yeah
<geist> like some sort of memory pressure notification to slab users: "hey trim your shit yo"
<geist> and then have the users do their own LRU
<heat> "Taking a
<heat> spinlock in a destructor is a bit risky since the slab allocators may run
<heat> the destructors anytime they decide a slab is no longer needed"
<mjg> one little tidbit you may want consider re dtors/ctors is batching
<mjg> the freebsd api, and likely solaris, just accepts 1 obj at at time
<geist> yah i'd also think that a slab has a better view of page utilization
<mjg> which is highly pessimal for usecases like mine
<geist> so it could decide to free these 8 objects because that makes the page free
<mjg> 8 vnodes fit in a page, so i get 8 lock/unlock cycles on each import
<mjg> it is rare enough to not be a big deal, but it is crap for no good reason
<geist> well anyway, good stuff to think about
<geist> i think we're mostly stuck with one design in zircon, because C++ anyway
<geist> though you can placement new, the language itself is somewhat stricter about lifecycle of objects
<heat> also "Well, constructors are on their way out too because they don't seem to give the performance benefit they were designed for anymore" in 2007
<geist> you can't be as loosy goosey about halfway constructed objects, etc
<mjg> heat: that's basically a remark on the mutex_init et al stuff i mentioned
<geist> and yeah i suspect that's the 'cpus L1 caches are larger and cpus are really good at blatting out a relatively small amount of data in pretty close proximity'
<mjg> and does not account for less common usecaess, like the vnode stuff i mentioned
<mjg> [the global lru]
<geist> side note i'm still amazed at how big the vnode cache gets on linux. i checked one time on i think a 64GB or maybe a 128GB machine after touching basically every file in my fs
<mjg> huh
<geist> and sure enough /proc/slabinfo showed like a little over 4GB of vnodes
<geist> millions of them
<mjg> well linux is pretty aggressive about populating the dentry cache
<geist> and the dentries, yeah
<mjg> i guess it is fine if you hae enough ram anyway?
<geist> linux seems to go all in on dentry as a load bearing thing, i was a little surprised to recently learn this
<mjg> i would be worried if the system was running into trobule because of it
<heat> oh yeah totally
<geist> my general VFS knowledge is very BSD centric, where a dir cache is purely an acceelerator cache, and not strictly necessary
<mjg> [which to be clear does happen :>]
<mjg> geist: funny you mention that, i was looking at makingi t mandatory :)
<heat> dentries are required if you want to go up or down the path, or if you want fast lookups
<mjg> the problem with an optional cache is that you hinder user debugging
<heat> hitting the fs is bad
<geist> yah at first i'm a little grossed out about it, but then once you think about it it's not too bad
<mjg> for example they open a file, you don't store the name
<geist> i thin kthe hard part will be network fses, they may require a bit more thinking re:dentries being mandatory
<mjg> then you lsof or compatible and there is no entry
<mjg> sucks
<geist> what i was surprised to hear is stuff like tmpfs doesn't need to really maintain its own dir structure, since it apparently just tosses stuff in dentries
<heat> the funny part of linux dentries is that filesystems can invalidate them
<geist> and thus you get the VFS storing your data for you
<geist> oh sure. i'd *assume* the fs can invalidate them. it has to
<mjg> it does not need to do it indeed, i have plans to utilize it in freebsd
<mjg> right now it is de facto double allocating names
<geist> well hard links would generally do that
<geist> but that should be fine: you just have to dentries pointing at the same vnode
<mjg> right
<mjg> fun fat, solaris refrains from entering tmpfs vnodes into dnlc
<mjg> instead they have a local hash
<geist> i think the dentry thing also nicely solves bind mounting and recursive mounts, i think
<heat> what's dnlc
<geist> though actually i havne't resolved in my mind completely how that works
<mjg> directory name lookup cache
<mjg> that's how old unix systems call it
<heat> I do all my mounts in dentries
<geist> ie, if you mount /a and then mount something else at /a/b and then at a/b /c you bind mount it, do you get /c/b? i think in linux you dont
<mjg> geist: i have to admit me neither :)
<heat> a mountpoint is just a dentry with DENTRY_FLAG_MOUNTPOINT
<geist> ie, bind mounts dont drag along the entire submount structure, so there must be some way to nak mountpoint dentries when traversing
<mjg> so to be clear, the idea is not linux-specific, dragonflybsd also has the general approach
<mjg> but i don't know how it differs
<mjg> geist: you use a touple of dentry + mount point to do the looukp
<heat> geist, I would imagine you would
<geist> that's kinda what i'm thinking
<geist> heat: i think not, because it also lets you recursively bind mount, and you dont want that to go forever
<heat> sure, you have a limit to recursion
<geist> ie, /a can be mounted at /a/b and i think it only goes one level?
<heat> symloops baby
<geist> anyway yeah there must be some sort of mount id instance or something that causes it to nak traversing it the second time, etc
<heat> yes
<heat> I forgot what its called
<heat> vfsmnt?
<heat> *shrug*
<geist> anyway the old scheme i used to use for VFSes is to store the mount point in the vnode itself. basically a field that says 'i'm covered by <pointer to new mount instance>'
<geist> but putting it in the dentry makes sense, if the design makes it load bearing
<bslsk05> ​elixir.bootlin.com: slab.h - mm/slab.h - Linux source code (v5.19.12) - Bootlin
<heat> I know
<heat> I gave you the patch that removed it
<mjg> heat: in 2007
<mjg> heat: for all i know it came back since then
<bslsk05> ​marc.info: '[RFC] Slab allocators: Drop support for destructors' - MARC
<mjg> i'll read it later, have to go afk
<heat> geist, you know, I was just thinking, what if its ok to perform construction at slab allocation in zircon since you can just make construction lighter
<heat> particularly since google's coding style effectively suggests that
<heat> (with no exceptions, etc)
<geist> yah, would have to probably templatetize a slab instance on it so it knows which constructor to use, etc
<geist> and then that's mega code bloat (but that's probalby what most folks would do first)
<heat> the only drawback is that you'd stop being able to do something like "new inode{inode_num, uid, gid, ...};"
<geist> so you end up with a type safe slab
<heat> yeah
<geist> yah you'd need an Init() routine,, which is actually what the style generally calls for
<heat> I have a type safe memory pool but it's ass
<geist> anything that's not trivial in the kernel generally has a ctor + Init() routine pattern
<geist> because Init can return error, etc
<heat> particularly, you can't have a list of slabs, etc
<heat> and then, yes, template bloat woohoo
<geist> yah, but if you do it right can probably avoid any function pointers, at the expense of code bloat
<geist> the fun art is carving off inner routines and making them just arg driven
<geist> like the internal memory allocation and slicing routines
k8yun has joined #osdev
<geist> but really template bloat is par the course in Google Style. for better or worse. i'm a bit aghast at it but what can i do
<heat> avoid it? :P
<geist> "The avalanche has already started, it's too late for the pebbles to vote"
<heat> every time I look at zircon I find yet another thing that got replaced by some overengineered C++ solution
<heat> usually with templates
<heat> last time it was CPUID
<geist> agreed. imagine when it's your stuff that's being replaced. it's a bit difficult at times
<geist> oh dont get me started on the cpuid stuff.
<geist> at least the codegen on that stuff is the same, the ergonomics are just insane. and people *like it*
<geist> that's what i can't argue with. a whole class of folks just love that stuff
<geist> 'look at how unit testable it is! the fields are self descriptive! i can <insert random templatey neat thing you can do with it>'
<geist> *shrug* yeah it is 'better' i guess. i just like simple unless complexity is necessary
<heat> yeah but then the kernel doubles in compile time and maybe increases in size
<geist> yup. and i've made complaints about that before, but the general retort is 'can you prove that the increase in size is causing a problem?'
<heat> uint8_t ProcessorId::local_apic_id() const {return fbl::ExtractBits<31, 24, uint8_t>(registers_.ebx());}
<heat> just shift and mask?
<geist> sure, but folks that like that stuff point out that that solution is totally superior: no chance for mistakes
<geist> thing is that folks in the trenches that write this sort of stuff for a living in C or whatnot point out that these sort of bit level mistakes are really not that common. 95% of driver/low level stuff errors are not putting the wrong bit in the wrong place, they're overall logic errors, or misunderstandisng of hardware
<geist> which no amount of templatey goop will solve
<heat> yea
<mjg> but can obfuscate enough for you to bea ble to claim progress :>
<mjg> refactor this, refactor that, work is being done boss
<geist> but folks that haven't written drivers for a living love looking at the 'fragile nasty C' code and building complex abstractions to solve a perceived problem
<heat> is there a divide between old school and C++ people?
<geist> but at the end of the day what you *can* do is at least make sure the codegen isn't bad
<geist> heat: yes but the c++ people outnumber the old school 10/1
<geist> so what can you do
<heat> :|
<geist> anyway, it's not as bad as i make it out. there are smart C++ people. they know what is an isn't a good idea re:codegen
<j`ey> join a linux team!
<heat> I think geist would rather die a painful death
<geist> nah, linux is way to toxic. it'd be like asking me to play CSGO
<heat> what's wrong with csgo
<geist> toxic
<heat> noooooo
<heat> rocket league is way worse
<geist> it's all shades of toxicity
<heat> if you just dick around in csgo no one will fuck with you
<geist> fine. anyway re all this i got some work to do today
<heat> in rocket league you get "What a Save!"'d 4 times a match at least
<heat> j`ey, whats your favourite directory in linux
<heat> if you say arch/arm64 I know you're lying, that's a PR answer
<heat> arch/arm is even worse because no one likes that dir
MiningMarsh has quit [Ping timeout: 252 seconds]
MiningMarsh has joined #osdev
<j`ey> heat: drivers/firmware/efi
<heat> lie
<heat> what a fucking lie
<j`ey> D:
<heat> the only correct answer is in a hipster kernel
<heat> you have like 5 BSDs to choose from, and 9front
<heat> also darwin
<heat> and *ONYX* obv
<ddevault> the rewrite is paying off https://l.sr.ht/UekL.png
<ddevault> couldn't do this in the old version
<j`ey> heat: what is it?
<heat> i don't know
<heat> you choose
<heat> linux loving is boring, stanning a hipster kernel is cool
<heat> oh, forgot haiku, silly me
<j`ey> but I spend all my time in arch/arm64 D:
nick64 has joined #osdev
<heat> also too mainstream
<heat> I think ARM should pay you to work on VAX support
<heat> what's a unix without vax support
<nick64> ring0 and ring3 in Intel aren't just software constructs, but physical indicators in the processor, right?
<heat> yes
<gog> yeah
<gog> the instructions the current process has access to will depend on the CPL
<heat> x86 has ring 0, 1, 2 and 3, and then SMM and "theoretically" a hypervisor if you're under it
<gog> ring -1
<gog> as it's sometimes called
<heat> confusingly you can have SMM under a hypervisor as well
<heat> rings in x86 are hard
<gog> yeh x86 is a strange cpu
<heat> anyway use 0 and 3
<heat> the rest don't matter
<gog> yeah 1 and 2 aren't really useful for anything
<gog> almost nobody used them
<gog> they were supposed to be for privileged applications like kernel services for microkernels
<nick64> That is the very next question I was going to ask, are the negative rings really hardware indicators, or just indicators based on some register bits, as a software construct?
<gog> it's more like system management mode is an entirely different thing
<gog> and the programmer doesn't really have access to its inner workings
<heat> gog, funnily enough RISCV has M (machine) mode for fw and S (supervisor) mode for kernel stuff but you can't be in M mode under a hypervisor
<gog> interesting
<heat> so QEMU can't load firmware if kvm is enabled
<nick64> A virtual machine monitor in a type 1 hypervisor, from one point of view is a kernel of the type1 (ring 0), and from another point of view ring -1, since it is hypervisor. Now I am confused, what is it?
<heat> nick64, they're hardware modes
<ddevault> rdmsr/wrmsr proposed for worst x86_64 instruction
<heat> it's a fine instruction
<heat> slow as balls, but fine
<gog> use sparingly
<heat> nick64: anyway, hypervisors are funky. the "ring" they are really depends on where you are down the stack
<heat> if you're under it, -1, if you're alongside it, 0
<heat> all of this is relative and life is relative and nothing is absolute
<zid> idk why people insisted on extended the ring model beyond its means
<heat> yeah
<heat> it's just marketing at this point
<zid> we're up to like ring -8 now
<nick64> heat: I know that ring0 and ring2 are "real" and "physical". Can't wrap my head around ring -1 being physical
<zid> and we only had 0 and 3 to begin with
<heat> nick64: negative rings are just concepts around things that are "above you" in the traditional sense
<heat> hypervisors are one of them, so is SMM
<zid> I could make an interpretation for ring -1 being "physical". The silicon is cheating on you and presenting an orchestrated fake physical reality.
<heat> yes
<nick64> zid: what other way would you suggest instead of extending the rings, without making it insecure?
<zid> rings has nothing to do with security or lack thereof
<heat> zid isn't talking about "extending the rings"
<zid> it's a silly descriptor of a methodology
<heat> it's that the ring numbers are stupid and don't make sense
<zid> you don't have to remove metholody to remove a silly name
<heat> 0 - 3 really exist as rings
<ddevault> new iso should probably work should anyone care to test it for me: https://mirror.drewdevault.com/boot.iso
<heat> ddevault, doesn't work in kvm
<nick64> I see.. so you mean it should have been called ring0,1,2,3, and there should be some other naming/hardware bits to indicate the virtualization partition?
<ddevault> heat: aware
<ddevault> haven't bothered to investigate yet, patches welcome
<heat> okgreat
<ddevault> works on qemu softmmu and on my laptop, that's enough for now
<heat> nick64, arm64 and riscv levels are saner
<zid> I assume "halt" "unhandled IRQ1" is not what it's supposed to do?
<ddevault> that's fine
<ddevault> expected result is https://files.catbox.moe/h6r2g1.jpg
<gog> yeh if we wanna get really technical the ISA and the microarchitecture are not the same and what the programmer sees is a façade
<ddevault> halt is by design, unhandled IRQ1 is because there's no keyboard driver running
<zid> I didn't press anything I swear :,
<ddevault> doubt
<heat> nick64, the really important part to grasp is that every "ring level" is "physical"
<heat> as in, it exists
<nick64> heat: So say CPU is in ring 0, and decides to kick start the virtualisation stuff to spin up a guest. Does the CPU that is handling the monitor sort of "become" ring -1 (from the ring 0 it was), or is it just a visualisation, and it is still ring 0, with the VMCS indicating it is ring 0 root or ring 0 non-root?
<heat> SMM is more privileged than ring 0, hypervisors are more privilege than ring 0
<heat> last
<zid> technically
<zid> you end up with a clone of ring 0-3 again
<heat> yea
<zid> the guest has its own, the host has its own
<zid> the guest isn't supposed to "know" that this is the case
<heat> yeah
<zid> but the host can intercept things that the guest ring 0 did
<heat> TL;DR the negative ring numbering is bullshit
<nick64> Oh! So root partition has rings 0-3, and each non-root partition has their own ring 0-3 rings?
<heat> yes
<zid> so you can say that in *practice* the host ring 0 is actually the guest's ring -1
<heat> you can even have hypervisors under hypervisors
<zid> but it doesn't make ring -1 *exist*
<heat> yeah
<nick64> So it is not a hardware thing below ring 0, just that it is a software datastructure (aka VMCS structure)?
<nick64> How accurate? ^
<heat> it's a hardware thing
<zid> the hardware implements it
<zid> else it wouldn't work at all
<zid> there has to be some silicon to let you make a 'fake' ring0, that you can control from the 'real' ring0
<heat> on a similar note, you can have SMM under a hypervisor and SMM above the hypervisor
<heat> and both can stop you
<zid> And yea, the motherboard is also doing it to you in the first place
<heat> the difference is that the SMM above the hypervisor takes you out the vm, SMM under the hypervisor takes you to SMM inside it
<zid> Your host ring0 has its own ring -1, the motherboard, and the motherboard might have its own ring -1, the NSA :P
<nick64> Hardware implements the negative rings, or hardware implements the magic to make sense of the "software" VMCS structure on the memory?
<heat> which is -2 and which isn't? they both are
<heat> dude
<heat> negative rings don't exist
<zid> you're listening but you ain't listening
<heat> we've been through this
<zid> negative rings are an *abstraction*
<zid> they are not *real*
<nick64> Ok let me read everything all over again :D
<heat> it's just stupid marketing
demindiro has joined #osdev
<zid> It's just how you make sense of a guest client thinking it's in ring0, but not really, what 'number' do you give to the 'real' host ring0 priv level
<heat> -1 and -2 exist to make you go "woowwww there's something above ring0?????"
<nick64> How does CPU know if the ring0 it is in, is in fact the actual ring 0 or one of the VM's ring 0?
<nick64> Or in marketing terms, how does CPU actually know if it is in ring -1 or in ring 0?
<nick64> (Not asking how an application code can ask CPU this, how does the CPU itself know)
<gog> rings are fake. programmign is fake
<gog> our lives are meaningless
<nick64> How does CPU know to happy execute in/out in the right 0 of the host, but to cause a trap in ring 0 of the guest?
<gog> anyhow, there are flags for vm hypervisors
<gog> so it can know if it's in a "real" ring0 or not
<zid> of course the cpu knows?
<zid> How could it not, it's the one that implements it
<zid> if you did the "jump to guest instruction", it knows that happened
<zid> and can keep track if it in a bit somewhere
<nick64> I mean, if it "knows", then it is no longer "not real"?
<gog> vmenter and vmexit
<zid> ring -1 is not real, correct
<demindiro> Rings are a social construct and hardware state is magic
<gog> yes
<zid> there's the host ring0, and the guest ring0, that's it
Gooberpatrol_66 has quit [Remote host closed the connection]
<zid> the cpu knows which one it is in, as to whether it ends up vmexiting back to the host or not when something fun happens
<zid> that's the entire implementation and semantics
<zid> there's no special ring -1, either in the manual or in the cpu
<zid> there's just "if we vmentered, fun things should run vmexit"
<nick64> Okay, bear with me for a moment to make a contradicting statement, "ring -1 exists, but it is more appropriate to call it 'ring 0 of root-partition that came into picture post vmxon' rather than the 'stupid?' marketing term?
<zid> why is it
<zid> every time you are told ring -1 exists
<nick64> How wrong am I ^
<zid> doesn't exist*
<zid> you start every subsequent summary with "ring -1 exists"
<zid> :P
<nick64> :yikes:
<zid> even if you're now trying to devil's advocate, you seemingly can't resist
<zid> It's a buzzword you *might* see on the product info of a software product, where it says "LATEST TECHnOLOGY" and "APPROVED BY 97% OF ALL DOGS"
<gog> semi-related side note but i've been taking locker 386 at the gym because it's a really easy number for me to remember :D
<zid> nice
<zid> I did something similar recently, I got 386th place in trackmania so I just stopped improving that track
<zid> why spend a bunch of time practicing how to ruin it
<gog> nice
<nick64> So ring -1 does not exist. But there exists some sort of hardware magic where CPU can tell if the ring0 is in host or guest. And in that point of view, even the virtual machine monitor is actually just ring 0 (and not some marketing ring -1 stuff), just that it is in root partition which CPU knows is more privileged
<zid> still can't resist
<zid> host, and guest, these are the terms you want
<zid> the rest you can delete
<nick64> Is VMX partitioning real?
<zid> Does vmenter and vmexit deal in vmx partitions
<zid> if yes, yes, if no no
<nick64> Okay I have a better phrased version of my first question. Let us say CPU is running in ring 0. We all agree ring 0 is real I guess. When the code that CPU is running does all that vmxon, vmptrload, vmlaunch magic, and the control flow is transferred to the Ubuntu guest machine kernel of mine, does the "real ring 0" status of the CPU remain in ring 0 itself?
<gog> the cpu is operating in a more-or less normal mode with the exception of certain instructions and access to certain regions of memory
<gog> so when it's in guest ring 0, it's in ring 0 with caveats
<gog> but the guest OS doesn't know or care
<gog> it can do whatever it wants and the hypervisor services it
<nick64> Right. And when there is a VMExit, the code that handles the exit, although the marketing folks would say is in ring-1, is actually just ring0 itself, but with some other part of the CPU not restricting those privileged instructions?
<gog> yes
<zid> so you can probably appreciate why someone *might* refer to it as ring -1, given it's a "super ring0 with privledged instructions" now, given that "guest ring0 without privledged instructions" exists
<nick64> So if the Ubuntu guest, from within it, spawns up another VM inside of it, as far as the CPU is concerned, both the guest ring 0s are of the EXACT SAME privilege, and the only difference is the address of where the exit handler function pointer and other VMCS related points to?
<zid> but it's just a cheeky abstraction to make some lengthy exposition of the situation disappear, not a real thing
<nick64> Same question simplified, as far as the CPU is concerned a VM's kernel running in ring 0, and a nested VM within that VM's kernel which is also running at 0, are of same privilege?
<zid> the only privledge it officially acknowledges are 0, 1, 2 and 3
<zid> of which software uses 0 and 3
<zid> nested vm shit is just nested vm shit
<nick64> So how does it differentiate the security privilege of VM1 that runs on host, and the VM2 that is spawned by VM1?
<zid> it makes various datasets have a host and guest relationship
<nick64> VM2 ring0 is not supposed to be able to control VM1 kernel right
<zid> no, that's the point of the encapsulation provided by the virtualization
<zid> vm2's the guest of vm1, its host
<zid> vm1 is in the same relationship and position toward vm0
<nick64> CPU keeps tracks of the foreward allow access backwards don't relationship at multiple depts of nesting?
<zid> yes, it just knows where to look for the current set of flags and shit and updates accordingly at vmenter/exit
<zid> given you provide it when you set up the shit in each layer
<clever> nick64: aarch64 hypervisor api makes it far simpler to keep track of all of that and to explain it
<clever> EL2 is where the hypervisor runs, and it has access to special hypervisor registers
<clever> with those registers, you can add a second set of paging tables, so the "physical address" from EL1(kernel mode) goes thru a second set of translations
<clever> and it can configure the cpu to trap any unauthorized or emulated function
<clever> EL1 then just cant access hw it isnt allowed to access, causing either a normal pagefault or a handler in EL2 to run
<clever> any time you context switch to another guest, EL2 has to re-configure all of those registers, to set the limits based on that other guest
<nick64> I wonder how it is restoring the trashed register states. Pushing to some virtual memory location like function stacks?
<clever> same as any other irq or exception handler, save all registers to a stack upon starting the handler
<nick64> Oh maybe intel is also doing something like that for nesting VMCS
<heat> demindiro, exactly
<heat> you're the most right person here
<clever> i think for nesting vm's, the hypervisor has to trap access to vm control registers
<clever> and then emulate the hypervisor api
<zid> heat: can I be 3rd most right?
<heat> tes
<heat> yes*
<heat> but also tes
<zid> I figure if I go for 2nd, I might have to fight
<j`ey> clever: nested virt requires extra stuff, linux doesnt even support it yet
<heat> x86 privilege levels are stupid and magic
<zid> and while those guys are fighting, I'll snap up 3rd
<heat> i would never give you 2nd
<heat> i'm 2nd
<nick64> I think we'd need to invite Steven Crowder to settle this
<zid> so we have demindiro, stallman, zid
<heat> hello it's me, richard stallman
<heat> pleasure to meet you
<zid> how do you feel about parrots
<heat> gnu and gplv3+ and GNU plus linucks and Hurd and tivoization and pedofiles and all that
<heat> pedophiles?
<heat> probably
<heat> richard stallman, matthew garrett, undead corpse of tony blair and a literal poltergeist is all this channel comes down to in the end
<zid> which one of us is the corpse of tony blair
<heat> yes
<zid> oh
<heat> the real corpse of tony blair is the friends we make along the way
<CompanionCube> objection: for better or worse, tony blair is still alive
<heat> but is he really
<nick64> heat: is there any difference whatsoever in privilege or anything, when it comes to "the kernel thread that executed the vmx set of instructions to become the monitor for a VM" vs "any other threads spawned my the kernel of the same host machine, running on the metal"
<CompanionCube> well, mostly the hair is bad
heat has quit [Remote host closed the connection]
heat has joined #osdev
<heat> the link crashed my irc client
heat has quit [Remote host closed the connection]
<zid> the linux
heat has joined #osdev
<CompanionCube> the curse of tony blair
<heat> oh wow I can't open that link lmao
<zid> to be fair, it's thetimes.co.uk, you probs don't want to
<CompanionCube> try it in a browser maybe?
<zid> It'll somehow make you vote conservative by declaring that red wine gives you cancer
<CompanionCube> zid: that's the daily express or mail
<zid> and the times
<heat> yeah the hair is NOT IT
<zid> and the express
<zid> and the sun
<heat> the s*n
<zid> and the sunday express
<zid> and the telegraph
<zid> and the evening telegr-
k8yun has quit [Ping timeout: 268 seconds]
<CompanionCube> yes but the two mentioned are the ones who do the cancer thing
<heat> and SKY NEWS
<heat> PRAISE BE THE MURDOCH
<zid> Turns out if you put restrictions on how much of a right-wing dipshit you can be on TV, murdoch buys all your newspapers
<zid> except the left leaning one which is owned by.. a russian oligarch!?
<CompanionCube> and since it's a picture of tony blair, rather he makes you vote New Labour and pretend people still like the Third Way.
<Ermine> Wtf is happening here
<heat> zid, the guardian is nice
<CompanionCube> zid: a russian oligarch who's buddies with boris enough to get into the lords, no less!
<CompanionCube> You wonder how he feels about that 'baron of siberia' title post-invasion.
<zid> UK media needs a reset button pushing
<\Test_User> media in general does
<Bitweasil> False! According to major media fact checking agencies, media is doing just fine, and you're the fault! MediaFactCheck++ Certified Answer!
<Bitweasil> :(
<\Test_User> XD
<Bitweasil> Now's a good time to get hobbies like "kerosene lanterns" and "off grid power systems" and "Wow, you know, being cold isn't so bad when you bundle up!" :(
<zid> I have investigated myself and found no wrongdoing
<\Test_User> being cold doesn't feel bad once you've died
k8yun has joined #osdev
<heat> mjg, how do you make slab go fassssstttttttttttt
<nick64> Is the vm monitor thread running in ring0 (which the hypervisor product marketing would call it as ring -1) somehow any different from the rest of the ring0 threads running in the same kernel, or exactly same level/mode/privilege as far as CPU is concerned?
<heat> same pri
<heat> v
<Bitweasil> The code calling VMLAUNCH/etc is just "ring 0 code," with the same ability to do stuff as other ring 0 code.
<Bitweasil> There's nothing special about it.
<heat> your only privilege is with respect to what's running under you
<nick64> Cool. I think that concretises the idea that -1 is not real, and that is is the guest side of this is which is something new (ability to cause a trap on priv instrs)
<Bitweasil> Well... from whose perspective?
<heat> no ring is real in x86
<Bitweasil> ^^
<heat> it's all about perspectives
<gog> one ring to rule them all
<Bitweasil> From the *host* point of view, ring 0 is ring 0, the guest is running as a VMX guest, with its own rings.
<gog> precious
<\Test_User> aka intel me
<heat> ARM and RISCV are a lot more concrete
<Bitweasil> From the *guest* point of view, there are things "below ring 0."
<heat> \Test_User, intel ME is not a ring
<\Test_User> heat: true, but it does rule them all
<Bitweasil> Yeah, ME is a separate CPU core in the system, separate memory, etc.
<Bitweasil> Not really.
<Bitweasil> I'd argue that SMM, or the STM, are the lowest levels of x86.
<nick64> I think \Test_User meant SMM?
<heat> STM?
<Bitweasil> "Yo dawg, I heard you like hypervisors, so I put one in SMM so you can hypervise while you hypervize."
<\Test_User> lol
<demindiro> So bureaucracy
<Bitweasil> It's an obscure little corner of x86 I happen to know very well.
<zid> The lowest level of x86 is actually the superio chip
<nick64> I wonder if SMM to ME is some sort of processor to processor interrupt
<Bitweasil> No...
<zid> because I can fake keystrokes to log in as root via it, ergo it's the deepest attack vector
<heat> ME is totally separate
<nick64> Lowest level of x86 would be microcode, right?
<Bitweasil> So, SMM, system management mode, was the most privileged place in x86.
<heat> sure
<Bitweasil> And SMM handlers were a hot mess.
<Bitweasil> BIOS vendors largely said, "Well, we can't fix it, the guy who wrote that retired a decade ago."
<heat> Bitweasil, they're getting rid of a good chunk of SMM
<heat> PRM they call it
<Bitweasil> Oh? Link plz?
<Bitweasil> Anyway, SMM can violate hypervisor separation of guests, it can touch everything.
<Bitweasil> And back in the day, the High Assurance Platform (HAP) was upset about this.
<gog> SMM is the spooky ghost in the machine
<Bitweasil> So Intel solved it by adding a hypervisor over there.
<heat> let me find the whitepaper
<Bitweasil> So now, that hypervisor, the STM, logically lives "below" the executive hypervisor in the main operating space (you vmexit from ring 0 to the STM, vmlaunch from the STM to normal land ring 0).
<Bitweasil> And it can sandbox the existing legacy SMM handlers.
<Bitweasil> So now, *that* code is the most powerful code on the platform.
<Bitweasil> And if you're a tiny bit creative, you can do an awful lot of fun things from the STM with regards to introspection, snooping on other hypervisors, etc.
<\Test_User> interesting
<bslsk05> ​github.com: edk2/PrmPkg at master · tianocore/edk2 · GitHub
<Bitweasil> Ooh, thanks!
<Bitweasil> "Problems with SMM" section is nice and to the point. It's an opaque hot mess. We've been polite, but... that's what it is.
<heat> for sure
<heat> it's a fucky area
<heat> no one really likes it
poyking16 has quit [Quit: WeeChat 3.5]
<Bitweasil> Dunno, I like it! :D
<Bitweasil> I've gotten paid for a lot of years to mess in it.
<Bitweasil> I mean, that's only because it's a poorly understood hot mess, but...
<Bitweasil> :p
<nick64> Some SMI handlers will be moved to OS context as part of PRM? Won't that dissolve all the security boundary between the OS and the SMM?
<heat> this is for code that doesn't need a security boundary
<nick64> If CPU executes in SMM privileges on code that can be modified by a malicious OS!
<Bitweasil> It depends on what you're doing. A lot of the stuff SMM does is just platform-specific stuff that happens to be shoved into SMM because legacy reasons.
<Bitweasil> Those *capabilities* need to be there, but they don't need to be executed in SMM.
<heat> you'll still need it for things like the lockbox or authenticated variables
<Bitweasil> They just happen to be, for legacy reasons.
<Bitweasil> Right, or hopefully NOR flashing.
<heat> >If CPU executes in SMM privileges
<heat> it doesn't
<heat> that's the point
<nick64> The code that is executing maybe something that does not need a security boundary, but the attacker in the OS can very well modify the code into the ones that open sesame, if the CPU fetch and execute from the OS mapped memory after getting into SMM mode, right?
<nick64> Oh got it
<Bitweasil> So, for instance, USB/PS2 emulation lives in SMM because it can be transparent. There's no reason that *needs* to be in SMM. It just... is.
<Bitweasil> Or some thermal response stuff.
<Bitweasil> Etc.
<Bitweasil> You'd be exposing those services to the OS, where it can execute those things without having to trap to SMM.
<Bitweasil> Because a multicore sync over there is hella-expensive.
<Bitweasil> (it doesn't *have* to happen, but most modern systems will lock up if you don't do that)
<nick64> I see. in that case it might actually be better for security, reducing code footprint on high privilege areas
<Bitweasil> Correct.
<Bitweasil> If you can move "the random crap that's in SMM now that doesn't actually need to be there" to somewhere not-SMM, that's an improvement.
<Bitweasil> And should perform a lot better too.
<heat> yeah
<heat> outb 0xb2 is super duper expensive
<Bitweasil> mmhmm.
<zid> what's on b2?
<Bitweasil> Super de duper mega expensive, flashing mah bling, making it rain microseconds!
<heat> register on intel chipsets that triggers an SMI
<Bitweasil> That's a traditional trap to SMI.
<zid> ah interesting
<zid> although every outb is pretty slow isn't it, because it does ISA bus speed, 1 microsecond
<Bitweasil> "These are my fathers ports. Archaic interfaces, for a single-core time."
<Bitweasil> "... yeah, you still have to learn them, because stuff still uses them."
<Bitweasil> "Yes, it sucks. No, nobody's going to get rid of them. Still gotta boot DOS, after all. Now, let me tell you the story of the A20 gate..."
<heat> the A20 is not a thing anymore
<Bitweasil> Last I saw it was deprecated. Is it actually *gone gone* now?
<heat> no reset vector messes with that, qemu already comes with it enabled
<Bitweasil> (deprecated != removed)
<heat> I don't know if it's gone, but it's definitely enabled by default
<Bitweasil> But mah DOS!
<heat> actually
<heat> hasn't it always?
<Bitweasil> No, I thought it used to boot with it set to only 1MB of RAM, so non-A20 aware OSes would work properly.
<heat> I guess it's just a thing you disable when BIOS booting/CSM
<Bitweasil> That would make sense.
<heat> because there's no way you could reach the reset vector if A20 was masked
<Bitweasil> Hm? I'm pretty sure you can...
<heat> how?
<Bitweasil> Reset vector is just under 1MB, right?
<heat> no
<heat> 0xfffffff0
eck has quit [Quit: PIRCH98:WIN 95/98/WIN NT:1.0 (build 1.0.1.1190)]
<heat> it was there in 286 times
<heat> the last 64KB (IIRC) are still mirrored there for Legacy Purposes
<heat> (tm)(r) Legacy is a trademark of Intel Corporation
<Bitweasil> I thought it was set up so the segment/offset got you the right thing.
<heat> no
eck has joined #osdev
<Bitweasil> I mean, INIT-SIPI-SIPI is just to work around a bug in *one* 486 variant...
<Bitweasil> (you don't normally need the second one)
<heat> at reset your segments look 286-ish (and would theoretically point to the legacy 1MB area), but anything 386+ has a segment cache that points to the upper 4GB reset vector
<Bitweasil> Anyway, I've not actually done x86 bringup in a while, so... I will happily defer to anyone doing it currently.
<heat> this is actually pretty well defined in the SDM
Starfoxxes has quit [Ping timeout: 265 seconds]
h4zel has quit [Ping timeout: 246 seconds]
Starfoxxes has joined #osdev
<heat> it's so theoretically 286s and 386s could boot in the same mobo
<Bitweasil> *nods*
<heat> just like itanium had an itanium entry point and an ia32 entry point in their firmware
<Bitweasil> I've mostly played in the 64-bit spaces.
<Bitweasil> UEFI is a mess, but it's a different kind. :)
<heat> this also applies to UEFI
k8yun has quit [Read error: Connection reset by peer]
<Bitweasil> Sure, I'm saying I play further along in the process.
<heat> yea
<Bitweasil> Most of my low x86 work is 64-bit hypervisors and SMM/STM>
k8yun has joined #osdev
<heat> I like PEI the most
<Bitweasil> I've done a bit of 32-bit work, but almost nothing below that.
<heat> out of UEFI
<heat> DXE is soo damn bloated
<heat> and SMM is SMM :v
<Bitweasil> Yup.
<Bitweasil> :)
<Bitweasil> Fun place to be, though.
<Bitweasil> "Awww, look at you peons way down there in ring 0! It'd be a shame if *someone* meddled!"
<heat> i find it mildly scary at best
<heat> but you do you :P
pretty_dumm_guy has quit [Ping timeout: 264 seconds]
<Bitweasil> Oh, I said it was a fun place to be. Related, I would *very very much not like* other people to be in there.
<Bitweasil> Since I know what you can do from it.
<Bitweasil> Still irked that Blackhat talk was denied. Not flashy enough, I guess.
<heat> but security in firmware is very much just a sideshow :)
<Bitweasil> I submitted a nice talk on what you can do from down there.
<Bitweasil> You ever mess with QubesOS?
<heat> no
<heat> but yeah I can imagine the stuff you can do
<heat> it's like zombocom
pretty_dumm_guy has joined #osdev
<Bitweasil> "Anything you want."
<Bitweasil> You have the full system at your fingertips and can configure it how you want.
<Bitweasil> Would you like the local APIC to deliver performance monitoring interrupts as SMIs? Sure, have at it! :D
<Bitweasil> (and then you can more or less do your own tracing, get execution every few branches, point PEBS or such at it...)
<heat> I prefer writing the firmware and *forgetting to lock SMRAM*
<heat> then get people to exploit the shit out of it
<heat> (how to have ONE FUCKING JOB and fail miserably at it)
<Bitweasil> There's that, too...
<Bitweasil> Plus that fun little race condition from... what, 12-13 years ago?
<Bitweasil> "Unlock the bit. Oh, wait, SMM goes nuh uh and locks it again!" Fine, for the single core era.
<Bitweasil> But if you have another core racing on things... yeah, help yourself.
<Bitweasil> I honestly don't trust computers anymore.
<Bitweasil> I just do them because it pays well. :(
<heat> lmao
<heat> there was also a problem with local apic relocation and SMRAM
<heat> where you could use it to reach SMM
<mjg> heat: you don't do what solaris did
<mjg> heat: ... which is locks around it
<heat> you can have percpu caches but you still need locks
<heat> and apparently it's a *tiny* bit more complex than that
<heat> given that linux has 3 slab allocators, all of them thousand-line beasts
<mjg> you don't
<heat> you don't?
<mjg> as long as you guarantee there will be no allocs from interrupt context
<mjg> you can just disable preemption around the fast path
<Bitweasil> Yeah, SMM has been a choose your own adventure story for a long while.
<mjg> you try to pop an obj, which normally succeds and leave
<mjg> if not, you got to she slowpath, take locks and whatnot
<Bitweasil> "If you want to try a multicore race, turn to page 49. If you want to hope the firmware didn't lock it and just directly access it, turn to page 76."
<mjg> well it is a little more complicated than that if you have 2 magazines, but the general point stands
<heat> how do you pop it atomically?
<mjg> you don't need atomics if that's what you mean
<heat> how?
<heat> 2 cpus pop the same thing, what now?
<mjg> that would be an invariant violation
<heat> why?
<heat> are you just talking about the percpu cache?
<mjg> the entire point is that each cpu has its own collection of objs it can pop without fucking with anyone else
<mjg> yes
<mjg> for refiling the cache you do need locks
<mjg> (well that's the easiest anyway)
<mjg> but the fast path -- pop or push an obj, you get away with disabling preemption
<mjg> if you size things right, the locked slowpath will be rare(tm)
<mjg> (unless you are unlucky with a workload, which does happen)
<heat> right
<heat> or irqs
<mjg> provide a dedicated alloc for irqs
<heat> I like to allocate in interrupts
<mjg> which should not be allocating shit anyway
<mjg> just give them a dedicated magazine
<heat> hrm
<heat> actually, do I allocate in interrupts?
<mjg> anyhow solaris per-cpu locks in the fast path, which is pretty slow
<mjg> and in fact their supposed fast path is very long
* mjg mumbles something about i-cache remarks
<heat> per-cpu locks?
<mjg> a lock stored in each per-cpu instance of the slab, next to the magazine
<heat> why?
<heat> so they can shuffle things around I guess?
<mjg> which part
<mjg> why not preemption?
<mjg> or if thisp rotects against other cpus
<mjg> in the slab paper or some other one bonwick claims disabling preemption there is prohibitive
<mjg> and they don't even disable migration to other cpus either
<mjg> [i find the claim to be backwards fwiw :>]
<mjg> so in their setup they do need actual locks
<mjg> but this is self-induced, not inherent to per-cpu slab allocation
<heat> disabling preemption is prohibitive, but they need to disable it anyway?
<heat> I don't see how you would *not disable preemption*
<mjg> they don't disable it
<mjg> they just take a lock
<heat> so what if you get scheduled out
<heat> and something hits the slab allocator again
<mjg> and concede they may get off cpu, and then land on another one
<heat> do you just sit there and wait
<mjg> ye you take extra trips around sleep/wakeup
<mjg> which is why i consider his claim backwards
<mjg> anyway just disbale the fucking preemption and be donew ith it
<heat> define extra trips
<mjg> you take the lock, go off cpu
<mjg> whoever preempted you also wants to alloc, so they take the lock... but that fails since you have it
<mjg> ... so they go off cpu and wake you up
<heat> oh shit
<heat> sleepable locks?
<mjg> od you did not know
<mjg> YES
<heat> fuck
<heat> actually I'm way into this to not write a slab allocator now
<heat> also riscv irqs aren't working and I can't be bothered to patch QEMU
<heat> mjg, btw is the general consensus to use the big linear page mapping or do you map things virtually?
<mjg> i don't know what you mean here
<mjg> for what specifically
<mjg> for the most part you try to slap the kernel into a collection of huge pages
<heat> slabs
<heat> are you mapping things virtually here? as in linux vmalloc or so
<mjg> there is a tech debt-induced nuance here i'm afraid
<mjg> but tl;dr apart from crertain case where you huge page, you are going to inject singular mappings
<mjg> one at a time
<mjg> so that, if need be, you can free the page and give it to userspace
<heat> what's the nuance
<mjg> you really like to read my rant, don't you
<heat> yes
<mjg> have you heard of "type stability"?
<heat> no
<heat> pls go on
<mjg> historically the old unixes were unable to ever free numerous object types
<mjg> most notably vnodes
<heat> yes
<mjg> on bsds this also included name cache entries, vm objects and more
<heat> im kinda aware
<mjg> so
<mjg> if you can't *ever* reuse this for any other purpose, you may at least damage control
<mjg> you slap a huge page and allocate from there
<mjg> [except freebsd fails to take advantage of it]
<mjg> vnodes and nc entries are freeable now, but vm objects still is not and that's probably never going to change
<heat> ok, that's weird
<mjg> LS... what was the letter
<heat> and linux doesn't fall into this right
<mjg> i don't know if linux ever had this problem
<mjg> i suspect not
<heat> linux likes to do all its memory stuff on the big mapping
<mjg> if you want weird you can check how 4.4bsd sorts out namecache entries
<mjg> work of fucking art
<heat> we've went through that IIRC
<mjg> *all*?
<heat> yes
<heat> vmalloc is the last resort
<mjg> that sounds very suspicious, 2MB granularity may not be enough to recover in low memory conditions
<heat> which kind of makes sense, given the buddy allocator and whatnot
<mjg> perhaps tthey default to huge pages which they demote if needed?
<heat> nooo
<heat> you're getting me wrong
<heat> you know the big linear mapping they have right? I don't know if fbsd has a similar one
<mjg> you mean direct map perhaps?
<heat> yes
<mjg> that's the name man :>
<heat> the ~800MB in 32-bit and gajillion bytes in 64-bit
scaleww has joined #osdev
<mjg> ye ye ye, that's fine
<heat> it's fine but it gets me nervous
<mjg> ye direct map is great, freebsd does not fully utilize it
<heat> it's super unsafe
<heat> it also only reaaallyyy works if you have a buddy allocator or something similar that can give out big chunks of pages
<heat> particularly, in a fast way
<heat> i have a list of pages, which doesn't particularly work
<mjg> descending into it is supposed to be rare
<heat> sure
<mjg> i would say for now just map pages as needed
<heat> but going down and allocating a slab is not something I want to be stupidly slow
<mjg> once it all works and whatnot i woudl ervisit direct map as an optimizatin
dude12312414 has quit [Remote host closed the connection]
dude12312414 has joined #osdev
demindiro has quit [Quit: Client closed]
gildasio has quit [Ping timeout: 258 seconds]
gildasio has joined #osdev
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
pretty_dumm_guy has quit [Ping timeout: 265 seconds]
pretty_dumm_guy has joined #osdev
scaleww has quit [Quit: Leaving]
les_ has quit [Quit: Adios]
les has joined #osdev
vdamewood has joined #osdev
lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]
<heat> mjg, how do you implement kfree()? as in knowing what slab an object belongs to
<heat> I know how linux does it but that's not suitable here, going down the page tables to get a struct page isn't an option
gog has quit [Quit: byee]
<heat> I could try and place some slab struct inside the page itself but then I can't have variable-sized slabs
gjnoonan has quit [Read error: Connection reset by peer]
gjnoonan has joined #osdev
vdamewood has quit [Read error: Connection reset by peer]
Arsen has quit [Remote host closed the connection]
qookie has quit [Remote host closed the connection]
vdamewood has joined #osdev
qookie has joined #osdev
Arsen has joined #osdev
nyah has quit [Ping timeout: 248 seconds]
[itchyjunk] has joined #osdev
<heat> ok freebsd handles that problem by simply not handling that problem
<heat> excellent