tacco has quit [Remote host closed the connection]
srjek|home has joined #osdev
srjek has quit [Ping timeout: 265 seconds]
[itchyjunk] has quit [Ping timeout: 252 seconds]
[itchyjunk] has joined #osdev
scoobydoo has quit [Read error: Connection timed out]
scoobydoo has joined #osdev
smach has joined #osdev
[itchyjunk] has quit [Remote host closed the connection]
srjek|home has quit [Ping timeout: 248 seconds]
smeso has quit [Quit: smeso]
smeso has joined #osdev
<geist>
agreed re: not having 512 bit wide alu. glad there's not a bunch of dark silicon dedicated to it
tarel2 has joined #osdev
<mrvn>
My Ryzen cpu can drive either sata or nvme, you have to pick one.
<mxshift>
Huh. I'd have to go pull up the Ryzen zen4 docs to verify the SATA omission. I'm 80% confident EYPC zen4 still shows many of the PCIe lanes to be switched to SATA
smach has quit []
smach has joined #osdev
heat has quit [Ping timeout: 268 seconds]
<moon-child>
disappointed at the slow gather/scatter
<moon-child>
I had some code with zen2 that was way slower with gather than with scalar loads. Pretty sure next time I touch that code I'll upgrade it to avx512 and it'll be faster; 8-way gather on intel is pretty much the same throughput per load as scalar loads
<moon-child>
(not on zen4, though, apparently. ¯\_(ツ)_/¯)
Goodbye_Vincent has joined #osdev
epony has quit [Ping timeout: 252 seconds]
smach has quit []
Ram-Z has quit [Ping timeout: 246 seconds]
scoobydoo has quit [Read error: Connection timed out]
scoobydoo has joined #osdev
m5zs7k has quit [Ping timeout: 250 seconds]
m5zs7k has joined #osdev
SGautam has joined #osdev
Ram-Z has joined #osdev
epony has joined #osdev
scoobydoo has quit [Read error: Connection timed out]
scoobydoo has joined #osdev
jjuran has quit [Read error: Connection reset by peer]
jjuran has joined #osdev
divine has quit [Ping timeout: 268 seconds]
divine has joined #osdev
opal has quit [Remote host closed the connection]
opal has joined #osdev
opal has quit [Remote host closed the connection]
opal has joined #osdev
jjuran has quit [Quit: Killing Colloquy first, before it kills me…]
jjuran has joined #osdev
GeDaMo has joined #osdev
zaquest has quit [Remote host closed the connection]
SGautam has quit [Quit: Connection closed for inactivity]
seer has quit [Ping timeout: 268 seconds]
vdamewood is now known as vinleod
vinleod is now known as vdamewood
opal has quit [Ping timeout: 258 seconds]
opal has joined #osdev
zaquest has joined #osdev
smach has joined #osdev
maxdev has joined #osdev
<maxdev>
helloo
<sham1>
Hello
<maxdev>
man it's been some time since i've been here
<maxdev>
does anyone know if reading the LAPIC id register has any side effects? i'm reading it a lot to identify which core i'm running on, and it's giving me a headache
<zid>
it's easier to do it back to front if you're doing strings, then you can accumulate the value easier and not have to actually make a list, but that's an optimization and the concept is the same
<jafarlihi>
Ok thanks
<heat>
ok so I need some feedback
<zid>
so if you had str[3] = {'8', '5', '1'} to represent 158, d1 = str[0]-'0' * 3; str[0] = (d1%10)+'0'; d2 = (str[1]-'0') * 3; str[1] = (d2%10)+'0' + d1/10; d3 = str[2]-'0' ...
<zid>
You want it that way so that you don't end writing your remainders to str[-1]
<heat>
I have lots of issues with holding locks and doing things for a loooong time with them held (imagine filesystem lookups, IO, etc)
<heat>
what are the standard patterns for solving this?
<heat>
I know linux does a lot of fuckery with flags and waiting on things futex-style
<zid>
I mean, you described what you were doing, but not that actual issue?
<zid>
If a long operation needs exclusion while you do it, it needs it. What's the *problem* your impl. causes?
<heat>
the issue is that imagine I'm holding the lock for /home/zid
<zid>
That your exclusion periods are not correctly fenceposted and you hold it for longer than you need to? That your locks are too expensive? etc
<heat>
you have 2 threads doing lookups and 1 thread doing writes (which involve hitting the fs as part of O_CREAT)
<heat>
the 2 threads that could do easy, quick lookups to caches dentries will end up being held back by the writer which is effectively serializing things
<zid>
wouldn't you typically leave that to the reader to deal with? tocttu bugs etc
<heat>
the issue is that doing expensive things like IO when holding contested locks will effectively serialize things
<zid>
and if they want to avoid them, they use special interfaces like rename instead of rm; write
<heat>
where does tocttu come into play?
<zid>
that's why anyone would care about not being able to read data someone is updating
<zid>
or rather, care that they can, and maybe they shouldn't be able to
<heat>
ah, yes, UAFs?
<heat>
and similar bugs
<heat>
yes, that's a problem, which is why the lock is there
jafarlihi has quit [Ping timeout: 265 seconds]
<zid>
right, I'm saying usually you leave that to the application to deal with, by making them request special primitives that are safer
<heat>
but this crashes the kernel
<zid>
oopsie doopsie
<zid>
you never mentioned kernel crashes
<heat>
if you have UAF bugs you can crash the kernel
<heat>
simple
<heat>
which is why the lock is required to be there
<zid>
yea I hadn't figured out if you were being overzealous or underzealous yet
<zid>
isn't this typically where you'd use an RCU
<zid>
makes the alg lockless, but as a side-effect, also makes it.. lock safe
<zid>
nobody needs to write potentially buggy locking code
<heat>
my problem is that every time I hold the lock in a write-way (my dentry code uses rwlocks and not mutexes) I do something stupidly expensive
<heat>
usually filesystem->open(...), or filesystem->creat(...)
<heat>
you get the idea
<zid>
RCU also helps with that, no?
<heat>
no, AFAIK RCU requires preemption to be disabled
<zid>
don't insert the new file into the dir until it's made and finished
<zid>
rather than locking the dir
<zid>
making the file, unlocking the dir
<heat>
right, but then you have concurrent requests for the same data
<zid>
there's no concurrency issue there
<heat>
or concurrent creats
<heat>
you *do not* want a race condition between creats, they need to be serialized
<heat>
same for renames, yadda yadda
<zid>
there's a gajllion lockless inserts
<heat>
there's no lockless filesystem
<zid>
bear in mind there's two things at play here
<zid>
the bytes on the drive, and the structures in memory
<heat>
anyway, linux solves these kinds of issues by just creating "incomplete" structures and waiting on some flag using a wait queue or a futex-in-the-kernel thing
<zid>
you'll still want to serialize/lock the actual disk update so that two threads aren't shitting on each other, via whatever mechanism you want, dedicated worker thread or whatever, but the actual in-memory versions can have totally different semantics
<zid>
cmpxchg doesn't really exists for hard disks
<heat>
and their solution is OK but it seems complex
<zid>
it does for cpus though
<heat>
it does btw
<heat>
nvme has cmpxchg
<zid>
page level, or byte?
<heat>
page
<zid>
pretty big lock :D
<heat>
unless I'm talking out of my ass here, but I specifically recall NVMe having commands for that
<heat>
anyway
<zid>
I'd probably use that to update the inode or whatever, but the thread organizing that, just reading from an in-memory RCU, that other threads stomp on
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<heat>
do you get my problem?
<zid>
I think you just don't believe in yourself
<zid>
You think you're not capable of writing it lockless *in memory* because the disk version *does* need locks
<heat>
you can create an in-memory version, but if that in-memory version requires an expensive op you just cant/shouldn't hold the lock
<heat>
and if you create an incomplete version of the structure, you'll need to wait for it to be complete (which is fine)
<zid>
right, and if there's no lock, there's no locking bugs
<zid>
you're just adding shit to a queue for the 'serialize to disk' process to happen
<zid>
and the only "hard" part is localized into one spot
<zid>
the dequeuer
<zid>
not "every single appender"
<heat>
no
<heat>
you *need* to hit the filesystem
<heat>
you can't create an in-memory inode (and not hit the filesystem), and then go "oopsie, we had no inodes left"
<heat>
does it need to hit the disk? no, thanks to the buffer cache, etc
<heat>
but you will probably need to read or lookup the buffer cache's version of it
<heat>
and this is just the traditional case
<clever>
i can see multiple ways i might implement such a thing
<heat>
if you go for NFS, etc you'll be 10x more fucked
<clever>
first layer, might be to have a count of free inodes, and a count of allocating inodes
<clever>
grab a lock, check if free > allocating, then allocating++; and drop the lock
<clever>
now you can go off and build the inode, before you even know what index it is, and be garanteed that one is reservd
<heat>
but you do need to allocate
<heat>
what if you fstat? and you have no st_ino?
<clever>
yeah, at a later stage, you will need to grab some other lock, scan the inode table for an empty slot, and allocate it properly
<pitust>
you can also do this without a lock, by using some atomic stuff
<clever>
and then you need to grab 2 locks, and allocating--
<heat>
yes
<heat>
but all of this is part of your i_fops->creat(...)
<heat>
this is filesystem stuff
<clever>
pitust: with atomics, you could increment allocating, but what about the risk of 2 people incrementing it (properly atomicly), but now it exceeds free!
<heat>
cmpxchg
<clever>
ah, yeah that can work
<clever>
so you increment it in a non-atomic copy, and then only store if you won the race
<pitust>
or you can get the old value, and if that exceeds the max, subtract and retry
<pitust>
although if you want the old value GCC and clang still have to use a CAS loop
<pitust>
(on x86)
nvmd has joined #osdev
<clever>
heat: the benefit i can see, to a seperate allocating and allocate stage, is if you inode table is broken up into groups
<clever>
you have a very quick cmpxchg based increment, to reserve an inode fs wide
<heat>
but there's no benefit to doing that
<clever>
then each core can grab a lock on a different inode group, in parallel
<heat>
this is generic code
<clever>
and can scan that group for a free slot, in parallel
<clever>
but from memory, i believe ext2/3/4 and xfs has the inode tables split up into groups
<heat>
THIS is my problem
scoobydoo has quit [Read error: Connection timed out]
<heat>
I can reverse the cheap and expensive part, but I need some way to wait for it to be complete
<heat>
which is possible, but clunky and non-standard
scoobydoo has joined #osdev
<heat>
and where you see "dentry" you can also imagine your page cache or something
xenos1984 has quit [Ping timeout: 268 seconds]
<heat>
parallel lookups to the same thing will need to wait for completion
<heat>
and in an ideal, non-serialized world, parallel lookups to other things will be able to go concurrently because no one is holding the lock
<clever>
yeah, thats why i was thinking of a per inode group lock
<heat>
dude
<clever>
but you could do it lockless, and just retry upon failure
<heat>
that's soooooooooooooooooooooo much lower level than what I'm talking about
<heat>
also unavoidable
<heat>
basically all I'm asking is if there's a known pattern for this
<heat>
I can't use "lock and do thing" because that is slow
<zid>
I need to get someone with more authority than me to repeat the bit I said earlier :P
<heat>
i dont understand what you mean
<heat>
lockless is not the issue here
<zid>
stop trying to consider the disk while thinking about the data structures
<heat>
my point here is that the data structures will only be complete when you hit the disk
<zid>
why do you need complete ones?
<heat>
two parallel lookups that hit the disk will need complete ones
<zid>
If I issue three writes all hitting the same directory, if it's possible for my writes not to affect each other on the final disk image, the fancy data structures should handle that
<zid>
if they don't, that's a missed optimization at best
<heat>
"here's the inode I found! caller: which inode? lookup: i dunno"
<heat>
lets ditch the dentry example
<heat>
two threads try to look up page 0 of file foo (pagecache), one allocates the page and starts the IO, the other one will need to wait for the page to become filled
<heat>
if you do the IO while holding the page cache's lock, you serialize everyone, so doing slow things outside the lock is the only valid approach
FreeFull has joined #osdev
<heat>
if you them outside of it, only threads trying to look up the same thing will get blocked, which is the desired behaviour
<heat>
s/if you them/if you do them/
<heat>
right?
xenos1984 has joined #osdev
seer has joined #osdev
scoobydoo has quit [Read error: Connection timed out]
scoobydoo has joined #osdev
terminalpusher has quit [Remote host closed the connection]
nvmd has quit [Quit: WeeChat 3.6]
nvmd has joined #osdev
gog has joined #osdev
renopt has joined #osdev
dude12312414 has joined #osdev
SGautam has joined #osdev
<mjg>
heat: so it was interrupts after all!
<heat>
yeah
<heat>
which kind of begs the question "why"
<heat>
waking up threads isn't supposed to be slow
<mjg>
that's not what 'begging the questin' means
<heat>
maybe it's just natural and not an anomaly
smach has quit [Remote host closed the connection]
<heat>
oopsie
<heat>
you get what i mean
<mjg>
ye, just sayin people misuse this bit so often i repeat it to myself to not fall back :>
smach has joined #osdev
<mjg>
i had a look at your code, all the write locking should be whacked man
<mjg>
when collecting the graphs
<mjg>
erm, you have a global lock which you read lock on each cpu, that's super pessimal
<mjg>
bare minimum, still pessimal, you can implement locks per-cpu so that at least they don't interfere with each other
<mjg>
then when disabling the mechanism you flip the flag to off and wait for all locks to not be taken
<heat>
it was the quickest solution
<mjg>
correct way requires memory barriers and wahtnot and is not warrranted
<mjg>
dude the above can be coded in the smae time + 2 minutes
<mjg>
:>
<heat>
:)
<heat>
anyway I've been tackling bigger issues
<heat>
mainly trying to remove the dentries' rwlock
<heat>
I want a rwspinlock
<mjg>
R C... don't want to triggr anyeone
<heat>
lmao
<heat>
you mean EBR
<mjg>
believe it or not, rw lock there should perform just fine at the measily 4 threads you got
<Griwes>
U seem to be really careful about it
<mjg>
in fact it will be ok-ish until about 16
<mjg>
it performs way worse than i'm describing because the implementation you have right now sucks
<mjg>
dentry or not, you will keep running into it, so that should be fixed
<heat>
I've switched it around a bit
<heat>
i take less locks
<heat>
and it seems to be similar to other kernels
<bslsk05>
github.com: Onyx/rwlock.cpp at 77853fcdda34cdc256ed1a3bf5cc7daa9c950d9e · heatd/Onyx · GitHub
<mjg>
it is not since you still take a sipnlock just to wait for it
<heat>
something funny I did notice is that vfsmix performs way better when "lets try to reschedule" code is commented out because there's a lot more idle
<mjg>
this actually may be worse than openbsd :-P
<heat>
mjg, seems to be how lunix does it
<heat>
:)
<mjg>
what
<heat>
yes
<mjg>
where
<mjg>
rwsem?
<heat>
yes
<mjg>
are you sure you did not misread it
<heat>
yup
<mjg>
fallback, the fucking bottom, definitely does it to interlock going off cpu vs unlock
<mjg>
there is also a hack where pending writers serialize on a hand-rolled mcs lock
<mjg>
but that's not the same thing
<heat>
what fallback?
<mjg>
slowpath, call it whatever you want, i have not seen that code in 5 years
<heat>
I tried to look at freebsd but that code was bonkers
<mjg>
ye
<mjg>
you missed this part
<mjg>
if (rwsem_can_spin_on_owner(sem) && rwsem_optimistic_spin(sem)) {
<heat>
i didn't
<heat>
<heat> except no spinning
<mjg>
ok, miscocummunicated
<mjg>
the no spinning bit makes/breaksp erformance man
<heat>
does it?
<mjg>
yep
<mjg>
look for the commit which introduced it
<mjg>
that or the mail thread has numbers
<mjg>
but wait, they don't ever spin for *readers*?
<mjg>
that's defo pessima
<mjg>
l
<mjg>
but i understand why
<heat>
they do
xenos1984 has quit [Read error: Connection reset by peer]
<mjg>
where
<heat>
sorry, not spinning
SpikeHeron has quit [Ping timeout: 250 seconds]
<heat>
* Reader optimistic lock stealing.
<mjg>
so the general problem with rw locks where you can go off cpu while holding them
<mjg>
is that there is no sensible way to track if any of the readers if off cpu
<mjg>
so then what
<mjg>
[there are funny ways to try to approach it, but i'm not fond of anything i came up with and i'm unaware of anyone coming up with anything better]
<mjg>
basically this multicore stuff likes to suddenly collapse
<mjg>
in terms of performance
<heat>
right, but it effectively is
<heat>
you're all waiting for the last lock
<mjg>
and you are waiting *longer* if the owner is off cpu
<mjg>
there is a huge multiplicatin factor here
<mjg>
if that going off cpu could have been avoided, you have a dramatic win
<heat>
how big a win?
<mjg>
let's say the ersource is contested and you have 32 cpu threads, which is not much
<mjg>
20 of which want the lock
<mjg>
so whatever extra delay incured by the lock owner is multiplied by 20
<mjg>
and even then they will be serializing on each other
<mjg>
you went from possibly slow but tolerable to a non-starter
<heat>
right
<mjg>
i can't stress enough how this likes to degrade
<heat>
but if your locks are spin-happy you're also just wasting cpu time for something that may very well take a long time
<mjg>
to ilustrate with a real example, there was a point where freebsd was ok-ish at 80 threads when doing buildkernel
<mjg>
on a 4 socket westmere
<mjg>
then it was booted on 4 socket broadwell, 128 threads
<mjg>
and the same workload collapsed into oblivion
<mjg>
heat: i'm not saying every single instance of spinning is good, just that in practice, spinning tends to win
<mjg>
ultimatley all locking is just performance damage control, the moment you contend you are already losing
jimbzy has joined #osdev
<mjg>
and in fact you are losing already by having a shared lock, even if it is not contested as youare bouncing it
<gog>
jimbzy: sosig
xenos1984 has joined #osdev
<jimbzy>
SOSIG
<mjg>
heat: all this aside, i propose a game for you
<mjg>
heat: to make selected benchmarks, like vfsmix, scale better than on openbsd
<mjg>
heat: you in?
<jimbzy>
How are you doing, gog?
<gog>
jimbzy: pretty well actually
<jimbzy>
Love it!
<zid>
I had sosig the other day, in a bnu
<zid>
it was pig slices today
<zid>
You know that noise of people running wood through a huge band saw? *nrrrrrrwwww*
<zid>
Like that
<heat>
mjg, sure
<heat>
sounds good
<mjg>
heat: right on
<mjg>
heat: so i guess you shoul start with getting an openbsd vm
<heat>
aw
<heat>
im not in anymore
<mjg>
(:
<heat>
openbsd is CRINGE
<mjg>
OH
<heat>
oh
<mjg>
good thing theo is not on the channel
<heat>
what the fuck
<heat>
why are there so many installation options
<mjg>
they wanna fuck with you
<heat>
anyway, something I want to ask you
<heat>
how does fbsd do lookup when you need to hit the disk?
<heat>
i assume your dentries have some sort of rwlock?
<mjg>
there is a fallback to locked lookup
<heat>
point being that I want to replace all my rwlocks with rw spinlocks and do the IO outside the lock
<mjg>
ye that's sensible, but then you will still need a way to serialize on this
<heat>
yes, I have that
<mjg>
kind of a dedicated io lock, so to speak
<heat>
io lock? to protect what?
<mjg>
say you have 2 threads doing the same lookup and finding they need to i/o to proceed
<mjg>
then what
<heat>
oh yeah sure
<heat>
I have a futexish thing
<heat>
I'll make them wait on an address
<mjg>
whatever syncs them is fine
<mjg>
basically the point is to avoid repeat i/o
<mjg>
and not get false negatives
<heat>
I think I'll still need to repeat the lookup if it fails right?
<heat>
you can't assume failure = ENOENT
<mjg>
you do disgunguish "we have no entry" from "there is no file like that" from "we have an entry which says there is no file like that"
<heat>
i dont have negative dentries yet
<mjg>
now that i wrote it, do you cache results that there is nothing named like that?
<mjg>
ouch
<heat>
cry path resolution man
SpikeHeron has joined #osdev
* mjg
cries a river
<mjg>
look if you wanna beat openbsd, yuo have to step up
<mjg>
as is you are probably around their level, unless they fixed something in the last 3 yeras since i looked
<heat>
lmao
<heat>
i assume that if negative dentries existed, error != ENOENT would mean you discard the negative dentry? and then concurrent lookups would need to retry
<heat>
...or I could store the errno in the negative dentry, but I don't know how iffy that is
<mjg>
of a negative entry exists, where are you getting the error from?
<mrvn>
heat: you get lots of code that searches PATHs and always for similar files. Like libc.so. Seems like it would be usefull not to have to read the disk every time.
<mjg>
lookup succeeded without i/o
<mjg>
you just return ENOENT to the caller
<heat>
mrvn, i know that's what a negative dentry is
<mrvn>
mjg: the first lookup produces and error. You store that and return it every time
<mjg>
anyway just make sure you invalidate such entries on file creation, rename, mkdir etc
<mrvn>
and have an option for the FS to disable or limit it. Like NFS.
<heat>
I was assuming negative dentries would only be for non-existant files, vs lookups that errored out
<mjg>
i have difficulty parsing this
<mrvn>
heat: your choice. But why would you do a second lookup on EACCESS?
<mjg>
you create a negative entry in the name cache when the fs told you it does not have the requested name
<heat>
filesystems are not returning EACCES
<mrvn>
heat: then the problem doesn't arrise. Note: NFS
<heat>
imagine -EIO
<heat>
do I cache that open("stupid.jpeg") returns EIO?
<heat>
is that a cacheable return value?
<mrvn>
questionable.
<mjg>
no
<mjg>
you cache when the fs tells you it got nothing, not when something failed to even find out
<mrvn>
A user can easily DOS you by requesting that over and over and causing your disk and SATA controller to constantly reset.
<heat>
ok, so ENOENT only
<mjg>
i would say let the filesystem add an entry for now
<mrvn>
or the block cache or block device. there are many places you can cache
<mrvn>
Does anyone have a FS interface where you request stat() for a whole path at once and the FS then does a path walk and returns an array?
<heat>
no
<heat>
you /could/
<mrvn>
I kind of want to keep the round trips for path walk small.
<mrvn>
Maybe I should add the idea of an agent. The kernel doesn't ask the FS to stat a file but sends it an agent (function pointer basically) that then runs under the FS process to do a path walk.
<heat>
"fstype: 4.2BSD"
<heat>
am I supposed to be scared mjg?
<mjg>
:)
<mjg>
no
<mjg>
note that they are going to have a single-threaded slowdowns vs you due to security mitigations
<mjg>
however, once multicore performance is better, you can look into disabling that bit
<mjg>
mrvn: but who needs that modulo userspace realptah, which you should implemented in the kernel instead
pretty_dumm_guy has joined #osdev
<Griwes>
Implementing a thing _in the kernel_?! Travesty
<heat>
musl used to use linux's kernel realpath but ended up rolling their own because of some issues
<heat>
openbsd doens't fucking boot
<heat>
wonderful
<mjg>
kvm?
<heat>
yeah
<mjg>
i would do a quick google, chances are decent you can flip something easily
<heat>
booting from the hard drive (after the installation) said No active partition
<heat>
I also picked GPT despite them saying it was possible it couldn't boot sooo
<heat>
retrying with MBR
<mjg>
were you fucking with ithe installer?
<mjg>
right
<heat>
no
<mjg>
go all defaults man
<heat>
openbsd is a fragile flower
<heat>
(2022 and im using a fucking MBR)
<mjg>
well it does have 2005 scalability....
<mjg>
obsd kernel
<heat>
so do I, but I have gpt support
<mjg>
i hear they added something which lets them get flamegraph tho!
<heat>
ah that was it
<heat>
they don't support GPT disks
<mjg>
?
<mjg>
that would be weird
<mjg>
well i'm not looking into this bit
<heat>
well, it doesn't boot
<heat>
but the MBR installation does sooo
<mjg>
you will need to install 'gmake' and use that instead of make
<heat>
can you pastebin the vfsmix again?
<heat>
wait, no need
<mjg>
you will need your hacked main.c as wel
<mjg>
do you have any means to move files between onyx nad the rest?
<mjg>
i guess you had to to get wis working
<heat>
usually I just craft a new fs, it's the easiest
<heat>
i do have a local copy soooo I'll just pastebin it myself
<mjg>
you got ext2?
<heat>
yes
<mjg>
well you still need ot patch main.c :-P
<heat>
openbsd has 4.2BSD which is highly superior