klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books
<heat> most europeanest sport
<theWeaver> it's fucking cringe tho
<heat> why
<theWeaver> because of danish people i guess
<heat> you have a ball, dribble, jump, throw it towards the goal, hopefully score
<theWeaver> yeah but you let danes play and danes are cringe
<heat> although AFAIK most of them wear pants which is a certified cringe
<heat> sports should be played with shorts and not pants
<theWeaver> are we talking british english pants or american english pants
<heat> american
<theWeaver> idk why i asked, both options would be pretty cringe
<FireFly> theWeaver: well is like, cricket the only noncringe sport or what :p
<theWeaver> FireFly: no, as I said, cricket is fucking cringe as fuck
<theWeaver> re-read the scrollback pls
<FireFly> oops
<heat> chad sport: gymnastics
<heat> which one? all of it
<theWeaver> rn the only non-cringe sport i can think of is basketball
<theWeaver> basketball is just straight baller
<theWeaver> no question
<FireFly> pretty sure danish people do basketball too :P
<heat> nba basketball is cringe, ncaa basketball is cringier
<theWeaver> basketball is so cool even danes can't make it cringe
<heat> all american sports are just poor excuses for ad breaks
<theWeaver> heat: your mum is cringe
<FireFly> ok what about floorball? :p
<heat> no urs is cringe
<theWeaver> heat: yeah she's pretty fuckin cringe
<theWeaver> doesn't change the fact, your is s too
<heat> OH hockey is legit
<theWeaver> actually roller derby is non-cringe
<FireFly> oh yeah
<FireFly> roller derby's cool
<theWeaver> roller derby is mega cool
<theWeaver> it's the most lesbian sport ever and it rules
<heat> you know what's REALLY cringe?
<theWeaver> heat: what
<heat> snooker and cycling
<heat> old people sports
<theWeaver> idk i feel like golf is even worse
<gog> golf is the wrost
<gog> golf is evil
<theWeaver> but snooker and cycling are pretty cringe yeah
<heat> snooker is like the only sport where the top 5 is composed by old, balding british men
<heat> with a belly ofc
<FireFly> billiard is fine in a casual setting :p
<theWeaver> pool is cool
<theWeaver> snooker is cringe
<FireFly> oh idk the precise differences
<theWeaver> FireFly: if it helps i'll repeat it
<theWeaver> the cool one is pool
<theWeaver> the cringe one is snooker
<FireFly> :p
<heat> chess boxing is mega cringe
<heat> just nerds trying to be cool
<theWeaver> ... what the fuck is chess boxing
<FireFly> that's fine :p
<heat> it's exactly what it sounds like
<FireFly> theWeaver: alternating short rounds of boxing and chess until either checkmate or knockout
<theWeaver> i can't conceive of how you even mix those two
<theWeaver> FireFly: what
<theWeaver> are you serious
<FireFly> the idea being that you probably make poorer moves after having been beaten for a bit
<FireFly> yes :p
<FireFly> idk people are allowed to do silly things
<FireFly> I don't mind :p
<theWeaver> i'm not saying they're not allowed
<theWeaver> i just don't really get it
<theWeaver> but then people do all sorts of stupid shit they shouldn't be allowed to that i do get
<FireFly> like what?
<heat> sniffing glue
<theWeaver> voting for tories
<theWeaver> wait shit
<theWeaver> no not that one, i don't get that one
<heat> hahaha
<theWeaver> (definitely stupid and shoudln't be allowed tho)
<heat> labour's now-anual how-to-lose-an-election is definitely a sport
<heat> annual*
<FireFly> "The goal of eight-ball, which is played with a full rack of fifteen balls and the cue ball, is to claim a suit (commonly stripes or solids in the US, and reds or yellows in the UK), [...]" wait what, red/yellow instead of solid/striped o.o
<theWeaver> someone needs to punch Keith in the dick and make him resign
<theWeaver> useless motherfucker
<heat> who's keith?
<theWeaver> keir starmer
<FireFly> oh yeah rainy island politics is very weird
<heat> not-a-sport: chess, poker
<FireFly> from what I gather from the occaisonal updates I hear of it
<heat> also darts are cringe
<theWeaver> FireFly: yeah must be strange for someone like you who lives in a partially sane country and comes from a halfway decent one
<heat> britain is relatively sane
<theWeaver> relative to what
<theWeaver> the USA?
<heat> britain is only "omg totally insane lads can't take this" to british people
<theWeaver> beacuse if so yeah kinda but that's a dangerously low bar
<heat> 3/4 of europe at least
<theWeaver> i've yet to find a european country that was more fucked up than the UK
<theWeaver> france maybe
<heat> france, italy, germany, portugal, spain, all of eastern europe
<theWeaver> fuck off, germany is definitely saner
<theWeaver> spain too
<heat> spain is not sane
<FireFly> not sure I think france is more fucked up tan the UK tbh
<FireFly> heat: that was not the claim :p
<heat> spain has like 3 separatist movements going on at the same time
<mjg> lol @ your separatist movements
<mjg> in poland there are at least 3 separate monarchist movements
<mjg> one of them has a self-proclaimed regent
<mjg> apart from all of this there was a self-proclaimed king
<mjg> i don't even know who to bow to anymore
<heat> mjg for king
<mjg> i would teachy limitations of big O in elementary school
<mjg> feel the tyrranny
<heat> history is just unix geezers who wrote PESSIMAL code
<mjg> there are also non-unix geezers who did the same thing man
<heat> and readings of git blame
<theWeaver> tbh, politics be fucked up
<mjg> old unix geezer is the today's webdev
<FireFly> "remember that asymptotic behaviour doesn't necessarily specialise to a specific choice of n, kids" "..can we learn about multiplication now, teacher?"
<mjg> FireFly: "mention balancing your checkbook again. i dare you, i double dare you motherfucker"
<theWeaver> mjg: what?
<heat> theWeaver, germany has insane politics. like late stage UK politics
<mjg> theWeaver: what what
<theWeaver> mjg: what what, in the butt?
<FireFly> germany has dumb politics but doesn't the UK have even more of that? :p from my POV at least
<mjg> theWeaver: no propositioning on a sunday
<FireFly> I mean hey rainy island even went full brexit
<mjg> FireFly: what's the german equivalent of farage?
<theWeaver> mjg: ooookaaaaaaaaaaaay
<heat> UK has a two party system with one party dominance and a bunch of smaller, separate parties
<heat> germany has CDU
<theWeaver> heat: CDU is still not as bad as the tories
<FireFly> in theory it's a SPD/CDU two-major-parties one-on-each-side system though, no?
<geist> okay... so
<FireFly> lessee
* geist points to the stay-on-topic sign
<FireFly> ..reasonable yes
<theWeaver> there's a topic in this channel?
* theWeaver just uses #osdev to shitpost in
<geist> we really should make a #osdev-offtopic channel
<geist> yeah please dont
<mjg> so fun fact: a naive sse2 memcpy *loses* to naive movs for sizes up to ~24
<mjg> the one found in bionic
<heat> i mean wasn't #offtopia #osdev-offtopic?
slidercrank has quit [Ping timeout: 255 seconds]
<mjg> [note: no sse used below 16]
<geist> heat: i didn't parse that sentence
<heat> i think current #offtopia was #osdev-offtopic a few years ago
<geist> well, may not have survived the move to libera
<mjg> what on earth is #Offtopia
<geist> yah i dunno what that is
<heat> it's an offtopic channel with a bunch of #osdev people in it
<mjg> what's the signal:noise ratio on that one
<heat> anyway re: memcpy, bionic sse2 string ops aren't that great
<mjg> i can tell you a well guarded secret: freebsd developer channel is named #sparc64, which is extra funny ow that the arch is not longer supported
<mjg> heat: agreed
<geist> surprised it tried to use sse on anything smaller than say 64 bytes or so
<heat> blind rep movsb can beat its fancy sse2 memcpy
<heat> in fact, it mostly does
<geist> mjg: seems like a pretty good way to avoid lookyloos
<mjg> i'm pretty sure glibc uses simd as soon as it can
<geist> but anyway bionic not having an optimized x86 is probably not that surprising, considering android on x86 is not that big of a thing
<mjg> so either 16 or 32 depending on instruction set
<heat> geist, it does, contributed by intel
<geist> exactly.
<mjg> quite frankly i would expect that code to be slapped in from whatever internals they had
<geist> and then probably promptly dropped on the floor
<mjg> probably shared with glibc to some extent at the time
<heat> glibc's string ops code is nuts
<mjg> so it's not like it was coded up by an intern
<heat> the way they have it, avx512 code is just avx code which is just sse code but all with different params
<mjg> these ops have hardcoded parameters for one uarch
<mjg> i'm guessing the asm over there is what glibc would have used at the time for said arch
<mjg> extracted from entire machinery
<geist> so here's a question: microarchitecturally speaking is it *always* a good idea to have the bestest fastest possible memcpy
<geist> ie, given that you have a cpu that has potentially a lot of things in flight, or prefetching this and that
<mjg> but what makes the besterest mate
<geist> does shoving through the maximum amount of data through the memory subsystem per clock negatively affect other things that may be going on at the same time
<heat> mjg, btw i have experimentally verified that the avx2 memset is really solid
<geist> or even competing against other hyperthreads
<geist> well, i'd say bestest as in 'maximum number of bytes/clock'
<heat> it doubles the bandwidth of sse2 memset
<mjg> geist: that's a funny one
<heat> so if you had avx memcpy you could potentially also do double
<geist> i'm sure the answer is probably yes, but there are i'm sure sometimes downsides, kinda like how you mentioned the clzero on AMD can saturate one socket
<heat> and that's more or less what glibc does too
<mjg> geist: short answer is 'who the fuck knows', in practice people go for the fastest possible
<mjg> i will note all benchez i had seen do stuff in isolation
<geist> yah i mean in lieu of anything else, fastest > not fastest
<geist> hypothetically a fast memcpy competes negatively with SMT pairs that are off running 'regular' code at the same time, for example
selve has quit [Remote host closed the connection]
<mjg> i also have to note that glibc has tons of uarch-specific quirks
<mjg> in its string ops
<mjg> thus i found it plausible they damage control concerns like the above
* geist nods
<mjg> to the point were yu end up with a net win
selve has joined #osdev
<mjg> where
<mjg> that asid, personally i don't have good enough basis to make a definitive statement on the matter
<geist> yah was more of a thought experiment than anything else
<geist> one for which there isn't a solid answer probably
<mjg> i'll note one common idea is to roll with nt stores
<geist> or there is an answer, microarchitecturally, in very specific situations but in aggregate it is a win
<mjg> past certain size
<mjg> which is already a massive 'who knows'
<geist> yah that probably helps. i assume for exampe that NT stores dont chew up lots of write queue units
<mjg> the folks at G had the right idea with their automemcpy paper
<mjg> instead of hoping for generic 'least bad everywhere', they created best suited for the workload
<geist> if the cpu can track say 16 outstanding writes, and some memcpy comes along and fills at 16, then you probably have to wait for all the previous ones to finish
<geist> or maybe NT stores only use one at a time
<geist> while the other writebacks can finish
<mjg> well really depends when you start doing them
<geist> and similarly if the memcpy is slamming the load units, then the cpu may not internally compete well with it for preemptive reads
<mjg> past some arbitrary threshold or perhaps when you know the total wont fit the cache
<mjg> ultimately ram bandwidth is infinite either
<mjg> not*
<mjg> as usual the win is to not do memcpy if it can be helped :p
<geist> re: the discussion of SMT static vs dynamic assignment of resources, iirc the ryzen at least has at least some amount of static assignment in the load/store units, i think
<geist> so maybe they avoided the problem by chopping it in half there
<geist> not so much the load/store units, but the amount of outstanding transactions i thik
<mjg> so i may be able to get a sensible data point soon(tm). as noted few times freebsd libc string ops don't use simd, but i can plop in some primitives and see what happens
<heat> ohhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh the llvm-libc string ops are their (automemcpy)
<heat> theirs*
<mjg> btw facebook has simd memcpy and memset apache licensed
<bslsk05> ​github.com: folly/memcpy.S at main · facebook/folly · GitHub
<mjg> interestingly this bit:
<mjg> .L_GE2_LE3: movw (%rsi), %r8w movw -2(%rsi,%rdx), %r9w movw %r8w, (%rdi) movw %r9w, -2(%rdi,%rdx)
<mjg> erm
<mjg> .L_GE2_LE3: movw (%rsi), %r8w movw -2(%rsi,%rdx), %r9w movw %r8w, (%rdi) movw %r9w, -2(%rdi,%rdx)
<mjg> sigh
<heat> that's sick
<mjg> welp you see which one
<mjg> this bit used to suffer some uarch bullshit penalty
<mjg> i don't know about today
<moon-child> 'in some implementations they're the same function and legacy programs rely on this behavior' sigh
<moon-child> penalty for what?
<geist> hah i love how that folly project doesn't even attempt to architecturally isolate
<mjg> yu had to movzwl to dodge it
<geist> guess facebook dont do arm
<mjg> geist: given their commens looks like they only do skylake :p
<geist> yah
<mjg> moon-child: partial register reads
<mjg> moon-child: this guy: movw (%rsi), %r8w and this guy: movw -2(%rsi,%rdx), %r9w
<mjg> moon-child: erm, stores to
<moon-child> oh right yes
<mjg> i would not be surprised if this was still a problem
<mjg> there is so much bullshit to know of it is quite rankly discouraging to try anything
<geist> yah really helps to only worry about one microarchitecture (skylake-x)
<mjg> guaranteed fuckUarch suffers a penalty in a stupid corner case you can't hope to even test
<geist> that's the luxury the big companies can generally rely on
<geist> then everyone scrambles when some new uarch comes along, but that's job security
<moon-child> I think at one point it was able to rename the low bits differently, but only when it knows the high bits are zeroed? But then maybe at some point they walked back on that as not worth the effort, since no one actually uses small registers?
<moon-child> don't remember
<mjg> there was some bullshit how if buffers differ in a multiply of page size there is a massive penalty
<mjg> copying forward
<mjg> like off cpu
<mjg> buffer addresses
<moon-child> oh yess cache associativity stuff
<moon-child> that's a thing pretty much everywhere
<moon-child> apparently it's too expensive to put even a really dumb hash in front of l1
<mjg> and ERMS *backwards* is turbo slow
<heat> rep movsb backwards is not ERMS
<heat> it's explicitly stated
<mjg> ye ye
<mjg> point is
<mjg> another lol corner case
<heat> mjg, what's an overlapping copy
<heat> in memcpy implementation terms, overlapping stores or whatever
<heat> i don't get it
<mjg> copying a buffer partially onto itself
<moon-child> suppose you wanna copy 7 bytes
<moon-child> you copy the first 4 bytes, and the last 4 bytes
<moon-child> these have a one byte overlap in the middle, but it doesn't matter, because it's the same byte
<heat> 1) how 2) why is this faster
<moon-child> this lets you handle 4, 5, 6, 7, 8 bytes in one path with no branches
<mjg> right
<mjg> the overlapping stores suffer a pentalty to some extent
<mjg> but it is cheaper than branching on exact size
<moon-child> char *dst, *src; size_t n. if (4 <= n <= 8) { int lo4 = *(int*)src, hi4 = *(int*)(src+n-4); *(int*)dst = lo4; *(int*)(dst+n-4) = hi4; }
wand has quit [Remote host closed the connection]
<mjg> the .L_GE2_LE3: has a simple example
gog has quit [Ping timeout: 255 seconds]
<mjg> what breaks my heart is that agner fog recommends a routine which does *NOT* do it
<mjg> 's liek wtf man
<mjg> that was my first attempt and it sux0red
<moon-child> don't meet your heroes
<mjg> :]
<mjg> you are too young to truly know that mate
wand has joined #osdev
<heat> ok I understand the simple overlapping stores thing
<heat> how do I use that to write a fast GPRs-only memcpy
<mjg> i'm afraid see bionic memset for sizes up to 16
<mjg> that covers that range
<bslsk05> ​cgit.freebsd.org: memmove.S « string « amd64 « libc « lib - src - FreeBSD source tree
<mjg> moon-child: funny you mention it, i'm gonna do some fixups over there :]
<mjg> for example rcx is written to for hysterical raisins
<heat> what's wrong with your label names
<mjg> what happened was the original code was rep movs only, that uses rcx for the count
<mjg> heat: sad necessity for macro usage
elderK has joined #osdev
<mjg> heat: i don't like it but did not want to waste more life arguing
<heat> its like you obfuscated the code lol
<mjg> the end result is that this is plopped into copyin/copyout
<mjg> as in the same code shared
<mjg> if there is a nice way to do it i would be happy to change it
<heat> namespace your label names?
<mjg> note it does not inject jumps or whatever to do the actual copy, it's all *in* routines
<heat> .Lmemmove_0_to_8:
<mjg> heat: again macroz, names just don't work, but maybe there is a work around for it
wand has quit [Remote host closed the connection]
<mjg> mate that's what i started with
<mjg> :]
<heat> why don't they work?
<mjg> i don't rmeember the error, but it craps out
wand has joined #osdev
<mjg> it's been like 5 years since i wrote it
<moon-child> maybe can do token pasting or something?
<moon-child> it generates two versions one with erms and one without, so duplicate label names right?
<mjg> again i don't remember what happened
<mjg> here is what does matter right now:
<mjg> 1. rep sucks
<mjg> 2. there is no speed up from making a loop bigger than 32 bytes per iteration
<mjg> 3. it's all sad tradeoffs
<mjg> 4. did i mention rep sucks?
<heat> doesn't rep unsuck on large sizes?
<mjg> it most definitely does not
<mjg> except vast majority of all calls are for sizes for which it *does* suck
<heat> also, what happens if you pad with int3's instead of nops?
<mjg> afair some uarchs don't like that
<mjg> as in they get a penalty
<heat> i know ret; int3 does, as its used to stop SLS
<mjg> look mate, it is over 2 am here, i'm heading off
<mjg> that memmove.S has some tradeoffs which maybe aren ot completely defensible, and defo has some crap which i'm gonna fix
<mjg> i would say look at bionic
<mjg> :]
<mjg> for < 16 bytes
<mjg> hue movb(%rsi),%dl
<mjg> partial fucker not taken care of
<geist> https://gcc.godbolt.org/z/4q58c1sj7 pretty strange looking at the codegen in tehre, gcc does some weird stuff right in the middle of that 8 byte loop
<mjg> gonna movzbl it
<geist> ie, at .L5
<geist> it seems to add 8 to the in var (a5) but it recomputes the dest var (a3) by adding the in + some precomputed delta between
<geist> very strange, the more logical code is to just add 8 to a3
<geist> but it does do the logical thing and compute the max address and do a bne against that, instead of subtracting one from a counter, like the code is written
<mjg> heat: maybe i expreswsed myself poorly. rep is great for "big enough" sizes, but said sizes are rarely used in comparison to sizes for which it is terrible
<geist> since riscv has nice compare two regs and branch instructions
<heat> yes
<mjg> heat: i'm off
<heat> bye
<geist> clang compiles this code as written, no weird optimizations.
<geist> even puts the store right after the load :(
<heat> i'll either write an ebpf thing or write a memcpy tonight, maybe both
<heat> ideally none
<geist> yeeeesssss
smach has joined #osdev
smach has quit []
nyah has quit [Quit: leaving]
smach has joined #osdev
mrvn has quit [Ping timeout: 246 seconds]
<geist> been piddling with it, and FWIW glibc currently has no specially optimized string routines for riscv
<geist> but the default C version does a fairly good job
<heat> yes, it doesn't
<heat> also if you looked at the source you'll see that the generic memcpy has page moving capabilities because of hurd
<heat> :v
<heat> haha, fun fact: GAS .align N, 0x90 actually picks smart nop sizes on x86
<heat> ok memcpy done
<heat> not hard
<heat> mjg will probably shoot it full of holes tomorrow but i'm relatively satisfied
smach has quit [Remote host closed the connection]
[itchyjunk] has quit [Ping timeout: 248 seconds]
[itchyjunk] has joined #osdev
<geist> yeah, i'm doign kinda the same thing
<geist> have a reasonably tight 8 byte at a time riscv memset working
<geist> not really any better than the compiler could do with similiar C code, but it's nicer to read and commented
<geist> drat, still gets trounced by glibcs version which unrolls it to 64 byte
<heat> i should check if 64 byte makes a difference here
smeso has quit [Quit: smeso]
<heat> no, it doesnt
smeso has joined #osdev
zxrom has quit [Quit: Leaving]
zxrom has joined #osdev
<moon-child> y'all have riscv hardware?
<geist> i do, now
<geist> woot. my new asm memset now matches or beats glibc
<heat> better share it comrade
<heat> give us our new memcpy
<geist> yah hang on a sec
<geist> memcpy is next, but probably wont get to that tonight
<geist> memset is to just warm up, get used to writing riscv asm in large amounts. there's kinda a zen to it
<heat> yeah im not really comfortable doing it
<heat> for any risc really
<bslsk05> ​gist.github.com: riscv memset · GitHub
<geist> may still be bugs, but it passes my test harness
wand has quit [Ping timeout: 255 seconds]
<heat> // TODO: see if multiplying by 0x1010101010101010 is faster
<heat> is it?
<geist> dunno!
<geist> the hard part is getting the constant into the register, which requires a load
<geist> but when i ran the expand logic into godbolt gcc juse does the mul
<heat> yeah
<geist> clang does some even weirder shit: https://gcc.godbolt.org/z/o46vsv59W
<heat> haha that's genius
<geist> i think it's actually doing the shift and add trick to synthesize the constant, and then mul it
<geist> there are a few other tricks i've seen the compiler do to rewrite 'store + add base reg + bne' to 'add base reg + store - 8 + bne'
<geist> though that's kinda debatable, because there still is a dep between incrementing the base reg and something
<heat> does any of that matter on your riscv cpu?
<geist> which part?
<heat> dependencies
<heat> can it do any ooo?
<geist> probably not, though the u74 is at least dual issue so i think there are some deps
wand has joined #osdev
<geist> but there's definitely a huge win to unrolling the inner loop on this thing. to the tune of 10GB/sec vs about 3.5
[itchyjunk] has quit [Remote host closed the connection]
<geist> though that's only when in the L2 cache range (<2MB). once you get past that it settles in to what is apparently bus speed, which seems to be around 800MB/sec
daily has joined #osdev
<heat> that's pretty fast
daily has left #osdev [Leaving]
<geist> yeah this is an actually kinda reasonable cpu. it seems to more or less perform as i expect for this class of core.
vdamewood has quit [Remote host closed the connection]
vdamewood has joined #osdev
gxt__ has quit [Ping timeout: 255 seconds]
gxt__ has joined #osdev
heat has quit [Ping timeout: 248 seconds]
<sham1> mrvn: RE: commenting on adding a tagged integer. I'd expect there to be a macro or something for that. That is, to make a C integer into an OCaml one
Vercas has quit [Ping timeout: 255 seconds]
Vercas has joined #osdev
foudfou_ has joined #osdev
foudfou has quit [Remote host closed the connection]
foudfou_ has quit [Remote host closed the connection]
foudfou has joined #osdev
elderK has quit [Quit: Connection closed for inactivity]
arminweigl_ has joined #osdev
arminweigl has quit [Ping timeout: 246 seconds]
arminweigl_ is now known as arminweigl
vdamewood has quit [Read error: Connection reset by peer]
vdamewood has joined #osdev
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
GeDaMo has joined #osdev
Starfoxxes has quit [Ping timeout: 246 seconds]
zzo38 has joined #osdev
<zzo38> :Here I wrote some ideas I have about operating system design: http://zzo38computer.org/fossil/osdesign.ui/dir?ci=trunk&name=design (Now you can complain about it, and/or other comment)
<bslsk05> ​zzo38computer.org: Unnamed Fossil Project: File List
bgs has joined #osdev
<kof123> "does shoving through the maximum amount of data through the memory subsystem per clock negatively affect other things" anyone can make a faster cpu, the trick is to make a fast system -- cow tools guy ^H^H^H^H^H^H^H cray
<AmyMalik> the trick to making a fast CPU useful is to keep the fast CPU fed
<AmyMalik> bandwidth, latency, and actually having tasks you need done
<zid> hence cpu caches, hence speculative execution, hence, hence
slidercrank has joined #osdev
<sham1> Hence Spectre
<sham1> And hence a nice James Bond movie
<zid> yes, the pinnacle of cpu design, spectre
<zid> The true end goal
<bslsk05> ​www.tomshardware.com: AMD Unveils More Ryzen 3D Packaging and V-Cache Details at Hot Chips | Tom's Hardware
<sham1> Well, you go so fast as to break security. At what point can we start saying that CPUs are fast enough
<sham1> We need both horizontal and vertical scaling
<zid> GeDaMo: You haven't bought one yet?
<GeDaMo> You know I haven't :|
<zid> Annoyingly AMD have done that thing where the most useful config is the cheapest model, so gets the worst silicon
<zid> my friend just did
<zid> so we've been playing with it
craigo has joined #osdev
<GeDaMo> Is fast? :P
craigo has quit [Client Quit]
<zid> yep, tis fast
craigo has joined #osdev
<netbsduser> zzo38: it sounds very mainframey
gog has joined #osdev
<netbsduser> the record-based files especially, and echoes of IBM i in the object stuff
<gog> hi
<netbsduser> gog: well come
<lav> hii
* gog patpatpatpat lav
<lav> ee
* lav purrs
<zid> I'm swearing off unix for being too woke, I did ls / and what do I see? Libs.
<lav> It's a little-known fact that Qt actually stands for Queer & transgender
<zid> and kde is.. kaleidoscope of dicks everywhere?
<lav> yup
<gog> i'm a qt qt
<gog> fr fr
<lav> agreed
<Ermine> hi gog, may I pet you?
<gog> yes
* Ermine pets gog
* gog prr
<zid> gog: how sure are we that you're not just a sussy cis sissy?
danilogondolfo has joined #osdev
<gog> you don't need to be sure of anything breh
nyah has joined #osdev
frkzoid has quit [Ping timeout: 255 seconds]
mavhq has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]
mavhq has joined #osdev
jimbzy has quit [Ping timeout: 260 seconds]
foudfou has quit [Ping timeout: 255 seconds]
foudfou has joined #osdev
Vercas8 has joined #osdev
Vercas has quit [Ping timeout: 255 seconds]
Vercas8 is now known as Vercas
dennis95 has joined #osdev
Starfoxxes has joined #osdev
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
<netbsduser> fuse seems to be completely antithetical to a sane VFS
Turn_Left has joined #osdev
<netbsduser> there seems to be no separation of the file from the vnode layer, so you end up with the most outrageous requirements, like requiring to pass FUSE_RELEASE the same flags that you FUSE_OPEN'd something with
<netbsduser> i will just add two opaque fields to kernel file descriptors + pass the kernel fd to vnode ops purely for the sake of this monstrosity, since (fuse being undocumented) i don't dare try to figure out how to route around the nuttiness
Vercas has quit [Ping timeout: 255 seconds]
<netbsduser> another bit of stupidity, the root inode number is especially specified to be FUSE_ROOT_NODE, but (at least with virtiofs) its `.` and `..` entries are not! this would play havoc with the vnode cache
Vercas has joined #osdev
[itchyjunk] has joined #osdev
Vercas has quit [Ping timeout: 255 seconds]
Vercas has joined #osdev
Vercas has quit [Ping timeout: 255 seconds]
Turn_Left has quit [Ping timeout: 256 seconds]
Vercas has joined #osdev
wand has quit [Remote host closed the connection]
wand has joined #osdev
nur has joined #osdev
mrvn has joined #osdev
Beato has quit [Quit: I have been discovered!]
theboringkid has joined #osdev
Vercas has quit [Quit: Ping timeout (120 seconds)]
Vercas has joined #osdev
eroux has quit [Ping timeout: 268 seconds]
eroux has joined #osdev
danilogondolfo has quit [Remote host closed the connection]
<netbsduser> virtiofs is half-baked
<mrvn> so you can write files but not read them to check if it actually worked?
<netbsduser> i've got a root dir with a subdir "test" in it. result of FUSE_LOOKUP on the root dir for `test` = fuse node number 3. result of FUSE_LOOKUP on that folder for `.` is fuse node number 13968. oh, and while the root dir's fuse-recognised number is actually `1`, FUSE_LOOKUP `..` in 'test' = 13955
heat has joined #osdev
<mrvn> is that maybe the inode of the mountpoint?
<mrvn> and what does stat on / say?
Brnocrist has quit [Ping timeout: 256 seconds]
<mrvn> Does anyone actually use "." and ".." from the FS and not generate them internally?
<netbsduser> those are indeed the inode numbers of the underlying mountpoint, virtiofs lets them appear, so it seems you need to treat fuse inode numbers and the inode numbers from getattr or lookup of `.` or `..` as fundamentally different
<mrvn> sounds like a bug in virtiofs though.
<netbsduser> mrvn: i used `.` in a failed attempt to reduce the effort to map fuse/virtiofs semantics to my vfs
<mrvn> If virtiofs doesn't map . and .. properly then I would rather have it not contain them in readdir at all.
<netbsduser> i can abolish my use of `.` but i really do need `..` though, and so i think most would, since it's not as though i carry a `parent directory` pointer in vnodes
<netbsduser> maybe linux does, they are known to be "different" in this area
<mrvn> netbsduser: without parent directory pointer how will you implement "mount --bind"?
<mrvn> or get from the root of a mounted filesystem (e.g. /home) to the parent directory?
Beato has joined #osdev
<netbsduser> mrvn: however the BSDs do nullfs, and by checking `vnodecovered` field of the vnode and then doing lookup `..` on that vnode, respectively
<mrvn> you can skip the parent pointer if you add a generated ".." entry into the vnode of every dir. But that's just another way to store the parent pointer.
<netbsduser> i only store details like that in the name lookup cache (well, i would if i *had* a name lookup cache, i am speaking aspirationally)
<netbsduser> i fear virtio-fs might be completely incompatible with anything other than the Linux VFS if i can't figure out some workaround that at least lets me get `..` lookup to return the right thing
<mrvn> so maybe implementing the name lookup cache will fix the problem.
<netbsduser> it could get me *somewhere*, but i would have to support pinning entries in the cache. the problem remains of people `mv`'ing a directory on the host elsewhere; then the stale entry is stuck, and i can never get the right entry because FUSE_LOOKUP `..` will give me the unusable host fs inode number instead of the fuse inode number
<mrvn> sure. But you are already stuck with bind mounts and crossing filesystem barriers in general. ".." simply isn't unqiue.
<mrvn> Now if you don't want bind mounts then you still need to pin normal mountpoints so you can fix the ".." at the filesystem boundary.
<mrvn> Note: there is also mount --move in linux.
<mjg> heat: don't be so easy on yourself
<mrvn> netbsduser: what happens with virtiofs when something moves a directrory on the host?
<mrvn> Do you end up with a "stale handle" error like with NFS?
<netbsduser> i am not convinced that it is necessary for bind mounts, at least not if they are equivalent to bsd nullfs (which mounts of subtree of the system fs into another point)
<netbsduser> and for the case of parent directory of mountpoints, that's handled especially by the vfs lookup function checking for a `vnodecovered` pointer in a vnode (meaning it's the root of an FS and occluding a vnode of another FS; then you lookup `..` of the vnodecovered to get the true parent)
<mrvn> netbsduser: you can bind mount /my/little/subdir /bla. Then ".." of /my/little/subdir is /my/little under /my/little/subdir and / under /bla. But both would be the same vnode, right?
<mrvn> Maybe vnodecovered also covers the bind mount case.
<sham1> This is why bind mounting is a bad idea
<sham1> You need to keep track of the full path you used to get to a place in order to properly do ..
<mrvn> I have no idea what code you are copying. I just know you need the .. stored explicitly for mountpoints somewhere.
<netbsduser> mrvn: on moving a directory on the host, i have no idea, it probably segfaults judging by my current experience with virtiofsd (it actually crashes every time qemu exits)
<sham1> Same with symlinks, which is why plan9 removed them
<mrvn> sham1: indeed. You need the full name lookup cache keeping the full path alive so you can follow the parent pointers.
xenos1984 has quit [Read error: Connection reset by peer]
<mrvn> And multiple cached names can point to the same inode.
<mrvn> .oO(But you have that with hardlinks already)
<mrvn> Another special case to keep in mind is chroot, or containers with a new FS namespace.
xenos1984 has joined #osdev
<sham1> Mm. I know that at least Serenity solves this by having essentially the file description remember the path it was accessed though, which is then cached so often used components are shared, and things like openat can then use these to do relative actions
<netbsduser> it appears that nullfs on netbsd at least creates new vnodes on-demand
<netbsduser> now dragonflybsd i remember boasts they need no such thing
theboringkid has quit [Quit: Bye]
<mrvn> is that something to boast about? It's not like it matters. Allocating a few vnodes is peanuts.
<sham1> We should just associate UUIDs with files
<sham1> Or GUIDs
<mrvn> too short.
<sham1> 128 bits is too short?
<sham1> Okay, then it can just be doubled. 256 bits
<mrvn> sham1: every directory you bind mount creates 2 paths to the file. So you need 1 bit to differentiate them. Do that 128 times and you have no bits left to specify the file in the dir at all.
<sham1> Ah
<sham1> So how many bits would a vnode need then
<mrvn> variable. it needs a parent pointer.
<mrvn> or you need a lookup from vnode ID to path
<mrvn> NFS runs into this problem because it's stateless. The client can't just throw some ID on the server because the server might not have the ID cached anymore and can't find the path. NFS handles do some magic to encode the path in some way but even that doesn't always work.
<netbsduser> my plan for now: fusefs_nodes will store their parent's node-id and that will be used to service `..` lookups
<netbsduser> all this stuff falls apart in the presence of the host moving the directory, but it sounds as though it falls apart rather nastily on linux too, so such is life
<mrvn> that's how all the union FSes in the kernel fall apart too, except a segfault in the kernel is worse. Only unionfs-fuse handles FS changes on the underlying FSes properly.
mctpyt has joined #osdev
<mrvn> (well, without crashing, nothing you can do to fix it)
<netbsduser> i wonder on a related thing, how fuse deals with nfs, and this virtiofs fuse setup in particular
<mrvn> netbsduser: the fuse client has handles that attached to each dentry and if something on the server changes you get an error about stale handles.
<mrvn> fuse does nothing for NFS, doesn't even see NFS, only the vfs.
<mrvn> s/the fuse client/the nfs client/
<mrvn> remember: fuse filesystems are just user processes that access the filesystem though normal syscalls.
mctpyt has quit [Ping timeout: 264 seconds]
nur has quit [Quit: Leaving]
<zid> moon-child: it was you messing around with pointer tagging doubles and pointers together right?
Brnocrist has joined #osdev
nur has joined #osdev
jimbzy has joined #osdev
<mrvn> What are pointer tagging doubles?
foudfou_ has joined #osdev
<mjg> heat: i'm gonna reap myself a new one for that memcpy i wrote :S
<mjg> heat: looks like we are gonna both going to get chew out on this one
foudfou has quit [Quit: Bye]
<heat> mjg, does that memcpy suck?
<heat> i took it as an inspiration for mine
<mjg> suck -- no, but it has stupid perf bugs
<mjg> for example i did not take care of partial regs for 1 byte copy forward
<mjg> and for copies <= 4 bytes backward
<mrvn> mjg: memcpy does not handle overlapping
<mjg> mrvn: age old
<heat> ok question: 1) should you interleave loads and stores? 2) isn't this chain of branching a bad idea?
<mjg> look i need this for *memmove* as well, and there is overlap of code
<mrvn> true
<mjg> so my *memmove* has the above problem
<mrvn> heat: isn't that obsoleted by the cpu pipeline and out-of-order execution?
<sham1> So you'd use memmove for that overlapping code, obviously
<mrvn> (1)
<mjg> heat: normally you do all the loads first and stores later, then branch on whatever
<heat> like the 16 byte branch does e.g cmp $16, %len; jb .L8byte
<mjg> just show me your code
<heat> yeah
<heat> btw that lea trick is pretty cute
<mjg> i also note reg allocation is a little questionable, but it i had hysterical reasons
<mjg> you would learn lea trick if you checked disasm of any real code mate
<mjg> 's how i did it :p
<mjg> i also *suspect* the code which aligns to 16 bytes could be much better
<mjg> i'll try to express it in C and see what clang comes up with
<bslsk05> ​gist.github.com: memcpy-gpr.S · GitHub
<mrvn> heat: you aren't checking for alignment.
<mjg> movb (%rsi), %cl
<mjg> movzbl
<mjg> that's one of the bugs
<mjg> movw (%rsi), %cx
<mjg> movzwl
<heat> aha riiight
<heat> let me guess, some uarchs have false dependencies on the rest of rcx?
<mjg> i don't remember what happens there, i do rmeember i did measure a slowdown from not doing it
<mjg> on haswell et al
<mjg> it may be kabylake no longer has the problem
<mjg> .L1_byte: missing ALIGN_TEXT?
<mjg> .Lerms: mov %rdx, %rcx rep movsb -- normally you want to align the buf at least to 16
<heat> I guessed it's stupid to have an ALIGN_TEXT there because it's a single byte memcpy
<mjg> then handle 1 byte early instead
<heat> in my logic, memcpy(1) is already stupidly pessimal anyway, no real reason to pad it early
<mjg> so there is one fundamental tradeoff in that code, which is not 100% defensible
<heat> s/early/at all/
<mjg> you can either roll with some branches upfront and jump once to the target code
<mjg> or you can have a cascade if you will, like in the code above
<heat> yes, that's part of my "to improve" ideas
<heat> bionic memmove.S branches upfront
<mjg> so the idea behind it was that sizes 32-256 consist of majority of the calls
<mjg> so it makes sense to make it the fastest
<mjg> hence fewer branches to get there
<mjg> in my code you slide into it
<mjg> as in no jumps to start
<heat> I added a branch to 16 because I noticed in your histogram that most memcpies were 16 byte long
<mjg> you added enough branches to perhaps defeat that
<heat> did I?
<mjg> again, this one is *super* murky
<mjg> i'm gonna do another take today or tomorrow
<heat> hmm ok
<mjg> generate more datasets, from freebsd and ilnux
<mjg> and then measure total time to execute them with both variants
<mjg> lower total time wins
<mjg> by dataset i mean collect all sizes along with the number of times they showed up
<heat> yes
<mjg> randomize the order
<mjg> and we will see on a bunch of cpus
<mjg> no claiming perfect, but should be good enough
<heat> is memmove just doing this but backwards?
<mjg> yes
<mjg> i needed to implement it because 'bcopy' which was the goto way
<mjg> used to be memcpy
<mjg> and then a geezer made it into memmove
<mjg> and now i'm screwed
<heat> is there a penalty to always copying backwards?
<heat> having two versions of the same code that do forwards/backwards sounds depressing
<mjg> i don't htink you can get away with that for arbitrarily stupid args
<heat> so I could have memcpy doing forwards and memmove doing backwards, that's my idea
<heat> hmmm ok
<mjg> anyhow i plan to sort out memset first
<mjg> same general issue + same idea what to d
<mjg> btw that 256 is lowballing it
<mjg> my haswell does better
<heat> is it? I think I tried higher and saw really mixed results
<mjg> there may be lullers on your arch which make it into a problem
<mjg> uarch
<mjg> again, fucking cpus man
dennis95 has quit [Quit: Leaving]
<mjg> key though: rep movs is quite pessimal for short sizes, what you do about it is for the most part tradeoff city
<heat> do the same principles apply to SIMD memcpy too?
<zid> mjg do you say other words
<heat> except maybe SSE may have issues storing to unaligned addresses
<heat> I know AVX doesn't
<mjg> zid: my english is limited, i only got 'english for chronic complainers about perfomrnace' in school
<zid> makes sense
<zid> are you much more personable in polish
<mjg> of course, i'm a very well read person
<heat> peszimal
<mjg> heat: i don't know the realities for simd which i could 100% defend
<mjg> heat: i could give you a stackoverflow-quality answer
<heat> do it
<mjg> you wanna do overlapping stores as soon as you can
<mjg> but watch out how mcuh you do them for one set
<mjg> [relaity check: sse2 /sucks/ when you do it for certain sizes]
<mjg> i have 0 real data for avx
<mjg> i intentionally not checkd glibc memcpy s that i can implement my own if needed
<mjg> but preferably i would steal one with an adequate license
<mjg> it was quitea bummr to find how much bionic sucks here :/
<heat> folly has an avx2 one
<mjg> yes i linked it
<heat> yes i know
<heat> just saying, it's an option
<mjg> the problem is apache license would need some finesing
<mjg> also i did not bench it myself
<mjg> also see the automemcpy paper
<mjg> for all i now i can generate an ok memcpy without handrolling any asm
<mjg> which would be the bestest
foudfou_ has quit [Remote host closed the connection]
foudfou has joined #osdev
<heat> linux memcpy_orig seems quite ok
<heat> could use the erms bit for lengthy copies but it seems similar to what we both have
<mjg> oh he, rolls with that jmp chain thing
<mjg> heh even
<mjg> /*
<mjg> * We check whether memory false dependence could occur,
<mjg> * then jump to corresponding copy mode.
<mjg> */
<mjg> cmp %dil, %sil
<mjg> jl .Lcopy_backward
<mjg> i don't know about this bit
<mjg> back then i talked to a big wig at intel about memcpy
<mjg> he told me to do address comparison and then do rep mov forwards or backwards
<mjg> et voila
<heat> lol
<mjg> no seriously
<mjg> the fact that their own optimization manual recommends against it
<mjg> did not phase him
<heat> against what?
<mjg> rep for short copies
<heat> the optimization manual seems to hail rep movsb as the best shit ever
<mjg> also note "fast short rep mov" making an appearance in recent years further proves it is crap
<mrvn> For a memcpy <= 32 byte isn't a simple 1byte copy loop faster than branching for 16, 8, 4, 2, 1 bytes?
<mjg> mrvn: nope. i had various experiments 5 years ago, including 8 byte loops etc
<heat> oh how does fsrm bench with this shit?
<mjg> it was all slower than overlapping stores
<mrvn> mjg: so a loop copying 8 byte that runs maybe twice is better? That's at least one branch misprediction.
<mjg> heat: afair fsrm does not help rep *stos*, it does help rep *movs*, but it is still slower for sizes < 64 or so
<mrvn> mjg: same for any remaining 4 byte and again 2 byte.
<mjg> mrvn: as noted previously i'm about to generate a good real-world-based dataset to memcpy and memset
<mjg> i'll hack up the above to the test mix
<heat> do you think I'll get shot if I try to patch memcpy to Be Good(tm)?
<mrvn> mjg: yes please. How many cpus can you test?
<mjg> mrvn: westmere, sandy, haswell, skylake, coffy lake, ice lake
<mjg> and probably some amd if i can get arsed
<mrvn> mjg: also will you benchmark real code? Replace the memcpy in libc and see what that says.
<mjg> see above for the description of what i intend to do
<mjg> i can easily get hands on more intels but i think that's enough
<mjg> would also be funnyt o bench no microcode updates vs fresh
<mjg> but i don't know if i can be arsed to get the former
<mrvn> I wish there where a way to mark different entry points into a function for the compiler. Like: enter here if src is 16 byte aligned, enter here if dst is 16 byte aligned, enter here if size > 64, ...
<mjg> heat: where? linux? it is a touchy subject so i would not
<mjg> heat: looks like the L guy and Boris SOmething are going to sort it out in a manner good enough(tm)
<mrvn> Sometimes I miss templates + constraints in C
<mjg> heat: for example i'm not gonna ship my memset over there :]
<heat> why is it a touchy subject?
<mjg> read the thread
<heat> you probably mean borislav petkov btw
<heat> mkya
<heat> mkay*
<mjg> yea
<mjg> also i guarantee there is something bad i don't even know about, which does affect the routine as coded by me
<mjg> and which some greyberad will point out as PESSIMAL
<mjg> while i welcome that, that's not the setting where i do
<mjg> :p
<heat> btw linux memmove is probably superior to memcpy ATM
<mjg> look i'm done chatting about bs, time to do some data collection
<heat> "And more would be dangerous because both Intel and AMD have errata with rep movsq > 4GB" haha
<mrvn> WTF? I have to rep movsq in blocks of 4GB? hehehehe
<mrvn> That's like DBcc on m68k only using the lower 16-bit of the counter register.
<mrvn> Can't do a full 32bit ripple carry addition, comparison and branch in the wanted cycle time
<mjg> now should i code the proram inRUST
<mjg> MOTHERF^Wi don't think
frkazoid333 has joined #osdev
xenos1984 has quit [Ping timeout: 256 seconds]
<mjg> NAME shuf - generate random permutations
<mjg> check htis out
xenos1984 has joined #osdev
heat has quit [Remote host closed the connection]
heat has joined #osdev
xenos1984 has quit [Ping timeout: 260 seconds]
<heat> mjg, CHECK WHAT OUT
<mjg> OH
<mjg> STFU
<mjg> i'm saying there is a ready-to-use tool to randomize the numbers
<zid> I can generate permutations in O(n) in both time and memory, in 3 lines of code
<zid> good enough for me
xenos1984 has joined #osdev
<zid> (LFSR with a cycle length the same as the sequence length can do O(1))
<heat> mjg, is there any benefit in interleaving loads and stores?
<zid> not on architectures that matter
<zid> yes on architectures that don't
<heat> I think they (Intel) explicitly say there is for SIMD
<zid> like mips, and atom
<zid> and avx512
<mjg> heat: for simd i don't know
eroux has quit [Ping timeout: 248 seconds]
eroux has joined #osdev
<zzo38> Do you have any suggestions about specific changes to my designs, or if anything about it is unclear, etc?
Arthuria has joined #osdev
wand has quit [Ping timeout: 255 seconds]
wand has joined #osdev
heat has quit [Remote host closed the connection]
heat has joined #osdev
<zzo38> Perhaps one thing I did not mention about the file records, is that the records do not all have to be the same size, and the record numbers do not have to be contiguous (it is likely that many record numbers will be skipped, since that file does not use them)
<zzo38> Does the design of capabilities makes sense, or do you suggest changes?
<zzo38> (It seems to be a problem of other operating systems, that do not properly support making locks and transactions that have several resources grouped at once; they usually only can do them separately.)
<heat> mjg, actually im wondering now if any of the ALIGN_TEXTs matter for small-ass sizes
craigo has quit [Ping timeout: 255 seconds]
<heat> at that point you're already doing something very pessimal, have gone through several branches, just for a 1-8 byte copy
<mjg> they do matter a tad bit
<heat> so wouldn't it be better to save on icache?
<mjg> once the target is far enough from the jump instruction you suffer from it not being laigned
<heat> yes but icache footprint
<mjg> they are most likely useless/harmful if you roll with a "switch" upfront
<mjg> it is a tradeoff, see once more the reasoning for sizes 32-256
<heat> yes
<heat> also I think bionic memmove does test fuckery instead of cmp
<heat> maybe worth a shot
<mjg> i checked in agner fog's tables
<mjg> it's literally the same shit
<mjg> in the cpu
<heat> really?
<mjg> yea
<heat> wtf
<mjg> i mean ports used and whatnot
<heat> yes
<mjg> cycle cost
<mjg> basically no diff that i culd bench
<mjg> and see above why
<heat> i would expect an AND operation to be a good bit better than cmp
<heat> guess not
<mjg> i think what actually costs is the fucking branch mate
<mjg> als note instruction fusing
<heat> yep
<mjg> that said, it may be there is a funny corner case
<mjg> absent good reason to *not* follow bionic n this one, i would argue you *should* do it
<mjg> 'looks the same so we gonna go the other way' is what i gave people shit for in the past
<heat> well yes but otoh that memmove isn't all that great AND it was written in 2014
<heat> almost 10 years ago
<mjg> is not this where your cpu is from
<mjg> :XX
<heat> no
<mjg> jinkies
<heat> kabylake is 2016, kabylake R is 2017
<mjg> look at mr modern man ova here
<heat> i bet you're using haswell
<mjg> i really should have added more comments
<mjg> to all that stuff
<mjg> i just rediscovered why 'weird bit' is actually good
<heat> what weird bit?
<mjg> in memset 32 or more i do
<mjg> cmpb $16,%dl
<mjg> ja 201632f
<mjg> movq %r10,-16(%rdi,%rdx)
<mjg> movq %r10,-8(%rdi,%rdx)
<mjg> as in the tail bigger than 16 is handled separately
<mjg> turns out overlapping 16 bytes when it can be avoided is tolerable
<mjg> 32 is a major bummer
<moon-child> heat: all basic arithmetic is single cycle for a long time now
<heat> why do you cmp on the actual 8/16-bit reg
<heat> is there any advantage in doing that
<mjg> it is smaller code
<mjg> mr ifunc
<mjg> erm icache
<heat> ifunc, icache, icrap
<mjg> iPhone
<mjg> irepstos
<heat> yes smaller code and then you blow it out the water with a nice 10-byte nop or whatever
<mjg> but i can fit more in there if needed
<mjg> mofer
<moon-child> won't somebody please think of the bytes!
<mjg> aight, got a db of 520684993 real-world calls to memset
<heat> export it to SQL and query away
<mjg> i'm gonna do it on the cloud mate
<heat> oracle database moment?
d5k has joined #osdev
<heat> I feel dirty using r8d and r8w
<nikolar> nah it's fine
<heat> it's not fine
<heat> 1) needs an extra prefix
<heat> 2) clunky naming
<mjg> you don't need these regs
<mjg> i only used them so that i can safely embedd into copyin/copyout
<mjg> which already use some regs
<mjg> and i dnot want to save/restore
<heat> i do need them
<heat> rdi, rsi, rdx are used by the args, rax is primed with the return value
<heat> so that leaves me with rcx, r8, r9, r10, r11
<mjg> see bionic
d5k has quit [Quit: leaving]
heat_ has joined #osdev
heat has quit [Read error: Connection reset by peer]
heat_ is now known as heat
<heat> <heat> bionic saves rbx
<mjg> wut
<heat> yes
<heat> although they do have a funny trick here where they reuse rsi for the last load when doing the tail copying
<mjg> just be happy this is not ia32
<mjg> famine register state
<moon-child> ia64
<moon-child> 128 registers
<mjg> bring back itanium!!
<moon-child> everything else is trash by comparison
<heat> YESSIR
<mjg> onw i'm curious how a memset runs there
<mjg> i mean looks like
<bslsk05> ​elixir.bootlin.com: memcpy.S - sysdeps/ia64/memcpy.S - Glibc source code (glibc-2.37.9000) - Bootlin
<heat> it looks stunning
<heat> as in "i'm stunned wtf is going on"
<moon-child> 'memcpy assumes little endian mode' wat
<moon-child> why doesn't it matter? Don't loads have the same endianness as stores either way?
<heat> KEEP HATING moon-child
<moon-child> lol
<mjg> haters gonna hate
<mjg> fuck you moon-child
<mjg> !!!
<heat> shut up mjg cpu architecture fascist
<heat> mjg? more like bitchjg
<mjg> E10k or bust motherfucker
<bslsk05> ​www.youtube.com: Sun Enterprise 10000 - YouTube
bgs has quit [*.net *.split]
smeso has quit [*.net *.split]
warlock has quit [*.net *.split]
bauen1 has quit [*.net *.split]
m5zs7k has quit [*.net *.split]
mahk has quit [*.net *.split]
matthews has quit [*.net *.split]
bnchs has quit [*.net *.split]
zzo38 has quit [*.net *.split]
ZipCPU has quit [*.net *.split]
sprock has quit [*.net *.split]
dminuoso_ has quit [*.net *.split]
k4m1 has quit [*.net *.split]
fkrauthan has quit [*.net *.split]
aws has quit [*.net *.split]
Archer has quit [*.net *.split]
j`ey has quit [*.net *.split]
Stary has quit [*.net *.split]
DoubleJ has quit [*.net *.split]
particleflux has quit [*.net *.split]
night has quit [*.net *.split]
meisaka has quit [*.net *.split]
zzo38 has joined #osdev
dminuoso_ has joined #osdev
mahk has joined #osdev
sprock has joined #osdev
bnchs has joined #osdev
matthews has joined #osdev
fkrauthan has joined #osdev
j`ey has joined #osdev
k4m1 has joined #osdev
Archer has joined #osdev
DoubleJ has joined #osdev
warlock has joined #osdev
smeso has joined #osdev
bauen1 has joined #osdev
ZipCPU has joined #osdev
aws has joined #osdev
particleflux has joined #osdev
Stary has joined #osdev
night has joined #osdev
meisaka has joined #osdev
bgs has joined #osdev
m5zs7k has joined #osdev
Archer has quit [Max SendQ exceeded]
dminuoso_ has quit [Max SendQ exceeded]
<heat> why does bionic memcpy also handle memmove?
<heat> is this mildly concerning?
<mjg> it used to be that glibc did it
<mjg> and trying to not do resulted in buggz
<heat> is this an actual compatibility concern?
dminuoso has joined #osdev
<mjg> depends, i don't know if glibc is doing it today
<mjg> people claim it is not
<heat> generic memcpy isn't I think
<heat> so...
<zid> because you can't trust people who'd use bionic
<mjg> that funky memcpy does
rein-er has joined #osdev
<zid> to stick to the actual semantics of memcpy
<moon-child> I would rather check for overlap and fault if so
<heat> their generic memcpy also supports page moving for GNU hurd
<moon-child> fix yo shit
<zid> I'm actually really lazy about using memcpy instead of memmove >_<
<mrvn> memcpy() should check and assert so bad code gets fixed.
<mjg> fucking
<heat> fucking. - mjg
<mjg> i wrote that toy prog i mentioned, very wip
<mjg> runtimes vary wildly
<mjg> turns out the total time is so long it gets preempted
<mjg> :d
<heat> toy prog for what?
<mjg> heheszek-read-bionic time 8920719533
<mjg> heheszek-read-erms time 10307939142
<mjg> heheszek-read-current time 8229679417
<mjg> heheszek-read-current time 10446866317
<mjg> heheszek-read-current time 6845942134
<mjg> running the 50 mln memsets
<geist> yah i think most libcs i've looked at simply have memcpy and memmove be the same symbol
<mrvn> mjg: pin the test to the core and pin everything else away from it
<mjg> mrvn: i already did
<geist> is it silly? yeah, but then really having two separate symbols is
<mjg> i may need to boot on linux and isolate cpus
<mrvn> mjg: then linux tickless implementation sucks.
<geist> it's like sprintf or gets, they're bad ideas from an older era
<mjg> mrvn: that's on freebsd :)
<mjg> mrvn: will do it on linux
<mrvn> mjg: did you remember to pin the IRQs too?
<mjg> i can't do that on that sucker
<mjg> again, will do it right on linux, but boomer i have to resort to it
<heat> geist, i think separate memcpy still makes sense. you optimize out a branch
<geist> but you might break code that misuses it
<heat> just like having separate memcpy_aligned_N or memcpy_nt makes sense
<geist> also means you need to write two implementations
<zid> It makes very sense for specifically a language like C to make both available as builtins
<zid> why use C if you don't want optimizations like that to happen and break your code, use rust :P
<heat> ruuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
<geist> ruuuuuuuuuuuust
<geist> though rust will probably just call throgh to memcpy to be fair
<geist> but it should know if things overlap
<heat> that's right, if you don't want memcpy, use go
* heat watches as they reimplement memcpy
Vercas2 has joined #osdev
<moon-child> you can't know if stuff overlaps
<moon-child> in general
<moon-child> cus you can compute arbitrary subscripts into an array
<mjg> everyonei s a smartass until they need to code a website
<geist> rust can probably via a series of rules know that any two objects can't overlap
<geist> it's only situations where you're moving stuff within the same array
<mrvn> moon-child: but then the compiler knows it's the same array
<moon-child> sometimes you're in the same array and you know your subscript never overlap
<geist> rustin particular would know this. C/C++ wouldn't
Vercas has quit [Ping timeout: 255 seconds]
gxt__ has quit [Ping timeout: 255 seconds]
Vercas2 is now known as Vercas
<moon-child> I'm not saying it doesn't sometimes know, I'm saying it doesn't always know
<moon-child> c can restrict
<mrvn> moon-child: If it can't know if ranges of objects overlap then it can call memmove
<geist> does make me think, i do remember there was a fair clever sequence of instructions to detect overlap, and how much
<mjg> i once more note some of the intel claim yu need to check relative addresses for performance reasons anyway
<geist> and also interestingly, depending on how two thins overlap and by how much and how long your copy stride is you can still a lot of the time use your core algorithm
<mjg> and if you do that, the issue goes away
<heat> why is no one on the page loaning koolaid
<geist> that'd have to be fast as fuck to beat a copy
<mrvn> if (abs(dst - src) > 256) memcpy_I_dont_care()
<moon-child> geist, first thing that comes to mind is sign bit of (startx - starty) xor (startx + len - starty)
<geist> moon-child: something like that
<geist> mrvn: hmm, but does that work with both directions of overlap?
<mrvn> geist: no. sometimes you have to copy backwards.
<geist> if your algorithm copies forward then one of the overlaps will be problematic, i think
<mrvn> geist: if (abs(dst - src) > len) memcpy_its_safe()
<moon-child> even if they overlap the 'wrong' way, as long as it's less than your buffer size, it's fine
<mrvn> moon-child: no. some overlaps you have to copy backwards.
<geist> moon-child: only if you copy in the right direction, because otherwise you'll start ovewriting source data before you get to it
<moon-child> err no
<moon-child> ignore me
<mrvn> You can ignore backwards copying when len < buffer_size
<geist> but yeah you *should* be generally ablve to write a reverse copy version of your algorithm at pretty much the same speed
<geist> so now you have to write two versions of the whole thing, forward and backwards
<geist> then handle the overlap case where they're too close: which is probably just revert to a bytewise forward or back copy
<geist> then that's basically the guts of memmove
<mrvn> Copying backwards is bad for the performance though. So you probably want to coppy smaller chunks forward if the overlap is larger than your buffer size.
<geist> not really. i think most decent architectures detect reverse stride just as well
<moon-child> why would backwards be bad?
<mrvn> e.g. on an overlap with 64k offset
<moon-child> yeah prefetcher can detect backwards accesses
<mrvn> moon-child: because RAM chips are made to read incremting sequential.
<geist> oh modern stuff is so abstracted from what ram chips are doing that's immaterial
<mrvn> ok
<geist> but sure you can still do say 64 byte chunks forward as you step backwards
<mrvn> that info is maybe 12 decades old.
<geist> i mean the ram chip probably does have a open next row thing (depending on if it's row or column first) but then the rows are large, and i think you can easily address backwards in it
<mrvn> geist: would it matter for 64byte? That's a cache line. I don't think it matters in what direction you access a cache line.
<geist> right
<geist> hence why that wouldn't really matter if it did it forward or backwards within the cache line, probably.
<mrvn> geist: I was more thinking about 1k or so. Where the prefetcher would fail to get the next cache line ahead of time.
* geist nods
<moon-child> there is sub-cacheline structure. But I don't think order matters
<geist> now ifi t's just that level of arch where it can't really dtect prefetchig stride, then yeah you might be worse performance
<mrvn> And then as you say the next size is the row of the DIMMs. Every time you have to load a new row address you loose time.
<geist> but we're already talking about the sub case where things overlap in some way. if A is less than B or vice versa with no overlap there's no reason doing a backwards copy
<moon-child> (I think usually, sub-cacheline, is organised into groups of 8 bytes. A misaligned 8-byte load can grab from two 8-byte groups, doesn't matter if they're in the same line, hence misaligned 8-byte loads may be fast even when they cross a cache line boundary, contra wider loads)
<geist> honestly you could probably just revert to bytewise for overlap cases and probably wouldn't be a big deal
<geist> i mean it would be sub optimal but that's sort of a TODO case
<mrvn> checking for all the overlap cases though is a couple of branches. So for memcpy() it's worth it not to need those.
<geist> lots of overlap is usually folks moving strings around, and they're probably fairly small *or* it's something incredibly stupid
<geist> again depends on if you trust all users of memcpy to do the right thing. i'm not sure if that's a wise idea because of the existing implementations that just union the two things
<geist> and sure that's broken code, etc etc, but that's also the life of an OS hacker
<geist> dealing with dumbass users
<mrvn> geist: are you calling me dumbass? :)
<geist> welllll
<mrvn> being the only user has the pro and con that all users are as dumb as I expect them to be.
<geist> yah
heat has quit [Read error: Connection reset by peer]
heat has joined #osdev
<mrvn> I like having assert()s that check for overlap on memcpy though. Because I know I'm dumb. :)
gxt__ has joined #osdev
<geist> you can certainly do it in a wrapper without messing with the core implementation
<mrvn> It's best placed (or at least duplicted) in the .h file so the compiler can evaluate it at compile time where possible. Only the case where everything is unknown should call the full memcpy/memmove with all the branches.
<heat> mjg, turns out 256 is indeed too conservative for erms
<heat> i bumped it up to 512 here
<mjg> that's pushin it
<heat> it is
<mrvn> mjg: with all your memcpy benchmarking can you say at what point memcpy will be slower than chaning page tables?
<heat> also I assume erms on backwards sucks harder than a manual copy?
<mjg> mrvn: i'm only benchin small sizes
<mjg> mrvn: < 256
<mjg> mrvn: this is for kernel memcpy
<mjg> et al
<mjg> heat: yea
<mrvn> that's exactly where I would move pages around too
<geist> i dunno unless you have an extremely specialized situation, moving around pages is very expensive
<geist> you'd have to be local cpu, involve no TLB cross cpu TLB shootdowns, etc
<mjg> i once more refer you to https://people.freebsd.org/~mjg/bufsizes.txt
<heat> moving pages around's cost probably scales with the number of CPUs involved
<mjg> see memmove_erms et al
<mjg> tons and tons of ops are super small
<mrvn> geist: no threads, no shared memory, so no cross cpu worries there.
<zid> what's an erms
<geist> no matter what you do you should probably optimize a page sized memset and page sized copy
<geist> hypothetically that should be most of what your kernel does
<zid> I feel it isn't efficient reverse memory sausage
<mjg> i have to go afk now, can respond later to whatever
<mrvn> geist: it is, so far.
<geist> mrvn: sure. ie, the extremely specialzied situation
<heat> zid, Enhanced REP MOVSB
<geist> in more general purpose things, it's hard to justify fiddling with page tables at this level
<zid> oh right
<mrvn> still, would be nice to know at what size a local page remap, global page remap, cross cpu shootdown, ... becomes cheaper than copying
<geist> i think it made sense farther back in time, but it's one of these cases where modern cpus are faster copying things than the overhead of fiddling with the mmu. plus all the cross cpu shootdowns
<mrvn> isn't that reversing with the cross cpu shootdown mechanics that don't use IPIs?
<geist> like in ARM64? possibly, but that's also not free
<heat> they are NOT free
<geist> for example you'd probably have to to break-before-make
<heat> and also O(n) with the number of CPUs involved
<geist> since you're replacing one page with another
<mrvn> Not free. Cross cpu shootdowns just have become so expensive that they are getting optimized now.
<geist> the ARM TLB shootdown stuff is nice because it doesn' thave to do a full IPI, but there's still cose
<geist> the new intel TLB shootdown proposal effectively bumps the IPI up to some sort of pseudo microcode/SMM mode thing
<geist> so hypothetically that helps a bit, since it's not a full interrupt
<mrvn> I just want to invalidate the other cores TLB cache entry.
<geist> and of course AMD has their solution to
<heat> why are x86 vendors so stubborn?
<heat> screw them both
<geist> yay!
<geist> go Via! be the dark horse third one
<heat> i want a bootleg soviet 386
<geist> anyway re: TLB shootdown on ARM. it *has* to be more efficient than an explicit IPI like you get on x86 or riscv, but i honestly dont know the numbers
<geist> it's one of these things where you really just dont have any other choice, since there's functionally no other way to do it
<geist> so hypothetically the tlb shootdown is 'free' but probably in reality it's highly based on the microarchitecture, how many cpus are active, what they're doing at the time, etc
<geist> i've never actually measured it to be honest
<mrvn> geist: it should really be the same cost as a local TLB shoootdown except you send it to the other cache. The only costly thing would be when 2 cores try to do it at the same time. Then one has to somehow wait etc...
<mrvn> out of curiosity: Does ARM support multiple sockets?
<geist> yeah
<dh`> there must be multiple-socket arm64 boards by now
<geist> yep. i have one right here
<dh`> the thing I have always wondered about hardware-assisted tlb shootdown is: tlb shootdown has to ultimately be synchronous (that is, you have to wait for it to complete before you continue with whatever VM ops you're doing) but suspending the current processor entirely for that time seems like a bad plan
<geist> the mechanism by which the TLB invalidates gets broadcast is not documented, but presumably it's somewhat like a cache eviction thing
<zid> geist loves his athlon 64 x2 x2
<geist> dh`: the mechanism ARM (and AMD) do is you start the eviction with an instruction, and then later on there's another instruction to sync
<geist> DSB on arm, TLBSYNC on AMD
<mrvn> dh`: it's just a pipeline flush
<geist> so that lets you at least do some other work in the middle
<dh`> sounds like not a full switch to another thread though
<geist> so it does mitigate that somewhat. you can schedule multiple flushes in your code and then synchronize on the way out
<dh`> but maybe the latencies aren't high enough for that to matter
<heat> no, you don't switch
<heat> the point is that you can keep batching tlb shootdowns as you go and TLBI'ing them
<geist> what is annoying about swapping pages vs just general unmap/etc is you have to do break-before-make, which is somewhat more synchronous
<heat> then at the end you tlbsync/dsb
<geist> ie, yo uhave to right then and there shoot down the TLB and wait for it to complete before you can put the new entry in
<dh`> with an IPI it's quite feasible to run another thread
<mrvn> dh`: you only need to sync before the next instruction that could access the evicted address.
<dh`> since even without latency from interrupts being off on some of the other cpus, the interrupt and interrupt dispatch takes considerably longer than two thread switches
<mrvn> geist: why? I would say the opposite, swapping pages is make-before-break
<heat> i mean, you could schedule out I think?
<geist> nope. not on ARM. it's complicated
<geist> it's all about avoiding the situation of having conflicting TLB entries on multiple cpus at the same time
<mrvn> geist: you write the new address into the page table, you invalidate the TLB.
<geist> certain subsets of TLB shootdowns involve break-before-make
<geist> nope. that's not how it works mrvn
<geist> there's a whole treatise in the ARM manual about why you need break-before-make, and which precise situations you need it
<mrvn> geist: doesn't that garanty that after sync all cores will have the same entry (or none)?
<geist> it guarantees that *at all points in the sync* they have the same entry
<geist> which is mandatory for Reasons
<mrvn> urgs.
<geist> ie you remove the old entry, TLB sync, then add the new entry
<geist> ie, break-before-make
<mrvn> I would have though you just ignore the big race hole between make and break. It's UB anyway.
<geist> so only subets of page table modifies hit this case, but changing what is mapped at a particular slot to something else is one of them
<geist> unmapping or mapping doesn't cause it, of course
<mrvn> geist: does the reason involve the A/D bits?
<geist> yes
<dh`> I vaguely remember having this discussion once before here
<mrvn> ahh, not using (or having even) those.
<geist> everything to do with other cores having weak memory model writeback to A/D bits and having out of sync TLB entries
<geist> mrvn: congratulations
<mrvn> geist: the A/D bits can get triggered between make and break and then you indeed have problems.
<geist> yep, that's the main issue, and there's some other subtle reason
<geist> i encourage you to read the manual on the topic before you fiddle too much, so at least you kow if you're playing with fire or not
<mrvn> Is A/D hardware bits mandatory on AArch64 or still optional?
<geist> >=v8.1
<dh`> you can get inconsistent reads if it's a page shared with another process
<geist> >=v8.1
<mrvn> geist: I have no shared memory, no mmap, no page getting remapped actually. I (so far) only have map, unmap and move between 2 page tables.
<geist> congratulations
<mrvn> geist: I'm sticking with a verry simplistic memory model for reasons. :)
<geist> i understand you have a very simple system that bypasses most of these concerns, but it really doesn't matter to me (or probalby anyone else)
<geist> it's not useful to recommend things to other people based on your personal needs
<mrvn> just saying why I haven't run into this issue
<geist> sure
<geist> but again you should probably read the manual on the topic. i think there's some other reason why break-before-make may be necessary
<dh`> suppose it's a copy-on-write remap; process 1 thread 1 reads 6, goes to write 9, updates the mapping, starts shootdown, process 2 writes 7, process 1 thread 2 reads the 7, then the shootdown completes
<mrvn> dh`: if you have a shared page and one thread/process remaps the page without some form of synchronization then you have a race condition already.
<geist> there's some trickery when detaching page tables that involve a BBM style thing
<heat> gang, i need assembly help
<heat> jz .Lout
<heat> sub $32, %rdx
<heat> cmp $32, %rdx
<heat> jae .L32b_backwards
<geist> to keep other cpus from having a page table cache entry floating around before the page is reused
<heat> why is the cmp not redundant?
<heat> WAIT
<heat> im stupid
<dh`> when rdx starts out at 65 :-)
<geist> yay duck debugging
<mrvn> dh`: In your example both processes would trigger a page fault. The original entry is read-only.
<dh`> no? p1 t1 was already doing the page fault, p1 t2 was only reading, p2 might not have it readonly
GeDaMo has quit [Quit: That's it, you people have stood in my way long enough! I'm going to clown college!]
<mrvn> dh`: process 2 writes 7 ==> page fault
<heat> geist, what does zircon use to handle copyin/out page faults?
<heat> fixup table?
<dh`> p2 might not have it readonly
<mrvn> dh`: It's COW, it must be read-only.
<dh`> says who? see MAP_PRIVATE
<geist> fixup table i think, depending on precisely what you mean by fixup table
<mrvn> dh`: that's how COW work. Both sides of the COW get a read-only entry.
<dh`> I mean, arguably MAP_PRIVATE is a bug, but it's the agreed-upon default behavior
<heat> struct fixup_entry {u64 ip; u64 fixup_ip;} table[];
<mrvn> dh`: the page only becomes read-write when you resolve the COW and then the page is no longer shared.
<dh`> mrvn: I repeat, MAP_PRIVATE
<geist> dh`: i think the key here is at some point after the page table entry is updated either the other process still sees the RO version (and page faults) or the RW, but it's okay for that to be slippery
<mrvn> dh`: MAP_PRIVATE has nothing to do with that.-
<heat> like, erm, I want to plug this into copy_to/from_user but adding fixup entries for every single memory access sucks
<dh`> sure it does
<geist> worst case the second cpu page faults, then goes in to discover the page is already RW and shrugs and retries
<dh`> MAP_PRIVATE allows other processes' changes to show through to you until you trigger your own copy
<geist> heat: ah no we just do it as a register/deregister thing
<dh`> so some other process could be making such a change
<mrvn> dh`: when you MAP_PRIVATE you get a COW read-only mapping of everything. When you fault on a write you get a not shared copy.
<heat> ah, like BSD then
<geist> ie, the start f the copyin/copyout code sets a recovery pointer in the current thread structure
<heat> yep
<geist> i think that works reasonably well
<geist> it's not balls to the wall optimally fast, but i think it's a reasonble compromise
<geist> having the pre-canned table seems like a microoptimization
<dh`> geist: defining magic ranges for the trap handler to treat specially shifts a couple instructions out of the common fast path
<geist> yep
<mrvn> dh`: process 2 doing a MAP_PRIVATE won't allow it to write to pages so process 1 sees it
<dh`> yeah, micro-optimization
<dh`> no, mrvn
<dh`> process 2 is doing something else, maybe via MAP_SHARED, maybe just write(2)
<dh`> *you* have the MAP_PRIVATE and you're in the middle of copying
<mrvn> dh`: ahh, that is a horse of another color
<dh`> not really
<mrvn> mixing write and mmap without syncronization is a race condition. So yeah, you might get 7 or 9. It's a race.
<dh`> actually the example I provided is invalid but you can construct other more complicated ones that involve different addresses on the same page
<dh`> the point is that you can get traces where thread 1 sees that the copy happened before process 2 and thread 2 sees that it happened afterward
<dh`> and if you then do this on two separate pages you can get an observable inconsistency
<mrvn> dh`: sure. hence the need to synchronize. One way to do that is to first break the mapping like geist says you need to do anyway. Then all processes run into a fault and that blocks till your finished remapping everything.
<dh`> right
<dh`> that was the point, ultimately
<dh`> you can't engage the new mapping until the old mapping is revoked everywhere
<dh`> that's also what I was blathering about when I was talking about waiting for completion
<mrvn> dh`: if it involves user processes I would still think all those cases are actualy UB or race conditions. The kernel doing COW is a separate matter and the kernel needs to synchronize that.
<dh`> there's no UB at the machine level
<heat> ARM does have some UB
<dh`> and also, things like mprotect changes that can be triggered from userlevel come with implicit atomicity guarantees
<heat> in the IMPLEMENTATION DEFINED stuff or whatever they call it
<dh`> typically processors only allow unprivileged execution to be UNPREDICTABLE rather than UNDEFINED because the latter is bad for security
<mrvn> dh`: if one thread calls mprotect to make a page read-only and another thread writes to it then it's undefined wether the other thread segfaults or not. Depends on the exact timing. That kind of stuff.
<dh`> no, but it must happen either before or after
<dh`> and if you let that become fuzzy it becomes possible to construct ways to observe the inconsistency
<mrvn> even that might not be the case with compiler and cpu reordering stuff
<dh`> at least in unix it's a basic guarantee of the syscall api
<mrvn> A lot of that stuff predated threads :)
<dh`> and a lot of stupid stuff had to be sorted out when it became possible to have multiple threads observing in a single process
<dh`> I think at this point posix doesn't make promises about what happens if you mprotect or unmap regions that are arguments to read/write when those calls are in progress
<netbsduser> i am pinning my buffers nowadays
<mrvn> dh`: where did you see that mprotect is atomic? man 7 pthread doesn't have it in the list of thread safe functions and the manpage doesn't mention it here.
<netbsduser> if you do read/write it wires down the underlying pages
<dh`> all system calls are atomic unless explicitly stated otherwise
<dh`> anyway there's valid reasons for wanting mprotect to be atomic and no real excuse for fumbling it :-)
<mrvn> dh`: then they would all be thread safe
<dh`> what specifically do you mean by "thread safe"?
<geist> i think in general the rules are you can't observe thing in any particular order, but you at least observe old or new state, and nothing in between
<mrvn> never mind, found it.
<geist> that's really the only reasonable thing anything can guarantee
<mrvn> dh`: "A thread-safe function is one that can be safely (i.e., it will deliver the same results regardless of whether it is) called from multiple threads at the same time.
<mrvn> "
<dh`> all system calls that are actually system calls should be thread-safe in that sense
<dh`> that is very basic
<mrvn> dh`: they are "except for the following functions: ..."
<geist> well, that's not really true intrinsically. you generally have to jump through at least some number of hoops to guarantee that
<geist> like, say the file descriptor moves the cursor atomically
<dh`> calls that are allowed to be wrappers in libc are somewhat different
<geist> or, a file descriptor is either open or not at any given point
<mrvn> dh`: anything with static buffers is on that list
<dh`> there are no syscalls with static buffers
<dh`> it doesn't make sense
<mrvn> dh`: but way too many libc functions
<geist> what is really hard to do is guarantee that at the completion of a syscall all of its results are observed everywhere
<dh`> yes but that's an entirely different issue
<geist> easy to do up until you have multiple threads calling things simultaneously
<mrvn> geist: or even still valid
<mrvn> dh`: that was in reference to "wrappers in libc"
<geist> right, we ended up for zircon declaring the model is fairly loose
<dh`> right
<heat> is there no way to define descriptive function-local labels in GNU as?
<heat> .Lblah is not function local
<heat> this is exactly what mjg was talking about yesterday
<dh`> heat: only file-local
<dh`> what's a "function" in assembly anyway? not a meaningful concept
<heat> :^) shoot me
<dh`> geist: for anything that updates kernel state unlocking that state should force global visibility
<geist> yeah but the tough part is what does it do to syscalls that are simultaneously occurring on the same state
<geist> the canonical exampe is a syscall that frobs a handle simultaneously being called with a syscall that closes the handle
<dh`> right, one has to execute first
<geist> that's too difficult, without serializing the entire kernel
<dh`> that's part of what's meant by the atomicity guarantees for unix syscalls
<geist> the obvious 2 cases are: close happens first, frob fails. frob occurs first, close happens
<geist> but the 3rd and less obvioyus case is: frob occurs first, close happens *and* exits, frob continues to happen
<geist> ie you have a syscall that's still occurring on a handle that is now closed
<dh`> should not be, boils down to a single word-sized access of an entry in the descriptor table, even if you do everything unlocked one will go before the other
<geist> i'm not entirely sure posix defines that
heat has quit [Read error: Connection reset by peer]
<geist> the key is what happens when the frob syscall looks up the underlying object, gets a reference to it. it's now *past* the descriptor table.
heat has joined #osdev
<dh`> in principle it means the read happened before the close
<geist> it did, but then as a result the frob syscall goes about its business *after* the handle was closed
<dh`> even if all it actually means is that the read crossed the descriptor table before the close
<dh`> that defines the ordering
<geist> so you have to consider that case, or explicitly add machinery to make the close syscall wait until all outstanding operations on it have completed
<geist> we chose not to do the latter
<dh`> in order to have an inconsistency you need to then be able to observe something that shows that the close executed before the read
<geist> i think you're missing the point here. the point is not that the close happened before the *start* of the read. it's that the read is continiuing to happen
<geist> beause syscalls aren't atomic in units of time
<dh`> right, but can you construct an observation that shows that?
<dh`> you can call gettimeofday() after each call but that tells you nothing
<geist> absolutely. it's easy. do a blocking read on something and then close it
<geist> i actually dunno what posix does there. does it abort any read operations on the fd?
<geist> (probably). but is it defined that way
<geist> or is that just a side effect of implementation
<dh`> traditionally? the close affects the table, not the file object (or vnode)
<geist> exactly
<dh`> how do you observe that the read is still in progress though?
<geist> the fact that the read syscall is still occurring after T0, where T0 is where the close syscall returns
<dh`> how do you know it's occurring, and where do you get that stamp?
<geist> i'm not sure i understand this line of thinking. it's easy to observe all this stuff using standard observational stuff
<geist> ie, a universal clock to the computer
<dh`> sure, you can also in principle monitor the electron flows in the cpu
<dh`> but the semantically important question is what a program can observe
<geist> also i'm trying not to be posix specific here. i think posix sidesteps a lot of this by not having a tremendous number of types of frobs you can do on handle. i also think it sidesteps a fair number of these things by being implementation specific
<geist> thread B is still sitting in the read syscall after thread A has completed closing the handle
<dh`> can you distinguish that from thread B from having returned and stalled before doing anything else?
<geist> and after some reasonble time does not return with ERR_CANCELLED or whatnot
<geist> yes yes, i know what you're trying to do here, but that's not hte point
<dh`> it _is_ the point though because all these notions are relative to some model
<geist> so perhaps a better way of saying it is does thread B get its syscall cancelled as a reslt of thread A calling close
<geist> or does thread A wait until thread B exits, etc
<geist> and i simply dont know what posix states here, or if it states anything at all
<dh`> the whole point of having something more parallel than fetch one instruction at a time and execute it to completion is that there's supposed to be a model in which the execution is still consistent
<dh`> it's usually safe to assume that posix states either nothing or nothing helpful :-)
<geist> right. so my point is you have to define some sort of model ideally. even if the model if precisely undefined in some situations (ie, could be A or B but can't tell)
<dh`> at some point we discovered that someone had changed netbsd's close to behave in a manner other than the usual default and there was a big ruckus about it
<dh`> let me see if I can find that
<dh`> but I think the conclusion was that what whoever did was legal, just unexpected and possibly undesirable
<geist> what we did for zircon is since basically every syscall operates the same way: take a handle to a thing and do an operation on it. there's a phase in the syscall where the handle is looked up, and at that point the caller has access to it, even if the handle goes away simultaneously
<geist> and since handles cannot be modified, only added or removed, it avoids a bunch of races with handle changing permissions or whatnot
<mrvn> dh`: you can send thread2 a signal and see if read return EAGAIN
<geist> ie a slot in the handle table is in one of two states: present with a set of rights and pointing to an object, or empty. and can ony transition betweeo those two states
<mrvn> dh`: thread2: read(fd), thread1: close(fd); closed = true; signal(USR1); If read returns EINVL or something then close aborted the read. If read returns EAGAIN / closed is true then close() didn't abort for sure.-
<mrvn> you can add a sufficiently large sleep() after close to make it even more observable.
<dh`> and since you can't post signals explicitly to threads, what if the signal is only ever delivered to thread 1? :-)
<dh`> (that's being difficult, yeah, it's a possible way to observe it)
<dh`> but the question isn't whether close aborted the read, that you can presumably tell by whether the read fails with EBADF
<dh`> it's whether you can observe that the read is still running after close completed
<mrvn> dh`: how would that look like? close() aborts the read but then read still returns data written to the file after close()?
<dh`> anyway, I'll just retreat to the next fortification, which is that file descriptors being handles is part of the semantics of the unix system call API and there's no reason to require operations on handles to affect other operations that have passed the look-up-handle stage
<mrvn> ack
bgs has quit [Remote host closed the connection]
<dh`> mrvn: no, the idea is that close doesn't abort the read
<geist> yah that's the model i think that makse sense
<mrvn> dh`: but that part the signal would show.
<dh`> basically read looks up its filetable entry, close removes that entry, close returns, read continues
<mrvn> There is a grey zone in my test where close would abort the read(), the read doesn't return yet though and the signal then still happens to catch the sleeping read.
<dh`> I guess another more direct way to observe this is: thread 1 calls read, thread 2 calls close then writes to a pipe, thread/process 3 reads from the pipe and writes to thread 1's file, thread 1 reads that data
<mrvn> would be an odd implementation though for the read to be aborted but still catch signals and change the return code.
<geist> that being said i wonder what happens to network sockets
<geist> though that may be exactly why shutdown() exists
<mrvn> geist: shutdown is so you can close the sending side while still reading.
<dh`> mrvn: in most implementations it would post the signal handler but return EBADF and not EINTR
<dh`> most signal implementations, that is
<geist> hmm, you're right. so then what happens if you close a socket that has a pending blocking operation on it? seems in that particular case like it'd be silly to keep it going
<geist> since it could, hypothetically, block forever
<mrvn> geist: how ever would the read wake up in that case? It's not getting any more data.
<dh`> the argument I'd make is that if that's what you want you should call revoke rather than close
<mrvn> dh`: if you really want to know 100% then you have to read the kernel source.
<geist> i'm gonna bet it aborts the read, and it comes down to a case where posix is unclear and its implementation defined what happens
<heat> i think linux just wakes everyone up on shutdown
<mrvn> otherwise the test shows 99.9% sure
<heat> like t1: read(sockfd, ...); t2: shutdown(sockfd, RD) t3: read() = 0
<heat> s/t3/t1/
<mrvn> geist: I think a close on a socket or pipe means EOF so read should wake up.
<mrvn> Unlike on a file where cose doesn't change that.
<dh`> my expectation would be that even for a socket the close would only close the handle, not the socket, and the socket would go away after the read releases it
<mrvn> dh`: then the read never wakes up and the socket remains behind forever.
<geist> yah reading the man pages for close it doesn't really say what happens on simultaneous operations, but it seems to imply that if it's the last hadle then everything will be cleaned up
<geist> which implies that if something is blocking at least it'll get unblocked as the data structure its on is getting removed
<dh`> the natural implementation is to incref the file object when you fetch it from the file table for read, so you own a reference to it, and nothing under it goes away until you drop that reference
<dh`> the reason for whatever weird thing happened in netbsd was that someone wanted to avoid the atomic incref for that
<geist> but if it's something non blocking, like a read that is just taking time to copy data, it propbably as an implementation detail ends up waiting until that is finished
<mrvn> it's a valid but probably not that useful implementation
<dh`> mrvn: if you close the other end of the socket that will cause read to finish and return 0
<mrvn> dh`: I expect close() to close the socket. You are breaking that promise.
<geist> i think the idea there is there's a difference between bumpign a ref and holding onto the object, and the object itself getting cancelled such that any blocking operations bounce out
<mrvn> dh`: so the other end never sees the socket close and won't close it's own end.
<dh`> mrvn, that's not consistent with the existence of dup() let alone anything more complex
<geist> they are really two different things. the pending read can bounce out of something that is cancelled
<geist> if it's blocking
<mrvn> dh`: I assumed it's the close of the last copy of the socket.
<geist> if it's doing something non blocking, like page by page copying data out of a file cache, it could abort early or finish i suppose and still be consistent
<dh`> mrvn: but the thread reading has its own working copy of/reference to the socket
<dh`> if you wanted to revoke that you should have called revoke()
<mrvn> dh`: In my mind the process is this: close() -> socket close -> tcp close -> socket cancel blocked ops
<dh`> (and persuaded the maintainers of your kernel to support revoke on sockets)
<mrvn> dh`: the destruction of the tcp connection wakes up the read in the end.
<dh`> yes, but that's not the model you get by default
<mrvn> it's the behavior I expect posix systems to have. close on sockets should wake reads. Not sure what I expect on files.
<mrvn> The difference being EOF waking up read.
<dh`> it is definitely the case that that's not guaranteed, because like I said the natural implementation is that the read secures its own reference to the socket while it's working
<dh`> EOF will wake up read, but closing your file handle under the read doesn't cause that
<mrvn> dh`: so the socket would remain open forever? Even though the TCP side is closed?
<dh`> no
* dh` fails to understand what's so hard about this
<dh`> if you close the write end, the reader on the read end will wake up and exit
<mrvn> dh`: you close both ends with close()
<dh`> that's not a well-specified state
<dh`> close() closes file handles, not sockets.
<dh`> if you close the last reference to the write end, the reader on the read end will wake up and exit
<mrvn> then lets simplify: shutdown(fd, SHUT_RDWR);
<dh`> that should also cause any readers on the read end to wake up and exit
<mrvn> and close() should do something else on sockets?
<mrvn> In my mind close(fd) and shutdown(fd, SHUT_RDWR); should be the same.
<dh`> close closes file handles, not internal kernel objects
<dh`> they are not, because shutdown does not close the file handle
<geist> i'm not sure the man pages agree with that
xenos1984 has quit [Read error: Connection reset by peer]
<dh`> so your mind needs to visit a few man pages :-p
<geist> at least on linux and mac
<geist> both of them have verbiage to the effect of 'if it's the last file descriptor internal resources are freed'
<geist> lots of ways to read that but it seems to indicate that the fd count going to zero at least triggers some sort of internal shutdown path
<geist> even if there are still references to the objects floating around in the kernel
<mrvn> If fd is the last file descriptor referring to the underlying open file description (see open(2)), the resources associated with the open file description are freed;
<mrvn> *last file descriptor*, not internal reference
<dh`> maybe, it's not clear that whoever wrote that text even thought about pending references
<geist> not sure if these man pages are describing the behavior of the implementation of how its specced, however
<geist> s/of/or
<mrvn> possible
<dh`> and it's definitely inadvisable to impute intent regarding something to documentation that never considered it
<geist> the mac one is a bit more interesting
<mrvn> The manpage also says: "On Linux (and possibly some other systems), the behavior is different. the blocking I/O system call holds a reference to the underlying open file description, and this reference keeps the description open until the I/O system call completes. (See open(2) for a discussion of open file descriptions.) Thus, the blocking system call in the first thread
<mrvn> may successfully complete after the close() in the second thread."
<geist> "The close() call deletes a descriptor from the per-process object reference table. If this is the last reference to the underlying object, the object will be deactivated. For example, on the last close of a file the current seek pointer associated with the file is lost; on the last close of a socket(2) associated naming information and queued data are discarded; on the last close of a file holding an advisory lock the lock is
<geist> released (see further flock(2))."
<geist> the mac one seems to indicate it does the other path. ie, the object is closed when the last ref goes away, and internal refs also work
<dh`> geist: that's the same text we have in netbsd
<geist> yah probably derived from the same BSD docs
<dh`> yeah
<geist> actually says BSD 4 at the bottom yeah
<mrvn> I still think keeping a read() on a socket (and the socket and tcp connection) you closed alive is not desirable.
<mrvn> now I think the only thing left to do is test how bsd and linux actually behave.
<geist> so all this aside i think what we can derive is diffenre posix systems dot handle this consistently
<geist> but since linux is the only thing that matters...
<heat> AMEN
<mrvn> hehe. zircon matters too
<heat> also HP-UX
piraty is now known as Piraty
<geist> i say the last thing with a heavy heart
<heat> only itanium supporting systems
<dh`> in netbsd the text dates back to -r1.1 in 1993 so probably from 4.4-lite
<sham1> closing a file descriptor should cause an EINTR or something like that
<geist> zircon actually has something more subtle: an object can have any number of internal refereces, including just plain user facing handles. buit there *is* a one way signal called on the object when the user handle count goes to zero
<geist> .OnZeroHandles() or something like that o the object
<sham1> Basically to just stop the blocking read and saying "sorry, the file is now closed. Shouldn't have used threads like this"
<geist> so there are some cases where the last user handle going away automatically triggers some sort of internal cleanup even if some iternal references are held
<dh`> as we just spent a long time discussing, that is not guaranteed and not how it's implemented in most places
<mrvn> geist: can you close a socket while it still has references?
<mrvn> (the tcp side)
<geist> in what case?
<mrvn> close()
<geist> i dunno,what OS are you talking about?
<dh`> whether this behavior violates the basic atomicity guarantees is at least debatable
<mrvn> zircon, one thread does read(fd), another does close(fd).
<geist> the kernel doesn't implement file systems or net stack
<geist> but the gist is the other side would see that the last handle to the IPC channel went away and start shutting down
<heat> this is why microkernels are superior
<geist> ie, the network stack gets a signal when the other end is closed (ie, on zero handles to the client side of the IPC channel that the socket is implemented over)
<heat> you avoid all kinds of debates by just not doing it
<geist> really IPC objects are the main users of the OnZeroHandles state, since otherwise you can't tell if the other side hung up
<dh`> in a microkernel environment, what does it even mean to have an operation pending while you close the handle?
<dh`> there are only messages
<geist> and since you cant construct a new handle from zero handles, it's a one way road: once you get to zero handles it's a permanent state
<mrvn> geist: yeah. but tcp sockets are a bit different. they have connections that you can shutdown without the object getting deleted.
<mrvn> (at least in posix)
<geist> in that case a shutdown() would almost certainly be a mesage of the IPC
<geist> dh`: depends on the type of microkernel. zircon is fairly uh... 'rich'
<mrvn> yep.
<geist> in that it is on the larger side of it, but what we do kinda consistently is there are N types of objects and they all operate using the same model
<geist> threads, processes, jobs, ipcs, memory objects, etc
<dh`> or does the system guarantee you a reply paired with your request or something so there is still some kind of pending state?
<geist> so yes you can 'read' from a VM object for example, which is kinda file like
<mrvn> geist: so my mind model woulöd be that close(fd); does send a shutdown IPC message for sockets and then later when the refcount becomes 0 the resources get freed.
<mrvn> and removes the handle from the table
<heat> how do I make concurrent open()'s unsuck?
<heat> or suck less
<geist> yah though in the case where the process simply gets axed and all the handles closed you always have to have a mechanism for serviers in a µkernel world to detect the closing of the other end
<geist> so the built in OnZeroHandles mechanism works for zircon for that
<dh`> define suck in this case?
<geist> ie, in lieu of any explicit shutdown at least the server notices the other side went away
<heat> imagine a fd table with an rwlock, open/socket/dup/whatever that creates a fd needs to write lock
<heat> which Sucks(tm)
<heat> I think most UNIXes have a workaround for this (full blown RCU or other more dollar store solutions)
<dh`> open the object first, only lock the table to scan it and insert
<geist> mrvn: but yeah for a socket style close() (if you're trying to implement POSIX on top of the µkernel) you could do some sort of pending message to it
<heat> yes, but that's still slow
<dh`> (or alternatively, lock the table first to scan it and insert a placeholder, then leave only the placeholder locked while you're working)
<heat> you'll still have a bunch of contention there which will be PESSIMAL
<heat> I remember NetBSD also had a workaround for this
<dh`> unfortunately for unix-style handles you're supposed to guarantee that you return the lowest available slot so you can't avoid the scanning
<geist> mrvn: a lot of it depends on how much you do or dont try to map posix fds to a IPC object. if you did 1:1 it might make sense to just use the IPC channel close semantics to do the same thing as close
<dh`> you can cache it away in some cases
<geist> but if it's something more complicated, where you're multiplexing N fds over M IPC channels then you could build up your own state there in user space
<mrvn> geist: it's more about sockets having a shutdown method separate from the socket object getting destroyed.
<geist> sure
<geist> shutdown() i'd assume would be a message over the IPC channel to the network server
<mrvn> yep. as it is with tcp
<geist> since you're already going to need some messaging scheme for all the other out of band data
<mrvn> files don't have that semantic so I have no idea what read() on a file should do. different expecation there.
<mrvn> but with the IPC mechanism having files and sockets behave the same, i.e. close(fd) sends a shutdown over the IPC connection, it would make sense to have them behave the same.
<dh`> heat: you could imagine something like always keeping the descriptor table dense by allocating placeholder objects for holes and then keeping the placeholder objects on a linked list
<geist> yah part of the sort of half solution of modelling sockets as files in unix
xenos1984 has joined #osdev
<geist> like, it sort of works except where it doesn't
<dh`> so when you go to allocate you pop the first placeholder off the freelist, and if there isn't any you grab the next table entry
<netbsduser> i do like to allocate placeholder objects
<mrvn> dh`: union { int next_free_fd; file_descr fd; } fds[max];
<dh`> whether this is actually any better than just locking the table and scanning it (especially if you keep track of the start and end points for scanning) is an open question
<dh`> I'd guess not
<mjg> burp
<mjg> lemme tell you something
<heat> mjg, omg hi rick
<mrvn> dh`: both are O(1) if the table have a maximum size.
<netbsduser> it's how i do page-in efficiently: you allocate the page and mark it busy, and abandon all locks, then you wait on an event (to which the page structure points) until the page is in-memory
<dh`> mrvn: that doesn't work well because you need to be able to insert
<mjg> heat: EZ
<dh`> EVERYTHING is O(1) if the size is bounded, that's not useful
<mjg> - spin_lock(&lock);
<mrvn> dh`: insert what?
<mjg> + //spin_lock(&lock);
<netbsduser> so if someone decides to munmap the area, then it just sets a flag in the page saying "you are surplus to requirements, please be freed when this I/O is done"
<mjg> now yo uare LOCKLESS
<dh`> mrvn: freelist entries
<mrvn> dh`: how do you insert an FD between 4 and 5?
<heat> mjg, I'm reading netbsd's fd_expand, etc and I don't get it
<dh`> mrvn: suppose fds 0-2, 4-7, and 8-10 are open and I close 5
<mrvn> dh`: ahh, why? you reuse the last closed FD first.
<heat> it looks like running with scissors atomics version
<mjg> heat: just like with openbsd, i'm not looking at that
<mrvn> nothging says open should get the lowest free FD, right?
<dh`> not in unix you don't, you are required to return the lowest-numbered available fd
<mjg> mrvn: posix says
<mjg> which is a major pita
<mrvn> you want to do POSIX? you have bigger problems. :)
<dh`> posix says, because traditionally there was no dup2 and if you wanted to do I/O redirection you had to rely on that semantic
<heat> mjg, freebsd uses SMR right?
<mjg> whether you want or not what to do posix, this has been the case for decades
<mjg> heat: for what
<mjg> so you can't just change it
<heat> this stuff
<mjg> heat: no
<dh`> realistically these days you're unlikely to break anything by violating that rule
<mjg> there is code which expects the order
<heat> mjg, ok father, then how does it do stuff
<heat> does it handroll some weird RCU too
<mjg> heat: it is all stupid
<mjg> heat: file * objs are *never* actually freed
<dh`> mjg: have you seen any such code in the wild in the last say 10 years? I haven't
<mjg> heat: and file tables only disappear after the proc exits
<heat> god what
<mjg> dh`: i did, the idea is: close all shit, then open /dev/null a bunch of times to fill in 0/1/2
<mjg> heat: GEEZER motherfucker
<dh`> yes, I know the idea
<dh`> where did you see it and why didn't you patch it out?
<mjg> well if you no lnger guarantee lowest fd, you are dead here
<mjg> i don't even remember, does it matter?
<heat> so erm erm erm if I expand the fd table a bunch of times do you keep them all cached?
<mjg> point is there is codde like that out there
<mjg> you can't just change the behavior from under it
<mjg> heat: not if the process is single threaded, otherwise yes
<mrvn> anyway, you can make it a sorted doubly linked list.
<dh`> it matters because if you decide to break that rule you want to know what the probability is of hitting something that doesn't work
<mjg> heat: it doesn ot have to be like that, rcu or not, but here we are. mostly becaues geezer
<mrvn> or just keep a pointer to the lowest free FD and search from there.
<heat> god.jpeg
<heat> netbsd seems to do similar
<mjg> yes, the idea was tkaen from netbsd
<mjg> it is all geezer
<heat> now I want to see what OpenBSD does
<mjg> :DD
<mjg> brah
<heat> i bet 10 on BKL
<mjg> openbsd is doing turbo stuipd
<mjg> here is a funny story
<heat> hey no spoilers!
<mjg> traditionally unix would allocate fds by traversing an array
<mjg> bsds including
<mjg> openbsd was the first bsd to implement a bitmap, in fact two level
<mjg> some time later the rest followed suit
<mjg> exdcept freebsd has one level with no explanation why not two
<mjg> all of which referred to the same paper
<mjg> so sounds like obsd has a leg up, or at least did...
<geist> okay, again.
<heat> what paper?
<mjg> except apart from the bitmaps they *still* do linear array scans
<mjg> give me a minute
<zzo38> I think there is something wrong with the wiki. Even if you use real mode does not necessarily mean that you have to use the PC BIOS, and it is not necessary to use the PC BIOS for all operations even if you have it available.
<mjg> heat: Scalable kernel performance for Internet servers under realistic loads
<mjg> heat: guarav banga & co
<geist> zzo38: this is true. is there a good example of this?
<zzo38> It is true that some of the hardware features are a bit messy due to compatibility (such as the A20 gate), but some of the things still can be sensible for some kinds of systems.
<geist> i can imagine there's stuff that goes out of its way to use bios calls to write to the text display
<zzo38> Also, UEFI is even more messy and even more worse, in many ways.
<netbsduser> this is why i keep well away from both
<zzo38> BIOS calls are probably most useful during the initial booting to read the kernel and drivers from the disk; after that, presumably you will have better drivers suitable for your system.
<geist> that's the idea, yeah
<mjg> heat: btw the paper incorrectly claims the approach is logarithmic
<mjg> heat: kind of funny
<heat> mjg, open seems to have copied net too
<mjg> heat: too bad they did not bench vs single-level bitmap
<mjg> in what regard
<mjg> *bitmaps* were first in openbsd afair, it was the rest which copied from there
<mjg> obsd got it in 2003 or so
<mjg> very positively surprised with dtrace: dtrace -n 'fbt::memset:entry { printf("%d %d", cpu, arg2); }' -o /tmp/memsets
<mjg> per-cpu trace of all calls with 0 drops
<mjg> very nice
<heat> also fyi linux also does single-level I think
<mjg> no, linux got 2 level ~7 years ago
<zzo38> Also, the PC BIOS provides the booting function, and UEFI is too complicated in that way. Furthermore, I think it is not legitimate to be called "PC" if the PC BIOS is not implemented. (I do have an idea about how to design a better booting system in ROM, but it is not a PC booting system but it would be possible to implement both if it is desirable)
<mjg> heat: the real interesting bit is solaris which has a *tree* instead
<mjg> heat: dfly copied from there
<mjg> i have no idea how that perform
<mjg> s
<dh`> two layers is still a tree
<zzo38> I think that HDMI and USB also is no good
<mrvn> a stunted tree
<mjg> guaranteed 2 layers no matter what
<mjg> is not
<heat> how does a 2-layer bitmap work?
<mjg> cmon dwag
<mrvn> heat: top layer bit says if there is a leaf for the 2nd level bitmap
<mjg> read some openbsd!
<geist> i asssume you just have a top layer bitmap that determines if there are holes in blocks of lower level nodes
<mjg> right
<mjg> that's it, literally 0 magi
<mjg> c
<dh`> however you want, but my guess would be that each bit in the lower layer indicates the state of one fd entry and each bit in the upper layer indicates whether there's a free bit in each word of the lower layer
<dh`> because with 32-bit words and 1024 fds max that all fits tidily
<heat> right
<geist> yah that's what i'd think. you could do something more complicated like a bit that signifies if the entire sub tree is occupied or not
<netbsduser> just flicked Solaris Internals open to the page on the fd tree, funny, they have a comment on exactly what people were chatting about earlier on colliding read() and close()
<dh`> but it seems stupid given that the granularity of the upper layer should be a whole cache line of the lower layer
<mrvn> So 2 find_lowest_zero() calls give you the FD you can use.
<geist> it's all because of the stupid property that fds are first fit
<geist> and the source of a whole class of bugs and exploits
<dh`> and furthermore, each entry in the lower layer may as well represent a whole cache line's worth of the table itself
<heat> netbsduser, why do you have a Solaris Internals
<geist> i have that book too
<heat> do you have an Internals for every SVR4 descendent
<geist> it's quite well written
<netbsduser> heat: i like to read about other OSes to appreciate them + ruthlessly steal ideas i like
<mrvn> geist: AMEN, open() should return the next free FD with 64bit rollover.
<heat> how's the STREAMS
<geist> mrvn: we explicitly randomized handles in zircon to avoid this stuff
Vercas7 has joined #osdev
<dh`> <mrvn> yeah I want a fdtable of size 2^64
gxt__ has quit [Write error: Connection reset by peer]
Vercas has quit [Remote host closed the connection]
Vercas7 is now known as Vercas
<mrvn> dh`: you hash that
<heat> geist, but randomized handles forces you to use a tree which sucks
<geist> not necessarily
<mrvn> geist: so nobody can guess a handle?
foudfou_ has joined #osdev
<heat> you can't do anything remotely flat can you?
<geist> mrvn: correct. and more importantly if you close a handle it wont get reused quickly
gxt__ has joined #osdev
<geist> heat: depends on how good you 'randomize' it.
<netbsduser> it's coming along, i want to figure out whether i can implement a unified low-level module which pipes/fifos, ttys, etc can all use
foudfou has quit [Ping timeout: 255 seconds]
<mrvn> heat: you can hash the handle down to a small int and choose your random handles so the hash doesn't collide.
<geist> basically we feed it through a hash and put some salt in it so that the same slot doesn't net the same id
<mrvn> heat: creating a handle might have to try a few times.
<geist> at the end of the day it is indeed slots in a table, but the process sees it hashed
<heat> oh ok, so the handles aren't indices?
<geist> it's not cryptographically perfect. yuou can guess, but the main points is to avoid reusing handles quickly
<geist> so that most use-after-free bugs are caught
<dh`> just allocating sequentially mod the table size serves that purpose well enough
<geist> they're 'random' to the process, though post hash + salt they are indeed just indices
<dh`> (like with process ids, random process ids still seem like a stupid idea)
<mjg> geist: you *randomize* fd for security? am i misreading something?
<geist> basically. though zircon doesnt have fds per se. but it's the handle table
<geist> less of security and more of a bug catching thing
<geist> ie, handles take a very long time to get recycled
<mrvn> mjg: I think more a mitigation against bad code
<geist> we have some additional constraints you can put on a process to cause them to instantly abort if a bad handle is used
heat has quit [Read error: Connection reset by peer]
<geist> that catches a ton of things
<mjg> do you have dup2-equivalent?
heat_ has joined #osdev
<geist> no
<moon-child> imagine fd is stored in memory and buffer overflow corrupts it
<mjg> geist: ye that is a real problem
<moon-child> you're better off if malicious actor can't control which fd it turns into
<geist> you can absolutely not in any circumstances create a handle at a known value
<mjg> there are known multithreaded progs which use fds as they are being closed
<mjg> untintentionally
<moon-child> I heard the following anecdote: somebody forked, closed stderr, and then mapped some memory
<moon-child> then wrote a log message
<mjg> there was a bug in freebsd once which broke them
<mjg> kind of funny
<moon-child> mapped memory reused the stderr fd
<moon-child> so log message stomped mmap
<heat_> are we looping
<mjg> :d
<geist> we explicitly designed the handle mechanism to try to deal with this whole known set of posix issues with fd recycling and whatnot
<geist> works pretty well
<netbsduser> moon-child: that's appalling
<netbsduser> where did that happen
<zzo38> I would have solved it by making file descriptors that are not explicitly given a number to have a minimum file descriptor number; if you want a lower number then you must explicitly request it.
<moon-child> arcan
<mrvn> zzo38: that's even worse. Now all libraries compete for low numbers.
<geist> iirc QNX did somethingl ike putting all posix fds in positive space, and all other handles to QNX specific stuff in negative space (bit 31)
<dh`> you can't have both well-known addresses and a scheme for avoiding well-known addresses
<geist> or something along those lines, so the kernel can use different allocation strategies
<dh`> I can't imagine that would work since < 0 being invalid is baked in everywhere
<mrvn> dh`: you can pass the "well-known" addresses as arguments to a process.
<geist> idea is that for internal qnx stuff that's not doing posix, the negative handles are *bad*
<geist> so if they do leak out to posix space they wont work
<dh`> they'd have to audit pretty much every open for only testing -1 explicitly isntead of < 0
<geist> qnx being a microkernel, it's implementing posix in user space
<netbsduser> geist: clever trick, i might have to imitate that
<mjg> geist: my seal of approval
<dh`> I guess
<zzo38> mrvn: Well, normally 0 is used for stdin, 1 for stdout, 2 for stderr. Libraries shouldn't need to compete for low numbers, since they are only used for standard I/O anyways, I think
heat_ is now known as heat
<dh`> mrvn: you can but there are various costs to that
<heat> mjg's seal of approval is RARE
<geist> with some caveats being that they have some affordance for the kernel to directly map some ipc channels to fds, and in those case the fd-to-handle mapping is 1:1
<mrvn> zzo38: but we don't have 0, 1, 2 anyway so that point is moot.
<geist> and for everything else, handles to things that are meaningless to posix, they're in a different namespace, basically
<mjg> heat: true mjg!
<mjg> heat: true mjg@
<heat> ok mjg@
<geist> negative, not negative, doesn't matter. idea is the namespacing really
<netbsduser> i do know of some software which uses any means necessary to find out all open fds in a process and close them, but i suppose you can simply hide them from any posixy ways to find that out
<zzo38> However, my own (currently unnamed) operating system design does not have file descriptor numbers (although it can be emulated, if required for POSIX capabilities).
<netbsduser> (namely systemd uses linux's procfs to find them out)
<heat> netbsduser, FreeBSD has a syscall for that
<heat> and so does linux now
<mrvn> netbsduser: lots of software does that. Modern software should use CLOEXEC and the posix call to iter over the fds.
<netbsduser> heat: is there a specific syscall or is it via sysctl?
<geist> i suppose it'd be easily enough to implement some sort of close_range() call
<heat> syscall, close_range in both
<mrvn> netbsduser: scanning procfs fails with threads.
<heat> and closefrom in the libc I think
<geist> youc an onyl make a best effort. even close_range() would intrinsically race with opens in another thread
<geist> but you define most likely that it makes one pass, and closes them in a particular order
<netbsduser> systemd wants to close everything not on a whitelist it creates of acceptable fds, so i am not sure whether a close_range would work for it
<geist> syuch that races with any other threads are at least somewhat predictable
<zzo38> I do not have a name for my operating system design, so far
<zzo38> What operating systems were you designing and do you have any link of documentation?
<heat> netbsduser, sure does, use the gaps
<dh`> I don't think anyone here much cares what silly things systemd does
<mrvn> geist: posix_spawn can make it atomic
<mrvn> or you close after fork()
<netbsduser> dh`: i need to for the sake of a pointless publicity stunt
<heat> doing close_range for all the gaps is probably still a good bit faster than looping through fds and closing
<geist> right, where the multithreading isn't an issue
<netbsduser> porting systemd to my kernel would make excellent hackernews bait
<geist> because post fork it's just a single thread
<heat> lol
gog has quit [Ping timeout: 246 seconds]
<geist> (until yo ustart making new ones)
<mrvn> and you really shouldn't close random FDs before the fork()
<heat> didn't you port systemd to BSD?
<heat> or do I have the wrong guy
<netbsduser> i did, it was mostly for the same reason
<heat> ah, you do like the headlines
<geist> hah elaborate ways to get social headlines huh
<geist> i suppose that checks out
<netbsduser> i have an insatiable inner troll but i couldn't bear to do it the old-fashioned way with incendiary posts to forums and suchlike
<heat> you should port glibc to BSD now
<mrvn> Which actually brings me to a problem I had at work on friday: How do you get the highest FD.fileno that's open under python?
<heat> and then coreutils
<netbsduser> doing weird things with software is much more professional
<mrvn> heat: Debian kfreebsd
<netbsduser> heat: glibc did have a freebsd port at one point
<heat> i know
<netbsduser> at least formally it's a retargetable libc, i know someone is trying to bring it to managarm
<heat> so did debian
<heat> i have an in-progress port to Onyx
<netbsduser> they are having big trouble with its native posix threading library, which is very linux
<heat> it's a good libc
<heat> bah, nptl is fine
<heat> i hacked musl's nptl stuff and glibc isn't that much harder
<heat> you can also just implement your own separate library because glibc is ofc completely configurable
<netbsduser> in my experience i've found gnu stuff is often surprisingly portable, who else (but perl) checks for dynix, eunice, and the windows subsystem for posix applications in their configure scripts?
<geist> glibc, yeah that's been ported to all sorts of non linux things
<mjg> sorry to intjerect, do you have a minute to flame about memset?
<geist> haiku uses it, and back in the day BeOS did
<heat> yes gnu stuff is Great(tm)
<mjg> i got a real trace, all memsets made during build kernel, for each cpu
* dh` chuckles politely
<heat> supports all kinds of crap systems
<mjg> and a prog to execute them
<netbsduser> i never knew they were using glibc at haiku
<mjg> heheszek-read-current 148708742 cycles
<mjg> heheszek-read-bionic 98762683 cycles
<mjg> heheszek-read-erms 233876267 cycles
<mrvn> mjg: do you have a histogram of sizes?
<mjg> bionic "cheats" by using simd
<netbsduser> i know vmware esxi uses it (but i'm not sure if they just implemented a crudimentary linux abi compatibility)
<mrvn> mjg: how often is memset called to bzero?
<mjg> literally every time
<mjg> anyhow, as you can see, erms is turbo slower
<heat> mjg, ok, what's the point
<mrvn> literally or every time? Is there even one exception?
<mjg> mrvn: not in producton
<mjg> debug has it for poisnoning
<mjg> heat: what's the point of what
<mrvn> mjg: with a byte value or 32/64 bit pattern?
<heat> what's the big revelation in these results?
<heat> rep stosb bad, simd good, current ok?
<mjg> heat: there is no revelation, just confirmation erms crapper
<mjg> heat: and more importantly now there is a realworld-er (if you will) setup to bench changes to memset
<heat> where
<mjg> on my test box!
<heat> is this Proprietary(tm)
<mjg> not-heat licensed
<mjg> look mate, the code looks like poo right now
<mjg> i'm gonna play around with memset, clean that up and then publish somewhere
<heat> cool
<mjg> will be useful for thatl inux flame thread
<heat> no one flamed man
<mjg> note there was one major worry here: that there is branch prediction misses
<heat> how is that a flame thread?
<mjg> with ever changing sizes
<mjg> heat: see my previous remark about polack word choice
<heat> that thread is probably that tamest the lkml has ever been
<heat> particularly since linus likes you so much
<mjg> i don't think he does mate, but senkju
Vercas3 has joined #osdev
<heat> you're way better than the other mjg
<mjg> i'm going to generate more traces, including from linux
<mjg> for memset, memcpy and copyin/copyout
<mjg> then we will see what happens
<heat> geist, hello sir do u have time to run something on one of your ryzens?
Vercas has quit [Ping timeout: 255 seconds]
Vercas3 is now known as Vercas
<mjg> heat: do you have a memset?
<heat> no
<mjg> aight, no biggie
<heat> i wanted to try borislav's "rep movsb is totally good on amd" claim
<mjg> which amd tho
<heat> recent probably
<mjg> right
<heat> everything was bad on bulldozer
<mjg> it may not even be on the market
<mjg> even so, i have to note the typical way of benchmarking string ops by calling them in a loop with same size stuff can be quite misleading
<mrvn> heat: you should make the kernel/libc benchmark memcpy/memset at start and pick the fastest for the actual cpu.
<mjg> for example, if yout ained the branch predictor, a 32-byte loop is way faster than erms for sizes past 256 bytes even
<mjg> but this goes down the drain if you get misses
<mjg> tradeoff city
<mjg> in fact you may get slower
<heat> yes but don't forget this is all microbenchmarking
<mjg> north remembers
<heat> does any of this REALLY matter on a real workload? probably not
<mjg> ha
<mjg> wrong!
<heat> maybe slightly
<mjg> lemme find it
<mjg> well it mostly does not once you reach basic sanity
<heat> it's like the age old "just use rep movsb/q/l/w, cuz icache"
<mjg> i got numbers showing *tar* speed up after unfucking the string ops
<mjg> they used to be rep stosq ; rep stosb
<mjg> and so on
<mjg> absolute fucking massacre for the cpu
<mjg> bummer, can't find it right now
<heat> yeah but tar is just a fancy exercise in memory copying isn't it
<mjg> but bottom line, the really bad ops were demolishing perf
<heat> read(...) + write(...)
<mjg> handsome, tar was doing a metric fuckton of few byte ops
<mjg> not the actuald ata extraction
<mjg> and this was most affected
<heat> did you just call me handsome
<mjg> it is my goto insult
<heat> it's the harshest canadian insult after all
<mjg> so the jury is still out
airplanemodes has joined #osdev
<mjg> i *randomized* tons of sizes and fed them
<mjg> into the bench
<mjg> this makes erms faster *sometimes* and it is all because of branch mispredicts
<mjg> 19% for current memset, 4% for erms
<mjg> i note real-world trace has a win because the calls tend to repeat
<mjg> but in principle there may be another workload where the above happens instead
<zzo38> Are there wiki pages relating to such things as capability-based security?
<mjg> not that i know of, but one thing to google is: capsicum
<heat> and fuchsia
nyah has quit [Quit: leaving]