#osdev on 2023-03-05 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:00 <heat> most europeanest sport

00:00 <theWeaver> it's fucking cringe tho

00:00 <heat> why

00:00 <theWeaver> because of danish people i guess

00:00 <heat> you have a ball, dribble, jump, throw it towards the goal, hopefully score

00:00 <theWeaver> yeah but you let danes play and danes are cringe

00:00 <heat> although AFAIK most of them wear pants which is a certified cringe

00:01 <heat> sports should be played with shorts and not pants

00:01 <theWeaver> are we talking british english pants or american english pants

00:01 <heat> american

00:01 <theWeaver> idk why i asked, both options would be pretty cringe

00:01 <FireFly> theWeaver: well is like, cricket the only noncringe sport or what :p

00:02 <theWeaver> FireFly: no, as I said, cricket is fucking cringe as fuck

00:02 <theWeaver> re-read the scrollback pls

00:02 <FireFly> oops

00:02 <heat> chad sport: gymnastics

00:02 <heat> which one? all of it

00:02 <theWeaver> rn the only non-cringe sport i can think of is basketball

00:02 <theWeaver> basketball is just straight baller

00:02 <theWeaver> no question

00:03 <FireFly> pretty sure danish people do basketball too :P

00:03 <heat> nba basketball is cringe, ncaa basketball is cringier

00:03 <theWeaver> basketball is so cool even danes can't make it cringe

00:03 <heat> all american sports are just poor excuses for ad breaks

00:03 <theWeaver> heat: your mum is cringe

00:03 <FireFly> ok what about floorball? :p

00:03 <heat> no urs is cringe

00:04 <theWeaver> heat: yeah she's pretty fuckin cringe

00:04 <theWeaver> doesn't change the fact, your is s too

00:04 <heat> OH hockey is legit

00:04 <theWeaver> actually roller derby is non-cringe

00:04 <FireFly> oh yeah

00:04 <FireFly> roller derby's cool

00:04 <theWeaver> roller derby is mega cool

00:05 <theWeaver> it's the most lesbian sport ever and it rules

00:05 <heat> you know what's REALLY cringe?

00:05 <theWeaver> heat: what

00:05 <heat> snooker and cycling

00:05 <heat> old people sports

00:05 <theWeaver> idk i feel like golf is even worse

00:05 <gog> golf is the wrost

00:05 <gog> golf is evil

00:05 <theWeaver> but snooker and cycling are pretty cringe yeah

00:06 <heat> snooker is like the only sport where the top 5 is composed by old, balding british men

00:06 <heat> with a belly ofc

00:06 <FireFly> billiard is fine in a casual setting :p

00:06 <theWeaver> pool is cool

00:06 <theWeaver> snooker is cringe

00:06 <FireFly> oh idk the precise differences

00:07 <theWeaver> FireFly: if it helps i'll repeat it

00:07 <theWeaver> the cool one is pool

00:07 <theWeaver> the cringe one is snooker

00:07 <FireFly> :p

00:07 <heat> chess boxing is mega cringe

00:07 <heat> just nerds trying to be cool

00:07 <theWeaver> ... what the fuck is chess boxing

00:07 <FireFly> that's fine :p

00:07 <heat> it's exactly what it sounds like

00:08 <FireFly> theWeaver: alternating short rounds of boxing and chess until either checkmate or knockout

00:08 <theWeaver> i can't conceive of how you even mix those two

00:08 <theWeaver> FireFly: what

00:08 <theWeaver> are you serious

00:08 <FireFly> the idea being that you probably make poorer moves after having been beaten for a bit

00:08 <FireFly> yes :p

00:08 <FireFly> idk people are allowed to do silly things

00:08 <FireFly> I don't mind :p

00:09 <theWeaver> i'm not saying they're not allowed

00:09 <theWeaver> i just don't really get it

00:09 <theWeaver> but then people do all sorts of stupid shit they shouldn't be allowed to that i do get

00:09 <FireFly> like what?

00:10 <heat> sniffing glue

00:11 <theWeaver> voting for tories

00:11 <theWeaver> wait shit

00:11 <theWeaver> no not that one, i don't get that one

00:11 <heat> hahaha

00:11 <theWeaver> (definitely stupid and shoudln't be allowed tho)

00:11 <heat> labour's now-anual how-to-lose-an-election is definitely a sport

00:12 <heat> annual*

00:12 <FireFly> "The goal of eight-ball, which is played with a full rack of fifteen balls and the cue ball, is to claim a suit (commonly stripes or solids in the US, and reds or yellows in the UK), [...]" wait what, red/yellow instead of solid/striped o.o

00:12 <theWeaver> someone needs to punch Keith in the dick and make him resign

00:12 <theWeaver> useless motherfucker

00:12 <heat> who's keith?

00:12 <theWeaver> keir starmer

00:13 <FireFly> oh yeah rainy island politics is very weird

00:13 <heat> not-a-sport: chess, poker

00:13 <FireFly> from what I gather from the occaisonal updates I hear of it

00:14 <heat> also darts are cringe

00:14 <theWeaver> FireFly: yeah must be strange for someone like you who lives in a partially sane country and comes from a halfway decent one

00:16 <heat> britain is relatively sane

00:16 <theWeaver> relative to what

00:16 <theWeaver> the USA?

00:16 <heat> britain is only "omg totally insane lads can't take this" to british people

00:16 <theWeaver> beacuse if so yeah kinda but that's a dangerously low bar

00:17 <heat> 3/4 of europe at least

00:17 <theWeaver> i've yet to find a european country that was more fucked up than the UK

00:17 <theWeaver> france maybe

00:19 <heat> france, italy, germany, portugal, spain, all of eastern europe

00:19 <theWeaver> fuck off, germany is definitely saner

00:19 <theWeaver> spain too

00:19 <heat> spain is not sane

00:19 <FireFly> not sure I think france is more fucked up tan the UK tbh

00:19 <FireFly> heat: that was not the claim :p

00:20 <heat> spain has like 3 separatist movements going on at the same time

00:20 <mjg> lol @ your separatist movements

00:21 <mjg> in poland there are at least 3 separate monarchist movements

00:21 <mjg> one of them has a self-proclaimed regent

00:21 <mjg> apart from all of this there was a self-proclaimed king

00:21 <mjg> i don't even know who to bow to anymore

00:21 <heat> mjg for king

00:22 <mjg> i would teachy limitations of big O in elementary school

00:22 <mjg> feel the tyrranny

00:23 <heat> history is just unix geezers who wrote PESSIMAL code

00:23 <mjg> there are also non-unix geezers who did the same thing man

00:23 <heat> and readings of git blame

00:23 <theWeaver> tbh, politics be fucked up

00:23 <mjg> old unix geezer is the today's webdev

00:23 <FireFly> "remember that asymptotic behaviour doesn't necessarily specialise to a specific choice of n, kids" "..can we learn about multiplication now, teacher?"

00:24 <mjg> FireFly: "mention balancing your checkbook again. i dare you, i double dare you motherfucker"

00:24 <theWeaver> mjg: what?

00:24 <heat> theWeaver, germany has insane politics. like late stage UK politics

00:24 <mjg> theWeaver: what what

00:24 <theWeaver> mjg: what what, in the butt?

00:24 <FireFly> germany has dumb politics but doesn't the UK have even more of that? :p from my POV at least

00:24 <mjg> theWeaver: no propositioning on a sunday

00:25 <FireFly> I mean hey rainy island even went full brexit

00:25 <mjg> FireFly: what's the german equivalent of farage?

00:25 <theWeaver> mjg: ooookaaaaaaaaaaaay

00:25 <heat> UK has a two party system with one party dominance and a bunch of smaller, separate parties

00:25 <heat> germany has CDU

00:26 <theWeaver> heat: CDU is still not as bad as the tories

00:26 <FireFly> in theory it's a SPD/CDU two-major-parties one-on-each-side system though, no?

00:26 <geist> okay... so

00:26 <FireFly> lessee

00:26 * geist points to the stay-on-topic sign

00:26 <FireFly> ..reasonable yes

00:26 <theWeaver> there's a topic in this channel?

00:26 * theWeaver just uses #osdev to shitpost in

00:26 <geist> we really should make a #osdev-offtopic channel

00:26 <geist> yeah please dont

00:27 <mjg> so fun fact: a naive sse2 memcpy *loses* to naive movs for sizes up to ~24

00:27 <mjg> the one found in bionic

00:27 <heat> i mean wasn't #offtopia #osdev-offtopic?

00:27 slidercrank has quit [Ping timeout: 255 seconds]

00:27 <mjg> [note: no sse used below 16]

00:27 <geist> heat: i didn't parse that sentence

00:27 <heat> i think current #offtopia was #osdev-offtopic a few years ago

00:28 <geist> well, may not have survived the move to libera

00:28 <mjg> what on earth is #Offtopia

00:28 <geist> yah i dunno what that is

00:28 <heat> it's an offtopic channel with a bunch of #osdev people in it

00:28 <mjg> what's the signal:noise ratio on that one

00:29 <heat> anyway re: memcpy, bionic sse2 string ops aren't that great

00:29 <mjg> i can tell you a well guarded secret: freebsd developer channel is named #sparc64, which is extra funny ow that the arch is not longer supported

00:29 <mjg> heat: agreed

00:29 <geist> surprised it tried to use sse on anything smaller than say 64 bytes or so

00:30 <heat> blind rep movsb can beat its fancy sse2 memcpy

00:30 <heat> in fact, it mostly does

00:30 <geist> mjg: seems like a pretty good way to avoid lookyloos

00:30 <mjg> i'm pretty sure glibc uses simd as soon as it can

00:30 <geist> but anyway bionic not having an optimized x86 is probably not that surprising, considering android on x86 is not that big of a thing

00:30 <mjg> so either 16 or 32 depending on instruction set

00:30 <heat> geist, it does, contributed by intel

00:30 <geist> exactly.

00:31 <mjg> quite frankly i would expect that code to be slapped in from whatever internals they had

00:31 <geist> and then probably promptly dropped on the floor

00:31 <mjg> probably shared with glibc to some extent at the time

00:31 <heat> glibc's string ops code is nuts

00:31 <mjg> so it's not like it was coded up by an intern

00:31 <heat> the way they have it, avx512 code is just avx code which is just sse code but all with different params

00:32 <mjg> these ops have hardcoded parameters for one uarch

00:32 <mjg> i'm guessing the asm over there is what glibc would have used at the time for said arch

00:32 <mjg> extracted from entire machinery

00:33 <geist> so here's a question: microarchitecturally speaking is it *always* a good idea to have the bestest fastest possible memcpy

00:33 <geist> ie, given that you have a cpu that has potentially a lot of things in flight, or prefetching this and that

00:33 <mjg> but what makes the besterest mate

00:33 <geist> does shoving through the maximum amount of data through the memory subsystem per clock negatively affect other things that may be going on at the same time

00:33 <heat> mjg, btw i have experimentally verified that the avx2 memset is really solid

00:33 <geist> or even competing against other hyperthreads

00:33 <geist> well, i'd say bestest as in 'maximum number of bytes/clock'

00:33 <heat> it doubles the bandwidth of sse2 memset

00:34 <mjg> geist: that's a funny one

00:34 <heat> so if you had avx memcpy you could potentially also do double

00:34 <geist> i'm sure the answer is probably yes, but there are i'm sure sometimes downsides, kinda like how you mentioned the clzero on AMD can saturate one socket

00:34 <heat> and that's more or less what glibc does too

00:35 <mjg> geist: short answer is 'who the fuck knows', in practice people go for the fastest possible

00:35 <mjg> i will note all benchez i had seen do stuff in isolation

00:35 <geist> yah i mean in lieu of anything else, fastest > not fastest

00:35 <geist> hypothetically a fast memcpy competes negatively with SMT pairs that are off running 'regular' code at the same time, for example

00:35 selve has quit [Remote host closed the connection]

00:35 <mjg> i also have to note that glibc has tons of uarch-specific quirks

00:36 <mjg> in its string ops

00:36 <mjg> thus i found it plausible they damage control concerns like the above

00:36 * geist nods

00:36 <mjg> to the point were yu end up with a net win

00:36 selve has joined #osdev

00:36 <mjg> where

00:37 <mjg> that asid, personally i don't have good enough basis to make a definitive statement on the matter

00:37 <geist> yah was more of a thought experiment than anything else

00:37 <geist> one for which there isn't a solid answer probably

00:37 <mjg> i'll note one common idea is to roll with nt stores

00:37 <geist> or there is an answer, microarchitecturally, in very specific situations but in aggregate it is a win

00:37 <mjg> past certain size

00:37 <mjg> which is already a massive 'who knows'

00:38 <geist> yah that probably helps. i assume for exampe that NT stores dont chew up lots of write queue units

00:38 <mjg> the folks at G had the right idea with their automemcpy paper

00:38 <mjg> instead of hoping for generic 'least bad everywhere', they created best suited for the workload

00:38 <geist> if the cpu can track say 16 outstanding writes, and some memcpy comes along and fills at 16, then you probably have to wait for all the previous ones to finish

00:38 <geist> or maybe NT stores only use one at a time

00:38 <geist> while the other writebacks can finish

00:39 <mjg> well really depends when you start doing them

00:39 <geist> and similarly if the memcpy is slamming the load units, then the cpu may not internally compete well with it for preemptive reads

00:39 <mjg> past some arbitrary threshold or perhaps when you know the total wont fit the cache

00:40 <mjg> ultimately ram bandwidth is infinite either

00:40 <mjg> not*

00:40 <mjg> as usual the win is to not do memcpy if it can be helped :p

00:41 <geist> re: the discussion of SMT static vs dynamic assignment of resources, iirc the ryzen at least has at least some amount of static assignment in the load/store units, i think

00:41 <geist> so maybe they avoided the problem by chopping it in half there

00:41 <geist> not so much the load/store units, but the amount of outstanding transactions i thik

00:42 <mjg> so i may be able to get a sensible data point soon(tm). as noted few times freebsd libc string ops don't use simd, but i can plop in some primitives and see what happens

00:42 <heat> ohhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh the llvm-libc string ops are their (automemcpy)

00:42 <heat> theirs*

00:43 <mjg> btw facebook has simd memcpy and memset apache licensed

00:43 <mjg> https://github.com/facebook/folly/blob/main/folly/memcpy.S

00:43 <bslsk05> github.com: folly/memcpy.S at main · facebook/folly · GitHub

00:44 <mjg> interestingly this bit:

00:44 <mjg> .L_GE2_LE3: movw (%rsi), %r8w movw -2(%rsi,%rdx), %r9w movw %r8w, (%rdi) movw %r9w, -2(%rdi,%rdx)

00:44 <mjg> erm

00:44 <mjg> .L_GE2_LE3: movw (%rsi), %r8w movw -2(%rsi,%rdx), %r9w movw %r8w, (%rdi) movw %r9w, -2(%rdi,%rdx)

00:44 <mjg> sigh

00:44 <heat> that's sick

00:44 <mjg> welp you see which one

00:44 <mjg> this bit used to suffer some uarch bullshit penalty

00:44 <mjg> i don't know about today

00:44 <moon-child> 'in some implementations they're the same function and legacy programs rely on this behavior' sigh

00:45 <moon-child> penalty for what?

00:45 <geist> hah i love how that folly project doesn't even attempt to architecturally isolate

00:45 <mjg> yu had to movzwl to dodge it

00:45 <geist> guess facebook dont do arm

00:45 <mjg> geist: given their commens looks like they only do skylake :p

00:45 <geist> yah

00:45 <mjg> moon-child: partial register reads

00:46 <mjg> moon-child: this guy: movw (%rsi), %r8w and this guy: movw -2(%rsi,%rdx), %r9w

00:46 <mjg> moon-child: erm, stores to

00:46 <moon-child> oh right yes

00:46 <mjg> i would not be surprised if this was still a problem

00:46 <mjg> there is so much bullshit to know of it is quite rankly discouraging to try anything

00:47 <geist> yah really helps to only worry about one microarchitecture (skylake-x)

00:47 <mjg> guaranteed fuckUarch suffers a penalty in a stupid corner case you can't hope to even test

00:47 <geist> that's the luxury the big companies can generally rely on

00:47 <geist> then everyone scrambles when some new uarch comes along, but that's job security

00:47 <moon-child> I think at one point it was able to rename the low bits differently, but only when it knows the high bits are zeroed? But then maybe at some point they walked back on that as not worth the effort, since no one actually uses small registers?

00:47 <moon-child> don't remember

00:47 <mjg> there was some bullshit how if buffers differ in a multiply of page size there is a massive penalty

00:48 <mjg> copying forward

00:48 <mjg> like off cpu

00:48 <mjg> buffer addresses

00:48 <moon-child> oh yess cache associativity stuff

00:48 <moon-child> that's a thing pretty much everywhere

00:48 <moon-child> apparently it's too expensive to put even a really dumb hash in front of l1

00:48 <mjg> and ERMS *backwards* is turbo slow

00:48 <heat> rep movsb backwards is not ERMS

00:48 <heat> it's explicitly stated

00:49 <mjg> ye ye

00:49 <mjg> point is

00:49 <mjg> another lol corner case

00:51 <heat> mjg, what's an overlapping copy

00:51 <heat> in memcpy implementation terms, overlapping stores or whatever

00:51 <heat> i don't get it

00:51 <mjg> copying a buffer partially onto itself

00:51 <moon-child> suppose you wanna copy 7 bytes

00:51 <moon-child> you copy the first 4 bytes, and the last 4 bytes

00:51 <moon-child> these have a one byte overlap in the middle, but it doesn't matter, because it's the same byte

00:52 <heat> 1) how 2) why is this faster

00:52 <moon-child> this lets you handle 4, 5, 6, 7, 8 bytes in one path with no branches

00:52 <mjg> right

00:52 <mjg> the overlapping stores suffer a pentalty to some extent

00:52 <mjg> but it is cheaper than branching on exact size

00:53 <moon-child> char *dst, *src; size_t n. if (4 <= n <= 8) { int lo4 = *(int*)src, hi4 = *(int*)(src+n-4); *(int*)dst = lo4; *(int*)(dst+n-4) = hi4; }

00:53 wand has quit [Remote host closed the connection]

00:53 <mjg> the .L_GE2_LE3: has a simple example

00:54 gog has quit [Ping timeout: 255 seconds]

00:55 <mjg> what breaks my heart is that agner fog recommends a routine which does *NOT* do it

00:55 <mjg> 's liek wtf man

00:55 <mjg> that was my first attempt and it sux0red

00:56 <moon-child> don't meet your heroes

00:56 <mjg> :]

00:56 <mjg> you are too young to truly know that mate

00:58 wand has joined #osdev

00:58 <heat> ok I understand the simple overlapping stores thing

00:59 <heat> how do I use that to write a fast GPRs-only memcpy

00:59 <mjg> i'm afraid see bionic memset for sizes up to 16

01:00 <mjg> that covers that range

01:00 <moon-child> https://cgit.freebsd.org/src/tree/lib/libc/amd64/string/memmove.S or so? :)

01:00 <bslsk05> cgit.freebsd.org: memmove.S « string « amd64 « libc « lib - src - FreeBSD source tree

01:00 <mjg> moon-child: funny you mention it, i'm gonna do some fixups over there :]

01:00 <mjg> for example rcx is written to for hysterical raisins

01:01 <heat> what's wrong with your label names

01:01 <mjg> what happened was the original code was rep movs only, that uses rcx for the count

01:01 <mjg> heat: sad necessity for macro usage

01:01 elderK has joined #osdev

01:01 <mjg> heat: i don't like it but did not want to waste more life arguing

01:02 <heat> its like you obfuscated the code lol

01:02 <mjg> the end result is that this is plopped into copyin/copyout

01:02 <mjg> as in the same code shared

01:03 <mjg> if there is a nice way to do it i would be happy to change it

01:03 <heat> namespace your label names?

01:03 <mjg> note it does not inject jumps or whatever to do the actual copy, it's all *in* routines

01:03 <heat> .Lmemmove_0_to_8:

01:03 <mjg> heat: again macroz, names just don't work, but maybe there is a work around for it

01:03 wand has quit [Remote host closed the connection]

01:03 <mjg> mate that's what i started with

01:03 <mjg> :]

01:03 <heat> why don't they work?

01:04 <mjg> i don't rmeember the error, but it craps out

01:04 wand has joined #osdev

01:04 <mjg> it's been like 5 years since i wrote it

01:04 <moon-child> maybe can do token pasting or something?

01:04 <moon-child> it generates two versions one with erms and one without, so duplicate label names right?

01:04 <mjg> again i don't remember what happened

01:05 <mjg> here is what does matter right now:

01:05 <mjg> 1. rep sucks

01:05 <mjg> 2. there is no speed up from making a loop bigger than 32 bytes per iteration

01:05 <mjg> 3. it's all sad tradeoffs

01:06 <mjg> 4. did i mention rep sucks?

01:06 <heat> doesn't rep unsuck on large sizes?

01:06 <mjg> it most definitely does not

01:06 <mjg> except vast majority of all calls are for sizes for which it *does* suck

01:06 <heat> also, what happens if you pad with int3's instead of nops?

01:07 <mjg> https://people.freebsd.org/~mjg/bufsizes.txt

01:07 <mjg> afair some uarchs don't like that

01:07 <mjg> as in they get a penalty

01:08 <heat> i know ret; int3 does, as its used to stop SLS

01:08 <mjg> look mate, it is over 2 am here, i'm heading off

01:09 <mjg> that memmove.S has some tradeoffs which maybe aren ot completely defensible, and defo has some crap which i'm gonna fix

01:09 <mjg> i would say look at bionic

01:09 <mjg> :]

01:09 <mjg> for < 16 bytes

01:09 <mjg> hue movb(%rsi),%dl

01:09 <mjg> partial fucker not taken care of

01:09 <geist> https://gcc.godbolt.org/z/4q58c1sj7 pretty strange looking at the codegen in tehre, gcc does some weird stuff right in the middle of that 8 byte loop

01:09 <mjg> gonna movzbl it

01:10 <geist> ie, at .L5

01:10 <geist> it seems to add 8 to the in var (a5) but it recomputes the dest var (a3) by adding the in + some precomputed delta between

01:10 <geist> very strange, the more logical code is to just add 8 to a3

01:11 <geist> but it does do the logical thing and compute the max address and do a bne against that, instead of subtracting one from a counter, like the code is written

01:11 <mjg> heat: maybe i expreswsed myself poorly. rep is great for "big enough" sizes, but said sizes are rarely used in comparison to sizes for which it is terrible

01:11 <geist> since riscv has nice compare two regs and branch instructions

01:12 <heat> yes

01:12 <mjg> heat: i'm off

01:12 <heat> bye

01:12 <geist> clang compiles this code as written, no weird optimizations.

01:12 <geist> even puts the store right after the load :(

01:12 <heat> i'll either write an ebpf thing or write a memcpy tonight, maybe both

01:12 <heat> ideally none

01:19 <geist> yeeeesssss

01:20 smach has joined #osdev

01:25 smach has quit []

01:38 nyah has quit [Quit: leaving]

01:38 smach has joined #osdev

01:52 mrvn has quit [Ping timeout: 246 seconds]

01:53 <geist> been piddling with it, and FWIW glibc currently has no specially optimized string routines for riscv

01:54 <geist> but the default C version does a fairly good job

01:55 <heat> yes, it doesn't

01:56 <heat> also if you looked at the source you'll see that the generic memcpy has page moving capabilities because of hurd

01:56 <heat> :v

02:01 <heat> haha, fun fact: GAS .align N, 0x90 actually picks smart nop sizes on x86

02:23 <heat> ok memcpy done

02:23 <heat> not hard

02:23 <heat> mjg will probably shoot it full of holes tomorrow but i'm relatively satisfied

02:25 smach has quit [Remote host closed the connection]

02:38 [itchyjunk] has quit [Ping timeout: 248 seconds]

02:43 [itchyjunk] has joined #osdev

02:43 <geist> yeah, i'm doign kinda the same thing

02:44 <geist> have a reasonably tight 8 byte at a time riscv memset working

02:44 <geist> not really any better than the compiler could do with similiar C code, but it's nicer to read and commented

02:45 <geist> drat, still gets trounced by glibcs version which unrolls it to 64 byte

02:47 <heat> i should check if 64 byte makes a difference here

03:00 smeso has quit [Quit: smeso]

03:05 <heat> no, it doesnt

03:10 smeso has joined #osdev

03:23 zxrom has quit [Quit: Leaving]

03:25 zxrom has joined #osdev

03:30 <moon-child> y'all have riscv hardware?

03:30 <geist> i do, now

03:33 <geist> woot. my new asm memset now matches or beats glibc

03:34 <heat> better share it comrade

03:34 <heat> give us our new memcpy

03:34 <geist> yah hang on a sec

03:34 <geist> memcpy is next, but probably wont get to that tonight

03:34 <geist> memset is to just warm up, get used to writing riscv asm in large amounts. there's kinda a zen to it

03:35 <heat> yeah im not really comfortable doing it

03:35 <heat> for any risc really

03:38 <geist> https://gist.github.com/travisg/7c6b5494990162241a8f590fb2cb06ba

03:38 <bslsk05> gist.github.com: riscv memset · GitHub

03:38 <geist> may still be bugs, but it passes my test harness

03:38 wand has quit [Ping timeout: 255 seconds]

03:39 <heat> // TODO: see if multiplying by 0x1010101010101010 is faster

03:39 <heat> is it?

03:39 <geist> dunno!

03:39 <geist> the hard part is getting the constant into the register, which requires a load

03:40 <geist> but when i ran the expand logic into godbolt gcc juse does the mul

03:40 <geist> https://gcc.godbolt.org/z/r3Wa6ov4T

03:40 <heat> yeah

03:41 <geist> clang does some even weirder shit: https://gcc.godbolt.org/z/o46vsv59W

03:41 <heat> haha that's genius

03:41 <geist> i think it's actually doing the shift and add trick to synthesize the constant, and then mul it

03:44 <geist> there are a few other tricks i've seen the compiler do to rewrite 'store + add base reg + bne' to 'add base reg + store - 8 + bne'

03:44 <geist> though that's kinda debatable, because there still is a dep between incrementing the base reg and something

03:46 <heat> does any of that matter on your riscv cpu?

03:46 <geist> which part?

03:46 <heat> dependencies

03:47 <heat> can it do any ooo?

03:47 <geist> probably not, though the u74 is at least dual issue so i think there are some deps

03:51 wand has joined #osdev

03:51 <geist> but there's definitely a huge win to unrolling the inner loop on this thing. to the tune of 10GB/sec vs about 3.5

03:53 [itchyjunk] has quit [Remote host closed the connection]

03:53 <geist> though that's only when in the L2 cache range (<2MB). once you get past that it settles in to what is apparently bus speed, which seems to be around 800MB/sec

03:53 daily has joined #osdev

03:55 <heat> that's pretty fast

04:01 daily has left #osdev [Leaving]

04:02 <geist> yeah this is an actually kinda reasonable cpu. it seems to more or less perform as i expect for this class of core.

04:37 vdamewood has quit [Remote host closed the connection]

04:38 vdamewood has joined #osdev

05:06 gxt__ has quit [Ping timeout: 255 seconds]

05:08 gxt__ has joined #osdev

05:13 heat has quit [Ping timeout: 248 seconds]

05:19 <sham1> mrvn: RE: commenting on adding a tagged integer. I'd expect there to be a macro or something for that. That is, to make a C integer into an OCaml one

05:28 Vercas has quit [Ping timeout: 255 seconds]

05:28 Vercas has joined #osdev

05:32 foudfou_ has joined #osdev

05:34 foudfou has quit [Remote host closed the connection]

05:37 foudfou_ has quit [Remote host closed the connection]

05:38 foudfou has joined #osdev

05:51 elderK has quit [Quit: Connection closed for inactivity]

06:09 arminweigl_ has joined #osdev

06:10 arminweigl has quit [Ping timeout: 246 seconds]

06:10 arminweigl_ is now known as arminweigl

06:51 vdamewood has quit [Read error: Connection reset by peer]

06:52 vdamewood has joined #osdev

06:57 dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]

07:03 GeDaMo has joined #osdev

07:12 Starfoxxes has quit [Ping timeout: 246 seconds]

07:40 zzo38 has joined #osdev

07:40 <zzo38> :Here I wrote some ideas I have about operating system design: http://zzo38computer.org/fossil/osdesign.ui/dir?ci=trunk&name=design (Now you can complain about it, and/or other comment)

07:40 <bslsk05> zzo38computer.org: Unnamed Fossil Project: File List

07:47 bgs has joined #osdev

08:26 <kof123> "does shoving through the maximum amount of data through the memory subsystem per clock negatively affect other things" anyone can make a faster cpu, the trick is to make a fast system -- cow tools guy ^H^H^H^H^H^H^H cray

08:28 <AmyMalik> the trick to making a fast CPU useful is to keep the fast CPU fed

08:28 <AmyMalik> bandwidth, latency, and actually having tasks you need done

08:43 <zid> hence cpu caches, hence speculative execution, hence, hence

08:45 slidercrank has joined #osdev

08:48 <sham1> Hence Spectre

08:48 <sham1> And hence a nice James Bond movie

08:48 <zid> yes, the pinnacle of cpu design, spectre

08:48 <zid> The true end goal

08:49 <GeDaMo> I'm just reading about AMD's 3D cache https://www.tomshardware.com/news/amd-unveils-more-ryzen-3d-packaging-and-v-cache-details-at-hot-chips

08:49 <bslsk05> www.tomshardware.com: AMD Unveils More Ryzen 3D Packaging and V-Cache Details at Hot Chips | Tom's Hardware

08:49 <sham1> Well, you go so fast as to break security. At what point can we start saying that CPUs are fast enough

08:50 <sham1> We need both horizontal and vertical scaling

08:50 <zid> GeDaMo: You haven't bought one yet?

08:50 <GeDaMo> You know I haven't :|

08:50 <zid> Annoyingly AMD have done that thing where the most useful config is the cheapest model, so gets the worst silicon

08:50 <zid> my friend just did

08:51 <zid> so we've been playing with it

08:52 craigo has joined #osdev

08:52 <GeDaMo> Is fast? :P

08:52 craigo has quit [Client Quit]

08:53 <zid> yep, tis fast

08:53 craigo has joined #osdev

09:39 <netbsduser> zzo38: it sounds very mainframey

09:40 gog has joined #osdev

09:41 <netbsduser> the record-based files especially, and echoes of IBM i in the object stuff

09:41 <gog> hi

09:42 <netbsduser> gog: well come

09:43 <lav> hii

09:44 * gog patpatpatpat lav

09:44 <lav> ee

09:44 * lav purrs

09:48 <zid> I'm swearing off unix for being too woke, I did ls / and what do I see? Libs.

09:53 <lav> It's a little-known fact that Qt actually stands for Queer & transgender

09:53 <zid> and kde is.. kaleidoscope of dicks everywhere?

09:54 <lav> yup

09:57 <gog> i'm a qt qt

09:57 <gog> fr fr

09:57 <lav> agreed

09:58 <Ermine> hi gog, may I pet you?

09:59 <gog> yes

09:59 * Ermine pets gog

10:00 * gog prr

10:03 <zid> gog: how sure are we that you're not just a sussy cis sissy?

10:04 danilogondolfo has joined #osdev

10:05 <gog> you don't need to be sure of anything breh

10:49 nyah has joined #osdev

11:06 frkzoid has quit [Ping timeout: 255 seconds]

11:08 mavhq has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]

11:09 mavhq has joined #osdev

11:47 jimbzy has quit [Ping timeout: 260 seconds]

11:56 foudfou has quit [Ping timeout: 255 seconds]

11:57 foudfou has joined #osdev

12:01 Vercas8 has joined #osdev

12:04 Vercas has quit [Ping timeout: 255 seconds]

12:04 Vercas8 is now known as Vercas

12:14 dennis95 has joined #osdev

12:20 Starfoxxes has joined #osdev

12:22 foudfou has quit [Remote host closed the connection]

12:22 foudfou has joined #osdev

12:27 <netbsduser> fuse seems to be completely antithetical to a sane VFS

12:27 Turn_Left has joined #osdev

12:28 <netbsduser> there seems to be no separation of the file from the vnode layer, so you end up with the most outrageous requirements, like requiring to pass FUSE_RELEASE the same flags that you FUSE_OPEN'd something with

12:30 <netbsduser> i will just add two opaque fields to kernel file descriptors + pass the kernel fd to vnode ops purely for the sake of this monstrosity, since (fuse being undocumented) i don't dare try to figure out how to route around the nuttiness

12:31 Vercas has quit [Ping timeout: 255 seconds]

12:36 <netbsduser> another bit of stupidity, the root inode number is especially specified to be FUSE_ROOT_NODE, but (at least with virtiofs) its `.` and `..` entries are not! this would play havoc with the vnode cache

12:43 Vercas has joined #osdev

12:57 [itchyjunk] has joined #osdev

13:00 Vercas has quit [Ping timeout: 255 seconds]

13:33 Vercas has joined #osdev

13:39 Vercas has quit [Ping timeout: 255 seconds]

13:53 Turn_Left has quit [Ping timeout: 256 seconds]

13:57 Vercas has joined #osdev

14:04 wand has quit [Remote host closed the connection]

14:09 wand has joined #osdev

14:11 nur has joined #osdev

14:12 mrvn has joined #osdev

14:58 Beato has quit [Quit: I have been discovered!]

15:08 theboringkid has joined #osdev

15:11 Vercas has quit [Quit: Ping timeout (120 seconds)]

15:11 Vercas has joined #osdev

15:13 eroux has quit [Ping timeout: 268 seconds]

15:17 eroux has joined #osdev

15:19 danilogondolfo has quit [Remote host closed the connection]

15:22 <netbsduser> virtiofs is half-baked

15:22 <mrvn> so you can write files but not read them to check if it actually worked?

15:24 <netbsduser> i've got a root dir with a subdir "test" in it. result of FUSE_LOOKUP on the root dir for `test` = fuse node number 3. result of FUSE_LOOKUP on that folder for `.` is fuse node number 13968. oh, and while the root dir's fuse-recognised number is actually `1`, FUSE_LOOKUP `..` in 'test' = 13955

15:27 heat has joined #osdev

15:30 <mrvn> is that maybe the inode of the mountpoint?

15:31 <mrvn> and what does stat on / say?

15:32 Brnocrist has quit [Ping timeout: 256 seconds]

15:32 <mrvn> Does anyone actually use "." and ".." from the FS and not generate them internally?

15:32 <netbsduser> those are indeed the inode numbers of the underlying mountpoint, virtiofs lets them appear, so it seems you need to treat fuse inode numbers and the inode numbers from getattr or lookup of `.` or `..` as fundamentally different

15:32 <mrvn> sounds like a bug in virtiofs though.

15:33 <netbsduser> mrvn: i used `.` in a failed attempt to reduce the effort to map fuse/virtiofs semantics to my vfs

15:34 <mrvn> If virtiofs doesn't map . and .. properly then I would rather have it not contain them in readdir at all.

15:34 <netbsduser> i can abolish my use of `.` but i really do need `..` though, and so i think most would, since it's not as though i carry a `parent directory` pointer in vnodes

15:35 <netbsduser> maybe linux does, they are known to be "different" in this area

15:35 <mrvn> netbsduser: without parent directory pointer how will you implement "mount --bind"?

15:35 <mrvn> or get from the root of a mounted filesystem (e.g. /home) to the parent directory?

15:36 Beato has joined #osdev

15:37 <netbsduser> mrvn: however the BSDs do nullfs, and by checking `vnodecovered` field of the vnode and then doing lookup `..` on that vnode, respectively

15:38 <mrvn> you can skip the parent pointer if you add a generated ".." entry into the vnode of every dir. But that's just another way to store the parent pointer.

15:40 <netbsduser> i only store details like that in the name lookup cache (well, i would if i *had* a name lookup cache, i am speaking aspirationally)

15:40 <netbsduser> i fear virtio-fs might be completely incompatible with anything other than the Linux VFS if i can't figure out some workaround that at least lets me get `..` lookup to return the right thing

15:40 <mrvn> so maybe implementing the name lookup cache will fix the problem.

15:43 <netbsduser> it could get me *somewhere*, but i would have to support pinning entries in the cache. the problem remains of people `mv`'ing a directory on the host elsewhere; then the stale entry is stuck, and i can never get the right entry because FUSE_LOOKUP `..` will give me the unusable host fs inode number instead of the fuse inode number

15:44 <mrvn> sure. But you are already stuck with bind mounts and crossing filesystem barriers in general. ".." simply isn't unqiue.

15:45 <mrvn> Now if you don't want bind mounts then you still need to pin normal mountpoints so you can fix the ".." at the filesystem boundary.

15:45 <mrvn> Note: there is also mount --move in linux.

15:47 <mjg> heat: don't be so easy on yourself

15:48 <mrvn> netbsduser: what happens with virtiofs when something moves a directrory on the host?

15:49 <mrvn> Do you end up with a "stale handle" error like with NFS?

15:49 <netbsduser> i am not convinced that it is necessary for bind mounts, at least not if they are equivalent to bsd nullfs (which mounts of subtree of the system fs into another point)

15:50 <netbsduser> and for the case of parent directory of mountpoints, that's handled especially by the vfs lookup function checking for a `vnodecovered` pointer in a vnode (meaning it's the root of an FS and occluding a vnode of another FS; then you lookup `..` of the vnodecovered to get the true parent)

15:50 <mrvn> netbsduser: you can bind mount /my/little/subdir /bla. Then ".." of /my/little/subdir is /my/little under /my/little/subdir and / under /bla. But both would be the same vnode, right?

15:50 <mrvn> Maybe vnodecovered also covers the bind mount case.

15:51 <sham1> This is why bind mounting is a bad idea

15:51 <sham1> You need to keep track of the full path you used to get to a place in order to properly do ..

15:51 <mrvn> I have no idea what code you are copying. I just know you need the .. stored explicitly for mountpoints somewhere.

15:51 <netbsduser> mrvn: on moving a directory on the host, i have no idea, it probably segfaults judging by my current experience with virtiofsd (it actually crashes every time qemu exits)

15:51 <sham1> Same with symlinks, which is why plan9 removed them

15:52 <mrvn> sham1: indeed. You need the full name lookup cache keeping the full path alive so you can follow the parent pointers.

15:52 xenos1984 has quit [Read error: Connection reset by peer]

15:53 <mrvn> And multiple cached names can point to the same inode.

15:53 <mrvn> .oO(But you have that with hardlinks already)

15:54 <mrvn> Another special case to keep in mind is chroot, or containers with a new FS namespace.

15:54 xenos1984 has joined #osdev

15:55 <sham1> Mm. I know that at least Serenity solves this by having essentially the file description remember the path it was accessed though, which is then cached so often used components are shared, and things like openat can then use these to do relative actions

15:55 <netbsduser> it appears that nullfs on netbsd at least creates new vnodes on-demand

15:56 <netbsduser> now dragonflybsd i remember boasts they need no such thing

15:57 theboringkid has quit [Quit: Bye]

15:58 <mrvn> is that something to boast about? It's not like it matters. Allocating a few vnodes is peanuts.

15:59 <sham1> We should just associate UUIDs with files

15:59 <sham1> Or GUIDs

15:59 <mrvn> too short.

16:00 <sham1> 128 bits is too short?

16:01 <sham1> Okay, then it can just be doubled. 256 bits

16:01 <mrvn> sham1: every directory you bind mount creates 2 paths to the file. So you need 1 bit to differentiate them. Do that 128 times and you have no bits left to specify the file in the dir at all.

16:02 <sham1> Ah

16:02 <sham1> So how many bits would a vnode need then

16:02 <mrvn> variable. it needs a parent pointer.

16:03 <mrvn> or you need a lookup from vnode ID to path

16:04 <mrvn> NFS runs into this problem because it's stateless. The client can't just throw some ID on the server because the server might not have the ID cached anymore and can't find the path. NFS handles do some magic to encode the path in some way but even that doesn't always work.

16:09 <netbsduser> my plan for now: fusefs_nodes will store their parent's node-id and that will be used to service `..` lookups

16:11 <netbsduser> all this stuff falls apart in the presence of the host moving the directory, but it sounds as though it falls apart rather nastily on linux too, so such is life

16:13 <mrvn> that's how all the union FSes in the kernel fall apart too, except a segfault in the kernel is worse. Only unionfs-fuse handles FS changes on the underlying FSes properly.

16:13 mctpyt has joined #osdev

16:13 <mrvn> (well, without crashing, nothing you can do to fix it)

16:17 <netbsduser> i wonder on a related thing, how fuse deals with nfs, and this virtiofs fuse setup in particular

16:23 <mrvn> netbsduser: the fuse client has handles that attached to each dentry and if something on the server changes you get an error about stale handles.

16:24 <mrvn> fuse does nothing for NFS, doesn't even see NFS, only the vfs.

16:24 <mrvn> s/the fuse client/the nfs client/

16:25 <mrvn> remember: fuse filesystems are just user processes that access the filesystem though normal syscalls.

16:27 mctpyt has quit [Ping timeout: 264 seconds]

16:32 nur has quit [Quit: Leaving]

16:32 <zid> moon-child: it was you messing around with pointer tagging doubles and pointers together right?

16:34 Brnocrist has joined #osdev

16:35 nur has joined #osdev

16:46 jimbzy has joined #osdev

16:48 <mrvn> What are pointer tagging doubles?

16:59 foudfou_ has joined #osdev

16:59 <mjg> heat: i'm gonna reap myself a new one for that memcpy i wrote :S

17:00 <mjg> heat: looks like we are gonna both going to get chew out on this one

17:00 foudfou has quit [Quit: Bye]

17:01 <heat> mjg, does that memcpy suck?

17:02 <heat> i took it as an inspiration for mine

17:02 <mjg> suck -- no, but it has stupid perf bugs

17:02 <mjg> for example i did not take care of partial regs for 1 byte copy forward

17:02 <mjg> and for copies <= 4 bytes backward

17:03 <mrvn> mjg: memcpy does not handle overlapping

17:03 <mjg> mrvn: age old

17:03 <heat> ok question: 1) should you interleave loads and stores? 2) isn't this chain of branching a bad idea?

17:03 <mjg> look i need this for *memmove* as well, and there is overlap of code

17:03 <mrvn> true

17:04 <mjg> so my *memmove* has the above problem

17:04 <mrvn> heat: isn't that obsoleted by the cpu pipeline and out-of-order execution?

17:04 <sham1> So you'd use memmove for that overlapping code, obviously

17:04 <mrvn> (1)

17:04 <mjg> heat: normally you do all the loads first and stores later, then branch on whatever

17:04 <heat> like the 16 byte branch does e.g cmp $16, %len; jb .L8byte

17:04 <mjg> just show me your code

17:04 <heat> yeah

17:05 <heat> btw that lea trick is pretty cute

17:05 <mjg> i also note reg allocation is a little questionable, but it i had hysterical reasons

17:05 <mjg> you would learn lea trick if you checked disasm of any real code mate

17:05 <mjg> 's how i did it :p

17:05 <mjg> i also *suspect* the code which aligns to 16 bytes could be much better

17:06 <mjg> i'll try to express it in C and see what clang comes up with

17:06 <heat> https://gist.github.com/heatd/fe2c9a2d3a4ef04616d481ee6660c722

17:06 <bslsk05> gist.github.com: memcpy-gpr.S · GitHub

17:07 <mrvn> heat: you aren't checking for alignment.

17:07 <mjg> movb (%rsi), %cl

17:07 <mjg> movzbl

17:07 <mjg> that's one of the bugs

17:08 <mjg> movw (%rsi), %cx

17:08 <mjg> movzwl

17:08 <heat> aha riiight

17:08 <heat> let me guess, some uarchs have false dependencies on the rest of rcx?

17:09 <mjg> i don't remember what happens there, i do rmeember i did measure a slowdown from not doing it

17:09 <mjg> on haswell et al

17:09 <mjg> it may be kabylake no longer has the problem

17:09 <mjg> .L1_byte: missing ALIGN_TEXT?

17:09 <mjg> .Lerms: mov %rdx, %rcx rep movsb -- normally you want to align the buf at least to 16

17:09 <heat> I guessed it's stupid to have an ALIGN_TEXT there because it's a single byte memcpy

17:10 <mjg> then handle 1 byte early instead

17:10 <heat> in my logic, memcpy(1) is already stupidly pessimal anyway, no real reason to pad it early

17:10 <mjg> so there is one fundamental tradeoff in that code, which is not 100% defensible

17:10 <heat> s/early/at all/

17:11 <mjg> you can either roll with some branches upfront and jump once to the target code

17:11 <mjg> or you can have a cascade if you will, like in the code above

17:11 <heat> yes, that's part of my "to improve" ideas

17:11 <heat> bionic memmove.S branches upfront

17:11 <mjg> so the idea behind it was that sizes 32-256 consist of majority of the calls

17:11 <mjg> so it makes sense to make it the fastest

17:11 <mjg> hence fewer branches to get there

17:11 <mjg> in my code you slide into it

17:12 <mjg> as in no jumps to start

17:12 <heat> I added a branch to 16 because I noticed in your histogram that most memcpies were 16 byte long

17:12 <mjg> you added enough branches to perhaps defeat that

17:12 <heat> did I?

17:12 <mjg> again, this one is *super* murky

17:12 <mjg> i'm gonna do another take today or tomorrow

17:12 <heat> hmm ok

17:13 <mjg> generate more datasets, from freebsd and ilnux

17:13 <mjg> and then measure total time to execute them with both variants

17:13 <mjg> lower total time wins

17:13 <mjg> by dataset i mean collect all sizes along with the number of times they showed up

17:13 <heat> yes

17:14 <mjg> randomize the order

17:14 <mjg> and we will see on a bunch of cpus

17:14 <mjg> no claiming perfect, but should be good enough

17:14 <heat> is memmove just doing this but backwards?

17:14 <mjg> yes

17:15 <mjg> i needed to implement it because 'bcopy' which was the goto way

17:15 <mjg> used to be memcpy

17:15 <mjg> and then a geezer made it into memmove

17:15 <mjg> and now i'm screwed

17:15 <heat> is there a penalty to always copying backwards?

17:16 <heat> having two versions of the same code that do forwards/backwards sounds depressing

17:16 <mjg> i don't htink you can get away with that for arbitrarily stupid args

17:16 <heat> so I could have memcpy doing forwards and memmove doing backwards, that's my idea

17:16 <heat> hmmm ok

17:17 <mjg> anyhow i plan to sort out memset first

17:17 <mjg> same general issue + same idea what to d

17:18 <mjg> btw that 256 is lowballing it

17:18 <mjg> my haswell does better

17:19 <heat> is it? I think I tried higher and saw really mixed results

17:19 <mjg> there may be lullers on your arch which make it into a problem

17:19 <mjg> uarch

17:19 <mjg> again, fucking cpus man

17:20 dennis95 has quit [Quit: Leaving]

17:20 <mjg> key though: rep movs is quite pessimal for short sizes, what you do about it is for the most part tradeoff city

17:21 <heat> do the same principles apply to SIMD memcpy too?

17:21 <zid> mjg do you say other words

17:21 <heat> except maybe SSE may have issues storing to unaligned addresses

17:21 <heat> I know AVX doesn't

17:21 <mjg> zid: my english is limited, i only got 'english for chronic complainers about perfomrnace' in school

17:21 <zid> makes sense

17:22 <zid> are you much more personable in polish

17:22 <mjg> of course, i'm a very well read person

17:22 <heat> peszimal

17:22 <mjg> heat: i don't know the realities for simd which i could 100% defend

17:22 <mjg> heat: i could give you a stackoverflow-quality answer

17:22 <heat> do it

17:23 <mjg> you wanna do overlapping stores as soon as you can

17:23 <mjg> but watch out how mcuh you do them for one set

17:23 <mjg> [relaity check: sse2 /sucks/ when you do it for certain sizes]

17:23 <mjg> i have 0 real data for avx

17:23 <mjg> i intentionally not checkd glibc memcpy s that i can implement my own if needed

17:24 <mjg> but preferably i would steal one with an adequate license

17:24 <mjg> it was quitea bummr to find how much bionic sucks here :/

17:24 <heat> folly has an avx2 one

17:25 <mjg> yes i linked it

17:25 <heat> yes i know

17:25 <heat> just saying, it's an option

17:25 <mjg> the problem is apache license would need some finesing

17:25 <mjg> also i did not bench it myself

17:25 <mjg> also see the automemcpy paper

17:25 <mjg> for all i now i can generate an ok memcpy without handrolling any asm

17:25 <mjg> which would be the bestest

17:25 foudfou_ has quit [Remote host closed the connection]

17:26 foudfou has joined #osdev

17:27 <heat> linux memcpy_orig seems quite ok

17:28 <heat> could use the erms bit for lengthy copies but it seems similar to what we both have

17:28 <mjg> oh he, rolls with that jmp chain thing

17:29 <mjg> heh even

17:29 <mjg> /*

17:29 <mjg> * We check whether memory false dependence could occur,

17:29 <mjg> * then jump to corresponding copy mode.

17:29 <mjg> */

17:29 <mjg> cmp %dil, %sil

17:29 <mjg> jl .Lcopy_backward

17:29 <mjg> i don't know about this bit

17:29 <mjg> back then i talked to a big wig at intel about memcpy

17:30 <mjg> he told me to do address comparison and then do rep mov forwards or backwards

17:30 <mjg> et voila

17:30 <heat> lol

17:30 <mjg> no seriously

17:31 <mjg> the fact that their own optimization manual recommends against it

17:31 <mjg> did not phase him

17:31 <heat> against what?

17:31 <mjg> rep for short copies

17:31 <heat> the optimization manual seems to hail rep movsb as the best shit ever

17:31 <mjg> also note "fast short rep mov" making an appearance in recent years further proves it is crap

17:32 <mrvn> For a memcpy <= 32 byte isn't a simple 1byte copy loop faster than branching for 16, 8, 4, 2, 1 bytes?

17:32 <mjg> mrvn: nope. i had various experiments 5 years ago, including 8 byte loops etc

17:32 <heat> oh how does fsrm bench with this shit?

17:32 <mjg> it was all slower than overlapping stores

17:33 <mrvn> mjg: so a loop copying 8 byte that runs maybe twice is better? That's at least one branch misprediction.

17:33 <mjg> heat: afair fsrm does not help rep *stos*, it does help rep *movs*, but it is still slower for sizes < 64 or so

17:33 <mrvn> mjg: same for any remaining 4 byte and again 2 byte.

17:33 <mjg> mrvn: as noted previously i'm about to generate a good real-world-based dataset to memcpy and memset

17:33 <mjg> i'll hack up the above to the test mix

17:34 <heat> do you think I'll get shot if I try to patch memcpy to Be Good(tm)?

17:34 <mrvn> mjg: yes please. How many cpus can you test?

17:34 <mjg> mrvn: westmere, sandy, haswell, skylake, coffy lake, ice lake

17:34 <mjg> and probably some amd if i can get arsed

17:34 <mrvn> mjg: also will you benchmark real code? Replace the memcpy in libc and see what that says.

17:35 <mjg> see above for the description of what i intend to do

17:35 <mjg> i can easily get hands on more intels but i think that's enough

17:36 <mjg> would also be funnyt o bench no microcode updates vs fresh

17:36 <mjg> but i don't know if i can be arsed to get the former

17:36 <mrvn> I wish there where a way to mark different entry points into a function for the compiler. Like: enter here if src is 16 byte aligned, enter here if dst is 16 byte aligned, enter here if size > 64, ...

17:36 <mjg> heat: where? linux? it is a touchy subject so i would not

17:37 <mjg> heat: looks like the L guy and Boris SOmething are going to sort it out in a manner good enough(tm)

17:37 <mrvn> Sometimes I miss templates + constraints in C

17:37 <mjg> heat: for example i'm not gonna ship my memset over there :]

17:38 <heat> why is it a touchy subject?

17:38 <mjg> read the thread

17:38 <heat> you probably mean borislav petkov btw

17:38 <heat> mkya

17:38 <heat> mkay*

17:39 <mjg> yea

17:39 <mjg> also i guarantee there is something bad i don't even know about, which does affect the routine as coded by me

17:39 <mjg> and which some greyberad will point out as PESSIMAL

17:40 <mjg> while i welcome that, that's not the setting where i do

17:40 <mjg> :p

17:40 <heat> btw linux memmove is probably superior to memcpy ATM

17:40 <mjg> look i'm done chatting about bs, time to do some data collection

17:42 <heat> "And more would be dangerous because both Intel and AMD have errata with rep movsq > 4GB" haha

17:44 <mrvn> WTF? I have to rep movsq in blocks of 4GB? hehehehe

17:47 <mrvn> That's like DBcc on m68k only using the lower 16-bit of the counter register.

17:48 <mrvn> Can't do a full 32bit ripple carry addition, comparison and branch in the wanted cycle time

17:57 <mjg> now should i code the proram inRUST

17:58 <mjg> MOTHERF^Wi don't think

18:00 frkazoid333 has joined #osdev

18:01 xenos1984 has quit [Ping timeout: 256 seconds]

18:01 <mjg> NAME shuf - generate random permutations

18:01 <mjg> check htis out

18:01 xenos1984 has joined #osdev

18:03 heat has quit [Remote host closed the connection]

18:03 heat has joined #osdev

18:11 xenos1984 has quit [Ping timeout: 260 seconds]

18:18 <heat> mjg, CHECK WHAT OUT

18:18 <heat> YOURE MAKING ME MAD

18:18 <mjg> OH

18:19 <mjg> STFU

18:19 <mjg> i'm saying there is a ready-to-use tool to randomize the numbers

18:22 <zid> I can generate permutations in O(n) in both time and memory, in 3 lines of code

18:23 <zid> good enough for me

18:25 xenos1984 has joined #osdev

18:26 <zid> (LFSR with a cycle length the same as the sequence length can do O(1))

18:26 <heat> mjg, is there any benefit in interleaving loads and stores?

18:26 <zid> not on architectures that matter

18:26 <zid> yes on architectures that don't

18:26 <heat> I think they (Intel) explicitly say there is for SIMD

18:26 <zid> like mips, and atom

18:26 <zid> and avx512

18:27 <mjg> heat: for simd i don't know

18:29 eroux has quit [Ping timeout: 248 seconds]

18:34 eroux has joined #osdev

18:47 <zzo38> Do you have any suggestions about specific changes to my designs, or if anything about it is unclear, etc?

18:48 Arthuria has joined #osdev

18:49 wand has quit [Ping timeout: 255 seconds]

18:50 wand has joined #osdev

18:54 heat has quit [Remote host closed the connection]

18:55 heat has joined #osdev

18:56 <zzo38> Perhaps one thing I did not mention about the file records, is that the records do not all have to be the same size, and the record numbers do not have to be contiguous (it is likely that many record numbers will be skipped, since that file does not use them)

18:58 <zzo38> Does the design of capabilities makes sense, or do you suggest changes?

19:00 <zzo38> (It seems to be a problem of other operating systems, that do not properly support making locks and transactions that have several resources grouped at once; they usually only can do them separately.)

19:22 <heat> mjg, actually im wondering now if any of the ALIGN_TEXTs matter for small-ass sizes

19:22 craigo has quit [Ping timeout: 255 seconds]

19:22 <heat> at that point you're already doing something very pessimal, have gone through several branches, just for a 1-8 byte copy

19:23 <mjg> they do matter a tad bit

19:23 <heat> so wouldn't it be better to save on icache?

19:23 <mjg> once the target is far enough from the jump instruction you suffer from it not being laigned

19:23 <heat> yes but icache footprint

19:24 <mjg> they are most likely useless/harmful if you roll with a "switch" upfront

19:24 <mjg> it is a tradeoff, see once more the reasoning for sizes 32-256

19:24 <heat> yes

19:25 <heat> also I think bionic memmove does test fuckery instead of cmp

19:25 <heat> maybe worth a shot

19:25 <mjg> i checked in agner fog's tables

19:25 <mjg> it's literally the same shit

19:25 <mjg> in the cpu

19:25 <heat> really?

19:25 <mjg> yea

19:25 <heat> wtf

19:26 <mjg> i mean ports used and whatnot

19:26 <heat> yes

19:26 <mjg> cycle cost

19:26 <mjg> basically no diff that i culd bench

19:26 <mjg> and see above why

19:26 <heat> i would expect an AND operation to be a good bit better than cmp

19:26 <heat> guess not

19:26 <mjg> i think what actually costs is the fucking branch mate

19:26 <mjg> als note instruction fusing

19:27 <heat> yep

19:27 <mjg> that said, it may be there is a funny corner case

19:27 <mjg> absent good reason to *not* follow bionic n this one, i would argue you *should* do it

19:28 <mjg> 'looks the same so we gonna go the other way' is what i gave people shit for in the past

19:28 <heat> well yes but otoh that memmove isn't all that great AND it was written in 2014

19:28 <heat> almost 10 years ago

19:28 <mjg> is not this where your cpu is from

19:28 <mjg> :XX

19:28 <heat> no

19:29 <mjg> jinkies

19:29 <heat> kabylake is 2016, kabylake R is 2017

19:29 <mjg> look at mr modern man ova here

19:30 <heat> i bet you're using haswell

19:30 <mjg> i really should have added more comments

19:30 <mjg> to all that stuff

19:30 <mjg> i just rediscovered why 'weird bit' is actually good

19:31 <heat> what weird bit?

19:31 <mjg> in memset 32 or more i do

19:31 <mjg> cmpb $16,%dl

19:31 <mjg> ja 201632f

19:31 <mjg> movq %r10,-16(%rdi,%rdx)

19:31 <mjg> movq %r10,-8(%rdi,%rdx)

19:31 <mjg> as in the tail bigger than 16 is handled separately

19:31 <mjg> turns out overlapping 16 bytes when it can be avoided is tolerable

19:31 <mjg> 32 is a major bummer

19:36 <moon-child> heat: all basic arithmetic is single cycle for a long time now

19:40 <heat> why do you cmp on the actual 8/16-bit reg

19:40 <heat> is there any advantage in doing that

19:40 <mjg> it is smaller code

19:40 <mjg> mr ifunc

19:40 <mjg> erm icache

19:40 <heat> ifunc, icache, icrap

19:40 <mjg> iPhone

19:40 <mjg> irepstos

19:40 <heat> yes smaller code and then you blow it out the water with a nice 10-byte nop or whatever

19:41 <mjg> but i can fit more in there if needed

19:41 <mjg> mofer

19:41 <moon-child> won't somebody please think of the bytes!

19:42 <mjg> aight, got a db of 520684993 real-world calls to memset

19:42 <heat> export it to SQL and query away

19:42 <mjg> i'm gonna do it on the cloud mate

19:43 <heat> oracle database moment?

19:46 d5k has joined #osdev

19:46 <heat> I feel dirty using r8d and r8w

19:47 <nikolar> nah it's fine

19:47 <heat> it's not fine

19:47 <heat> 1) needs an extra prefix

19:47 <heat> 2) clunky naming

19:51 <mjg> you don't need these regs

19:51 <mjg> i only used them so that i can safely embedd into copyin/copyout

19:51 <mjg> which already use some regs

19:51 <mjg> and i dnot want to save/restore

19:54 <heat> i do need them

19:54 <heat> rdi, rsi, rdx are used by the args, rax is primed with the return value

19:54 <heat> so that leaves me with rcx, r8, r9, r10, r11

19:54 <mjg> see bionic

19:55 d5k has quit [Quit: leaving]

19:58 heat_ has joined #osdev

19:58 heat has quit [Read error: Connection reset by peer]

19:58 heat_ is now known as heat

19:58 <heat> <heat> bionic saves rbx

19:58 <mjg> wut

19:59 <heat> yes

19:59 <heat> although they do have a funny trick here where they reuse rsi for the last load when doing the tail copying

20:04 <mjg> just be happy this is not ia32

20:04 <mjg> famine register state

20:04 <moon-child> ia64

20:04 <moon-child> 128 registers

20:04 <mjg> bring back itanium!!

20:04 <moon-child> everything else is trash by comparison

20:04 <heat> YESSIR

20:04 <mjg> onw i'm curious how a memset runs there

20:04 <mjg> i mean looks like

20:05 <heat> https://elixir.bootlin.com/glibc/latest/source/sysdeps/ia64/memcpy.S

20:05 <bslsk05> elixir.bootlin.com: memcpy.S - sysdeps/ia64/memcpy.S - Glibc source code (glibc-2.37.9000) - Bootlin

20:06 <heat> it looks stunning

20:06 <heat> as in "i'm stunned wtf is going on"

20:06 <moon-child> 'memcpy assumes little endian mode' wat

20:06 <moon-child> why doesn't it matter? Don't loads have the same endianness as stores either way?

20:06 <heat> KEEP HATING moon-child

20:07 <moon-child> lol

20:07 <mjg> haters gonna hate

20:07 <mjg> fuck you moon-child

20:07 <mjg> !!!

20:09 <heat> shut up mjg cpu architecture fascist

20:09 <heat> mjg? more like bitchjg

20:09 <mjg> E10k or bust motherfucker

20:10 <mjg> https://www.youtube.com/watch?v=OSprsQTsy7c

20:10 <bslsk05> www.youtube.com: Sun Enterprise 10000 - YouTube

20:10 bgs has quit [*.net *.split]

20:10 smeso has quit [*.net *.split]

20:10 warlock has quit [*.net *.split]

20:10 bauen1 has quit [*.net *.split]

20:10 m5zs7k has quit [*.net *.split]

20:10 mahk has quit [*.net *.split]

20:10 matthews has quit [*.net *.split]

20:10 bnchs has quit [*.net *.split]

20:10 zzo38 has quit [*.net *.split]

20:10 ZipCPU has quit [*.net *.split]

20:10 sprock has quit [*.net *.split]

20:10 dminuoso_ has quit [*.net *.split]

20:10 k4m1 has quit [*.net *.split]

20:10 fkrauthan has quit [*.net *.split]

20:10 aws has quit [*.net *.split]

20:10 Archer has quit [*.net *.split]

20:10 j`ey has quit [*.net *.split]

20:10 Stary has quit [*.net *.split]

20:10 DoubleJ has quit [*.net *.split]

20:10 particleflux has quit [*.net *.split]

20:10 night has quit [*.net *.split]

20:10 meisaka has quit [*.net *.split]

20:11 zzo38 has joined #osdev

20:11 dminuoso_ has joined #osdev

20:11 mahk has joined #osdev

20:11 sprock has joined #osdev

20:11 bnchs has joined #osdev

20:11 matthews has joined #osdev

20:11 fkrauthan has joined #osdev

20:11 j`ey has joined #osdev

20:11 k4m1 has joined #osdev

20:11 Archer has joined #osdev

20:11 DoubleJ has joined #osdev

20:11 warlock has joined #osdev

20:11 smeso has joined #osdev

20:11 bauen1 has joined #osdev

20:11 ZipCPU has joined #osdev

20:11 aws has joined #osdev

20:11 particleflux has joined #osdev

20:11 Stary has joined #osdev

20:11 night has joined #osdev

20:11 meisaka has joined #osdev

20:11 bgs has joined #osdev

20:11 m5zs7k has joined #osdev

20:12 Archer has quit [Max SendQ exceeded]

20:12 dminuoso_ has quit [Max SendQ exceeded]

20:12 <heat> why does bionic memcpy also handle memmove?

20:13 <heat> is this mildly concerning?

20:13 <mjg> it used to be that glibc did it

20:13 <mjg> and trying to not do resulted in buggz

20:13 <heat> is this an actual compatibility concern?

20:13 dminuoso has joined #osdev

20:13 <mjg> depends, i don't know if glibc is doing it today

20:13 <mjg> people claim it is not

20:14 <heat> generic memcpy isn't I think

20:14 <heat> so...

20:14 <zid> because you can't trust people who'd use bionic

20:14 <mjg> that funky memcpy does

20:14 rein-er has joined #osdev

20:14 <zid> to stick to the actual semantics of memcpy

20:14 <moon-child> I would rather check for overlap and fault if so

20:14 <heat> their generic memcpy also supports page moving for GNU hurd

20:14 <moon-child> fix yo shit

20:15 <zid> I'm actually really lazy about using memcpy instead of memmove >_<

20:15 <mrvn> memcpy() should check and assert so bad code gets fixed.

20:15 <mjg> fucking

20:15 <heat> fucking. - mjg

20:15 <mjg> i wrote that toy prog i mentioned, very wip

20:16 <mjg> runtimes vary wildly

20:16 <mjg> turns out the total time is so long it gets preempted

20:16 <mjg> :d

20:16 <heat> toy prog for what?

20:16 <mjg> heheszek-read-bionic time 8920719533

20:16 <mjg> heheszek-read-erms time 10307939142

20:16 <mjg> heheszek-read-current time 8229679417

20:16 <mjg> heheszek-read-current time 10446866317

20:16 <mjg> heheszek-read-current time 6845942134

20:16 <mjg> running the 50 mln memsets

20:16 <geist> yah i think most libcs i've looked at simply have memcpy and memmove be the same symbol

20:16 <mrvn> mjg: pin the test to the core and pin everything else away from it

20:16 <mjg> mrvn: i already did

20:16 <geist> is it silly? yeah, but then really having two separate symbols is

20:17 <mjg> i may need to boot on linux and isolate cpus

20:17 <mrvn> mjg: then linux tickless implementation sucks.

20:17 <geist> it's like sprintf or gets, they're bad ideas from an older era

20:17 <mjg> mrvn: that's on freebsd :)

20:17 <mjg> mrvn: will do it on linux

20:17 <mrvn> mjg: did you remember to pin the IRQs too?

20:17 <mjg> i can't do that on that sucker

20:17 <mjg> again, will do it right on linux, but boomer i have to resort to it

20:17 <heat> geist, i think separate memcpy still makes sense. you optimize out a branch

20:18 <geist> but you might break code that misuses it

20:18 <heat> just like having separate memcpy_aligned_N or memcpy_nt makes sense

20:18 <geist> also means you need to write two implementations

20:18 <zid> It makes very sense for specifically a language like C to make both available as builtins

20:19 <zid> why use C if you don't want optimizations like that to happen and break your code, use rust :P

20:19 <heat> ruuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu

20:19 <geist> ruuuuuuuuuuuust

20:19 <geist> though rust will probably just call throgh to memcpy to be fair

20:19 <geist> but it should know if things overlap

20:19 <heat> that's right, if you don't want memcpy, use go

20:20 * heat watches as they reimplement memcpy

20:20 Vercas2 has joined #osdev

20:20 <moon-child> you can't know if stuff overlaps

20:20 <moon-child> in general

20:21 <moon-child> cus you can compute arbitrary subscripts into an array

20:21 <mjg> everyonei s a smartass until they need to code a website

20:21 <geist> rust can probably via a series of rules know that any two objects can't overlap

20:21 <geist> it's only situations where you're moving stuff within the same array

20:21 <mrvn> moon-child: but then the compiler knows it's the same array

20:21 <moon-child> sometimes you're in the same array and you know your subscript never overlap

20:21 <geist> rustin particular would know this. C/C++ wouldn't

20:21 Vercas has quit [Ping timeout: 255 seconds]

20:21 gxt__ has quit [Ping timeout: 255 seconds]

20:21 Vercas2 is now known as Vercas

20:21 <moon-child> I'm not saying it doesn't sometimes know, I'm saying it doesn't always know

20:21 <moon-child> c can restrict

20:22 <mrvn> moon-child: If it can't know if ranges of objects overlap then it can call memmove

20:22 <geist> does make me think, i do remember there was a fair clever sequence of instructions to detect overlap, and how much

20:23 <mjg> i once more note some of the intel claim yu need to check relative addresses for performance reasons anyway

20:23 <geist> and also interestingly, depending on how two thins overlap and by how much and how long your copy stride is you can still a lot of the time use your core algorithm

20:23 <mjg> and if you do that, the issue goes away

20:23 <heat> why is no one on the page loaning koolaid

20:23 <geist> that'd have to be fast as fuck to beat a copy

20:23 <mrvn> if (abs(dst - src) > 256) memcpy_I_dont_care()

20:24 <moon-child> geist, first thing that comes to mind is sign bit of (startx - starty) xor (startx + len - starty)

20:24 <geist> moon-child: something like that

20:24 <geist> mrvn: hmm, but does that work with both directions of overlap?

20:24 <mrvn> geist: no. sometimes you have to copy backwards.

20:24 <geist> if your algorithm copies forward then one of the overlaps will be problematic, i think

20:25 <mrvn> geist: if (abs(dst - src) > len) memcpy_its_safe()

20:25 <moon-child> even if they overlap the 'wrong' way, as long as it's less than your buffer size, it's fine

20:26 <mrvn> moon-child: no. some overlaps you have to copy backwards.

20:26 <geist> moon-child: only if you copy in the right direction, because otherwise you'll start ovewriting source data before you get to it

20:26 <moon-child> err no

20:26 <moon-child> ignore me

20:26 <mrvn> You can ignore backwards copying when len < buffer_size

20:26 <geist> but yeah you *should* be generally ablve to write a reverse copy version of your algorithm at pretty much the same speed

20:26 <geist> so now you have to write two versions of the whole thing, forward and backwards

20:27 <geist> then handle the overlap case where they're too close: which is probably just revert to a bytewise forward or back copy

20:27 <geist> then that's basically the guts of memmove

20:27 <mrvn> Copying backwards is bad for the performance though. So you probably want to coppy smaller chunks forward if the overlap is larger than your buffer size.

20:27 <geist> not really. i think most decent architectures detect reverse stride just as well

20:27 <moon-child> why would backwards be bad?

20:27 <mrvn> e.g. on an overlap with 64k offset

20:28 <moon-child> yeah prefetcher can detect backwards accesses

20:28 <mrvn> moon-child: because RAM chips are made to read incremting sequential.

20:28 <geist> oh modern stuff is so abstracted from what ram chips are doing that's immaterial

20:28 <mrvn> ok

20:28 <geist> but sure you can still do say 64 byte chunks forward as you step backwards

20:28 <mrvn> that info is maybe 12 decades old.

20:29 <geist> i mean the ram chip probably does have a open next row thing (depending on if it's row or column first) but then the rows are large, and i think you can easily address backwards in it

20:29 <mrvn> geist: would it matter for 64byte? That's a cache line. I don't think it matters in what direction you access a cache line.

20:29 <geist> right

20:30 <geist> hence why that wouldn't really matter if it did it forward or backwards within the cache line, probably.

20:30 <mrvn> geist: I was more thinking about 1k or so. Where the prefetcher would fail to get the next cache line ahead of time.

20:31 * geist nods

20:31 <moon-child> there is sub-cacheline structure. But I don't think order matters

20:31 <geist> now ifi t's just that level of arch where it can't really dtect prefetchig stride, then yeah you might be worse performance

20:31 <mrvn> And then as you say the next size is the row of the DIMMs. Every time you have to load a new row address you loose time.

20:32 <geist> but we're already talking about the sub case where things overlap in some way. if A is less than B or vice versa with no overlap there's no reason doing a backwards copy

20:32 <moon-child> (I think usually, sub-cacheline, is organised into groups of 8 bytes. A misaligned 8-byte load can grab from two 8-byte groups, doesn't matter if they're in the same line, hence misaligned 8-byte loads may be fast even when they cross a cache line boundary, contra wider loads)

20:32 <geist> honestly you could probably just revert to bytewise for overlap cases and probably wouldn't be a big deal

20:32 <geist> i mean it would be sub optimal but that's sort of a TODO case

20:33 <mrvn> checking for all the overlap cases though is a couple of branches. So for memcpy() it's worth it not to need those.

20:33 <geist> lots of overlap is usually folks moving strings around, and they're probably fairly small *or* it's something incredibly stupid

20:33 <geist> again depends on if you trust all users of memcpy to do the right thing. i'm not sure if that's a wise idea because of the existing implementations that just union the two things

20:34 <geist> and sure that's broken code, etc etc, but that's also the life of an OS hacker

20:34 <geist> dealing with dumbass users

20:35 <mrvn> geist: are you calling me dumbass? :)

20:36 <geist> welllll

20:36 <mrvn> being the only user has the pro and con that all users are as dumb as I expect them to be.

20:37 <geist> yah

20:38 heat has quit [Read error: Connection reset by peer]

20:38 heat has joined #osdev

20:38 <mrvn> I like having assert()s that check for overlap on memcpy though. Because I know I'm dumb. :)

20:39 gxt__ has joined #osdev

20:47 <geist> you can certainly do it in a wrapper without messing with the core implementation

20:49 <mrvn> It's best placed (or at least duplicted) in the .h file so the compiler can evaluate it at compile time where possible. Only the case where everything is unknown should call the full memcpy/memmove with all the branches.

20:58 <heat> mjg, turns out 256 is indeed too conservative for erms

20:59 <heat> i bumped it up to 512 here

20:59 <mjg> that's pushin it

20:59 <heat> it is

20:59 <mrvn> mjg: with all your memcpy benchmarking can you say at what point memcpy will be slower than chaning page tables?

21:00 <heat> also I assume erms on backwards sucks harder than a manual copy?

21:00 <mjg> mrvn: i'm only benchin small sizes

21:00 <mjg> mrvn: < 256

21:00 <mjg> mrvn: this is for kernel memcpy

21:01 <mjg> et al

21:01 <mjg> heat: yea

21:01 <mrvn> that's exactly where I would move pages around too

21:01 <geist> i dunno unless you have an extremely specialized situation, moving around pages is very expensive

21:01 <geist> you'd have to be local cpu, involve no TLB cross cpu TLB shootdowns, etc

21:02 <mjg> i once more refer you to https://people.freebsd.org/~mjg/bufsizes.txt

21:02 <heat> moving pages around's cost probably scales with the number of CPUs involved

21:02 <mjg> see memmove_erms et al

21:02 <mjg> tons and tons of ops are super small

21:02 <mrvn> geist: no threads, no shared memory, so no cross cpu worries there.

21:02 <zid> what's an erms

21:02 <geist> no matter what you do you should probably optimize a page sized memset and page sized copy

21:03 <geist> hypothetically that should be most of what your kernel does

21:03 <zid> I feel it isn't efficient reverse memory sausage

21:03 <mjg> i have to go afk now, can respond later to whatever

21:03 <mrvn> geist: it is, so far.

21:03 <geist> mrvn: sure. ie, the extremely specialzied situation

21:03 <heat> zid, Enhanced REP MOVSB

21:04 <geist> in more general purpose things, it's hard to justify fiddling with page tables at this level

21:04 <zid> oh right

21:04 <mrvn> still, would be nice to know at what size a local page remap, global page remap, cross cpu shootdown, ... becomes cheaper than copying

21:04 <geist> i think it made sense farther back in time, but it's one of these cases where modern cpus are faster copying things than the overhead of fiddling with the mmu. plus all the cross cpu shootdowns

21:05 <mrvn> isn't that reversing with the cross cpu shootdown mechanics that don't use IPIs?

21:06 <geist> like in ARM64? possibly, but that's also not free

21:06 <heat> they are NOT free

21:06 <geist> for example you'd probably have to to break-before-make

21:06 <heat> and also O(n) with the number of CPUs involved

21:06 <geist> since you're replacing one page with another

21:06 <mrvn> Not free. Cross cpu shootdowns just have become so expensive that they are getting optimized now.

21:06 <geist> the ARM TLB shootdown stuff is nice because it doesn' thave to do a full IPI, but there's still cose

21:07 <geist> the new intel TLB shootdown proposal effectively bumps the IPI up to some sort of pseudo microcode/SMM mode thing

21:07 <geist> so hypothetically that helps a bit, since it's not a full interrupt

21:08 <mrvn> I just want to invalidate the other cores TLB cache entry.

21:08 <geist> and of course AMD has their solution to

21:08 <heat> why are x86 vendors so stubborn?

21:08 <heat> screw them both

21:08 <geist> yay!

21:08 <geist> go Via! be the dark horse third one

21:08 <heat> i want a bootleg soviet 386

21:10 <geist> anyway re: TLB shootdown on ARM. it *has* to be more efficient than an explicit IPI like you get on x86 or riscv, but i honestly dont know the numbers

21:10 <geist> it's one of these things where you really just dont have any other choice, since there's functionally no other way to do it

21:10 <geist> so hypothetically the tlb shootdown is 'free' but probably in reality it's highly based on the microarchitecture, how many cpus are active, what they're doing at the time, etc

21:10 <geist> i've never actually measured it to be honest

21:12 <mrvn> geist: it should really be the same cost as a local TLB shoootdown except you send it to the other cache. The only costly thing would be when 2 cores try to do it at the same time. Then one has to somehow wait etc...

21:13 <mrvn> out of curiosity: Does ARM support multiple sockets?

21:13 <geist> yeah

21:13 <dh`> there must be multiple-socket arm64 boards by now

21:13 <geist> yep. i have one right here

21:15 <dh`> the thing I have always wondered about hardware-assisted tlb shootdown is: tlb shootdown has to ultimately be synchronous (that is, you have to wait for it to complete before you continue with whatever VM ops you're doing) but suspending the current processor entirely for that time seems like a bad plan

21:15 <geist> the mechanism by which the TLB invalidates gets broadcast is not documented, but presumably it's somewhat like a cache eviction thing

21:15 <zid> geist loves his athlon 64 x2 x2

21:15 <geist> dh`: the mechanism ARM (and AMD) do is you start the eviction with an instruction, and then later on there's another instruction to sync

21:15 <geist> DSB on arm, TLBSYNC on AMD

21:15 <mrvn> dh`: it's just a pipeline flush

21:15 <geist> so that lets you at least do some other work in the middle

21:16 <dh`> sounds like not a full switch to another thread though

21:16 <geist> so it does mitigate that somewhat. you can schedule multiple flushes in your code and then synchronize on the way out

21:16 <dh`> but maybe the latencies aren't high enough for that to matter

21:16 <heat> no, you don't switch

21:17 <heat> the point is that you can keep batching tlb shootdowns as you go and TLBI'ing them

21:17 <geist> what is annoying about swapping pages vs just general unmap/etc is you have to do break-before-make, which is somewhat more synchronous

21:17 <heat> then at the end you tlbsync/dsb

21:17 <geist> ie, yo uhave to right then and there shoot down the TLB and wait for it to complete before you can put the new entry in

21:17 <dh`> with an IPI it's quite feasible to run another thread

21:17 <mrvn> dh`: you only need to sync before the next instruction that could access the evicted address.

21:17 <dh`> since even without latency from interrupts being off on some of the other cpus, the interrupt and interrupt dispatch takes considerably longer than two thread switches

21:18 <mrvn> geist: why? I would say the opposite, swapping pages is make-before-break

21:18 <heat> i mean, you could schedule out I think?

21:18 <geist> nope. not on ARM. it's complicated

21:18 <geist> it's all about avoiding the situation of having conflicting TLB entries on multiple cpus at the same time

21:18 <mrvn> geist: you write the new address into the page table, you invalidate the TLB.

21:18 <geist> certain subsets of TLB shootdowns involve break-before-make

21:19 <geist> nope. that's not how it works mrvn

21:19 <geist> there's a whole treatise in the ARM manual about why you need break-before-make, and which precise situations you need it

21:19 <mrvn> geist: doesn't that garanty that after sync all cores will have the same entry (or none)?

21:19 <geist> it guarantees that *at all points in the sync* they have the same entry

21:19 <geist> which is mandatory for Reasons

21:19 <mrvn> urgs.

21:19 <geist> ie you remove the old entry, TLB sync, then add the new entry

21:20 <geist> ie, break-before-make

21:20 <mrvn> I would have though you just ignore the big race hole between make and break. It's UB anyway.

21:20 <geist> so only subets of page table modifies hit this case, but changing what is mapped at a particular slot to something else is one of them

21:20 <geist> unmapping or mapping doesn't cause it, of course

21:21 <mrvn> geist: does the reason involve the A/D bits?

21:21 <geist> yes

21:21 <dh`> I vaguely remember having this discussion once before here

21:21 <mrvn> ahh, not using (or having even) those.

21:21 <geist> everything to do with other cores having weak memory model writeback to A/D bits and having out of sync TLB entries

21:21 <geist> mrvn: congratulations

21:22 <mrvn> geist: the A/D bits can get triggered between make and break and then you indeed have problems.

21:22 <geist> yep, that's the main issue, and there's some other subtle reason

21:22 <geist> i encourage you to read the manual on the topic before you fiddle too much, so at least you kow if you're playing with fire or not

21:22 <mrvn> Is A/D hardware bits mandatory on AArch64 or still optional?

21:23 <geist> >=v8.1

21:23 <dh`> you can get inconsistent reads if it's a page shared with another process

21:23 <geist> >=v8.1

21:24 <mrvn> geist: I have no shared memory, no mmap, no page getting remapped actually. I (so far) only have map, unmap and move between 2 page tables.

21:24 <geist> congratulations

21:24 <mrvn> geist: I'm sticking with a verry simplistic memory model for reasons. :)

21:24 <geist> i understand you have a very simple system that bypasses most of these concerns, but it really doesn't matter to me (or probalby anyone else)

21:24 <geist> it's not useful to recommend things to other people based on your personal needs

21:24 <mrvn> just saying why I haven't run into this issue

21:25 <geist> sure

21:25 <geist> but again you should probably read the manual on the topic. i think there's some other reason why break-before-make may be necessary

21:26 <dh`> suppose it's a copy-on-write remap; process 1 thread 1 reads 6, goes to write 9, updates the mapping, starts shootdown, process 2 writes 7, process 1 thread 2 reads the 7, then the shootdown completes

21:26 <mrvn> dh`: if you have a shared page and one thread/process remaps the page without some form of synchronization then you have a race condition already.

21:26 <geist> there's some trickery when detaching page tables that involve a BBM style thing

21:26 <heat> gang, i need assembly help

21:26 <heat> jz .Lout

21:26 <heat> sub $32, %rdx

21:26 <heat> cmp $32, %rdx

21:26 <heat> jae .L32b_backwards

21:26 <geist> to keep other cpus from having a page table cache entry floating around before the page is reused

21:26 <heat> why is the cmp not redundant?

21:26 <heat> WAIT

21:26 <heat> im stupid

21:26 <dh`> when rdx starts out at 65 :-)

21:26 <geist> yay duck debugging

21:28 <mrvn> dh`: In your example both processes would trigger a page fault. The original entry is read-only.

21:28 <dh`> no? p1 t1 was already doing the page fault, p1 t2 was only reading, p2 might not have it readonly

21:28 GeDaMo has quit [Quit: That's it, you people have stood in my way long enough! I'm going to clown college!]

21:29 <mrvn> dh`: process 2 writes 7 ==> page fault

21:29 <heat> geist, what does zircon use to handle copyin/out page faults?

21:29 <heat> fixup table?

21:29 <dh`> p2 might not have it readonly

21:29 <mrvn> dh`: It's COW, it must be read-only.

21:29 <dh`> says who? see MAP_PRIVATE

21:29 <geist> fixup table i think, depending on precisely what you mean by fixup table

21:29 <mrvn> dh`: that's how COW work. Both sides of the COW get a read-only entry.

21:29 <dh`> I mean, arguably MAP_PRIVATE is a bug, but it's the agreed-upon default behavior

21:29 <heat> struct fixup_entry {u64 ip; u64 fixup_ip;} table[];

21:30 <mrvn> dh`: the page only becomes read-write when you resolve the COW and then the page is no longer shared.

21:30 <dh`> mrvn: I repeat, MAP_PRIVATE

21:31 <geist> dh`: i think the key here is at some point after the page table entry is updated either the other process still sees the RO version (and page faults) or the RW, but it's okay for that to be slippery

21:31 <mrvn> dh`: MAP_PRIVATE has nothing to do with that.-

21:31 <heat> like, erm, I want to plug this into copy_to/from_user but adding fixup entries for every single memory access sucks

21:31 <dh`> sure it does

21:31 <geist> worst case the second cpu page faults, then goes in to discover the page is already RW and shrugs and retries

21:31 <dh`> MAP_PRIVATE allows other processes' changes to show through to you until you trigger your own copy

21:31 <geist> heat: ah no we just do it as a register/deregister thing

21:31 <dh`> so some other process could be making such a change

21:31 <mrvn> dh`: when you MAP_PRIVATE you get a COW read-only mapping of everything. When you fault on a write you get a not shared copy.

21:31 <heat> ah, like BSD then

21:31 <geist> ie, the start f the copyin/copyout code sets a recovery pointer in the current thread structure

21:31 <heat> yep

21:32 <geist> i think that works reasonably well

21:32 <geist> it's not balls to the wall optimally fast, but i think it's a reasonble compromise

21:32 <geist> having the pre-canned table seems like a microoptimization

21:32 <dh`> geist: defining magic ranges for the trap handler to treat specially shifts a couple instructions out of the common fast path

21:32 <geist> yep

21:32 <mrvn> dh`: process 2 doing a MAP_PRIVATE won't allow it to write to pages so process 1 sees it

21:32 <dh`> yeah, micro-optimization

21:32 <dh`> no, mrvn

21:33 <dh`> process 2 is doing something else, maybe via MAP_SHARED, maybe just write(2)

21:33 <dh`> *you* have the MAP_PRIVATE and you're in the middle of copying

21:33 <mrvn> dh`: ahh, that is a horse of another color

21:33 <dh`> not really

21:34 <mrvn> mixing write and mmap without syncronization is a race condition. So yeah, you might get 7 or 9. It's a race.

21:34 <dh`> actually the example I provided is invalid but you can construct other more complicated ones that involve different addresses on the same page

21:35 <dh`> the point is that you can get traces where thread 1 sees that the copy happened before process 2 and thread 2 sees that it happened afterward

21:36 <dh`> and if you then do this on two separate pages you can get an observable inconsistency

21:36 <mrvn> dh`: sure. hence the need to synchronize. One way to do that is to first break the mapping like geist says you need to do anyway. Then all processes run into a fault and that blocks till your finished remapping everything.

21:37 <dh`> right

21:37 <dh`> that was the point, ultimately

21:38 <dh`> you can't engage the new mapping until the old mapping is revoked everywhere

21:38 <dh`> that's also what I was blathering about when I was talking about waiting for completion

21:38 <mrvn> dh`: if it involves user processes I would still think all those cases are actualy UB or race conditions. The kernel doing COW is a separate matter and the kernel needs to synchronize that.

21:38 <dh`> there's no UB at the machine level

21:39 <heat> ARM does have some UB

21:39 <dh`> and also, things like mprotect changes that can be triggered from userlevel come with implicit atomicity guarantees

21:39 <heat> in the IMPLEMENTATION DEFINED stuff or whatever they call it

21:39 <dh`> typically processors only allow unprivileged execution to be UNPREDICTABLE rather than UNDEFINED because the latter is bad for security

21:40 <mrvn> dh`: if one thread calls mprotect to make a page read-only and another thread writes to it then it's undefined wether the other thread segfaults or not. Depends on the exact timing. That kind of stuff.

21:40 <dh`> no, but it must happen either before or after

21:41 <dh`> and if you let that become fuzzy it becomes possible to construct ways to observe the inconsistency

21:41 <mrvn> even that might not be the case with compiler and cpu reordering stuff

21:41 <dh`> at least in unix it's a basic guarantee of the syscall api

21:42 <mrvn> A lot of that stuff predated threads :)

21:42 <dh`> and a lot of stupid stuff had to be sorted out when it became possible to have multiple threads observing in a single process

21:44 <dh`> I think at this point posix doesn't make promises about what happens if you mprotect or unmap regions that are arguments to read/write when those calls are in progress

21:45 <netbsduser> i am pinning my buffers nowadays

21:45 <mrvn> dh`: where did you see that mprotect is atomic? man 7 pthread doesn't have it in the list of thread safe functions and the manpage doesn't mention it here.

21:45 <netbsduser> if you do read/write it wires down the underlying pages

21:45 <dh`> all system calls are atomic unless explicitly stated otherwise

21:46 <dh`> anyway there's valid reasons for wanting mprotect to be atomic and no real excuse for fumbling it :-)

21:46 <mrvn> dh`: then they would all be thread safe

21:47 <dh`> what specifically do you mean by "thread safe"?

21:47 <geist> i think in general the rules are you can't observe thing in any particular order, but you at least observe old or new state, and nothing in between

21:47 <mrvn> never mind, found it.

21:47 <geist> that's really the only reasonable thing anything can guarantee

21:47 <mrvn> dh`: "A thread-safe function is one that can be safely (i.e., it will deliver the same results regardless of whether it is) called from multiple threads at the same time.

21:47 <mrvn> "

21:48 <dh`> all system calls that are actually system calls should be thread-safe in that sense

21:48 <dh`> that is very basic

21:48 <mrvn> dh`: they are "except for the following functions: ..."

21:48 <geist> well, that's not really true intrinsically. you generally have to jump through at least some number of hoops to guarantee that

21:48 <geist> like, say the file descriptor moves the cursor atomically

21:49 <dh`> calls that are allowed to be wrappers in libc are somewhat different

21:49 <geist> or, a file descriptor is either open or not at any given point

21:49 <mrvn> dh`: anything with static buffers is on that list

21:49 <dh`> there are no syscalls with static buffers

21:49 <dh`> it doesn't make sense

21:49 <mrvn> dh`: but way too many libc functions

21:49 <geist> what is really hard to do is guarantee that at the completion of a syscall all of its results are observed everywhere

21:49 <dh`> yes but that's an entirely different issue

21:49 <geist> easy to do up until you have multiple threads calling things simultaneously

21:50 <mrvn> geist: or even still valid

21:50 <mrvn> dh`: that was in reference to "wrappers in libc"

21:50 <geist> right, we ended up for zircon declaring the model is fairly loose

21:50 <dh`> right

21:50 <heat> is there no way to define descriptive function-local labels in GNU as?

21:51 <heat> .Lblah is not function local

21:51 <heat> this is exactly what mjg was talking about yesterday

21:51 <dh`> heat: only file-local

21:51 <dh`> what's a "function" in assembly anyway? not a meaningful concept

21:51 <heat> :^) shoot me

21:52 <dh`> geist: for anything that updates kernel state unlocking that state should force global visibility

21:52 <geist> yeah but the tough part is what does it do to syscalls that are simultaneously occurring on the same state

21:52 <geist> the canonical exampe is a syscall that frobs a handle simultaneously being called with a syscall that closes the handle

21:53 <dh`> right, one has to execute first

21:53 <geist> that's too difficult, without serializing the entire kernel

21:53 <dh`> that's part of what's meant by the atomicity guarantees for unix syscalls

21:53 <geist> the obvious 2 cases are: close happens first, frob fails. frob occurs first, close happens

21:54 <geist> but the 3rd and less obvioyus case is: frob occurs first, close happens *and* exits, frob continues to happen

21:54 <geist> ie you have a syscall that's still occurring on a handle that is now closed

21:54 <dh`> should not be, boils down to a single word-sized access of an entry in the descriptor table, even if you do everything unlocked one will go before the other

21:54 <geist> i'm not entirely sure posix defines that

21:54 heat has quit [Read error: Connection reset by peer]

21:54 <geist> the key is what happens when the frob syscall looks up the underlying object, gets a reference to it. it's now *past* the descriptor table.

21:54 heat has joined #osdev

21:55 <dh`> in principle it means the read happened before the close

21:55 <geist> it did, but then as a result the frob syscall goes about its business *after* the handle was closed

21:55 <dh`> even if all it actually means is that the read crossed the descriptor table before the close

21:55 <dh`> that defines the ordering

21:55 <geist> so you have to consider that case, or explicitly add machinery to make the close syscall wait until all outstanding operations on it have completed

21:55 <geist> we chose not to do the latter

21:56 <dh`> in order to have an inconsistency you need to then be able to observe something that shows that the close executed before the read

21:56 <geist> i think you're missing the point here. the point is not that the close happened before the *start* of the read. it's that the read is continiuing to happen

21:56 <geist> beause syscalls aren't atomic in units of time

21:57 <dh`> right, but can you construct an observation that shows that?

21:57 <dh`> you can call gettimeofday() after each call but that tells you nothing

21:57 <geist> absolutely. it's easy. do a blocking read on something and then close it

21:57 <geist> i actually dunno what posix does there. does it abort any read operations on the fd?

21:57 <geist> (probably). but is it defined that way

21:58 <geist> or is that just a side effect of implementation

21:58 <dh`> traditionally? the close affects the table, not the file object (or vnode)

21:58 <geist> exactly

21:58 <dh`> how do you observe that the read is still in progress though?

21:59 <geist> the fact that the read syscall is still occurring after T0, where T0 is where the close syscall returns

21:59 <dh`> how do you know it's occurring, and where do you get that stamp?

21:59 <geist> i'm not sure i understand this line of thinking. it's easy to observe all this stuff using standard observational stuff

22:00 <geist> ie, a universal clock to the computer

22:01 <dh`> sure, you can also in principle monitor the electron flows in the cpu

22:01 <dh`> but the semantically important question is what a program can observe

22:01 <geist> also i'm trying not to be posix specific here. i think posix sidesteps a lot of this by not having a tremendous number of types of frobs you can do on handle. i also think it sidesteps a fair number of these things by being implementation specific

22:01 <geist> thread B is still sitting in the read syscall after thread A has completed closing the handle

22:02 <dh`> can you distinguish that from thread B from having returned and stalled before doing anything else?

22:02 <geist> and after some reasonble time does not return with ERR_CANCELLED or whatnot

22:02 <geist> yes yes, i know what you're trying to do here, but that's not hte point

22:03 <dh`> it _is_ the point though because all these notions are relative to some model

22:03 <geist> so perhaps a better way of saying it is does thread B get its syscall cancelled as a reslt of thread A calling close

22:03 <geist> or does thread A wait until thread B exits, etc

22:03 <geist> and i simply dont know what posix states here, or if it states anything at all

22:03 <dh`> the whole point of having something more parallel than fetch one instruction at a time and execute it to completion is that there's supposed to be a model in which the execution is still consistent

22:04 <dh`> it's usually safe to assume that posix states either nothing or nothing helpful :-)

22:04 <geist> right. so my point is you have to define some sort of model ideally. even if the model if precisely undefined in some situations (ie, could be A or B but can't tell)

22:04 <dh`> at some point we discovered that someone had changed netbsd's close to behave in a manner other than the usual default and there was a big ruckus about it

22:04 <dh`> let me see if I can find that

22:05 <dh`> but I think the conclusion was that what whoever did was legal, just unexpected and possibly undesirable

22:05 <geist> what we did for zircon is since basically every syscall operates the same way: take a handle to a thing and do an operation on it. there's a phase in the syscall where the handle is looked up, and at that point the caller has access to it, even if the handle goes away simultaneously

22:06 <geist> and since handles cannot be modified, only added or removed, it avoids a bunch of races with handle changing permissions or whatnot

22:06 <mrvn> dh`: you can send thread2 a signal and see if read return EAGAIN

22:06 <geist> ie a slot in the handle table is in one of two states: present with a set of rights and pointing to an object, or empty. and can ony transition betweeo those two states

22:10 <mrvn> dh`: thread2: read(fd), thread1: close(fd); closed = true; signal(USR1); If read returns EINVL or something then close aborted the read. If read returns EAGAIN / closed is true then close() didn't abort for sure.-

22:11 <mrvn> you can add a sufficiently large sleep() after close to make it even more observable.

22:13 <dh`> and since you can't post signals explicitly to threads, what if the signal is only ever delivered to thread 1? :-)

22:13 <dh`> (that's being difficult, yeah, it's a possible way to observe it)

22:14 <dh`> but the question isn't whether close aborted the read, that you can presumably tell by whether the read fails with EBADF

22:14 <dh`> it's whether you can observe that the read is still running after close completed

22:15 <mrvn> dh`: how would that look like? close() aborts the read but then read still returns data written to the file after close()?

22:16 <dh`> anyway, I'll just retreat to the next fortification, which is that file descriptors being handles is part of the semantics of the unix system call API and there's no reason to require operations on handles to affect other operations that have passed the look-up-handle stage

22:16 <mrvn> ack

22:16 bgs has quit [Remote host closed the connection]

22:16 <dh`> mrvn: no, the idea is that close doesn't abort the read

22:16 <geist> yah that's the model i think that makse sense

22:16 <mrvn> dh`: but that part the signal would show.

22:17 <dh`> basically read looks up its filetable entry, close removes that entry, close returns, read continues

22:18 <mrvn> There is a grey zone in my test where close would abort the read(), the read doesn't return yet though and the signal then still happens to catch the sleeping read.

22:19 <dh`> I guess another more direct way to observe this is: thread 1 calls read, thread 2 calls close then writes to a pipe, thread/process 3 reads from the pipe and writes to thread 1's file, thread 1 reads that data

22:19 <mrvn> would be an odd implementation though for the read to be aborted but still catch signals and change the return code.

22:19 <geist> that being said i wonder what happens to network sockets

22:19 <geist> though that may be exactly why shutdown() exists

22:19 <mrvn> geist: shutdown is so you can close the sending side while still reading.

22:19 <dh`> mrvn: in most implementations it would post the signal handler but return EBADF and not EINTR

22:20 <dh`> most signal implementations, that is

22:20 <geist> hmm, you're right. so then what happens if you close a socket that has a pending blocking operation on it? seems in that particular case like it'd be silly to keep it going

22:20 <geist> since it could, hypothetically, block forever

22:20 <mrvn> geist: how ever would the read wake up in that case? It's not getting any more data.

22:20 <dh`> the argument I'd make is that if that's what you want you should call revoke rather than close

22:21 <mrvn> dh`: if you really want to know 100% then you have to read the kernel source.

22:21 <geist> i'm gonna bet it aborts the read, and it comes down to a case where posix is unclear and its implementation defined what happens

22:21 <heat> i think linux just wakes everyone up on shutdown

22:21 <mrvn> otherwise the test shows 99.9% sure

22:21 <heat> like t1: read(sockfd, ...); t2: shutdown(sockfd, RD) t3: read() = 0

22:21 <heat> s/t3/t1/

22:21 <mrvn> geist: I think a close on a socket or pipe means EOF so read should wake up.

22:22 <mrvn> Unlike on a file where cose doesn't change that.

22:22 <dh`> my expectation would be that even for a socket the close would only close the handle, not the socket, and the socket would go away after the read releases it

22:22 <mrvn> dh`: then the read never wakes up and the socket remains behind forever.

22:22 <geist> yah reading the man pages for close it doesn't really say what happens on simultaneous operations, but it seems to imply that if it's the last hadle then everything will be cleaned up

22:22 <geist> which implies that if something is blocking at least it'll get unblocked as the data structure its on is getting removed

22:23 <dh`> the natural implementation is to incref the file object when you fetch it from the file table for read, so you own a reference to it, and nothing under it goes away until you drop that reference

22:23 <dh`> the reason for whatever weird thing happened in netbsd was that someone wanted to avoid the atomic incref for that

22:23 <geist> but if it's something non blocking, like a read that is just taking time to copy data, it propbably as an implementation detail ends up waiting until that is finished

22:23 <mrvn> it's a valid but probably not that useful implementation

22:23 <dh`> mrvn: if you close the other end of the socket that will cause read to finish and return 0

22:24 <mrvn> dh`: I expect close() to close the socket. You are breaking that promise.

22:24 <geist> i think the idea there is there's a difference between bumpign a ref and holding onto the object, and the object itself getting cancelled such that any blocking operations bounce out

22:24 <mrvn> dh`: so the other end never sees the socket close and won't close it's own end.

22:24 <dh`> mrvn, that's not consistent with the existence of dup() let alone anything more complex

22:24 <geist> they are really two different things. the pending read can bounce out of something that is cancelled

22:24 <geist> if it's blocking

22:25 <mrvn> dh`: I assumed it's the close of the last copy of the socket.

22:25 <geist> if it's doing something non blocking, like page by page copying data out of a file cache, it could abort early or finish i suppose and still be consistent

22:26 <dh`> mrvn: but the thread reading has its own working copy of/reference to the socket

22:26 <dh`> if you wanted to revoke that you should have called revoke()

22:26 <mrvn> dh`: In my mind the process is this: close() -> socket close -> tcp close -> socket cancel blocked ops

22:26 <dh`> (and persuaded the maintainers of your kernel to support revoke on sockets)

22:26 <mrvn> dh`: the destruction of the tcp connection wakes up the read in the end.

22:27 <dh`> yes, but that's not the model you get by default

22:27 <mrvn> it's the behavior I expect posix systems to have. close on sockets should wake reads. Not sure what I expect on files.

22:28 <mrvn> The difference being EOF waking up read.

22:28 <dh`> it is definitely the case that that's not guaranteed, because like I said the natural implementation is that the read secures its own reference to the socket while it's working

22:29 <dh`> EOF will wake up read, but closing your file handle under the read doesn't cause that

22:29 <mrvn> dh`: so the socket would remain open forever? Even though the TCP side is closed?

22:29 <dh`> no

22:29 * dh` fails to understand what's so hard about this

22:29 <dh`> if you close the write end, the reader on the read end will wake up and exit

22:30 <mrvn> dh`: you close both ends with close()

22:30 <dh`> that's not a well-specified state

22:30 <dh`> close() closes file handles, not sockets.

22:31 <dh`> if you close the last reference to the write end, the reader on the read end will wake up and exit

22:31 <mrvn> then lets simplify: shutdown(fd, SHUT_RDWR);

22:31 <dh`> that should also cause any readers on the read end to wake up and exit

22:31 <mrvn> and close() should do something else on sockets?

22:32 <mrvn> In my mind close(fd) and shutdown(fd, SHUT_RDWR); should be the same.

22:32 <dh`> close closes file handles, not internal kernel objects

22:32 <dh`> they are not, because shutdown does not close the file handle

22:32 <geist> i'm not sure the man pages agree with that

22:32 xenos1984 has quit [Read error: Connection reset by peer]

22:33 <dh`> so your mind needs to visit a few man pages :-p

22:33 <geist> at least on linux and mac

22:33 <geist> both of them have verbiage to the effect of 'if it's the last file descriptor internal resources are freed'

22:34 <geist> lots of ways to read that but it seems to indicate that the fd count going to zero at least triggers some sort of internal shutdown path

22:34 <geist> even if there are still references to the objects floating around in the kernel

22:34 <mrvn> If fd is the last file descriptor referring to the underlying open file description (see open(2)), the resources associated with the open file description are freed;

22:34 <mrvn> *last file descriptor*, not internal reference

22:34 <dh`> maybe, it's not clear that whoever wrote that text even thought about pending references

22:34 <geist> not sure if these man pages are describing the behavior of the implementation of how its specced, however

22:35 <geist> s/of/or

22:35 <mrvn> possible

22:35 <dh`> and it's definitely inadvisable to impute intent regarding something to documentation that never considered it

22:35 <geist> the mac one is a bit more interesting

22:35 <mrvn> The manpage also says: "On Linux (and possibly some other systems), the behavior is different. the blocking I/O system call holds a reference to the underlying open file description, and this reference keeps the description open until the I/O system call completes. (See open(2) for a discussion of open file descriptions.) Thus, the blocking system call in the first thread

22:36 <mrvn> may successfully complete after the close() in the second thread."

22:36 <geist> "The close() call deletes a descriptor from the per-process object reference table. If this is the last reference to the underlying object, the object will be deactivated. For example, on the last close of a file the current seek pointer associated with the file is lost; on the last close of a socket(2) associated naming information and queued data are discarded; on the last close of a file holding an advisory lock the lock is

22:36 <geist> released (see further flock(2))."

22:36 <geist> the mac one seems to indicate it does the other path. ie, the object is closed when the last ref goes away, and internal refs also work

22:36 <dh`> geist: that's the same text we have in netbsd

22:37 <geist> yah probably derived from the same BSD docs

22:37 <dh`> yeah

22:37 <geist> actually says BSD 4 at the bottom yeah

22:37 <mrvn> I still think keeping a read() on a socket (and the socket and tcp connection) you closed alive is not desirable.

22:37 <mrvn> now I think the only thing left to do is test how bsd and linux actually behave.

22:37 <geist> so all this aside i think what we can derive is diffenre posix systems dot handle this consistently

22:38 <geist> but since linux is the only thing that matters...

22:38 <heat> AMEN

22:38 <mrvn> hehe. zircon matters too

22:38 <heat> also HP-UX

22:38 piraty is now known as Piraty

22:38 <geist> i say the last thing with a heavy heart

22:38 <heat> only itanium supporting systems

22:39 <dh`> in netbsd the text dates back to -r1.1 in 1993 so probably from 4.4-lite

22:39 <sham1> closing a file descriptor should cause an EINTR or something like that

22:39 <geist> zircon actually has something more subtle: an object can have any number of internal refereces, including just plain user facing handles. buit there *is* a one way signal called on the object when the user handle count goes to zero

22:40 <geist> .OnZeroHandles() or something like that o the object

22:40 <sham1> Basically to just stop the blocking read and saying "sorry, the file is now closed. Shouldn't have used threads like this"

22:40 <geist> so there are some cases where the last user handle going away automatically triggers some sort of internal cleanup even if some iternal references are held

22:40 <dh`> as we just spent a long time discussing, that is not guaranteed and not how it's implemented in most places

22:40 <mrvn> geist: can you close a socket while it still has references?

22:40 <mrvn> (the tcp side)

22:40 <geist> in what case?

22:40 <mrvn> close()

22:40 <geist> i dunno,what OS are you talking about?

22:41 <dh`> whether this behavior violates the basic atomicity guarantees is at least debatable

22:41 <mrvn> zircon, one thread does read(fd), another does close(fd).

22:41 <geist> the kernel doesn't implement file systems or net stack

22:42 <geist> but the gist is the other side would see that the last handle to the IPC channel went away and start shutting down

22:42 <heat> this is why microkernels are superior

22:42 <geist> ie, the network stack gets a signal when the other end is closed (ie, on zero handles to the client side of the IPC channel that the socket is implemented over)

22:42 <heat> you avoid all kinds of debates by just not doing it

22:42 <geist> really IPC objects are the main users of the OnZeroHandles state, since otherwise you can't tell if the other side hung up

22:43 <dh`> in a microkernel environment, what does it even mean to have an operation pending while you close the handle?

22:43 <dh`> there are only messages

22:43 <geist> and since you cant construct a new handle from zero handles, it's a one way road: once you get to zero handles it's a permanent state

22:43 <mrvn> geist: yeah. but tcp sockets are a bit different. they have connections that you can shutdown without the object getting deleted.

22:43 <mrvn> (at least in posix)

22:43 <geist> in that case a shutdown() would almost certainly be a mesage of the IPC

22:43 <geist> dh`: depends on the type of microkernel. zircon is fairly uh... 'rich'

22:43 <mrvn> yep.

22:44 <geist> in that it is on the larger side of it, but what we do kinda consistently is there are N types of objects and they all operate using the same model

22:44 <geist> threads, processes, jobs, ipcs, memory objects, etc

22:44 <dh`> or does the system guarantee you a reply paired with your request or something so there is still some kind of pending state?

22:44 <geist> so yes you can 'read' from a VM object for example, which is kinda file like

22:45 <mrvn> geist: so my mind model woulöd be that close(fd); does send a shutdown IPC message for sockets and then later when the refcount becomes 0 the resources get freed.

22:45 <mrvn> and removes the handle from the table

22:45 <heat> how do I make concurrent open()'s unsuck?

22:45 <heat> or suck less

22:45 <geist> yah though in the case where the process simply gets axed and all the handles closed you always have to have a mechanism for serviers in a µkernel world to detect the closing of the other end

22:45 <geist> so the built in OnZeroHandles mechanism works for zircon for that

22:45 <dh`> define suck in this case?

22:46 <geist> ie, in lieu of any explicit shutdown at least the server notices the other side went away

22:46 <heat> imagine a fd table with an rwlock, open/socket/dup/whatever that creates a fd needs to write lock

22:46 <heat> which Sucks(tm)

22:46 <heat> I think most UNIXes have a workaround for this (full blown RCU or other more dollar store solutions)

22:46 <dh`> open the object first, only lock the table to scan it and insert

22:47 <geist> mrvn: but yeah for a socket style close() (if you're trying to implement POSIX on top of the µkernel) you could do some sort of pending message to it

22:47 <heat> yes, but that's still slow

22:47 <dh`> (or alternatively, lock the table first to scan it and insert a placeholder, then leave only the placeholder locked while you're working)

22:47 <heat> you'll still have a bunch of contention there which will be PESSIMAL

22:47 <heat> I remember NetBSD also had a workaround for this

22:47 <dh`> unfortunately for unix-style handles you're supposed to guarantee that you return the lowest available slot so you can't avoid the scanning

22:47 <geist> mrvn: a lot of it depends on how much you do or dont try to map posix fds to a IPC object. if you did 1:1 it might make sense to just use the IPC channel close semantics to do the same thing as close

22:48 <dh`> you can cache it away in some cases

22:48 <geist> but if it's something more complicated, where you're multiplexing N fds over M IPC channels then you could build up your own state there in user space

22:48 <mrvn> geist: it's more about sockets having a shutdown method separate from the socket object getting destroyed.

22:48 <geist> sure

22:48 <geist> shutdown() i'd assume would be a message over the IPC channel to the network server

22:49 <mrvn> yep. as it is with tcp

22:49 <geist> since you're already going to need some messaging scheme for all the other out of band data

22:49 <mrvn> files don't have that semantic so I have no idea what read() on a file should do. different expecation there.

22:50 <mrvn> but with the IPC mechanism having files and sockets behave the same, i.e. close(fd) sends a shutdown over the IPC connection, it would make sense to have them behave the same.

22:50 <dh`> heat: you could imagine something like always keeping the descriptor table dense by allocating placeholder objects for holes and then keeping the placeholder objects on a linked list

22:50 <geist> yah part of the sort of half solution of modelling sockets as files in unix

22:50 xenos1984 has joined #osdev

22:50 <geist> like, it sort of works except where it doesn't

22:51 <dh`> so when you go to allocate you pop the first placeholder off the freelist, and if there isn't any you grab the next table entry

22:51 <netbsduser> i do like to allocate placeholder objects

22:51 <mrvn> dh`: union { int next_free_fd; file_descr fd; } fds[max];

22:52 <dh`> whether this is actually any better than just locking the table and scanning it (especially if you keep track of the start and end points for scanning) is an open question

22:52 <dh`> I'd guess not

22:52 <mjg> burp

22:52 <mjg> lemme tell you something

22:52 <heat> mjg, omg hi rick

22:52 <mrvn> dh`: both are O(1) if the table have a maximum size.

22:52 <netbsduser> it's how i do page-in efficiently: you allocate the page and mark it busy, and abandon all locks, then you wait on an event (to which the page structure points) until the page is in-memory

22:52 <dh`> mrvn: that doesn't work well because you need to be able to insert

22:52 <heat> TELL ME HOW TO DO LOCKLESS

22:52 <mjg> heat: EZ

22:52 <dh`> EVERYTHING is O(1) if the size is bounded, that's not useful

22:52 <mjg> - spin_lock(&lock);

22:52 <mrvn> dh`: insert what?

22:52 <mjg> + //spin_lock(&lock);

22:52 <netbsduser> so if someone decides to munmap the area, then it just sets a flag in the page saying "you are surplus to requirements, please be freed when this I/O is done"

22:52 <mjg> now yo uare LOCKLESS

22:53 <dh`> mrvn: freelist entries

22:53 <mrvn> dh`: how do you insert an FD between 4 and 5?

22:53 <heat> mjg, I'm reading netbsd's fd_expand, etc and I don't get it

22:53 <dh`> mrvn: suppose fds 0-2, 4-7, and 8-10 are open and I close 5

22:53 <mrvn> dh`: ahh, why? you reuse the last closed FD first.

22:53 <heat> it looks like running with scissors atomics version

22:53 <mjg> heat: just like with openbsd, i'm not looking at that

22:53 <mrvn> nothging says open should get the lowest free FD, right?

22:53 <dh`> not in unix you don't, you are required to return the lowest-numbered available fd

22:53 <mjg> mrvn: posix says

22:54 <mjg> which is a major pita

22:54 <mrvn> you want to do POSIX? you have bigger problems. :)

22:54 <dh`> posix says, because traditionally there was no dup2 and if you wanted to do I/O redirection you had to rely on that semantic

22:54 <heat> mjg, freebsd uses SMR right?

22:54 <mjg> whether you want or not what to do posix, this has been the case for decades

22:54 <mjg> heat: for what

22:54 <mjg> so you can't just change it

22:54 <heat> this stuff

22:54 <mjg> heat: no

22:54 <dh`> realistically these days you're unlikely to break anything by violating that rule

22:55 <mjg> there is code which expects the order

22:55 <heat> mjg, ok father, then how does it do stuff

22:55 <heat> does it handroll some weird RCU too

22:55 <mjg> heat: it is all stupid

22:55 <mjg> heat: file * objs are *never* actually freed

22:55 <dh`> mjg: have you seen any such code in the wild in the last say 10 years? I haven't

22:55 <mjg> heat: and file tables only disappear after the proc exits

22:55 <heat> god what

22:55 <mjg> dh`: i did, the idea is: close all shit, then open /dev/null a bunch of times to fill in 0/1/2

22:55 <mjg> heat: GEEZER motherfucker

22:55 <dh`> yes, I know the idea

22:56 <dh`> where did you see it and why didn't you patch it out?

22:56 <mjg> well if you no lnger guarantee lowest fd, you are dead here

22:56 <mjg> i don't even remember, does it matter?

22:56 <heat> so erm erm erm if I expand the fd table a bunch of times do you keep them all cached?

22:56 <mjg> point is there is codde like that out there

22:56 <mjg> you can't just change the behavior from under it

22:56 <mjg> heat: not if the process is single threaded, otherwise yes

22:56 <mrvn> anyway, you can make it a sorted doubly linked list.

22:56 <dh`> it matters because if you decide to break that rule you want to know what the probability is of hitting something that doesn't work

22:57 <mjg> heat: it doesn ot have to be like that, rcu or not, but here we are. mostly becaues geezer

22:57 <mrvn> or just keep a pointer to the lowest free FD and search from there.

22:57 <heat> god.jpeg

22:57 <heat> netbsd seems to do similar

22:57 <mjg> yes, the idea was tkaen from netbsd

22:57 <mjg> it is all geezer

22:57 <heat> now I want to see what OpenBSD does

22:58 <mjg> :DD

22:58 <mjg> brah

22:58 <heat> i bet 10 on BKL

22:58 <mjg> openbsd is doing turbo stuipd

22:58 <mjg> here is a funny story

22:58 <heat> hey no spoilers!

22:58 <mjg> traditionally unix would allocate fds by traversing an array

22:58 <mjg> bsds including

22:58 <mjg> openbsd was the first bsd to implement a bitmap, in fact two level

22:58 <mjg> some time later the rest followed suit

22:59 <mjg> exdcept freebsd has one level with no explanation why not two

22:59 <mjg> all of which referred to the same paper

22:59 <mjg> so sounds like obsd has a leg up, or at least did...

22:59 <geist> okay, again.

22:59 <heat> what paper?

22:59 <mjg> except apart from the bitmaps they *still* do linear array scans

22:59 <mjg> give me a minute

23:00 <zzo38> I think there is something wrong with the wiki. Even if you use real mode does not necessarily mean that you have to use the PC BIOS, and it is not necessary to use the PC BIOS for all operations even if you have it available.

23:00 <mjg> heat: Scalable kernel performance for Internet servers under realistic loads

23:00 <mjg> heat: guarav banga & co

23:01 <geist> zzo38: this is true. is there a good example of this?

23:01 <zzo38> It is true that some of the hardware features are a bit messy due to compatibility (such as the A20 gate), but some of the things still can be sensible for some kinds of systems.

23:01 <geist> i can imagine there's stuff that goes out of its way to use bios calls to write to the text display

23:01 <zzo38> Also, UEFI is even more messy and even more worse, in many ways.

23:02 <netbsduser> this is why i keep well away from both

23:02 <zzo38> BIOS calls are probably most useful during the initial booting to read the kernel and drivers from the disk; after that, presumably you will have better drivers suitable for your system.

23:02 <geist> that's the idea, yeah

23:03 <mjg> heat: btw the paper incorrectly claims the approach is logarithmic

23:03 <mjg> heat: kind of funny

23:03 <heat> mjg, open seems to have copied net too

23:03 <mjg> heat: too bad they did not bench vs single-level bitmap

23:03 <mjg> in what regard

23:03 <mjg> *bitmaps* were first in openbsd afair, it was the rest which copied from there

23:03 <mjg> obsd got it in 2003 or so

23:04 <mjg> very positively surprised with dtrace: dtrace -n 'fbt::memset:entry { printf("%d %d", cpu, arg2); }' -o /tmp/memsets

23:04 <mjg> per-cpu trace of all calls with 0 drops

23:04 <mjg> very nice

23:04 <heat> also fyi linux also does single-level I think

23:04 <mjg> no, linux got 2 level ~7 years ago

23:04 <zzo38> Also, the PC BIOS provides the booting function, and UEFI is too complicated in that way. Furthermore, I think it is not legitimate to be called "PC" if the PC BIOS is not implemented. (I do have an idea about how to design a better booting system in ROM, but it is not a PC booting system but it would be possible to implement both if it is desirable)

23:05 <mjg> heat: the real interesting bit is solaris which has a *tree* instead

23:05 <mjg> heat: dfly copied from there

23:05 <mjg> i have no idea how that perform

23:05 <mjg> s

23:06 <dh`> two layers is still a tree

23:06 <zzo38> I think that HDMI and USB also is no good

23:06 <mrvn> a stunted tree

23:06 <mjg> guaranteed 2 layers no matter what

23:06 <mjg> is not

23:06 <heat> how does a 2-layer bitmap work?

23:06 <mjg> cmon dwag

23:06 <mrvn> heat: top layer bit says if there is a leaf for the 2nd level bitmap

23:06 <mjg> read some openbsd!

23:06 <geist> i asssume you just have a top layer bitmap that determines if there are holes in blocks of lower level nodes

23:06 <mjg> right

23:07 <mjg> that's it, literally 0 magi

23:07 <mjg> c

23:07 <dh`> however you want, but my guess would be that each bit in the lower layer indicates the state of one fd entry and each bit in the upper layer indicates whether there's a free bit in each word of the lower layer

23:07 <dh`> because with 32-bit words and 1024 fds max that all fits tidily

23:07 <heat> right

23:07 <geist> yah that's what i'd think. you could do something more complicated like a bit that signifies if the entire sub tree is occupied or not

23:08 <netbsduser> just flicked Solaris Internals open to the page on the fd tree, funny, they have a comment on exactly what people were chatting about earlier on colliding read() and close()

23:08 <dh`> but it seems stupid given that the granularity of the upper layer should be a whole cache line of the lower layer

23:08 <mrvn> So 2 find_lowest_zero() calls give you the FD you can use.

23:08 <geist> it's all because of the stupid property that fds are first fit

23:09 <geist> and the source of a whole class of bugs and exploits

23:09 <dh`> and furthermore, each entry in the lower layer may as well represent a whole cache line's worth of the table itself

23:09 <heat> netbsduser, why do you have a Solaris Internals

23:09 <geist> i have that book too

23:09 <heat> do you have an Internals for every SVR4 descendent

23:09 <geist> it's quite well written

23:09 <netbsduser> heat: i like to read about other OSes to appreciate them + ruthlessly steal ideas i like

23:09 <mrvn> geist: AMEN, open() should return the next free FD with 64bit rollover.

23:09 <heat> how's the STREAMS

23:10 <geist> mrvn: we explicitly randomized handles in zircon to avoid this stuff

23:10 Vercas7 has joined #osdev

23:10 <dh`> <mrvn> yeah I want a fdtable of size 2^64

23:10 gxt__ has quit [Write error: Connection reset by peer]

23:10 Vercas has quit [Remote host closed the connection]

23:10 Vercas7 is now known as Vercas

23:10 <mrvn> dh`: you hash that

23:10 <heat> geist, but randomized handles forces you to use a tree which sucks

23:10 <geist> not necessarily

23:10 <mrvn> geist: so nobody can guess a handle?

23:10 foudfou_ has joined #osdev

23:10 <heat> you can't do anything remotely flat can you?

23:11 <geist> mrvn: correct. and more importantly if you close a handle it wont get reused quickly

23:11 gxt__ has joined #osdev

23:11 <geist> heat: depends on how good you 'randomize' it.

23:11 <netbsduser> it's coming along, i want to figure out whether i can implement a unified low-level module which pipes/fifos, ttys, etc can all use

23:11 foudfou has quit [Ping timeout: 255 seconds]

23:11 <mrvn> heat: you can hash the handle down to a small int and choose your random handles so the hash doesn't collide.

23:11 <geist> basically we feed it through a hash and put some salt in it so that the same slot doesn't net the same id

23:11 <mrvn> heat: creating a handle might have to try a few times.

23:11 <geist> at the end of the day it is indeed slots in a table, but the process sees it hashed

23:12 <heat> oh ok, so the handles aren't indices?

23:12 <geist> it's not cryptographically perfect. yuou can guess, but the main points is to avoid reusing handles quickly

23:12 <geist> so that most use-after-free bugs are caught

23:12 <dh`> just allocating sequentially mod the table size serves that purpose well enough

23:12 <geist> they're 'random' to the process, though post hash + salt they are indeed just indices

23:13 <dh`> (like with process ids, random process ids still seem like a stupid idea)

23:13 <mjg> geist: you *randomize* fd for security? am i misreading something?

23:14 <geist> basically. though zircon doesnt have fds per se. but it's the handle table

23:14 <geist> less of security and more of a bug catching thing

23:14 <geist> ie, handles take a very long time to get recycled

23:14 <mrvn> mjg: I think more a mitigation against bad code

23:15 <geist> we have some additional constraints you can put on a process to cause them to instantly abort if a bad handle is used

23:15 heat has quit [Read error: Connection reset by peer]

23:15 <geist> that catches a ton of things

23:15 <mjg> do you have dup2-equivalent?

23:15 heat_ has joined #osdev

23:15 <geist> no

23:15 <moon-child> imagine fd is stored in memory and buffer overflow corrupts it

23:15 <mjg> geist: ye that is a real problem

23:15 <moon-child> you're better off if malicious actor can't control which fd it turns into

23:15 <geist> you can absolutely not in any circumstances create a handle at a known value

23:15 <mjg> there are known multithreaded progs which use fds as they are being closed

23:15 <mjg> untintentionally

23:15 <moon-child> I heard the following anecdote: somebody forked, closed stderr, and then mapped some memory

23:15 <moon-child> then wrote a log message

23:15 <mjg> there was a bug in freebsd once which broke them

23:15 <mjg> kind of funny

23:15 <moon-child> mapped memory reused the stderr fd

23:16 <moon-child> so log message stomped mmap

23:16 <heat_> are we looping

23:16 <mjg> :d

23:16 <geist> we explicitly designed the handle mechanism to try to deal with this whole known set of posix issues with fd recycling and whatnot

23:16 <geist> works pretty well

23:16 <netbsduser> moon-child: that's appalling

23:16 <netbsduser> where did that happen

23:16 <zzo38> I would have solved it by making file descriptors that are not explicitly given a number to have a minimum file descriptor number; if you want a lower number then you must explicitly request it.

23:16 <moon-child> arcan

23:17 <mrvn> zzo38: that's even worse. Now all libraries compete for low numbers.

23:17 <geist> iirc QNX did somethingl ike putting all posix fds in positive space, and all other handles to QNX specific stuff in negative space (bit 31)

23:17 <dh`> you can't have both well-known addresses and a scheme for avoiding well-known addresses

23:17 <geist> or something along those lines, so the kernel can use different allocation strategies

23:17 <dh`> I can't imagine that would work since < 0 being invalid is baked in everywhere

23:17 <mrvn> dh`: you can pass the "well-known" addresses as arguments to a process.

23:18 <geist> idea is that for internal qnx stuff that's not doing posix, the negative handles are *bad*

23:18 <geist> so if they do leak out to posix space they wont work

23:18 <dh`> they'd have to audit pretty much every open for only testing -1 explicitly isntead of < 0

23:18 <geist> qnx being a microkernel, it's implementing posix in user space

23:18 <netbsduser> geist: clever trick, i might have to imitate that

23:18 <mjg> geist: my seal of approval

23:18 <dh`> I guess

23:18 <zzo38> mrvn: Well, normally 0 is used for stdin, 1 for stdout, 2 for stderr. Libraries shouldn't need to compete for low numbers, since they are only used for standard I/O anyways, I think

23:18 heat_ is now known as heat

23:18 <dh`> mrvn: you can but there are various costs to that

23:19 <heat> mjg's seal of approval is RARE

23:19 <geist> with some caveats being that they have some affordance for the kernel to directly map some ipc channels to fds, and in those case the fd-to-handle mapping is 1:1

23:19 <mrvn> zzo38: but we don't have 0, 1, 2 anyway so that point is moot.

23:19 <geist> and for everything else, handles to things that are meaningless to posix, they're in a different namespace, basically

23:19 <mjg> heat: true mjg!

23:19 <mjg> heat: true mjg@

23:19 <heat> ok mjg@

23:19 <geist> negative, not negative, doesn't matter. idea is the namespacing really

23:19 <netbsduser> i do know of some software which uses any means necessary to find out all open fds in a process and close them, but i suppose you can simply hide them from any posixy ways to find that out

23:19 <zzo38> However, my own (currently unnamed) operating system design does not have file descriptor numbers (although it can be emulated, if required for POSIX capabilities).

23:19 <netbsduser> (namely systemd uses linux's procfs to find them out)

23:20 <heat> netbsduser, FreeBSD has a syscall for that

23:20 <heat> and so does linux now

23:20 <mrvn> netbsduser: lots of software does that. Modern software should use CLOEXEC and the posix call to iter over the fds.

23:20 <netbsduser> heat: is there a specific syscall or is it via sysctl?

23:20 <geist> i suppose it'd be easily enough to implement some sort of close_range() call

23:20 <heat> syscall, close_range in both

23:21 <mrvn> netbsduser: scanning procfs fails with threads.

23:21 <heat> and closefrom in the libc I think

23:21 <geist> youc an onyl make a best effort. even close_range() would intrinsically race with opens in another thread

23:21 <geist> but you define most likely that it makes one pass, and closes them in a particular order

23:21 <netbsduser> systemd wants to close everything not on a whitelist it creates of acceptable fds, so i am not sure whether a close_range would work for it

23:21 <geist> syuch that races with any other threads are at least somewhat predictable

23:21 <zzo38> I do not have a name for my operating system design, so far

23:22 <zzo38> What operating systems were you designing and do you have any link of documentation?

23:22 <heat> netbsduser, sure does, use the gaps

23:22 <dh`> I don't think anyone here much cares what silly things systemd does

23:22 <mrvn> geist: posix_spawn can make it atomic

23:22 <mrvn> or you close after fork()

23:22 <netbsduser> dh`: i need to for the sake of a pointless publicity stunt

23:22 <heat> doing close_range for all the gaps is probably still a good bit faster than looping through fds and closing

23:23 <geist> right, where the multithreading isn't an issue

23:23 <netbsduser> porting systemd to my kernel would make excellent hackernews bait

23:23 <geist> because post fork it's just a single thread

23:23 <heat> lol

23:23 gog has quit [Ping timeout: 246 seconds]

23:23 <geist> (until yo ustart making new ones)

23:23 <mrvn> and you really shouldn't close random FDs before the fork()

23:23 <heat> didn't you port systemd to BSD?

23:23 <heat> or do I have the wrong guy

23:23 <netbsduser> i did, it was mostly for the same reason

23:24 <heat> ah, you do like the headlines

23:24 <geist> hah elaborate ways to get social headlines huh

23:24 <geist> i suppose that checks out

23:24 <netbsduser> i have an insatiable inner troll but i couldn't bear to do it the old-fashioned way with incendiary posts to forums and suchlike

23:24 <heat> you should port glibc to BSD now

23:24 <mrvn> Which actually brings me to a problem I had at work on friday: How do you get the highest FD.fileno that's open under python?

23:25 <heat> and then coreutils

23:25 <netbsduser> doing weird things with software is much more professional

23:25 <mrvn> heat: Debian kfreebsd

23:25 <netbsduser> heat: glibc did have a freebsd port at one point

23:25 <heat> i know

23:25 <netbsduser> at least formally it's a retargetable libc, i know someone is trying to bring it to managarm

23:25 <heat> so did debian

23:26 <heat> i have an in-progress port to Onyx

23:26 <netbsduser> they are having big trouble with its native posix threading library, which is very linux

23:26 <heat> it's a good libc

23:26 <heat> bah, nptl is fine

23:26 <heat> i hacked musl's nptl stuff and glibc isn't that much harder

23:27 <heat> you can also just implement your own separate library because glibc is ofc completely configurable

23:27 <netbsduser> in my experience i've found gnu stuff is often surprisingly portable, who else (but perl) checks for dynix, eunice, and the windows subsystem for posix applications in their configure scripts?

23:28 <geist> glibc, yeah that's been ported to all sorts of non linux things

23:28 <mjg> sorry to intjerect, do you have a minute to flame about memset?

23:28 <geist> haiku uses it, and back in the day BeOS did

23:28 <heat> yes gnu stuff is Great(tm)

23:28 <mjg> i got a real trace, all memsets made during build kernel, for each cpu

23:28 * dh` chuckles politely

23:28 <heat> supports all kinds of crap systems

23:28 <mjg> and a prog to execute them

23:28 <netbsduser> i never knew they were using glibc at haiku

23:29 <mjg> heheszek-read-current 148708742 cycles

23:29 <mjg> heheszek-read-bionic 98762683 cycles

23:29 <mjg> heheszek-read-erms 233876267 cycles

23:29 <mrvn> mjg: do you have a histogram of sizes?

23:29 <mjg> bionic "cheats" by using simd

23:29 <netbsduser> i know vmware esxi uses it (but i'm not sure if they just implemented a crudimentary linux abi compatibility)

23:29 <mrvn> mjg: how often is memset called to bzero?

23:29 <mjg> literally every time

23:29 <mjg> anyhow, as you can see, erms is turbo slower

23:29 <heat> mjg, ok, what's the point

23:29 <mrvn> literally or every time? Is there even one exception?

23:30 <mjg> mrvn: not in producton

23:30 <mjg> debug has it for poisnoning

23:30 <mjg> heat: what's the point of what

23:30 <mrvn> mjg: with a byte value or 32/64 bit pattern?

23:30 <heat> what's the big revelation in these results?

23:31 <heat> rep stosb bad, simd good, current ok?

23:31 <mjg> heat: there is no revelation, just confirmation erms crapper

23:31 <mjg> heat: and more importantly now there is a realworld-er (if you will) setup to bench changes to memset

23:31 <heat> where

23:31 <mjg> on my test box!

23:31 <heat> is this Proprietary(tm)

23:32 <mjg> not-heat licensed

23:32 <mjg> look mate, the code looks like poo right now

23:32 <mjg> i'm gonna play around with memset, clean that up and then publish somewhere

23:32 <heat> cool

23:33 <mjg> will be useful for thatl inux flame thread

23:33 <heat> no one flamed man

23:33 <mjg> note there was one major worry here: that there is branch prediction misses

23:33 <heat> how is that a flame thread?

23:33 <mjg> with ever changing sizes

23:33 <mjg> heat: see my previous remark about polack word choice

23:34 <heat> that thread is probably that tamest the lkml has ever been

23:34 <heat> particularly since linus likes you so much

23:34 <mjg> i don't think he does mate, but senkju

23:34 Vercas3 has joined #osdev

23:34 <heat> you're way better than the other mjg

23:34 <mjg> i'm going to generate more traces, including from linux

23:34 <mjg> for memset, memcpy and copyin/copyout

23:35 <mjg> then we will see what happens

23:35 <heat> geist, hello sir do u have time to run something on one of your ryzens?

23:35 Vercas has quit [Ping timeout: 255 seconds]

23:35 Vercas3 is now known as Vercas

23:36 <mjg> heat: do you have a memset?

23:36 <heat> no

23:36 <mjg> aight, no biggie

23:37 <heat> i wanted to try borislav's "rep movsb is totally good on amd" claim

23:37 <mjg> which amd tho

23:37 <heat> recent probably

23:37 <mjg> right

23:37 <heat> everything was bad on bulldozer

23:37 <mjg> it may not even be on the market

23:38 <mjg> even so, i have to note the typical way of benchmarking string ops by calling them in a loop with same size stuff can be quite misleading

23:38 <mrvn> heat: you should make the kernel/libc benchmark memcpy/memset at start and pick the fastest for the actual cpu.

23:38 <mjg> for example, if yout ained the branch predictor, a 32-byte loop is way faster than erms for sizes past 256 bytes even

23:38 <mjg> but this goes down the drain if you get misses

23:38 <mjg> tradeoff city

23:39 <mjg> in fact you may get slower

23:39 <heat> yes but don't forget this is all microbenchmarking

23:40 <mjg> north remembers

23:40 <heat> does any of this REALLY matter on a real workload? probably not

23:40 <mjg> ha

23:40 <mjg> wrong!

23:40 <heat> maybe slightly

23:40 <mjg> lemme find it

23:40 <mjg> well it mostly does not once you reach basic sanity

23:40 <heat> it's like the age old "just use rep movsb/q/l/w, cuz icache"

23:41 <mjg> i got numbers showing *tar* speed up after unfucking the string ops

23:41 <mjg> they used to be rep stosq ; rep stosb

23:41 <mjg> and so on

23:41 <mjg> absolute fucking massacre for the cpu

23:42 <mjg> bummer, can't find it right now

23:42 <heat> yeah but tar is just a fancy exercise in memory copying isn't it

23:42 <mjg> but bottom line, the really bad ops were demolishing perf

23:42 <heat> read(...) + write(...)

23:42 <mjg> handsome, tar was doing a metric fuckton of few byte ops

23:42 <mjg> not the actuald ata extraction

23:42 <mjg> and this was most affected

23:42 <heat> did you just call me handsome

23:43 <mjg> it is my goto insult

23:44 <heat> it's the harshest canadian insult after all

23:44 <mjg> so the jury is still out

23:44 airplanemodes has joined #osdev

23:44 <mjg> i *randomized* tons of sizes and fed them

23:44 <mjg> into the bench

23:45 <mjg> this makes erms faster *sometimes* and it is all because of branch mispredicts

23:45 <mjg> 19% for current memset, 4% for erms

23:45 <mjg> i note real-world trace has a win because the calls tend to repeat

23:46 <mjg> but in principle there may be another workload where the above happens instead

23:48 <zzo38> Are there wiki pages relating to such things as capability-based security?

23:49 <mjg> not that i know of, but one thing to google is: capsicum

23:50 <heat> and fuchsia

23:50 nyah has quit [Quit: leaving]