<bslsk05>
'Michael got new cell mate who doesn't sleep at all - Prison Break' by PopularScenes2020 (00:02:26)
<kindofwonderful>
that
<kindofwonderful>
klys: the last
<klys>
okok
<klys>
are you trying to swap everything out
<klys>
got a bug in the hibernation
<kindofwonderful>
klys: i try to poweroff buti have input/outpt error
<klys>
user error
<klys>
of course it's your project
<klys>
just commit
<kindofwonderful>
i commit to nothing, i have done the marriage mistke once
<klys>
and that's why you don't have a github?
<kindofwonderful>
yes
<klys>
words, man, semantics
<klys>
you don't wan tto give an inch, I suppose
<klys>
you think someone will take a mile
<kindofwonderful>
why are you blaming me for this ? lol
<kindofwonderful>
i just have some pecularities
<kindofwonderful>
like ..
<kindofwonderful>
I DONT SLEEP
<kindofwonderful>
AT ALL
<klys>
it
<kindofwonderful>
and if I do with strong medication it's with open eyes ( as i have been told )
<kindofwonderful>
it's affecting me socially
<kindofwonderful>
the medication have strnog side effects which means i only take them once in 3 days
<kindofwonderful>
other than that and the fact i lost 15 pounds of muscle im pretty normal
<kindofwonderful>
give or take
<kindofwonderful>
now im getting tired ...
<kindofwonderful>
klys: cya :)
axis9 has joined #osdev
kindofwonderful has quit [Ping timeout: 264 seconds]
axis9 has quit [Ping timeout: 268 seconds]
axis9_ has joined #osdev
netbsduser has joined #osdev
<netbsduser>
decided to go ahead with trying to do some kind of single-level-store smalltalk. lots of interesting problems to tackle to make it work
<axis9_>
a fixed
<axis9_>
speech off
<axis9_>
touch off
<axis9_>
ident: optIn
<netbsduser>
the on-disk format will be one area of particular interest
<netbsduser>
i am expecting to be imitating a log structured filesystem for that
<axis9_>
slivation ON
<mrvn>
isn't a tree better?
<mrvn>
scanning a 1TB log to recover would take long
<mrvn>
B-tree with COW and multiple roots, one per generation
<netbsduser>
mrvn: that could work well, i read a bit on ZFS recently and such a design could work
<mrvn>
the biggest question is how you deal with mutable objects.
gog has joined #osdev
<netbsduser>
i am expecting to cow them. for small objects this is actually a potential sore point; i am aiming at 128bytes or under as the object table entry size (therefore the minimum size of an object), and so if a load of small objects are modified which are spread across different blocks, there is an excess of copying afterwards. large objects i want to cow in parts. but here again comes a problem
<netbsduser>
any object which is big and which is subject to random-looking write distributions will quickly become fragmented
<netbsduser>
if i went for something like checkpointing automatically every 30s it could be a real nightmare
<netbsduser>
but that might just be a problem i have to endure. i could always add some kind of hinting mechanism, so that e.g. objects which are expected to live a long time can be marked as such, and then subject to a different strategy for cow, perhaps copying a bigger extent if possible rather than an individual block
terminalpusher has joined #osdev
wootehfoot has quit [Read error: Connection reset by peer]
wootehfoot has joined #osdev
terminalpusher has quit [Remote host closed the connection]
terminalpusher has joined #osdev
marshmallow has quit [Ping timeout: 252 seconds]
eroux has quit [Ping timeout: 252 seconds]
eroux has joined #osdev
dude12312414 has joined #osdev
xvmt has quit [Ping timeout: 260 seconds]
Lumia has joined #osdev
[itchyjunk] has joined #osdev
<mjg>
geist: so how did it go? :)
gareppa has joined #osdev
gareppa has quit [Remote host closed the connection]
<mjg>
mxshift: for what sizes
<mjg>
mxshift: using nt stores for string ops past certain size is not a new idea
<mjg>
mxshift: the question is if nt stores are a win for the most part when employed in clear_page/pagezero/younameit, which is quite not obvious, despite typical handwaving
<mjg>
mxshift: ... and which does seem *pessimal* on cpus made in last few years
<mjg>
mxshift: well the q is if this was any good on the 32 bit suckers with sse, which may or may not be true, but which was never justified one way or the other with actual tests, that i could find anyway
<mjg>
mxshift: what i did find is people handwaving and rolling with it
heat has joined #osdev
<heat>
computer
<mjg>
so far my history digging for page zeroing shows: solaris/illumos using nt stores, already came like that with initial import
<mjg>
well export :-P
<mjg>
freebsd: added nt stores with no explanation in the commit message
<mjg>
netbsd: added nt stores with no explanation in the commit message
<mjg>
openbsd: added claiming copy-paste from freebsd and a speed up in zeroing, but no methodology was shown or if actual usage (instead of *just* zeroing) was benchmarked
<mxshift>
I don't recall what sizes ended up benefiting. Apple didn't care about 32bit x86 machines for the most part. macOS Leopard/SnowLeopard would be where the results of those experiments got introduced.
<mjg>
found linux, it is using sse for zeroing, but the stores used don't bypass the cache?
<mjg>
" movq %%mm0, (%0)\n"
<mjg>
" movq %%mm0, 8(%0)\n"
<mjg>
" movq %%mm0, 16(%0)\n"
<mjg>
will have to check intel docs
<heat>
that's not sse
<heat>
that's mmx
<heat>
wow
<mjg>
right
<heat>
is that for i686?
<mjg>
yes
<mjg>
i know the shit is bad on amd64
<mjg>
the question is about the real old stuff
rwb is now known as rb
<sham1>
Wouldn't the use of MMX also be expensive?
<heat>
yes
<heat>
i'm more surprised by the use of mmx itself lol
<heat>
but i guess in i686 what's what you have
<mjg>
there is sse2 in later cpus
<heat>
turns out those two mmx instructions in the reset vector aren't the only ones used :(
netbsduser has quit [Remote host closed the connection]
<sham1>
Could always REP STOSD
<sham1>
Since IIRC that does bypass caches
<heat>
no?
<heat>
they only get funky cache behavior on erms AFAIK
<sham1>
Ah, I see
<sham1>
So that's probably what I remember reading
<heat>
and rep stosd is probably worse for i686 than just manually looping
netbsduser has joined #osdev
<heat>
why would you use an i386 kernel on modern (erms capable), 64-bit hardware?
<sham1>
Good point
<heat>
they didn't even mitigate retbleed
<sham1>
Even if one had a 64 bit machine with not that much memory (although I'd hazard a guess that it wouldn't qualify as modern then) you'd probably better off using the x32 ABI for AMD64 for the smaller pointers while having all the extra registers
<heat>
yes
<heat>
or the 32-bit compat for less fuckery I guess
[itchyjunk] is now known as [spookyjunk]
<sham1>
Compat doesn't get the extra registers though, right?
<heat>
yup
<sham1>
And just those alone are very appealing
<heat>
i wonder how x32 affects performance
<heat>
if we still had far and near pointers we could possibly enjoy reduced memory usage using an x32 + x64 hybrid
<heat>
thank you for coming to my ted talk
<heat>
or if we had segmentation
<heat>
time to bring all these things back
<GeDaMo>
You could use a 32 bit offset to a 64 bit base
<heat>
but then the codegen sux
<GeDaMo>
That's the compiler's problem! :P
<heat>
we could also make things more compact by using 4-byte aligned u64/uptr
<heat>
as is in 32-bit land
<sham1>
We need 69-bit pointers. 64 bits for the address, 4 for a pointer tag if needed, and 1 for parity
<sham1>
Makes it all very nice
<sham1>
Or all 5 extra bits for checksumming
<sham1>
Although a more realistic thing would be to have two 64-bit pointers acting like a base+offset pair, but that's not quite as funny
<heat>
420-bit pointers
<zid>
Can we have 4 bits of parity
<zid>
and a special ECC instruction to compute it
<mjg>
dafuq's x32?
<sham1>
It's not really parity if it's four bits
<heat>
only if it's only available in xeons
<zid>
x32 is 64bit regs with 32bit memory
<mjg>
regular i386 + all the extra amd64 regs?
<heat>
mjg, pointers are 32-bits, the rest is 64
<heat>
it uses the normal x86_64 isa
<mjg>
was not that x64?
<zid>
IL64P32
<heat>
no, x64 is x86_64 is x86-64 is amd64
<sham1>
x64 is a Microsoft's name for AMD64
<heat>
that is also IA-32e
<zid>
x64 is itanium and you can't stop me
<heat>
and Intel 64
* zid
runs around ululating
<mjg>
ye what zid said is my headcannon onw
<sham1>
x64 for itanium would make sense
<sham1>
Well, so far as the x86-style naming makes sense period
<heat>
86 should've been doubled as the bits doubled
<heat>
x172
<zid>
8086 -> 80172 or 8172?
<heat>
yes
<heat>
why not both
<heat>
80172 in the US, 8172 in the rest of the world
<sham1>
Clearly 160172
xenos1984 has quit [Read error: Connection reset by peer]
<heat>
that is also a distinct possibility
<sham1>
I wonder what the next address space extension would be, since we're already at PLT5 thanks to Intel
<sham1>
And 52 bits
<sham1>
PTL5*
<sham1>
PML5**
<sham1>
And actually it was 57 bits
<sham1>
Man, I'm just misremembering like all the details
<heat>
is PML5 already implemented by hw?
<sham1>
So it leaves 7 bits for a hypothetical PML6
rorx has joined #osdev
<heat>
oh yeah, icelake
<sham1>
heat: since Ice lake
<sham1>
Yeah
<sham1>
So PML6 is going to have 128 PML6e
<heat>
i don't believe we're getting PML6
<heat>
because address tagging
<sham1>
You're not supposed to use the high bits for tags, although that is a distinct possibility
<bslsk05>
en.wikichip.org: Top-byte Ignore (TBI) - ARM - WikiChip
<sham1>
Usually I've seen the low order bits being used as tags across the board
<heat>
and related intel and amd extensions
<sham1>
Hum
<heat>
which they chose to implement *separately* and in an incompatible way
<heat>
you can't actually store much data in the lower bits
<heat>
at most 4 bits if you have normal malloc 16-byte alignment
<sham1>
Depends on where you align. For example if you align your allocation to 16 bytes, as said you get 4 tag bits and then you can do other neat things like making it so that if the lowest order bit is 0 it's a fixnum and if not, it's something else, interrogate the higher tag bits
<sham1>
Which gives you either 63 or 31 bit integers depending on the width used
<heat>
yeah but that's not an address is it :)
<heat>
you have 4 bits
<heat>
that's not a lot
<sham1>
Well you could have a singular tag for "heap allocated object" which in turn would have a header
<heat>
if you use the upper address range you get 16-bits to play with
<heat>
upper address bit range*
terminalpusher has quit [Remote host closed the connection]
<heat>
which is in practice what JITs do
<sham1>
Well clearly not V8 at least
<heat>
and why most of them have issues when you expand the available address space
<sham1>
V8 tags SMIs like I just described
<sham1>
SMall Integer
eau has joined #osdev
<sham1>
There's some happy consequences there, like that you can add a fixnum/SMI to another and get back a new fixnum/SMI. You'd have to promote to a full width register if your "this is actually an integer and not a pointer" bit was at the top
<sham1>
Subtraction and negation also work, assuming 2's complement
<heat>
python uses nanboxing
xenos1984 has joined #osdev
<sham1>
Right, as does LuaJIT and IIRC also SpiderMonkey
<sham1>
And that only works while your addresses and stuff stay sufficiently small
<heat>
I know luajit was broken on M1-sized address spaces
<heat>
at cloudflare the kernel team tried enabling larger address spaces on arm64 but luajit just wouldn't work
<heat>
linux x86 deals with it by just not giving you a >48-bit address in mmap :)
<sham1>
I could see something like MAP_57BITS or something being added as a Linux extension to those who really want to map even larger things
<heat>
"An application that needs that much memory, and which does not play games with virtual addresses, can provide an address hint above the boundary in a call to mmap(), at which point the kernel will understand that mappings in the upper range are accessible. "
<j`ey>
heat: 52 bit VA on arm64 or?
<heat>
j`ey, yeah that was probably what they tried to enable, cant remember
<zid>
I've only use more than like a couple of gigs of memory once ever
<zid>
and it was to use my mmu as a trie implementation
<zid>
cus I was too lazy to write one
ss4 has joined #osdev
wootehfoot has quit [Ping timeout: 250 seconds]
<mrvn>
sham1: for some reason pointers are usually 0 tagged and integers 1 tagges. So a + b ==> a + b - tag
<mrvn>
It's odd because pointer access in hardware usually allows an offset, which you could use to remove the tag.
xenos1984 has quit [Ping timeout: 264 seconds]
<sham1>
I don't see why. I mean, as I said, v8 tags (small) integers as 0 and pointers with 1, and it makes sense since then a + b is a + b. Adding and subtracting integers is so very common. Pointer access for languages like Javascript is common too, fair, but I suppose v8 takes care of that by being the absolute master of inlining and unboxing
xenos1984 has joined #osdev
<sham1>
I'd certainly tag pointers with 1
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<mrvn>
the drawback is that you need a larger opcode for the pointer access to store the tag
<netbsduser>
i've opted for pointers tagged 0 in my smalltalk dialect and smallintegers 1
<sham1>
Fixnum addition, especially for things like loops with increments by 1, are so common that I simply don't understand that decision
<mrvn>
The thing about integer arithmetic is that you often don't do that on memory. You load the values into registers, remove the tag, do lots of compuations, add the tag, store in memory.
<netbsduser>
(and a few other things too at some point; Characters and SmallFloats; Apple famously encoded short strings into tagged pointers, very useful for early iPhone OS)
<bslsk05>
opensource.googleblog.com: Announcing KataOS and Sparrow | Google Open Source Blog
<mrvn>
sham1: increment by 1 just becoems increment by 2
<sham1>
Yeah. And then you don't even need to remove and readd the tag
<sham1>
Even if done by register, that's still precious instructions
<mrvn>
sham1: any literals you just remove the tag there.
<mrvn>
* and / need to shift to remove the tag
<mrvn>
sham1: I don't get why many GCs tag ints with 1 but they do.
<sham1>
Yeah, but those are also not *quite* as common operations. They're still common, but not quite as common
<netbsduser>
it would be interesting to see the two competing approaches benchmarked
<mrvn>
I guess someone should implement both ways and benchmark that with modern code and modern cpus.
<sham1>
You need to retag and such anyway because a multiplication will give you 128 bits and in the worst case you'll have to promote to a bignum at least if you have something like a Lisp
<netbsduser>
mrvn: great minds think alike, and fools seldom differ
<mrvn>
I'm sure someone did that in the deep past but a lot has changed since then.
<sham1>
Same with division. You might end up with a rational number
<netbsduser>
sham1: this is the behaviour of smalltalk
<mrvn>
sham1: a * b ==> a / 2 * (b - 1) + 1 horrible
<sham1>
Yeah, any language with a competent numeric tower like Smalltalk, CL, Scheme and so on has to do so
<mrvn>
In any complex arithmetic the compiler can strip out a lot of the tagging.
<sham1>
Yeah, if it can unbox it. Of course that goes for a lot of things
<mrvn>
Ocaml uses the lowest bit for INT_TAG (1), and otherwise the lowest 2 bits for pointers (00) and special constants (10).
<mrvn>
On 64bit it could use 3 tag bits.
xenos1984 has quit [Ping timeout: 268 seconds]
<mrvn>
sham1: one thing why you might want pointers 0 tagged is access with offset and shift: r0 + 2<<3 to access an 8 byte aligned value. If you 1 tag pointers you can't shift and reduce the offset you can reach.
<mrvn>
and r0 + r1<<3 - 1 needs more bytes on x86.
<sham1>
Well usually that access would look more like r0 + constant
<mrvn>
sham1: no, r0 + constant << shift. In most cpus.
<sham1>
Why shift? I mean, if you know which field you're accessing, you could just precalculate the offset
<mrvn>
sham1: it's part of the opcode. uses fewer bits to access the offset
Lumia has quit [Ping timeout: 246 seconds]
<sham1>
If, say, I had a 0b111 as a heap object signature for whatever reason and I wanted to access the 2nd 8-byte word of a object, it could just calculate something like mov %r15, [%rdx + 1] or something to that effect
<sham1>
Sure, you'd have to embed the offset into the opcode, but still
<sham1>
And that is assuming I know that I wish to access the second word
<mrvn>
sham1: try accessing the 256th.
<mrvn>
256 * 8 works but 2047 needs too many bits.
<sham1>
I mean fair enough, but I do not know how often that'd be needed
<mrvn>
yeah, that part you have to benchmark.
<mrvn>
I think this was also intended for array access: x[i] ==> r0 + r1 << 3. But compilers either just increment r0 in the loop or increment r1 by 8.
<bslsk05>
git.kernel.org: kernel/git/torvalds/linux.git - Linux kernel source tree
<heat>
a linux release name had emojis
<Ermine>
<3
<Ermine>
send me a fax
<heat>
no
<heat>
its 2022
<Ermine>
Oh, really?
<heat>
yes
frkzoid has quit [Ping timeout: 264 seconds]
<heat>
fax
frkzoid has joined #osdev
<Ermine>
windows server 2012 has fax server role
<geist>
heat: well, problem is that 6.4 is too old
<geist>
it's off the support train so a) the source repo doesn't exist in cvs anymore, so none of the cvsup/etc stuff work anymore
pbx has joined #osdev
<geist>
and b) the ports/package stuff is too old, none of it seems to exist on the servers anymore
<geist>
so it's pretty hard to do anything with it. i fiddled with it for a bit but coudln't install any of the handful of ports i wanted and couldn't sync any source
<geist>
and a new ports dir predictably is wayyy too new for the old freebsd
<pbx>
speaking of ports, i'm working on an interesting one for my OS:https://media.discordapp.net/attachments/1028035833149800510/1031288200431419512/unknown.png
<geist>
this is pretty annoying. i found that at least old netbsd's can still kinda work, because the source and ports cvs are still there and still accessbile, even if using a 20 year old version
<heat>
pbx, oh wow
* Ermine
have got another crazy idea
\Test_User has joined #osdev
<heat>
have you thought of, erm, not doing that?
<Ermine>
No
<mjg>
geist: kldload hwpmc && pmcstat -L
<geist>
neither of those work
<mjg>
wut
<netbsduser>
pbx: that would indeed be an interesting port to see
<geist>
-L switch doesn't eist in this veresion
<mjg>
geist: give me 5
<geist>
and the kldload doesn't have that module
<netbsduser>
the only OSes i am aware of with X11 ports are the major ones of the osdev discord
<netbsduser>
hobby OSes rather. managarm, lyre, and aero
<mjg>
geist: what exact error are you getting from kldload
<mjg>
cause
<mjg>
The hwpmc driver first appeared in FreeBSD 6.0.
<geist>
hwpmc does, the -L switch doesn't exist on the thing
<geist>
oh lemme see, gotta boot the machine up again
<geist>
it doesn't do power management very well, so it seems to burn a fair amount of power constantly :)
<geist>
yay pentium 4
<mjg>
you got a meter attached to it?
<geist>
no but you can feel the heat coming out of it
<geist>
though i should find my old kill-a-watt somewhere
<clever>
i need to test my new router, i can feel the heat wafting off the top of it
<geist>
booting it up now. iirc there was a syslog message to the effect of 'kernel not configured with ....'
<geist>
as if the hwpmc driver was not a default option in the SMP kernel
<mjg>
does the following work: pmcstat -S dc-misses -O /tmp/out
<geist>
booting it up...
<clever>
the router PSU is rated for 12v 5a
<geist>
pmcstat: ERROR: Initialization of the pmc(3) library failed: No such file or directory
<mjg>
did you kldload
<geist>
"this kernel has not been compiled with options HWPMC_HOOKS"
<geist>
ie, top and sysstat and whatnot dont seem to show any activity on cpu 1
<mjg>
if i have my history right the first non-toy smp releaes was 7
<geist>
yah, i do remember that 5 and 6 i think wwere kinda the dark ages of freebsd
<mjg>
everything prior to it was a tirefire
<mjg>
well i had a 5.3 in production which was rock solid fwiw :-P
<geist>
for SMP, but freebsd 4.x was a super solid OS
<mjg>
but also unicore
<mjg>
interestingly it was freebsd which worked for me and not linux
<geist>
twas the era where SMP x86 machines were still kinda exotic, or at least you knew that you weren't getting a 2x speedup, but that was better than nothing
<mjg>
for examlpe in linux you still had to manually pick different /dev/dsp* (or whatever the name) devices for different programs
<mjg>
so that more than one prog can emit sound (LOL)
<mjg>
freebsd made it work with just 1
<heat>
geist, were the CPUs themselves bottlenecked?
<heat>
(for the "not getting a 2x speedup" comment)
<geist>
oh i mean in terms of the state of the art OS stuff was not that sophsticated as it as now
<mjg>
kernels were turbo bottlenecked
<heat>
well, yes
<mjg>
it's basically giant lock everywhere, like openbsd today
<geist>
also things like per cpu run queues and whatnot were not the usual thing
<mjg>
heat: if you are asking if the hw was inherently running into problems, then i don't know
<heat>
i'm curious if the cpu technology itself made it a smaller gain than 100%
<mjg>
right
<geist>
but it worked well enough, in general. there wer obviously workloads that would collapse to 1x because of bottlenecks, but on the average it was still better than one cpu
<mjg>
i would expect MESI was dogshit slow
<mjg>
and consequently there was an inherent drop no matter what
<geist>
again, i dont think it was that bad
<geist>
i think a general rule of thumb was something like 1.8x or whatnot
<geist>
but totally depends on what you're doing
<heat>
was there ever a serious attempt to run a kernel instance per cpu?
<geist>
it's all a matter of perspective. if you have the option of adding a second cpu to your machine you weighed the cost, but at a point in time there (late 90s) before intel and amd started bifurcating their market into server and client stuff, the overhead was you paid more for a dual socket mobo
<geist>
but you had the option of just sticking another cpu in
<geist>
but then you needed an OS that actually supported it, but it was generally a reasonable speedup
<mjg>
did windows 9x do smp?
<mjg>
sound silly now that i wrote it
<heat>
lol
<geist>
was it' optimal? probably not, but was > 1x by a pretty good margin, and worst case it'd probably collapse to about what you got with a single cpu, so it was a win no matter what
<mjg>
.. looks like you needed nt
<mjg>
imagine smp dos (LOL)
<geist>
and yeah win 9x was very smp
<geist>
very not smp
<geist>
i remember the time in collage very time i had to reboot my SMP machien into win9x to play some game a little bit of me died inside
<mjg>
:)
<geist>
i got a dual ppro in about 1996. was my main machine up until 1999 when i got the dual p3 which i still have
smach has quit []
<mjg>
wut 2 cpus in 1996?
<mjg>
are you from old money or something
<geist>
no but i worked at compaq at the time and there was a lot of... uh, discarded hardware
<mjg>
what were you running which could even take advantage of it
<geist>
ie if you were okay running a A0 rev ppro 180Mhz, go for it
<geist>
was going in the trash otherwise
<mjg>
oh ok, hw lying around in big corp is a topic of its own
<geist>
but you could get mobos for not that much more. the market hadn't yet bifurcated into workstation/server (== $$) and desktop
<geist>
a dual socket board was just a bit more expensive. but once you paid the cost you could just stick another cpu in later, and cpus weren't intrinsically SMP or not SMP capable at the time
<geist>
since most of it was in the chipset anyway
<mjg>
so how did cache coherency work?
<geist>
oh they had a MOESI already there, but that was just present in basically 486+
<geist>
was part of the bus
<geist>
but ppro as a lot more sophsticated though, since it pulled in an L2 onto the socket
<geist>
so the bus protocol i think was more complex there
<geist>
but 486 and pentium there was an L1 cache on the cpu socket, so ti had to at least be coherent there
<geist>
but any L2s were on the motherboard, and a single one, so the cache coherency there was part of the chipset
<geist>
okay, got the new kernel, booted. lets see...
<geist>
okay, so got the hwpmc module loaded, but of course the commandline is less featureful
<heat>
what are we doing again
<heat>
old freebsd, old cpu, but what for?
<geist>
i dunno, i was just using it as an excuse to fiddle with old freebsd
<mjg>
heat: you know, someone may join this channel and ask what would you develop a custom os for
<mjg>
geist: so does this guy crap out or not: pmcstat -S dc-misses -O /tmp/out
<geist>
fwiw when it loaded the hwpmc hwpmc: TSC/1/0x20<REA> P4/18/0xfff<INT,USR,SYS,EDG,THR,REA,WRI,INV,QUA,PRC,TAG,CSC>
<geist>
no, says it can't do dc-misses
<geist>
ie, the command has a different syntax, etc
<bslsk05>
www.phoronix.com: With AMD Zen 4, It's Surprisingly Not Worthwhile Disabling CPU Security Mitigations - Phoronix
<geist>
well, it's not *useful* as in it is a waste of power to run old hardware that chews up a lot of it just to actually compute things
<geist>
that's honestly why i dont find some cute way to have all of them on all the tim in the garage or whatnot
<mjg>
heat: dude phoronix is a perf tabloid
<geist>
which would be fun, but i also pay the power bill and care a little bit about to environment
<zid>
I could believe they spent their powerbudget elsewhere though
<zid>
other than speculation
<heat>
mjg, doesn't matter, did you see the results?
<mjg>
heat: phoronix performs fire & forget benchmarking
<geist>
i was talkig to some folks ta work about it. basically their take is zen 4 > *
<mjg>
heat: who tf knows what actually happened
<geist>
and zen 3-4 > zen 1-2 by a lot on the mitigaion front
<geist>
ie, 1 and 2 have some fairly gnarly issues
<zid>
I just have mitties off, spectre just means you can install printer drivers on my machine
<zid>
if you have my user account I've already lost everything else
<geist>
and yeah i usually run with mitties off, but frankly i can also just go into my bios and bump the power budget a bit and that more than makes up for it
<mjg>
geist: so... i'm afraid i don't see a way to get pmcstat working (perhaps other release would, but i'm not going there)
<\Test_User>
and bumping the power budget doesn't work without spectre mitigations? :P just do both
<geist>
basically too old of a freebsd huh?
<geist>
what about a freebsd 13 install? would it be testable?
<mjg>
geist: if you give me 15 i'll give you 6.4 firendly will-it-scale to benchp age faults
<geist>
oh sure. i'd be happy to compile and run a program
<mjg>
ok stay tuned
<heat>
STAY HARD
<geist>
this little 22 year old computer wants to please!
<geist>
yah temporarily shuttles were really popular
<geist>
they were kinda ahead of their time with the mini itx wave back in the early 2000s
<geist>
it has this elaborate heat pipe to turn the heat sink 90 degress so it dumps heat directly out the back
<heat>
np i understand
<geist>
re: the zen 4 fater with mitigations, was asking someone at work about it but they dont know precisely which one it is
<mjg>
so it's ture?
<mjg>
wut
<geist>
i should ask him to pop in here sometime, he is a wealth of knowledge that stuff
<heat>
we want all the knowledge
<mjg>
sounds pretty weird
<mjg>
:)
<heat>
when is bryan cantrill popping in too?
<mjg>
dude, don;t
<geist>
it's probably something far more mundane like it's not that the mitigations make it faster, but the path where the mitigations aren't taken is for some reason very pessimal on that hardware
<mjg>
cantrill is a great example of an old geezer
<heat>
he's not that old
<mjg>
tell you what
<geist>
anyway work on your program
<mjg>
ok, fair
<heat>
why do these new uarchs still have mitigations though?
<heat>
i would assume they would fix it in sillicon
<geist>
cause they find more, or the mitigation is an opt in thing
<geist>
as in 'to avoid problems with this sort of thing, frob this when context switching'
<geist>
and you can choose to not frob it
<geist>
it's kinda the new thing now: things you should do when doing X or Y but dont *have* to
<mjg>
funny how certain tooling is misisng, you would think it was there since the 80s
<mjg>
anyhow, given the above, i do expect sse2 to ultimately win. i'm somewhat surprised, but not shocked
<geist>
so what was the default?
<mjg>
sse2
<mjg>
if available
<geist>
so they were onto something!
<geist>
so i think this whole exercise is, they were right at the time
<mjg>
it looks like so, yes
<heat>
woah
<geist>
the p3 wont take this route bcause it didn't have SSE2, i assume
<geist>
bcause SSE had just come out on P3
<mjg>
it might have
<mjg>
lemme do a quick check
<mjg>
which uarch is this
<geist>
this is a pentium 4. which is where SSE2 was introduced
<heat>
what's the point of zeroing with temporal stores though?
<geist>
the precise one.... good question
<heat>
s/temporal/cached/
<geist>
it's a later P4, the first with HT
<heat>
assuming it's faster these days, why?
<geist>
northwood maybe?
<heat>
maybe the caches are just big enough now?
<geist>
well, i think the anwer is NT stuff is a complicated problem. it trades this sort of thing for that
<geist>
huh this cpu may be 2002. northwoods with HT came out in early 2002
<geist>
the narrative in wikipedia was about 2002 or so they finally got the clock rate up (2.4Ghz and above) and was starting to beat the athlons at the time with sheer mhz
<geist>
prior to that the p4s were kinda a joke
<mjg>
oooh wait!
<mjg>
wait
<mjg>
this test only accesses one line from each page
<mjg>
no wonder it's a win
<mjg>
let's access some more
<geist>
but that causes it to blat out the whole page
<mjg>
it zeroes the hole page
<geist>
ah i see yo umean if i read in the whole page it should start losing with NT
<mjg>
but you then only ptay for fetching one line
<bslsk05>
'Unix50 - Unix Today and Tomorrow: The Kernel' by Nokia Bell Labs (00:51:33)
<mjg>
heat: 2. i'm patching it up to address different sizes
<heat>
you obviously need a more realistic bench
<heat>
I propose: kernel building
<heat>
the classic
<mjg>
that wont see a diff
ss4 has quit [Quit: Leaving]
<heat>
ah, it doesn't matter then
<mjg>
at lesat i don't expect it to
<heat>
checkmate
<mjg>
no
<mjg>
it wont because of utter single theraded slowness all over
<mjg>
which masks any difference in this area
<geist>
this is functionally a single threaded machine
<geist>
actually the performance numbers are all over the place here
<geist>
looking at it running its 5 runs
<geist>
very inconsistent between runs
<mjg>
hence the script
<geist>
this is with the script
<mjg>
if you get total noise with the script
<mjg>
that's real bad
<mjg>
i would make sure to kill off any possible source
<mjg>
cron
<geist>
ooooh it's because it's causing the machine to swap out
<mjg>
sendmail
<mjg>
lol
<mjg>
haha
<heat>
hahahaha
<geist>
how much memory are you trying to map here?
<mjg>
geist: how much dum do you have there
<heat>
128MiB
<mjg>
me?
<geist>
1GB
<mjg>
that's the default from there: 128 mb
<geist>
but for some reason it's using a shitton more hwere
<mjg>
i think we can safely drop it to 32
<geist>
oh i see why. a local fix here causes it not to unmap the old memory
<mjg>
it busts all the cases anyway
<mjg>
lol
<gog>
hi
<geist>
because i c += pgsize
<heat>
gog, hello gog.com
<gog>
welcome
<gog>
to zombocom
<heat>
welcome to gogcom
<heat>
pbx, hrmm what is it using to draw to the fb?
<heat>
/dev/fb0 or something stupid like that?
poyking16 has quit [Quit: WeeChat 3.5]
<geist>
okay, now with it locally fixed i'm seeing extremely consistent reuslts: nothing matters
<geist>
all three are basically the same speed
<mjg>
for clines 128?
<geist>
yeah, maybe about 2% slower for the sse
<geist>
102k vs 106k
<mjg>
can you only access half
<mjg>
64 i mean
<geist>
lemme tweak it to 64
<geist>
64 lines of 64 now
<mjg>
that's a full page
<geist>
same numbers it seems: about 102k for SSE, 107k for the others
<geist>
(using 64 byte cache lines)
<geist>
ah you mean half the page, sure
<mjg>
and last one: change mmapped area to 2 pages
<mjg>
i promise to stop after :)
<geist>
at half the page..... it's about 118k for sse, 112k/110k for the others
<geist>
so based on this super simple test case: at best the SSE stuff is a little slower, but as you touch less and less of the page in subsequent usage, it becomes more of a wi
<geist>
maybe the takeway is NT clearing isn't bad because cpus are very very good at filling cache lines on subsequent touchings, if they are
<mjg>
do 2 pages man
<mjg>
it is basically expected the less of the area you touch the better it is for nt
<geist>
yeah i'm doing it now
<geist>
and you're right
<mjg>
the question is what real workloads are doing, but as pmcstat does not work, we can't test it on this sucker :/
<mjg>
if you can be arsed some other time, perhaps plopping inux in there would help answer it
<mjg>
:)
<mjg>
if their profiling tooling, whatever it was at the time, worked
<geist>
102k for sse....96k/94k
<mjg>
for 2 page area?
<geist>
yes
<mjg>
huh
<mjg>
not what i expected
<heat>
lmao
<mjg>
can you paste the final patched prog somewhere?
<geist>
fine.
<geist>
my interest in this has crossed zero and is starting to go deeply negative now
<mjg>
ye just give me the prog and i'm off your back
<mjg>
as promised
<heat>
i think you're trying to find a difference where there isn't any
<mjg>
read this. it's basically solaris smp bag of tricks and if you think that's any good, we need to have a serious convo
<heat>
not right now
<heat>
but skimming through it, most things seem sensible
<mjg>
> Hash tables are common data structures in performance-critical systems software, and sometimes they must be accessed in parallel. In this case, adding a lock to each hash chain, with the per-chain lock held while readers or writers iterate over the chain, seems straightforward.
<mjg>
you already don't scale
netbsduser has quit [Remote host closed the connection]
<mjg>
for read locking a chain
<mjg>
it was already true at scales solaris claimed to operate at the time
<heat>
huh?
<mjg>
huh what
<heat>
what's the problem here?
<heat>
lock per chain
<mrvn>
mjg: hashtables are supposed to not have collisions. So chains should be short (or you resize) and few cores would lock the same chain.
<heat>
seems simple
<heat>
and effective
<mrvn>
just don't access the same item on multiple cores.
<mrvn>
mjg: The big problem with hashtables is how to resize them.
<mjg>
you spawn n threads, each of which looks up foo/bar/baz/quux{$thread_id}
<mjg>
if you employ rcu or an equivalent and skip hash locking you scale perfectly
<mjg>
otherwise all of them bounce lines on their way to the final component
<heat>
but this is 2008 and no one had rcu
<heat>
except... lunix
<heat>
freebsd got rcu like 3 yeras ago
<mjg>
then maybe you can concede solaris not having it could not scale
<mrvn>
except when each thread deletes foo/bar/baz/quux{$thread_id} you end up with O(n^2) work
<heat>
it scales worse
<mjg>
mrvn: we also had this convo man
<mrvn>
except when each thread deletes foo/bar/baz/quux{$thread_id} you end up with O(n^3) work, I mean
<mjg>
mrvn: *modification* do write lock
<heat>
but it still scales
<mjg>
mrvn: so no, no O(n^2)
<heat>
get a bigger hashtable, use read-write locks, etc
<mrvn>
mjg: can't do write locks, that means you need read locks and your rcu is pointless
<heat>
(and a good hash function)
<mjg>
mrvn: no
<mjg>
mrvn: again we had this convo
<mrvn>
mjg: and you are still wrong
<mjg>
mrvn: dude it's literally how linux and frebsd do it
<mjg>
mrvn: write lock for changes, lockless for looukps
<mjg>
on the same chain
<mjg>
now you are telling me this does not work
<mrvn>
no, I'm telling you you are missing something
<mjg>
heat: no amount of hash resizing or hashing func changes is going to change the fact that foo/bar/baz/quux{$thread_id} lookups all visit the same 3 elements
<heat>
mjg, yeah but erm, this is not specifically for paths?
<heat>
they're talking about hashtables
<mjg>
what?
<heat>
and they said is sensible
<mjg>
so for example their name cache is hash table
<mjg>
and when you have to lock stuff to access it you run into the above?
<mjg>
they happen to have a mutex instead of a rw lock for chains, but the same problem would exist
<mjg>
you fundamentally can't jump over the fact that rcu-like lookup of foo/bar/baz/quux{$thread_id} bounces nothing
<mjg>
while read-write locking bounces *twice* for the first 3 components
<mrvn>
Is there such a thing as a timed lock? You can aquire it for N ticks and then it reverts to unlocked. And you can check if it's already locked for >= M ticks and skip locking.
<mjg>
so 6 times in total
<mjg>
multiple that by n workers and you are fucked
<mrvn>
mjg: for me N == 4 - 16 so totally not a problem.
y0m0n has joined #osdev
Burgundy has quit [Ping timeout: 252 seconds]
<mrvn>
mjg: FYI your example is pretty bad. If you have threads using foo/bar/baz/quux{$thread_id} then first they have to create it, which is O(n^3) with RCU (which is why you need the write lock). Then they would operate on the FD they opened so no path lookup and then cleanup in O(n^3) with RCU again.
<mjg>
dude
<mjg>
let's agree to disagree
<mrvn>
mjg: better example would be N gcc all parsing #include <stdio.h>
<mrvn>
N parallel path lookups and no modifications.
<epony>
it does not scale linearly without software rework
<mrvn>
epony: ???
<Matt|home>
memory stack is a cpu hardware feature correct?
<mrvn>
mjg: what's a stack? my cpu has no stack
<mrvn>
Matt|home: ^^
<Matt|home>
i ask because im thinking if it's _actually_ a bad idea to treat the entire memory space as virtual addressing and only using heap memory storage..
<bslsk05>
en.wikipedia.org: Amdahl's law - Wikipedia
<Matt|home>
if that word salad made sense
<mrvn>
Matt|home: like haskell does?
<epony>
also you have concurrency on cache lines
<Matt|home>
not familiar with haskell
<mrvn>
Matt|home: doesn't really matter. Point is that it allocated "stack" frames on the heap
<Matt|home>
yeah basically. the _main_ reason im thinking about it is because of an off-hand comment made earlier about the C compiler, where "eventually you'll stop thinking about memory in terms of stack and heap". like.. memory segmentation is a hardware feature it's not an abstraction
<Matt|home>
so how dumb would it be for a system to act that way
<Matt|home>
or am i thinking of something very dumb here..
<Matt|home>
eh im thinking of something very dumb, nevermind ignore me
<mrvn>
Matt|home: any language with callcc basically has to do it that way.
<mrvn>
and it probably only makes sense for GC languages
<Matt|home>
yeh
<mrvn>
Something like C every function decends till it reached a leaf and then returns so a stack makes a lot of sense.
<mrvn>
the heap probably makes the least sense. You really don't want a classic heap in a modern OS
<bslsk05>
en.wikipedia.org: C dynamic memory allocation - Wikipedia
<mrvn>
heat: you are using dmmalloc to improve the classic heap
<mrvn>
epony: please stop pasting random links
<heat>
the "classic heap" always had something on top of it
<heat>
K&R malloc isn't just sbrk(size)
<AmyMalik>
mrvn, I have epony on ignore.
<epony>
also, make sure you actually target some HW for your kernel, just being an application is less demanding on your designs
y0m0n has quit [Ping timeout: 248 seconds]
<kazinsal>
he should be on +b
<AmyMalik>
true.
<mrvn>
heat: most importantly dmmalloc uses mmap potentially
<heat>
it heavily prefers sbrk
<AmyMalik>
on FreeBSD, sbrk doesn't exist on arm64 or riscv.
<heat>
huh
<heat>
weird
<AmyMalik>
how?
<AmyMalik>
The brk() and sbrk() functions are legacy interfaces from before the advent of modern virtual memory management. They are deprecated and not present on the arm64 or
<AmyMalik>
riscv architectures. The mmap(2) interface should be used to allocate pages instead.
<heat>
how what?
DanDan has quit [Ping timeout: 252 seconds]
<AmyMalik>
how is it weird?
<heat>
it's a fairly simple syscall that may see some usage, particularly on some mallocs
civa has joined #osdev
<heat>
but I guess freebsd just uses jemalloc
<mrvn>
most programs just use libc and don't bake their on malloc/free so that isn't a big problem.
<AmyMalik>
yeah, FreeBSD does use JEMalloc
<heat>
it's also trivially emulatable in user-space using mmap + mremap
<bslsk05>
en.wikipedia.org: CPU cache - Wikipedia
<mrvn>
Does FreebSD lubc implement (s)brk?
<heat>
AmyMalik, re: reimplementing malloc, sure?
<heat>
you can do it wherever you prefer
<heat>
your OS, an existing one, etc
<AmyMalik>
right on
<mrvn>
AmyMalik: have you planted a tree, build a house and walked a road yet?
<mrvn>
wrote an editor=
<mrvn>
?
<heat>
usually if its a userspace memory allocator I would prefer doing it in Linux
<AmyMalik>
no, no, no and probably not
<heat>
because it's easier to debug if things go poopy
<heat>
which they inevitably will
<heat>
not quite as easy to debug your own OS :)
<heat>
(yet!)
<AmyMalik>
hm
<AmyMalik>
I don't have my own OS yet
<epony>
quote of the week: -"I worked in a number of high profiled failures.." -"You're fired."
dude12312414 has joined #osdev
terminalpusher has quit [Remote host closed the connection]
frkzoid has quit [Ping timeout: 260 seconds]
<Matt|home>
i'll look at it later but what's the basic idea behind 'writing your own malloc' implementation on an existing system. each program has it's own memory space, is the idea just to use what constraints you're given or are you expected to make some system calls or what
<Matt|home>
or is it just an exercise that you can design however you want