foudfou has quit [Remote host closed the connection]
gog has quit [Quit: byee]
gog has joined #osdev
ZombieChicken has quit [Quit: WeeChat 3.5]
nyah has quit [Ping timeout: 268 seconds]
elastic_dog has quit [Ping timeout: 255 seconds]
elastic_dog has joined #osdev
ghee has quit [Quit: EOF]
saltd has quit [Remote host closed the connection]
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
[itchyjunk] has quit [Ping timeout: 244 seconds]
<heat>
configure scripts are pretty useful as posix unit tests
saltd has joined #osdev
<klange>
that'd be an integration test, though
<zid>
yea a unit test would be to check that 1+1 is 2 still, even if you used a makefile :P
<klange>
the individual things that go into a configure script are reasonable, but you'd have to have compiled them all elsewhere for it to be a proper set of unit tests
<heat>
bah, you get the point
<heat>
I've managed to find a bunch of brokenness in my system (both POSIXY and not) by getting my first configure script to run
<heat>
also, yay i managed to build GNU hello
<klange>
that is quite an achievement given what GNU hello exists for
<heat>
I tried ncurses but I can't get multithreaded make to behave properly there for some reason...... although it builds single threaded
<klange>
I really should revisit some of this stuff.
<heat>
what stuff?
<klange>
stuff that isn't mine
<heat>
lol
<klange>
I did get that dash port working nicely. I should get busybox ported and see if I can manage to run configure scripts with that setup.
<klange>
I put so much effort into doing my own things, but my Unix toolset is still very incomplete, and it's not something I care too much about.
<heat>
try toybox
<heat>
it's exactly like busybox but worse (although liberally licensed and probably lighter on deps)
<klange>
I don't really care that Busybox is GPL.
<klange>
Toybox is what Android uses, right?
<heat>
my configure ran on top of dash, toybox, OpenBSD tr, and nawk
<heat>
yes
<klange>
Have you tried the one true awk? It apparently got an update earlier this year that added Unicode support.
<heat>
yes, that's nawk
<klange>
I am quite happy to have produced a rather complete system of my own things, but maybe it's time to rejoin the rest of the world again :)
<heat>
i had to pick between gawk and nawk and... yeah I'd rather not have more GNU crap
<heat>
time to abandon your inner NIH and embrace your literally-not-invented-here-half-my-system-isnt-mine
<klange>
Perhaps not bothering with these Unix things will help me shed the "Unix clone" derision. You want a proper POSIX Toaru, here's dash and *box, have fun.
<klange>
I do want to redo my shell, though... but I can avoid bothering to go for a POSIX sh if I just shrug and provide a dash package.
<klange>
Though maybe Drew was right all those years ago when we had our falling out over my shell being placed at /bin/sh
<heat>
context?
<klange>
(playing ffxiv so can't be too responsive right now)
<klange>
can only irc while autowalking or in an unskippable cutscene
<klange>
thankfully there are a lot of the latter in the content I am playing at the moment
<klange>
heat: many many years ago, when drew was working a POSIX shell implementation, I was in their IRC channel, and some choice words were thrown around about the fact that I put my shell at /bin/sh when it is very much not a POSIX shell.
<heat>
i would say that's wrong but "choice words" sounds pretty harsh
<zid>
You can't end your binaries in .exe, that's what playstation uses.
<kazinsal>
It's only a unix clone if it runs on a 16-bit machine from the 1970s, otherwise it's just sparkling timesharing
<heat>
unix clones and derivatives are shit
<heat>
at&t code only
<kazinsal>
"I will never write a unix clone," I say, writing a unix clone for the IBM PC/XT
<heat>
are you writing a unix clone now?
<zid>
You mean a sparkling timeshare for the IBM PC, it's not a mainframe from the main au framé region of france
<kazinsal>
I got bored one night and started fiddling with Open Watcom and am likely going to end up just reimplementing V3 or something similarly simple and early
<heat>
upvote
<heat>
I've wanted to try and do something similar myself
<kazinsal>
The hardest part was trying to figure out exactly which fucking invocation program to use for Open Watcom
<heat>
one of the options was to grab some early BSD (4.2 or so) and port it to the modern times
<kazinsal>
Because there's half a dozen different command-line frontends and only one of them has the correct combination of options available to compile freestanding flat 8086 binaries
[itchyjunk] has joined #osdev
<klange>
ugh my truetype rendering is so slow
<geist>
yah every time i sit down to hack more on a user space for LK i just end up implementing a little more unix
<geist>
because why not
<klange>
I have this calculator app, which is backed by Kuroko, and I [relatively] recently implemented bignums, so now it can do 68**420
<zid>
Are you willing to do insane shit?
<klange>
except when you calculate that there's so many digits to render it slows to a crawl
<zid>
glyph cache, glyph cache with 3x of each glyph so you can interpolate subpixel AA?
<zid>
string cache, character pair cache? *adds more caches*
<klange>
need to start with just a plain old glyph cache
<heat>
geist, tbf unix is probably the only thing you haven't done
<heat>
i think
<heat>
(also nt but that's legally dubious)
<geist>
to a certain extent yes
<zid>
dw, they made running windows legal ages ago
<heat>
whether or not you can build a fancy pants microkernel on top of lk is pretty clear by now
<heat>
the real question is if it can support a full svr4 implementation, abi compatible
<geist>
no reason, it's just a different set of stuff
<geist>
and a more structured driver and fs api
<heat>
such is the power of lk
<heat>
my kernel is pretty opinionated I think
<heat>
disintegrating the unix personality from the core kernel would end up with a worse end product I think
<geist>
yeah fork is the real test, and the LK vm is not set up for it, etc
<geist>
or at least not set up enough to do it efficiently
<geist>
without some sort of stop the world, iterate over everything and replace with something else style fork
<heat>
I don't think fork() is a big issue
<heat>
all in all it's pretty simple
<geist>
oh gosh no. it gets really complicated when dealing with things like memory mapped files and whatnot
<heat>
go through every private mapping, mark it cow for the child and the parent
<geist>
or like i said, doing it *efficiently* is difficult
<heat>
for shared mappings, just increment page refs
<geist>
even that presumes there's a notion of 'shared mappings'
<geist>
but then i guess you ahve to do to posix
<geist>
since posix has a more explicit notion of it
<heat>
well yeah you'd need to attach some notion of private vs shared to the vm subsystem
gog has quit [Ping timeout: 248 seconds]
<heat>
honestly, I don't think the techniques behind fork() have evolved much
<geist>
modern designs i guess you have to do vfork() and whatnot
<geist>
it's not tha fork is hard, fork is easy if yuo design the VM around it
<geist>
but if you dont have a VM designed around it it gets somewhat more difficult, is i guess my point
<geist>
and i currently do not have the LK/zircon vm aligned in that way
<geist>
otoh the LK vm is very very simple now, so it's not hard to go in that direction
<geist>
it's really nothing more than a simple PMM + some ability to map (without demand faulting) some pages to a virtual spot
<heat>
yeah ofc you need some functionality you wouldn't need otherwise
<geist>
no sharing, no mmap files, no demand paging. so it could be evolved in that direction
<heat>
vfork() is easy, I think that would be the easiest part
<heat>
even fuchsia could do that
<heat>
if a kernel has any notion of processes sharing address spaces, it Just Works(tm)
<heat>
probably the hardest UNIX idea to retrofit would be signals
<heat>
because that infects your whole kernel
<klange>
I spent a bunch of time revamping my signal implementation.
<raggi>
signals really need ttys and sessions so that's expands into a whole historical mess
<geist>
yeah i had a basic signal + job control system in newos that was a pain
<geist>
was my first real experience with it
<heat>
I do wonder if you could make an almost-fully interruptible kernel with stack unwinding
<geist>
that's actuallty the biggest thing i gotta do with LK before i get too far with a user space: add mutex unwinding for thread termination
<geist>
currently there's no such thing, but it was one of the firs big retrofits had to do for zircon
<geist>
would have to do it again
<heat>
what's that?
<geist>
unwinding threads that are blocked on something because you want to kill them
<geist>
there is no thread_kill() in LK, somewhat on purpose. but when you have a user space and something kills a process
<geist>
you have to stop all the threads owned by the process that are blocked on someting (say waiting on a device read or whatnot)
<geist>
so to do that you probably need a generalized way to forcibly unblock a thread and allow it to unwind from the deep kernel out to the syscall itnerface. not rocket science, but would have to retrofit the LK code for it
<klange>
I was trying to naïvely do stuff with that before I did my signal rewrite. toaru32 signal handling was... sort of actually doing signals in the kernel. It would save the stack state for the kernel thread and go off and do something else. It was bad.
<geist>
yah
<geist>
for zircon (retrofitted from LK) we just mark most blocking events as unterruptable, and the thread structure remembers what wait queue its blocked on (can only be blocked on one by definition)
<klange>
Now I do the "signals are only handled on normal return-to-userspace" thing, with delivered-and-not-ignored signals tripping sleeping threads to resume with an 'interrupted' bit.
<geist>
so it's matter of forcibly removing the thread from the wait queue and arranging it to return from it's blocking event with a custom error code (ERR_INTERRUPTED, etc)
<geist>
and then make sure all the code leading up to it knows how to unwind the state
<klange>
I even did restartable system calls!
<heat>
that's very linuxy yeah
<heat>
although linux has an important distinction between normal signals and SIGKILL
<geist>
yah it's one of these 'do i want to inflict that upon LK' sort of design things that i'd have to think about a bit
<geist>
since most code does not need it
<heat>
namely, you can specify a waiting state where you can be killed(interrupted by a SIGKILL) but not interruptied
<klange>
It's probably easier to think about when designing things than "can I just arbitrarily lose this lock and be descheduled".
<geist>
yah at the minimum having it be an opt in: 'block on this interruptably' means the blocking code is prepared to handle the special error case
<geist>
and then it's matter of finding most places where things really block for long times (usually waiting on IO, etc) and making sure they can deal with it
<heat>
per tradition, UNIX kernels aren't interruptible on a lot of IO
<heat>
disk writes, etc aren't interruptible
<raggi>
Kill traditionally still works
<mrvn>
I sometimes think threads should have a "critical section" bit the user space can set. Then when scheduling a process only threads with the bit set may run at first.
<heat>
you don't schedule processes, you schedule threads
<heat>
or tasks which are like processes but can be threads or just really weird processes
<heat>
its hard :|
<mrvn>
heat: do you? Then one process can easily swamp the system with threads to DOS everoyne else.
<heat>
yes, you do
<mats1>
yes
<heat>
i think geist mentioned before that windows is an outlier in that regard
<heat>
in that it schedules processse
<mrvn>
no everyone does. scheduling in multipple levels can be quite useful
<heat>
you should check your local linux and freebsd copies
<mrvn>
linux doesn't really have the concept of process and thread. It's all just tasks that amy or may not share namespaces.
<heat>
exactly
<mrvn>
doesn't mean everyone does that or that it's even a good idea.
<heat>
(well, it kind of does, its called a thread group, but whatever, doesn't matter here)
<heat>
freebsd (and most BSDs?) also do it like that
<heat>
zircon also schedules threads, but has a distinction between threads and processes
<mrvn>
unix in general
<mrvn>
their design kind of predates threads
<heat>
anyway, linux has a similarish concept called restartable sequences
<heat>
it's like a critical section but without the DoS
<heat>
you give it a range of instructions, and if you get preempted in the middle of it, you get restarted back to the beginning
<mrvn>
heat: does that just save the addess in the vdso or does it need a syscall?
<heat>
syscall
<heat>
well, you set them up in advance
<heat>
you don't syscall when entering and exiting
<mrvn>
So on every task switch the kernel has to look through the processes list of restartable sequences to see if the PC is in any one of them?
<heat>
no, it doesn't use lists
<heat>
its kind of complicated and I honestly don't understand the specifics very well
<mrvn>
.oO(obviously, linked list would be O(n))
<mrvn>
you can probably use some mix of hashtable and tree to make it O(1) in most cases.
<heat>
ok so reading through the man page
<mrvn>
what it's called?
<geist>
it can probaqbly see a preempt time and rewind it then
<geist>
or at the time it was rescheduled
<heat>
you register a structure, the kernel looks at that structure (directly from user space), that structure then has a pointer to the current code sequence (which you set when entering the rseq)
<geist>
which would make it much easier at task switch time, it would just need to look to see if it had been registered
<heat>
you can't even do syscalls
<geist>
oh this is a user space thing, interesting
<heat>
it's very much just a neat little thing for highly optimized stuff
<heat>
tcmalloc for instance is a user of it
<heat>
for percpu thread cache lists
<mrvn>
"Each critical section need to be implemented in assembly." that sucks
<mrvn>
Looks like librseq defines a number of atomic operations (e.g. rseq_cmpeqv_storev) that you can then call from your own code. You aren't ment to write your own asm so much.
heat has quit [Ping timeout: 256 seconds]
[itchyjunk] has quit [Remote host closed the connection]
<mrvn>
geist: in a kernel it would be a bit useless as you could just disabled IRQs
<geist>
not unless what you wanted to do is expensive
<geist>
or something you expect to fault
<geist>
but yes.
<geist>
in general if it's a short critical section you just disable preemption one way or another
<geist>
if what you awnt is something like 'i want to do this potentially expensive idempotent operation in a critical way, but i dont want to actually disable preemption, so it's okay to just rewind me to the start of you do desire to preempt'
<remexre>
are there any nice "posixy subsets" that have a more modern select alternative? Was looking at WASI, but looks like its' poll_oneoff is basically just select(), performance problems and all
xenos1984 has quit [Read error: Connection reset by peer]
<mrvn>
you could go half way, not disable IRQs but disable task switching.
<mrvn>
when I see "potentially expensive idempotent operation" then what I read is "I don't want to do this twice, it's too expensive for that"
<moon-child>
> wasi
<moon-child>
> nice
<moon-child>
pick one
<geist>
oh yeah re: posix, forgot about select() and poll()
<geist>
those are also a pain
<remexre>
yeah, I'd never worked with it before, but I was at least _aware_ of it -- looking for other things to be aware of instead
<moon-child>
but is there a problem with just stealing epoll or kqueue?
<geist>
if you're doing posix you have to at least implement the old ones
<geist>
you can't just pick only the new bits (no fork, just vfork. no select, only epoll, etc)
<geist>
or at least you can but you will only get a subset of working stuff
<moon-child>
sure. But implementing anything 'new and nice' will get you even fewer working applications
<remexre>
yeah, but select can at least be emulated with epoll, and afaik faithfully; similarly, fork and vfork with clone (not that I want clone...)
<moon-child>
if you want legacy compat, you gotta do the legwork. Or don't, which is completely valid
<remexre>
I don't think there's a problem with stealing epoll or kqueue (all of this is emulated anyway), but I'm more looking for a subset that's complete enough to get more than just coreutils working
<moon-child>
I mean
<geist>
reallyi guess klange or sortie are the best to ask here, being that they have actually trod this particular path before us
<geist>
and what their experience has been
<moon-child>
I feel like the answer there is: look at what you want to get working, and see what they need
<geist>
yah that's my guess
<moon-child>
repeat until all the compile errors go away
<geist>
plus kinda fun anyway. if you stress about all the things you might need you will just despair
xenos1984 has joined #osdev
<mrvn>
I always find myself having one problem with epoll. I want edge trigger for writing and level trigger for reading. So every socket needs 2 FDs.
<mrvn>
remexre: also look at async IO and uring
<klange>
added a quick dumb glyph image cache for the terminal, really helping a lot there; doing a generic one for regular text will take some thought
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<clever>
mrvn: what if you just remove an FD from epoll when its writable, and store it in your own internal list
<clever>
and if it ever returns EWOUDLBLOCK, move it back to epoll?
<mrvn>
clever: that's an expensive syscall
<clever>
and being level triggered, it becoming writable before you epoll_wait() again, will solve itself
<remexre>
mrvn: yeah, my actual syscalls are much more uring-shaped, plus a dash of objcaps
<mrvn>
better to do edge trigger and keep track of the level manually
<clever>
mrvn: but how do you know the depth of the buffers?
<mrvn>
what depth?
<clever>
so you know how many bytes you can safely write
<mrvn>
you just write non-blocking
<clever>
or do you just write as much as you want, and let the kernel do a partial write?
<clever>
and EWOULDBLOCK when it cant do any
<mrvn>
when you get a short write you know you hit the full level and then you wait for the edge to fall again
<clever>
and that edge trigger, i assume means it has some room?
<mrvn>
yep.
<clever>
so you can constantly top it up, and never let the fifo run dry
<mrvn>
the complex case is when you don't have enough to write. You have to track the "can still write" level.
<clever>
yeah
<mrvn>
or more to the point the "still have stuff to read" level.
kof123 has quit [Ping timeout: 268 seconds]
<mrvn>
What I want to do is "epoll" and then read one request from every FD with data and repeat. But if an FD has 2 requests pending the edge trigger won't trigger again on the second epoll.
<mrvn>
You can read as much data from each FD as present but then one FD can DOS you.
<clever>
i think i saw it in the LK pl011 uart code, but when you get an rx interrupt, you dont just read 1 byte, you keep reading until the fifo reports its empty
<clever>
and translating that to epoll, just keep reading until EWOULDBLOCK, and then move to the next FD
<mrvn>
So for reading you have to keep a list of FDs that haven't returned EWOULDBLOCK, do an epoll with 0 timeout (don't block) and merge the two lists.
<clever>
but i can see some risk for a race condition here, what if more data arrives before you epoll again? does the edge trigger buffer up?
<mrvn>
clever: if you get EWOULDBLOCK, then epoll triggers again
<clever>
ah, but you just reminded me of a complex case
<clever>
you cant just read everything, in all cases
<clever>
if your doing something like an nginx reverse proxy, the write-channel blocking, can stop you from reading the read-channel
<clever>
and then you have to leave data un-read
<clever>
and now it wont edge trigger again, so yeah, you have to track that
<mrvn>
yeah, that's where level trigger on read stops working.
<mrvn>
if you want to do read throtteling then you need edge trigger and your own logic for levels.
<bslsk05>
github.com: rpi-open-firmware/uart-manager.cpp at master · librerpi/rpi-open-firmware · GitHub
<clever>
this is the only time ive actually written my own epoll code
<clever>
lets see, it watches 3 FD's, the /dev/ttyUSB0, stdin, and a signalfd so i dont have to deal with volatiles on ctrl+c
<clever>
but there is no read throttling, and no watching the write side
<mrvn>
using a eventfd to watch the epoll_fd for activity is fun too.
<clever>
if anything comes from /dev/ttyUSB0, just copy it to stdout, stdout is always faster
<mrvn>
or slecting on the epoll_fd
<clever>
if anything comes from stdin, its a slow-ass human, he can never fill a 115200 baud buffer :P
GeDaMo has joined #osdev
<mrvn>
hehe, until you have foobar | uart-thing
<clever>
but now that i look over it, the xmodem side
<clever>
this program does 2 things a normal serial console doesnt
<clever>
1: it pulls DTR high on startup, and low on ctrl+c, wire that to the reset/run pin, and now code can only ever run when being watched
<clever>
2: if it detects the string "press X to stop autoboot", it will send an X over the uart, and then xmodem transmit an entire file, then stdin goes back into control
<clever>
mrvn: if i did `foobar | uart-thing`, then i cant use stdin afterwards, at least not without opening /dev/tty
<mrvn>
clever: I just ment that stdin isn't always human.
<clever>
yaah
<clever>
but it being something else, would defeat most of the point of this program
<clever>
ah, and i see why xmodem never has to deal with writes blocking
<clever>
it basically has a window-size of 256 bytes
<clever>
it just sends a single block, and then waits for an ack/nak
<mrvn>
was that bidirectional?
<clever>
its setup to transfer a .elf file from the host pc to a remote cpu over serial
<mrvn>
but it's only a problem if you have a higher half kernel and it happens in boot.S before mapping the higher half.
<clever>
c40036d6: 03 46 sub r3,r0
<clever>
c40036d8: 13 7a lsr r3,0x1
<mrvn>
clever: ARM64?
<clever>
c40036da: 30 42 add r0,r3
<clever>
c40036dc: c0 7a lsr r0,0xc
<clever>
VPU
<mrvn>
well, no idea what mul/div/mod functions the VPU has in libgcc.
<clever>
this never called libgcc
<clever>
it did the entire thing in asm
<mrvn>
I thought that was the arm side.
<clever>
arm.c is responsible for turning the arm core on
<clever>
and enabling peripherals that the arm side lacks drivers for
<clever>
ok, so the first thing the above asm does, is load the pllc_core0 freq (324MHz) and a magic constant of 0x57a05c9f, and then does a high-side mult
<clever>
i believe that will result in 0x69c3ba35093c700 and then truncate it down to 0x69c3ba3
<clever>
then i get lost in why its doing all of those shifts
<clever>
ive seen some magic tricks before (and forgot the name) which let gcc do division with mults
genpaku has quit [Remote host closed the connection]
<clever>
if it knows the divisor ahead of time
<mrvn>
clever: some cases will be off by one. the shifts do a correction for those
genpaku has joined #osdev
<mrvn>
clever: it's x * (1<<N/z) >> N
<clever>
ah, its recovering the wrong bit0, that makes the shifts and adds make a bit more sense
<clever>
ah yes, that reminds me, the 3d core uses 1/w for some stuff
<clever>
and thats probably what gcc is doing here
<clever>
and like you said, 1<<N lets you do that in a fixed-point form
<mrvn>
except lr and r3 are constants. why doesn't the compiler do that at compile time?
<mrvn>
ups, ld, never mind
<mrvn>
I was reading "lea"
<clever>
freq_pllc_core0 is kinda a constant, but you would need whole program LTO to discover that
<bslsk05>
github.com: lk-overlay/pll_control.c at master · librerpi/lk-overlay · GitHub
<clever>
and after setting the PLL up, it computes what the freq should be based on those, and stores it in a global
<clever>
and with enough LTO, you could optimize that variable out, and turn it into a constant
<mrvn>
it's worse if you get the frequency from the DT
<clever>
yeah, at that point you cant optimize it out
<mrvn>
nor multiply by 1/z
<clever>
only if the 25mhz goal comes from DT
<mrvn>
you can do a lookup table though if you really don't want to do division.
<clever>
the VPU does also have a 32bit division opcode, takes about 20 clock cycles
<clever>
but these are 64bit ints
<clever>
which leads into a bug the official firmware had, for the longest time, the config parser uses signed 32bit ints
<clever>
so anything over 2.147ghz, would overflow
<clever>
and it was impossible to request an overclock above that
<clever>
it wasnt until the pi400 came out, that the SoC could reliably overclock that hard, and the bug was discovered
<clever>
mrvn: as for why i even need this 25mhz clock, thats because RPF decided to shave a few cents off the production cost!
<clever>
the chip that does usbhub and usbNIC, needs a 25mhz crystal, or a 25mhz square wave to function
<clever>
and RPF swapped the crystal out for a GPCLK from the main SoC
<mrvn>
at a 100 million boards that's a few million bucks.
jimbzy has quit [Ping timeout: 252 seconds]
<clever>
mrvn: exactly
<mrvn>
At least it's something you can work without. Not like the memory-protection-unit they removed from the Amiga back when. Couldn't separate processes from each other so they didn't include that in the OS design. Then later when MMU came about you couldn't put it back into the OS.
<clever>
there is also the vl805 spi flash chip, which is a grey zone
<mrvn>
anyone here using a global address space for their OS?
<clever>
you can do without it, but you need to modify your xhci driver some, and then its no longer obeying the xhci specs
<mrvn>
that's just incremental growth. I want exponential.
<netbsduser>
mrvn: interestingly some members of the amiga community developed a sort of cult following to the lack of MMU
<netbsduser>
despite the father of the amiga Jay Miner recognising it as a demerit, and designing one into the (rejected by Commodore) amiga ranger which would've been released around 1987
<Ermine>
Otoh they have the fastest microkernel possible
<mrvn>
and it's OO
<mrvn>
libraries are objects with a vtable.
<netbsduser>
what is striking about Amiga Workbench is that it's 'different' in how it feels in some ways to modern OSes, but it fundamentally is a modern OS, nothing like the typical "just a BASIC interpreter" ROMs of other home computers of its era, nor even like DOS
<mrvn>
the libraries even have versioned symbols
<netbsduser>
well, fundamentally a modern OS minus virtual memory
<mrvn>
netbsduser: and all that in 256k rom
<mrvn>
or 512k later
<netbsduser>
mrvn: i love how they used OOP to great effect in things like datatypes.library, which lets an amiga app written 1991 that wants to deal with "some kind of image" able to seamlessly deal with a WebP or PNG implemented as a datatype by someone decades later
<netbsduser>
this is one area where workbench even *beats* most modern OSes
<mrvn>
yeah. They had the concept of plugins decades ahead of everyone else
<mrvn>
any code that used datatypes to load sound can also play mp3 or aac
<mrvn>
I think the only thing missing was a video data type.
<netbsduser>
at such a universal level, too, i think the majority of programs use datatypes.library. the ARexx port is another nicety, though for some bizarre reason, nowadays it seems it's become unfashionable for most programs to let themselves be scripted
gxt has quit [Ping timeout: 268 seconds]
gxt has joined #osdev
<mrvn>
netbsduser: if your kernel came with code to handle image and sound formats for you would you implement your own?
<mrvn>
s/your kernel/the kernel/
<mrvn>
datatypes was something they added in 2.0. older code didn't have it.
<mrvn>
and games didn't use it much at all I think because a) they didn't have their data in such nice, readable formats. b) they didn't run on the workbench.
<mrvn>
lots of custom GUIs in games.
<clever>
mrvn: the windows kernel had font rendering support, complete with buffer overflow exploits and a turing complete font language, lol
<clever>
hence, nobody runs untrusted fonts thru it
nyah has joined #osdev
foudfou has joined #osdev
foudfou_ has quit [Ping timeout: 268 seconds]
gxt has quit [Remote host closed the connection]
gxt has joined #osdev
ecs has quit [Remote host closed the connection]
ecs has joined #osdev
ecs has quit [Remote host closed the connection]
ecs has joined #osdev
the_lanetly_052 has joined #osdev
<kazinsal>
There's still an official HTTP server kernel module for Windows
<clever>
wut
<kazinsal>
IIS in fact uses it as the low-level HTTP/HTTPS responder/filter
<kazinsal>
The complicated stuff all happens in userspace but requests are grabbed by http.sys and if possible responded to without passing them on to user mode
<clever>
kazinsal: thats just crazy
<kazinsal>
Yeah it's rad
<clever>
kazinsal: about the only reason i can see that being a benefit, is response time for static files in the fs cache?
xenos1984 has quit [Read error: Connection reset by peer]
<Ermine>
Need more funky kernel modules
<clever>
no need to context switch, just slap some IP and http headers on it, and stream the fs cache right into the NIC with scatter-gather
<kazinsal>
That's pretty much what they list as the benefits for it
<kazinsal>
Cached requests basically get turned right around immediately
gxt has quit [Ping timeout: 268 seconds]
xenos1984 has joined #osdev
gxt has joined #osdev
\Test_User has quit [Ping timeout: 252 seconds]
kof123 has joined #osdev
<jjuran>
mrvn: I develop two OSes with a global address space, though neither runs on bare metal.
vdamewood has joined #osdev
sm2n has quit [Ping timeout: 244 seconds]
sm2n has joined #osdev
wootehfoot has joined #osdev
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<bslsk05>
github.com: llvm-project/libc at main · llvm/llvm-project · GitHub
<mjg>
gotta check string routines
<zid>
Looks.. kinda like what i'd expect
<zid>
it's very basic
<zid>
cos() is just fcos etc
<heat>
mjg, they're doing it in C++ with intrinsics
<heat>
that's just nasty
<LittleFox>
heat, musl is strictly linux only iirc
<heat>
well ofc
<heat>
you need to adapt it to your own needs
<heat>
that's what I did
<bauen1>
LittleFox: but it's easy to start hacking away on it (and the code is much easier to understand and modify than newlib from my experience)
<heat>
>the code is much easier to understand
<heat>
>musl
<heat>
does not compute
* kof123
gives heat fonz-like hit
<zid>
musl's code is amazing
<kof123>
*elbow
<zid>
we spent an hour trying to figure out what one function was doing here between 4 of us
<bauen1>
heat: I didn't exactly found newlib code to be particularely nice
<bauen1>
*find
<LittleFox>
I read that linux-only as not caring about portability to other OS', but that's not the case then?
<heat>
bauen1, but musl is not it
<LittleFox>
newlib is ... something, but not nice
<heat>
LittleFox, and it doesn't care, but everything is hackable
<LittleFox>
hmhm
<heat>
fuchsia also used musl
<LittleFox>
mahk, pointed me to mlibc which I forgot existed
<LittleFox>
the one from managarm
<LittleFox>
fuchsia uses musl?
<heat>
now it uses a hybrid of musl and llvm
gelatram has joined #osdev
<heat>
yes
<raggi>
fuchsias musl is heavily forked
<heat>
correct
<heat>
I think it can be best described as a hybrid of musl, fuchsia and llvm
<raggi>
musl will treat you well as being easy to integrate and mostly correct. It won't provide highly competitive performance
<raggi>
Fuchsia will probably move to llvm once it's mature enough
gelatram has quit [Quit: Client closed]
<LittleFox>
llvm looks extremely incomplete still, like, am I blind or is there no malloc yet?
gelatram has joined #osdev
<raggi>
Llvm has a portable secure malloc already
<LittleFox>
ok, I'm blind then ^^
<heat>
right, you're probably supposed to use scudo
<raggi>
Yeah, scudo, I was drawing a blank on the name for a minute
<raggi>
LittleFox: 6/half dozen, you're probably right it's not wired up out the box
<heat>
i think llvm will take a substantial amount of time to get complete and stable enough
<LittleFox>
already have a forked llvm as submodule in my repo, would be great to stay in that ecosystem
<heat>
(and the concept of "toolchain-official libc" is also dubious)
<raggi>
It's definitely not complete yet, I recently needed a better performing sysroot than musl and I looked at it disappointed and made a static safe glibc instead :-(
<heat>
raggi, what do you find slow?
<heat>
malloc is noticeably slow but that can be replaced pretty trivially
<raggi>
heat: I kinda disagree. Having the toolchain and libc agree on what becomes intrinsics and how is no bad thing, and llvm will make both portable and easily crossable which is a breath of fresh air frankly
<raggi>
heat: there are no native implementations for lots of stuff, the use case I had recently was static SQLite embedded into a hosted runtime, and the libc switch cost nearly 40% on a fairly heavy workload, which was more than we could accept
<heat>
yeah but *what*
<raggi>
malloc was one, that'd be easy to fix, but also mem*, str*, etc
<heat>
right, string functions, those are hard
<raggi>
It's perhaps not really fair to accuse musl of being slow, and more fair to say glibc really does have fast impls
<heat>
the problem with musl is that there's no MIT-like-licensed string ops
<heat>
maybe bionic's. I was going to try them out today or so
<heat>
but those are still old. 2014 or so
<raggi>
Bionic had slow ones for a long time, I did see some fast impls going in recently, I did a quick pass too :-)
<heat>
there's a patch going around #musl for an AMD memcpy implementation - but it's GPLv2
<raggi>
I'm sure some profs could be sniped into writing some
<heat>
of course it beats the shit out of musl's horrible memcpy
<raggi>
Yep
<raggi>
But e.g. this is where llvm can just do this
<heat>
they're using rep movsq, they don't even use ERMS...
<heat>
can it? it's just SIMD in C++
<heat>
there's no tuning
<raggi>
It'll take a while though, as they also want to intentionally try to reduce the reliance on dropping to asm, and instead reach for competitive performance with compiler work
<heat>
just SIMD instructions in the middle of whatever shitpaste the compiler gives you
<raggi>
I mean the project as a whole, not the vm :-)
<raggi>
That group of people will get it done
<heat>
maybe
<raggi>
fair
<raggi>
I have hopes
<heat>
but they could've just contributed to one of the libcs instead
<raggi>
Debatable
<heat>
and *I know* that musl people are a bit stubborn
<raggi>
Rich got extremely vocally angry at fuchsia over the musl fork
<heat>
even then, forking
<heat>
Rich got angry because he didn't see changes back
<heat>
now, I've heard that you folks tried to upstream stuff and it didn't go well. I don't have the full story
<LittleFox>
maybe not declare something as linux-only when you want changes from os-specific forks back?
<raggi>
But there's a fairly fundamental goal difference here with integrating well with the toolchain to solve problems like performance, that's hard to put into an intentionally language spec portable impl
wootehfoot has joined #osdev
<raggi>
A lot of the fuchsia fork things have no place in upstream, like deleting entire dirs of stuff it doesn't need or want
<raggi>
Or removing specification insecure stuff / making it obviously broke (like rand)
<raggi>
Those are strong project local opinions
<heat>
LittleFox, it being blatantly linux-only is what makes the project simpler
<heat>
well, one of the contributing factors
<raggi>
Anyway, whatever the triggers were, getting publicly angsty wasn't a good way to create change :-)
<raggi>
Though it was noticed, I remember people jumping in to watch, and some discussions were had
<heat>
probably. but forking something was pretty doable
<heat>
or, I don't know, use bionic ;)
<heat>
musl is the second largest libc in the GNU/linux space and it's severely understaffed
saltd has quit [Quit: joins libera]
<heat>
C standard libraries are hard :|
<raggi>
the judgement was made, before my time, that bionic would have been more work to fork/port, and while I haven't looked that closely I can totally believe that to be true
<raggi>
Yeah, I would have liked to have seen more investment back to musl in some form, that's also not always as easy as it could be
<raggi>
The fact that only individuals show up on the big donations list is a bit sad
frkzoid has joined #osdev
<heat>
true
<heat>
wasn't freebsd who got a 3000 USD donation from apple?
<heat>
sorry, $100-249
<mjg>
that would be funny
<heat>
anyway, friendly reminder that most open source projects are severely underfunded or unmaintained
<heat>
I don't understand what's going on or how they're doing things
<heat>
all I know is that this is definitely not ERMS
<heat>
a straight up rep movsb (no extra logic) is half as fast as glibc memcpy
<heat>
mybe its prefetching?
<GeDaMo>
I would have expected movsb to do that too
Matt|home has joined #osdev
<GeDaMo>
Could it be using page mapping?
<heat>
like moving pages?
<heat>
that would be slow af
brynet has joined #osdev
<zid>
huzzah, my code is finally confirmed to work on a real playstation
<zid>
It basically worked before but the person I had test it happened to be using a disc where my DMA source address was all zeros so it looked like it failed
<bslsk05>
sourceware.org: Szabolcs Nagy - using IFUNC outside the libc
<geist>
i've had a fairly hard time convincing folks at work over the years that something other than intel exists on x86. at the time someone replaced that they were like 'well all modern x86 cpus are erms, so lets do that'
<geist>
but OTOH it was a general improvement over what was there
<bslsk05>
github.com: sabotage/musl-improved-memcpy.patch at master · sabotage-linux/sabotage · GitHub
<moon-child>
geist: iirc zen 3 has erms now. So just a matter of time. Unless you're talking about via or something like that?
<geist>
or goldmont, etc
<heat>
this is super fast but not mergeable due to licensing issues
<geist>
though goldmont better have erms and ferms because it's the efficiency half of an alder lake
<mjg>
heat: honestly i think they should just take raw bionic strings
<mjg>
heat: bsd licensed
<heat>
yes but
<moon-child>
I think we've discussed specialising separately for big/little cores here before
<mjg>
heat: not super optimized, but would be perfectly acceptable
<geist>
okay. haven't been reading the scrollback
<moon-child>
but there's still no os-level affordances for it, and seemingly no one wants to add them
<geist>
as i said, there's a task to do this, but x86 has not gotten a tremendous amount of love on fuchsia
<moon-child>
and yeah bionic is solid
<geist>
or more to the point it runs rings around the arm hardware we also run on
<geist>
so there doesn't tend to be as much of a push to optimize every bit
<moon-child>
haha, well
<geist>
one of the big issues is to properly optimize this stuff you really should run on a wide variety of hardware
<geist>
and one of the issues i generally have with how we (fuchsi and google) tend to go about testing is if you can't continually reproduce it there's a negative impetus to get started
<geist>
so it'll tend to be 'yeah you want to test on 10 differnet microarchitectures? well, right now we can only run CQ/CI on 2'
<heat>
mjg, the bionic memcpy isn't a win/big win
<moon-child>
heat: hahaha, they have an optimised memcpy, but memmove is still just rep stosq
<heat>
it's slightly better, but that's it. I've seen rep movsq be fast
<heat>
who?
<moon-child>
the musl thig you linked
<moon-child>
thing
scoobydoo has quit [Read error: Connection timed out]
<heat>
oh yeah sure, that's just a dollar-store patch
<heat>
not merged, not going to be merged, not mergeable
<heat>
that AMD code is GPLv2
<geist>
put it another way, since we're so testing focused (not bad), changes that are performance oriented tend to require a corresponding performance regression benchmark get added to the fleet
<geist>
and we currently dont have the capability to run on a variety of microarchitectures
<geist>
so it tends to be 'ugh, lemme work on something else'
<geist>
vs just getting it done an then hoping it doesn't regress later
scoobydoo has joined #osdev
<geist>
whereas i'd assume something like freebsd would just be yolo, get it done
<geist>
which is refreshing
<geist>
i mean not yolo as in completely unstructured, but i would assume somone that does a fair amount of due diligence on their own isn't required to maintain a regression tester that runs daily forever more
<mjg>
tru.dat
<mjg>
heat: it has some issues for sure, which is why i did not import it into freebsd
<mjg>
heat: ... yet. when i have time i'm going to patch it up
<mjg>
i think key is that musl cannot runtime check for any extensions
<zid>
oh that was meant for elswhere, but you can have it too
<mjg>
is that zoomer humor?
<zid>
no, millenial
<moon-child>
zid, do you have two monitors and a light-up keyboard
<klange>
lol, look at this loser with only two monitors
<zid>
I do
<geist>
oh daaaamn two monitors
<geist>
gosh i remember when that was so exotic. way back in the dos days the only way to do it was vga + monochrome
<geist>
(and yes i'm sure amiga did it like 35 years earlier than that)
<zid>
I mean, my setup isn't my different
<zid>
much*
<zid>
one is a hd lcd, the other is a crt
<mjg>
crt?
<mjg>
are you playing old games or soemthing?
<zid>
cathode ray tomato
<zid>
oh
<zid>
I do do that with it yes, but it's also just my old main monitor
<zid>
good enough to throw a video onto or whatever
<zid>
or another pdf
<geist>
yah for a long time there they were still pretty coveted for having better color reproduction too than first or second gen LCDs
<geist>
i think in geeral against a IPS or so the line was crossed. but i also remember first gen color LCDs were washed out crap
<zid>
They still have basically 0 input latency
<geist>
side note it seems that monitor class OLEDs may be finally starting to happen, though they're of course really pricey
<geist>
yah that too
<zid>
so still superior for games as long as you don't have unlocked fps PC titles
<geist>
well, i wouldn't quite say superior, but they hold their own
<zid>
in terms of performance
<geist>
there are basically zero latency good gaming LCDs too
<geist>
but then those tend to be TN, which look worse, etc etc
<zid>
There are *low* latency LCDs, there are no 0 latency LCDs
<geist>
though, if you can then run the LCD at a higher refresh rate you can make up for it
<zid>
yea, I mentioned
<geist>
since a LCD running at twice the hz even if it takes up to a frame its still faster than a 60
<zid>
it's better to have an LCD if you're running unlocked fps PC stuff
<zid>
but if you're playing a console game or whatever, CRT still wins
<zid>
I have an arcade game I play on it a fair amount
<geist>
oh 100% especially if it's a retro console. there was that period there with the advent of 3d gaming where it was really relying on the CRT to smear out the jagged lines or whatnot
<zid>
I still want a GDM-FW900
<geist>
yah though i didn't have tha tone, i did have a sony 19" for a long while and it was lovelyk
<geist>
replaced the 17" also nice viewsonic (i think) i had before
<bslsk05>
www.ebay.com: Sony GDM-FW900 CRT Monitor for sale online | eBay
<geist>
i remember the trinitron being the shit
<zid>
I've actually seen them for $50, collection only
<geist>
i'm not sad i got rid of it though, they were so huge
<zid>
and also for yea, 4000
<zid>
depending on if they realize what it is and market it well
<geist>
the last trinitron i had on hand was a SGI Indy monitor, got rid of it in 2015 or so before moving
<zid>
my CRT is infact, a trinitron
<geist>
i think it was 21"
<zid>
basically all hdcrts are technically trinitrons though, I think either they buy the licence or it fell out of patent, not sure which
<geist>
not so sure about that, i think sony was pretty hard about not licensing it
<zid>
in the 80s and 90s definitely
<geist>
but maybe very late on either they were the only maker or as you say people copioed it
<geist>
yah
<zid>
CRTs didn't stop being made until like 2012
<geist>
i vaguely remember the first LCD i got was a viewsonic in about 2001. and it was terrible, i returned it
<geist>
but 2 or 3 years later they started to get better enough that they were acceptable
<geist>
probably 2007 or so? I still have the early Dell monitor i got
heat has quit [Ping timeout: 256 seconds]
heat_ has joined #osdev
<zid>
Somewhat counter-intuatively, LCDs tend to die easily
<zid>
because of cheap powersupplies and stuff
<geist>
17" Dell. still a great monitor for hooking up random things since it also does composite and whatnot
<geist>
yah or the backlight in the early gens tended to be a big gas filled thing
<geist>
also early gen LCDs were power hogs
<zid>
They still are tbh
<heat_>
mjg, sse2 is still as fast/slightly slower than rep movsb/q
<zid>
until LED TV was a thing
<zid>
and the LED there literally just means it's backlit by LEDs
<zid>
which is *very* recent
<heat_>
I assume you get the big improvement with a super tuned loop with avx stores
<geist>
nah, not anymore. the ones with lights distributed across the back use mostly nothing sitting still
<zid>
and they *still* die
<geist>
sure, but i think computer monitors have generally doe the LED styff for a while
<zid>
they get warm and the LEDs cool
<zid>
cook*
<heat_>
mjg, re: runtime checks, it's an odd issue. they don't support GNU ifuncs
<geist>
all of my monitor deaths i've experienced have been either cracked screen (fairly fragile, something poked em) or a line or a quadrant just dies
<heat_>
iirc it was a big philosophical question, like most of musl problems really
heat_ is now known as heat
<geist>
re: libc and runtime checks, i also think that's why we've put it off in fuchsia. roland is starting to work on a) getting the runtime patching working nicely and i think b) moving away from musl
<geist>
so i suspect the memcpy/etc updates will come as a result of that
<geist>
i think there's some emphasis on the llvm-libc project. hopefully us putting some energy behidn it will accellerate that project
<geist>
it *seems* like a pretty good idea and the design is i think supposed to honor embedded libc needs as well
<heat>
what's the minimum cpu requirement for fuchsia?
<zid>
1
<geist>
so maybe finally will get a good modern libc that can also be scaled down properly
<heat>
thank
<geist>
i think the needs of the pigweed project (also under the fuchsia umbrella) is informing that
<heat>
that's under fuchsia?
<geist>
heat: on arm it's armv8.0, 64bit. on x86 it's basically -march=x86-64-v2 and about that line in the sand where cores supported it
<geist>
it's a relatively new categorization that gcc and llvm have standardized on
<zid>
oh sse4
<geist>
yah basically 2nd gen x86-64s
<geist>
so that cuts out very first gen atoms (bonnell, etc) and is about where nehalem came around
<zid>
That's the "non intel core that only appeared in 2 small laptop ranges and nobody has ever actually seen" designation :P
<geist>
which is i think a fairly acceptable line
<heat>
I require Haswell
<heat>
which has, among other things, AVX
<zid>
rip me then
xenos1984 has quit [Read error: Connection reset by peer]
<geist>
also there are things like XN and 1GB pages that zircon intriniscally assumes is present
<zid>
what's haswell got that you need?
<geist>
the 1GB pages we can probably standardize on, but haven't looked super closely at where it precisley showed up
<heat>
avx, 1GB pages (I don't need them, my kernel doesn't assume any of the features really, but they're nice)
<zid>
oh sandy has those
<geist>
XN i think was basically there from day one except maybe some very early P4s
<heat>
I think I use rdrand?
<geist>
also v2 picks up things like cmpxhg8b
<zid>
I do not have rdrand
<geist>
ie, dual word cmpxgh, which wasn't present in K8 (x86-64-v1)
<geist>
there are a few things in the kernel that use it
frkzoid has quit [Ping timeout: 244 seconds]
<heat>
my kernel is pretty cpuid agnostic
<heat>
give it an x86_64 cpu with long mode and it Just Works
<heat>
userspace isn't though
<geist>
yah, LK generally is too, but tere are a few features that aren't necessarily implicit (1GB pages is a common one that's easy to just assume is there)
<zid>
I am also cpuid agnostic
<zid>
in that I don't check it
<geist>
and of course the kernel has to opt in for the various AVX context switch stuff
<geist>
also v2 is kinda nice becaus eyou can rely on things like popcnt which comes in handy
<geist>
anyway recent clang and gcc both support -march=x86-64-vN as a switch
<geist>
so it's nice to use to set the baseline
<zid>
yea seems like a vrey good addition
<zid>
considering literally 0 people own a 'core' cpu and not a core2
<mjg>
heat: sse2 is way faster than erms up to a point
<mjg>
heat: and definitely faster than regular movs
<moon-child>
why use rdrand?
<heat>
entropy
<mjg>
heat: ... for the range where they make sense
<heat>
early boot entropy at least
<moon-child>
ehhhh
<geist>
mjg: yeah but i think at the point that really starts to matter you're already tuning something that works well
<geist>
i think most of us are just trying to get the thing to work. whether or not it's tuned, that's gravy
<geist>
and or a fun weekend project
<mjg>
for real man, erms for sizes < 256 or so is straight up garbage
<mjg>
but i think it is time to ditch this subject
<geist>
my general take is the kernel shouldn't copy things around if i has to, or if it is it's probalby user/kernel copies (which tend to be special code anyway) or it's page aligned
<mjg>
want the last word? :)
<moon-child>
heat: sufficiently malicious rdrand can look at what entropy you've already gathered and deliberately poison it
<mjg>
it's not just copy, it is tons of zeroing
<heat>
mjg, my benchmarks show that it's not for "usual erms sizes"
<geist>
sure. trouble is most zeroing i've seen (in zircon at least) is compiler generated
<geist>
and then it does whatever the fuck it wants
<zid>
tinybench shows that I might as well use movs unless I use a super avx version on a big copy
<geist>
or it's page zeroing, in which case you already have ac ustom routine
<zid>
I get 40GB/s no matter what the hell happens
<geist>
yah totes mcgotes
<heat>
moon-child, oh no?
<mjg>
geist: do you have a way to attach yourself to memset?
<mjg>
and trace generated calls?
<mjg>
it happens a lot on linux nad freebsd, and sizes tend to be small
<geist>
in general most of the zeroing is stuff like 'initialize this pobject to zero' in which case it's implicit
<mjg>
i would be surprised if it did not on fuchsia
<geist>
but a lot of that is because c++
<heat>
moon-child, i'm not sure why my first concern would be "rdrand is fucked" if the cpu is against me
<geist>
that *tends* to be a bunch of movqs
<mjg>
it is a bunch of them if the compiler knows the size
<geist>
which actually drives me nuts to see the codegen is a series of 10 byte moveqs with 8 bytes of 0 in the instruction
<mjg>
and it is small-ish
<zid>
Yea small objects almost always just assemble to a couple of inline movqs regardless
<mjg>
when it does not you are scrweed
<geist>
*otoh* the cpu probably eats that shit up
<geist>
my risc brain says 'fill in a register and write it out!'
<geist>
but most likely movqs is superior, even if it's huge code
<heat>
movq?
<heat>
the mmx thing?
<geist>
x86-64's 8 byte move
<zid>
mov qword
<zid>
(or movabs if you're feeling constnat)
<geist>
*with* an immediate. iirc it's the only way to easily get an 8 byte immediate into a register or diretly to memory
<geist>
oh way, movabs isactually what i was thinking about
<moon-child>
huh I'm surprised that's better than the alternative
<geist>
the compiler tends to spit out a ton of movabs
<moon-child>
given zeroing idiom should just rename
<geist>
moon-child: that's my thought too, but both gcc and clang seem to think movabs is superiod
<geist>
and they're probably right, annoyingly
<geist>
or at least they're right in a microbenchmark where icache pressure doesn't matter
<heat>
they probably avoid it due to register allocation?
<moon-child>
mmmm
<geist>
ignoring it chewing up lots of icache i assume the compiler just flattens a movabs 0 to some specific µop that slams it out
<geist>
s/compiler/core
<moon-child>
but again, zeroing idiom should be handled in the frontend, so all you have to retire is a store
<geist>
but yeah agreed. perhaps you have to use -Os to get a zered register + store
<moon-child>
probably yeah. I bet it's register allocation, like heat said, but would be nice if it could do it dynamically, depending on if there are free registers
<moon-child>
usually do instruction selection before register allocation. I heard there was some research on doing them at the same time
<geist>
that itself is intereting that gcc thinks haswell vs skylake is a different thing
<moon-child>
yeah
<moon-child>
rep stos is the same on haswell and skylake according to agner
<geist>
though at 256 bytes it switches to rep stosb
<geist>
so it presumably knows about the 'stosb doesn't get fast until approx 256 bytes' thing that mjg is talking about
<mjg>
stosb does not get fast for a long time, it's just without simd your only option is to do movs and those start sucking real fast
xenos1984 has joined #osdev
<mjg>
so it's the lesser crapper
<heat>
gcc with -Os will make memcpy into rep movsb pretty quickly
<geist>
so really the summary of all of this is 'with erms you should move <256 with a loop and ideally 8 bytes at a time untl you cant'
<mjg>
fastest is 32-byte loop afaics
<mjg>
for the < 256 range
<geist>
via 4 stores of 8?
<mjg>
it beats the 16 byte i had and 64 byte does not add speed up
<heat>
how does frms work?
<mjg>
yea
<geist>
and then frms says just use rep stosb and be done with it?
<mjg>
frms claims the startup latency is way slower, but i don't see any numbers
<geist>
so without emrs it's back to the usual 'move 32 bytes at a time until you have less than 32 bytes and then whatever?'
<mjg>
for all i know it still remains pessimal up to a range
<geist>
well, i'm putting together an order for an alder lake i7-12700k tomorrow probably
<geist>
so should have one floating around to run tests for folks once it omes in and i build it
<mjg>
i have free credits for ec2 'n shit
<mjg>
i can probably get an fsrm-capable box within minutes
<geist>
yah though might want to be careful there since the memory bandwidth may be all over the place
<mjg>
i just don\t have benchmark code readily available
<mjg>
one can start with most naive test possible: just zero or copy something small in a loop
<mjg>
with the same src and target
<heat>
you should use my thingy
<geist>
i dont mind helping out if you need it. have a medly of skylakes, ivy bridges, zen 1, zen 2, and soon alder lake
<moon-child>
I don't have credits, but I had to rent one of their servers recently
<mjg>
if startup latency is to be measured here, that's good enough
<moon-child>
to test/repro a 32-bit overflow; needed 100gb of ram or so
<mjg>
geist: i got skylake, westmere, sandy, haswell and some lol atoms
<moon-child>
just for a few minutes, it was less than a dollar I think
<moon-child>
really cheap
<geist>
mjg: yah you should get some AMDs into the mix since they tend to be different enough
<geist>
actually the erms startup cost on a zen 3 would be fascinating, since that
<mjg>
i know
<geist>
's when they claim ERMS (though no FRMS)
<mjg>
i'm working on long term access to some fresh zens
<geist>
i have one, can run code for you if you want
<geist>
my desktop machine is a 5950x
<mjg>
it's getting late here, will have to turn in soon(tm)
<geist>
no worries
<mjg>
i'll hack up something trivial to just measure startup latency this week
<mjg>
[already monday here :>]
<geist>
i'm just hanging out sunday afternoon at the local brewery. it's inside and away from the daystar
<mjg>
very likely a funny test will do fine: plug in the code into will-it-scale and check iterations/s
<mjg>
no cache flushing or other shenanigans
<mjg>
since again, only startup latency is to be checked
<geist>
yah funny i was firing up some older raspberry pi the other day (pi400 i think) and it had some benchmark code from doug16k on it
<geist>
had got me to run something on an arm
<moon-child>
need some kinda barrier
<kof123>
you guys are really hammering the "summon doug16k" today :D
<moon-child>
to depend on the movs result
<moon-child>
to make sure they don't overlap
<geist>
haha doug16k is the chosen one
<geist>
he wil bring balance to the osdev
<kof123>
he came back for a short appearance
<geist>
oh earlier today? or just earlier in the last fwe weeks?
<kof123>
weeks/months.
<heat>
like a month or two ago
<kazinsal>
couple weeks back iirc
<geist>
ah yeah
<mjg>
really weird part about all of this is that intel folks patched linux to just roll with erms
<mjg>
claiming FAST
<mjg>
years later FSRM shows up, which debunks the previous claim
<mjg>
in the meantime someone patched copy_to/from_user to use regular movs up to 64 bytes or so
<mjg>
instead of erms
<mjg>
but did not patch memcpy
<mjg>
weird af
<heat>
it's like a conspiracy but everyone is bad at everything
<moon-child>
lol
<mjg>
i mean if you found out the hard way erms for small sizes sucks, why do you only patch ONE place
<mjg>
although in our current landscape with all the vuln mitigations i can't say if fixing this kind of stuff is more important than ever
<mjg>
... or meaningless
<geist>
yeah well also intel probably assumes AMD doesn't eixst and vice versa
<geist>
though they at least had the cpommon courtesy to put it behind a cpuid bit
<mjg>
well i'm happy to add to "wtf man"
<geist>
mjg: yeah that's also what i've been a little sad about
<geist>
with vulns it feels like a lot of these microoptimzations which used to be fun and/or matter dont
<mjg>
the documented idiom is to rep stosq + finish with rep stosb
<geist>
since it's far more important the code be safe etc than doing things quickly
<geist>
or the thing you thought mattered now doesn't because the codegen is terrible around it, etc
<mjg>
... except that's TERRIBLY slow and there is a cheap hack which works around it big time
<geist>
very much an instance of 'we can't have nice things'
<mjg>
you can finish it off like so:
<mjg>
movq %r10,-8(%rdi,%rdx)
<mjg>
rdi is the target buf, rdx is the tail
<geist>
the arm64 memcpy does also the interesting idiom of always moving by (multiple of register bytes) including th elast iteration
<geist>
where the last iteration is just offset and double copied
<geist>
a thing i wouldn't hae thought of honestly
<geist>
since i had always assumed you always copy every byte precisely once
<heat>
overlapping stores?
<geist>
yah
<heat>
mjg loves overlapping stores
<geist>
like if you have to copy, say, 9 bytes. the trivial way would be an 8 byte move + another 8 byte move offset by 1
<mjg>
that's the trick
<geist>
(obviously it's more complicated, but that's a simplified version)
<mjg>
movq %r10,(%rdi)
<mjg>
movq %r10,-8(%rdi,%rcx)
<geist>
yah my old days of doing memcopies was all about aligning everything because can't do unaligned
<mjg>
bam, range 8-16 with no branching on the exact size
<mjg>
not my idea, but fucking great
<geist>
yah
<heat>
your idea for sure
nyah has quit [Quit: leaving]
<geist>
OWN IT
<mjg>
actually you are correct
<heat>
genius memcpy man show us the way
<mjg>
as soon as i get back from the patent office
<geist>
i havne't looked into what the current state of the art is on riscv but presumably there's room to optimize for cores that can and cant do unaligned accesses
<geist>
it's like the early 2000s era arm all over agian
<geist>
where the exact alignment and prefetcability and whatnot really matters
<mjg>
heh i had not looked at riscv in almost any capacity
<heat>
i assume that the current state of the art on riscv is "none, cpus are still slow, check again in 4 years"
<mjg>
i ran into one lolzer though
<moon-child>
i thought riscv has an sve
<moon-child>
so you can do masked stores
<mjg>
there was a hand-rolled bit op somewhere which i replaced with a compiler builtin
<mjg>
... which turn out to generate slower code on that platform
<mjg>
(:
<geist>
moon-child: right except it's also optional and probalby dont want to use in the kernel, so you're back to multiple versions again
<geist>
i think the annoying thing is the riscv arch does not specify that unaliged accesses work and/or are efficient
<geist>
unlike say armv8 which finally just flat out mandated it
<mjg>
makes me wonder if the above addressing was invented by someone pissed at memset et al
<mjg>
it's now my headcannon
<moon-child>
which? -8(%rdi,%rcx)?
<moon-child>
i like it
<geist>
and as far as i can tell there's no good way to determine if unaligned is supported, since there's no cpuid equivalent and i ont think it's described in the device tree
<geist>
this is part of the growing up that riscv is dealing with currently
<moon-child>
'no cpuid equivalent' wat
<heat>
yeah
<heat>
its a machine register
<heat>
misa?
<moon-child>
for something that relies so heavily on extensions, how do you even
<moon-child>
oh
<heat>
it's just a string in the device tree
<moon-child>
so it does have a cpuid equivalent :P
<heat>
you parse the characters
<moon-child>
oh
<moon-child>
ok
<geist>
yah and that only tells you if a feature is present,a nd it's only available in machine mode
<moon-child>
that's fine, I guess
<heat>
starts with "rv$BITNESS"
<geist>
so it's literally like 26 bits of features
<heat>
then each char is an extension
<mjg>
you do the access on purpose and see if you got the trap
<mjg>
there
* moon-child
trouts mjg
<geist>
mjg: yeah except perhaps the machine mode monitor emulates it for you
<heat>
practical development with overlapping stores man
<mjg>
it's not a serious proposal, but i would expect linux to do it
<heat>
why is it not serious?
<heat>
it totally works
<geist>
anyway, it's part of this ying and yang that riscv is going through now. trying to keep simple but also deal with practical realities of a fractured arhitecture
<geist>
so will se
<mjg>
heat: hack as fuck
<heat>
you're in the kernel
<heat>
hack as fuck is your middle name
<geist>
edxdcept in riscv kernel is not the root. you have this machine mode to worry about
<heat>
do you think .text patching isn't hack as fuck? it's hack as fuck^fuck
<mjg>
.text patching is fine
<geist>
anyway i think mjg was going to bed
<mjg>
unless it's what's solaris is doing
<geist>
i tried not to keep going!
<mjg>
last rant!
<moon-child>
lemme just memcpy the sse memcpy over the avx memcpy if cpuid says there's no avx support--wait...
<mjg>
they introduced "zero cost probes" for dtrace
<mjg>
except if you take a look, they cost a lot
<zid>
have you considered AVX copying your memset code into memory, to see if it makes it faster by filling icache sooner
<mjg>
the marketing was that there is no branching on whether the probe is on
<mjg>
what they did is the laziest fcking hack you can imagine
<mjg>
the let the compiler generate the func call to the probe as if it was there all along
<mjg>
and then they nop out actual call instruction
<heat>
ok?
<heat>
that's pretty standard
<heat>
linux does that
<mjg>
no
<mjg>
they still set up registers for the call
<geist>
yes and no. if you did it the way mjg descvribes it also means all the code around it is... yeah
<mjg>
and recovery afterwards
<mjg>
in the fast path
<geist>
flattening regs, etc
<heat>
particularly for ftrace
<geist>
whereas what you probablyr eally want is something that calls a vaneer routine that dumps registers an whatnot and then calls through to the real one
<mjg>
hot patching as done normally with asm goto injects a nop sled
<heat>
the only thing you know is the callsite
<geist>
that way you take the hit only when you make the call
<mjg>
but all the reg + call manipulation is moved elsewhere
<mjg>
so the impact on the func with the probe is just the nops
<mjg>
as opposed to partially erased function call
<geist>
it's the hidden cost of a function call, you end up dumping regs, trashing some of them, etc
<mjg>
so tl;dr the "zero cost probes" are far from zero cost
<geist>
yah seems like it'd also turn things like leaf nodes into not leaf nodes
<mjg>
and to be clear, the 5 byte nops would qualify as zero in my book
<geist>
beause there is otherwise a function call
<mjg>
.. as seen with asm goto
<geist>
yah and they deeloped it mostly in sparc probably, which would have been just a single call (and potentially the branch delay slot after it)
<mjg>
i need to write a complete rant about supposed solaris scalability and post it to tuhs
<mjg>
i need answers
<geist>
well most likely the arguments for it was in the 90s/2000s wher eit probalby was superior to the alternatives
<geist>
but times change
<moon-child>
could even make it a 2 byte nop; 1 byte jump to a 4 byte jump
<moon-child>
that's what windows does iirc
<mjg>
cheeky
<geist>
god reminds me of some exploit fix we have to compile the kernel with for some reason taht causes it to nop stuff branches that cross 32 byte boundaries or whatnot
<geist>
beause of some stupid skylake hack
<mjg>
geist: i don't know about that man. i can tell you for a fact though that an era appropriate machine (sandy bridge, 40 threads) from where sun was still "alive" runs into drastic perf problems on solaris
<geist>
it's bad. i should ask if we still need it
<mjg>
geist: which i profiled to just terrible scalability overall
<geist>
mjg: frankly i i'm not sure solaris was ever that serious on x86
<mjg>
i guarantee it sucks terribly on sparc as well
<geist>
it was mostly serious when they had 32 core sparc machines when everything else was doing go to be 2 or 4 way
<mjg>
it's all mutexes et al taken A LOT
<mjg>
on shared objects
<mjg>
in general their kernel weirdly lacks smp infrastructure to begin with
<mjg>
even the basic stuff like annotations to keep vars in disjoint cache lines
<geist>
(these are all things you're describing that zircon does not yet do :) )
<mjg>
i mean one annotated var per line
<mjg>
does zircon have a reputation of scaling to high core counts?
<mjg>
i don't see any mentions in the old books man :)
<geist>
depends on what high is nowadays
<geist>
8 or 10 cores was a shit ton of cores like 20 years ago
<moon-child>
wasn't zircon supposed to replace android
<mjg>
even bonwick's slab paper is kind of funny here
<moon-child>
on phones with, like, 4 cores?
<geist>
that was written in the 90s man
<geist>
seriously 4 cores was a fairly large machine then
<mjg>
replaced globally locked allocator, got a great speed up for it
<mjg>
.. but that means the kernel did not scale for shit already
<mjg>
the paper is late 90s afair
<heat>
moon-child, no
<heat>
fuchsia is supposed to run on everything
<geist>
i met jeff bonwick once. he's a really nice guy
<heat>
just like zombocom, everything is possible
<geist>
ZOMBOCOM
<geist>
we have a tremendous amount of work to do, which is why we're always looking for good kernel engineers
<heat>
but probably not too low end hardware; it relies on 64-bit stuff and memory fragmentation and all that
<geist>
and engineers that are also good at dealing with the fact that you an't do everything at once in one step
<geist>
it's a long process. that's part of the essence of engineering. making baby steps
<mjg>
aah there it is
<mjg>
supposed multithreading-friendly userspace malloc turned out to be de facto globally locked
<mjg>
seen in the vmem paper
<mjg>
bonwick accidentally calling sun out
<heat>
hm?
<geist>
again 'm not really sure it's fair to judge old designs by modern stuff
<geist>
though os design hasn't changed tha tmuch int he last 20-30 years i think a tremendous amount of the details of how to scale has
<geist>
and also the very real realization that you can't always get wha tyou want
<geist>
it may be their stuff wans't perfect at the time, but was sitll better htna most of the competition which was even 'worse'
<mjg>
i don't judge ideas
<mjg>
i'm pointing out their own tests result show that the code did not scale
<moon-child>
apparently hoard showed up in 1999
<mjg>
... at time of publication
<moon-child>
sure, but did anyone know about it prior to that?
<geist>
mjg: scale to what?
<mjg>
to whatever hw they were testing on
<geist>
compared to what they had before? to the competition?
<moon-child>
anyway, I'd expect all allocators prior to that to suck on multithreaded
<geist>
to modern standards?
<mjg>
well let me restate
<geist>
moon-child: yeah i remember hoard being a thing there. i actually used it in newos years ago. we were thinkig about it (or aybe did use it) in beos at the time
<mjg>
by the end of the 90s solaris kernel had a globally locked kernel allocator. with this state there is no way it scaled for shit at the time
<geist>
mjg: okay. fine.
<mjg>
despite popular opinions to the contrary
<mjg>
then they patched "mtmalloc". which was for userspace and was supposed to scale
<mjg>
erm, bad sentence
<mjg>
then they ported the new allocator to userspace and benchmarekd it against mtmalloc
<mjg>
turns out the multithreading-friendly allocator they had was de facto globally locked as well
<mjg>
you can literally see it in the vmem paper
<mjg>
so again, did not scale for shit
<mjg>
the graph claims a 10 cpu system
<mjg>
and that's from sun's own published material
<moon-child>
anyway their thing looks pretty linear? What's the issue with that graph?
* geist
goes to take a walk
<mjg>
from checking commit history to mutex code (2005 onward i think) i also know for a fact that any contentnion on boxes bigger than few cores very negatively affected perf
<mjg>
moon-child: the graph is fine. the point is the malloc they had for multithreading progs turned out to not work
<mjg>
see mtmalloc performance
<heat>
oh yeah I should switch to scudo
<moon-child>
oh the mtmalloc thing was what they were using?
<moon-child>
I see
<mjg>
re mutex, they check if the lock owner is running. but they used to do it by walking all per-cpu scheduler state and checking if the thread is perhaps scheduled on that one
<moon-child>
urk
<mjg>
effectively constantly pulling out what should be virtually always exclusively owned lines
<mjg>
rant over
<mjg>
good night :)
<heat>
what kind of night doesn't end with a rant about solaris scalability really
<heat>
or D O O R S
<moon-child>
new possibilities lie just outside the door!
<moon-child>
oh, on the topic of self-published benchmarks where you Just Don't Scale
<bslsk05>
github.com: conc-map-bench/ReadHeavy.fx.throughput.svg at master · xacrimon/conc-map-bench · GitHub
<heat>
come unix
<klange>
Also I discovered last night that I had accidentally comitted a (shareware, thankfully) DOOM1.WAD to my repository earlier this year while working on ARM stuff.