#osdev on 2022-08-21 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:04 matt__ has joined #osdev

00:04 matt__ is now known as freakazoid333

00:15 ghee has joined #osdev

00:48 gxt has quit [Remote host closed the connection]

00:49 gxt has joined #osdev

01:08 foudfou_ has joined #osdev

01:10 foudfou has quit [Remote host closed the connection]

01:11 gog has quit [Quit: byee]

01:14 gog has joined #osdev

01:16 ZombieChicken has quit [Quit: WeeChat 3.5]

01:18 nyah has quit [Ping timeout: 268 seconds]

01:19 elastic_dog has quit [Ping timeout: 255 seconds]

01:20 elastic_dog has joined #osdev

01:29 ghee has quit [Quit: EOF]

01:31 saltd has quit [Remote host closed the connection]

01:48 gxt has quit [Remote host closed the connection]

01:49 gxt has joined #osdev

02:07 [itchyjunk] has quit [Ping timeout: 244 seconds]

02:16 <heat> configure scripts are pretty useful as posix unit tests

02:17 saltd has joined #osdev

02:17 <klange> that'd be an integration test, though

02:18 <zid> yea a unit test would be to check that 1+1 is 2 still, even if you used a makefile :P

02:18 <klange> the individual things that go into a configure script are reasonable, but you'd have to have compiled them all elsewhere for it to be a proper set of unit tests

02:19 <heat> bah, you get the point

02:20 <heat> I've managed to find a bunch of brokenness in my system (both POSIXY and not) by getting my first configure script to run

02:20 <heat> also, yay i managed to build GNU hello

02:21 <klange> that is quite an achievement given what GNU hello exists for

02:21 <heat> I tried ncurses but I can't get multithreaded make to behave properly there for some reason...... although it builds single threaded

02:22 <klange> I really should revisit some of this stuff.

02:24 <heat> what stuff?

02:24 <klange> stuff that isn't mine

02:25 <heat> lol

02:26 <klange> I did get that dash port working nicely. I should get busybox ported and see if I can manage to run configure scripts with that setup.

02:27 <klange> I put so much effort into doing my own things, but my Unix toolset is still very incomplete, and it's not something I care too much about.

02:27 <heat> try toybox

02:27 <heat> it's exactly like busybox but worse (although liberally licensed and probably lighter on deps)

02:28 <klange> I don't really care that Busybox is GPL.

02:28 <klange> Toybox is what Android uses, right?

02:28 <heat> my configure ran on top of dash, toybox, OpenBSD tr, and nawk

02:28 <heat> yes

02:29 <klange> Have you tried the one true awk? It apparently got an update earlier this year that added Unicode support.

02:29 <heat> yes, that's nawk

02:31 <klange> I am quite happy to have produced a rather complete system of my own things, but maybe it's time to rejoin the rest of the world again :)

02:31 <heat> i had to pick between gawk and nawk and... yeah I'd rather not have more GNU crap

02:32 <heat> time to abandon your inner NIH and embrace your literally-not-invented-here-half-my-system-isnt-mine

02:34 <klange> Perhaps not bothering with these Unix things will help me shed the "Unix clone" derision. You want a proper POSIX Toaru, here's dash and *box, have fun.

02:35 <klange> I do want to redo my shell, though... but I can avoid bothering to go for a POSIX sh if I just shrug and provide a dash package.

02:35 <klange> Though maybe Drew was right all those years ago when we had our falling out over my shell being placed at /bin/sh

02:36 <heat> context?

02:38 <klange> (playing ffxiv so can't be too responsive right now)

02:40 <klange> can only irc while autowalking or in an unskippable cutscene

02:40 <klange> thankfully there are a lot of the latter in the content I am playing at the moment

02:41 <klange> heat: many many years ago, when drew was working a POSIX shell implementation, I was in their IRC channel, and some choice words were thrown around about the fact that I put my shell at /bin/sh when it is very much not a POSIX shell.

02:43 <heat> i would say that's wrong but "choice words" sounds pretty harsh

02:48 <zid> You can't end your binaries in .exe, that's what playstation uses.

02:52 <kazinsal> It's only a unix clone if it runs on a 16-bit machine from the 1970s, otherwise it's just sparkling timesharing

02:53 <heat> unix clones and derivatives are shit

02:53 <heat> at&t code only

02:54 <kazinsal> "I will never write a unix clone," I say, writing a unix clone for the IBM PC/XT

02:55 <heat> are you writing a unix clone now?

02:55 <zid> You mean a sparkling timeshare for the IBM PC, it's not a mainframe from the main au framé region of france

02:55 <kazinsal> I got bored one night and started fiddling with Open Watcom and am likely going to end up just reimplementing V3 or something similarly simple and early

02:56 <heat> upvote

02:57 <heat> I've wanted to try and do something similar myself

02:57 <kazinsal> The hardest part was trying to figure out exactly which fucking invocation program to use for Open Watcom

02:57 <heat> one of the options was to grab some early BSD (4.2 or so) and port it to the modern times

02:58 <kazinsal> Because there's half a dozen different command-line frontends and only one of them has the correct combination of options available to compile freestanding flat 8086 binaries

03:03 [itchyjunk] has joined #osdev

03:14 <klange> ugh my truetype rendering is so slow

03:15 <geist> yah every time i sit down to hack more on a user space for LK i just end up implementing a little more unix

03:16 <geist> because why not

03:16 <klange> I have this calculator app, which is backed by Kuroko, and I [relatively] recently implemented bignums, so now it can do 68**420

03:17 <zid> Are you willing to do insane shit?

03:17 <klange> except when you calculate that there's so many digits to render it slows to a crawl

03:17 <zid> glyph cache, glyph cache with 3x of each glyph so you can interpolate subpixel AA?

03:18 <zid> string cache, character pair cache? *adds more caches*

03:18 <klange> need to start with just a plain old glyph cache

03:20 <heat> geist, tbf unix is probably the only thing you haven't done

03:20 <heat> i think

03:20 <heat> (also nt but that's legally dubious)

03:20 <geist> to a certain extent yes

03:20 <zid> dw, they made running windows legal ages ago

03:27 <heat> whether or not you can build a fancy pants microkernel on top of lk is pretty clear by now

03:27 <heat> the real question is if it can support a full svr4 implementation, abi compatible

03:29 <geist> no reason, it's just a different set of stuff

03:29 <geist> and a more structured driver and fs api

03:31 <heat> such is the power of lk

03:31 <heat> my kernel is pretty opinionated I think

03:33 <heat> disintegrating the unix personality from the core kernel would end up with a worse end product I think

03:33 <geist> yeah fork is the real test, and the LK vm is not set up for it, etc

03:34 <geist> or at least not set up enough to do it efficiently

03:34 <geist> without some sort of stop the world, iterate over everything and replace with something else style fork

03:34 <heat> I don't think fork() is a big issue

03:34 <heat> all in all it's pretty simple

03:35 <geist> oh gosh no. it gets really complicated when dealing with things like memory mapped files and whatnot

03:35 <heat> go through every private mapping, mark it cow for the child and the parent

03:35 <geist> or like i said, doing it *efficiently* is difficult

03:35 <heat> for shared mappings, just increment page refs

03:35 <geist> even that presumes there's a notion of 'shared mappings'

03:35 <geist> but then i guess you ahve to do to posix

03:36 <geist> since posix has a more explicit notion of it

03:36 <heat> well yeah you'd need to attach some notion of private vs shared to the vm subsystem

03:38 gog has quit [Ping timeout: 248 seconds]

03:39 <heat> honestly, I don't think the techniques behind fork() have evolved much

03:39 <geist> modern designs i guess you have to do vfork() and whatnot

03:40 <geist> it's not tha fork is hard, fork is easy if yuo design the VM around it

03:40 <geist> but if you dont have a VM designed around it it gets somewhat more difficult, is i guess my point

03:40 <geist> and i currently do not have the LK/zircon vm aligned in that way

03:40 <geist> otoh the LK vm is very very simple now, so it's not hard to go in that direction

03:41 <geist> it's really nothing more than a simple PMM + some ability to map (without demand faulting) some pages to a virtual spot

03:41 <heat> yeah ofc you need some functionality you wouldn't need otherwise

03:41 <geist> no sharing, no mmap files, no demand paging. so it could be evolved in that direction

03:41 <heat> vfork() is easy, I think that would be the easiest part

03:42 <heat> even fuchsia could do that

03:42 <heat> if a kernel has any notion of processes sharing address spaces, it Just Works(tm)

03:43 <heat> probably the hardest UNIX idea to retrofit would be signals

03:43 <heat> because that infects your whole kernel

03:44 <klange> I spent a bunch of time revamping my signal implementation.

03:44 <raggi> signals really need ttys and sessions so that's expands into a whole historical mess

03:48 <geist> yeah i had a basic signal + job control system in newos that was a pain

03:48 <geist> was my first real experience with it

03:48 <heat> I do wonder if you could make an almost-fully interruptible kernel with stack unwinding

03:49 <geist> that's actuallty the biggest thing i gotta do with LK before i get too far with a user space: add mutex unwinding for thread termination

03:49 <geist> currently there's no such thing, but it was one of the firs big retrofits had to do for zircon

03:49 <geist> would have to do it again

03:49 <heat> what's that?

03:50 <geist> unwinding threads that are blocked on something because you want to kill them

03:50 <geist> there is no thread_kill() in LK, somewhat on purpose. but when you have a user space and something kills a process

03:51 <geist> you have to stop all the threads owned by the process that are blocked on someting (say waiting on a device read or whatnot)

03:51 <geist> so to do that you probably need a generalized way to forcibly unblock a thread and allow it to unwind from the deep kernel out to the syscall itnerface. not rocket science, but would have to retrofit the LK code for it

03:51 <klange> I was trying to naïvely do stuff with that before I did my signal rewrite. toaru32 signal handling was... sort of actually doing signals in the kernel. It would save the stack state for the kernel thread and go off and do something else. It was bad.

03:51 <geist> yah

03:52 <geist> for zircon (retrofitted from LK) we just mark most blocking events as unterruptable, and the thread structure remembers what wait queue its blocked on (can only be blocked on one by definition)

03:52 <klange> Now I do the "signals are only handled on normal return-to-userspace" thing, with delivered-and-not-ignored signals tripping sleeping threads to resume with an 'interrupted' bit.

03:52 <geist> so it's matter of forcibly removing the thread from the wait queue and arranging it to return from it's blocking event with a custom error code (ERR_INTERRUPTED, etc)

03:52 <geist> and then make sure all the code leading up to it knows how to unwind the state

03:53 <klange> I even did restartable system calls!

03:53 <heat> that's very linuxy yeah

03:53 <heat> although linux has an important distinction between normal signals and SIGKILL

03:53 <geist> yah it's one of these 'do i want to inflict that upon LK' sort of design things that i'd have to think about a bit

03:53 <geist> since most code does not need it

03:53 <heat> namely, you can specify a waiting state where you can be killed(interrupted by a SIGKILL) but not interruptied

03:54 <klange> It's probably easier to think about when designing things than "can I just arbitrarily lose this lock and be descheduled".

03:54 <geist> yah at the minimum having it be an opt in: 'block on this interruptably' means the blocking code is prepared to handle the special error case

03:54 <geist> and then it's matter of finding most places where things really block for long times (usually waiting on IO, etc) and making sure they can deal with it

03:56 <heat> per tradition, UNIX kernels aren't interruptible on a lot of IO

03:57 <heat> disk writes, etc aren't interruptible

04:03 <raggi> Kill traditionally still works

04:11 <mrvn> I sometimes think threads should have a "critical section" bit the user space can set. Then when scheduling a process only threads with the bit set may run at first.

04:13 <heat> you don't schedule processes, you schedule threads

04:13 <heat> or tasks which are like processes but can be threads or just really weird processes

04:13 <heat> its hard :|

04:13 <mrvn> heat: do you? Then one process can easily swamp the system with threads to DOS everoyne else.

04:14 <heat> yes, you do

04:14 <mats1> yes

04:14 <heat> i think geist mentioned before that windows is an outlier in that regard

04:14 <heat> in that it schedules processse

04:15 <mrvn> no everyone does. scheduling in multipple levels can be quite useful

04:15 <heat> you should check your local linux and freebsd copies

04:16 <mrvn> linux doesn't really have the concept of process and thread. It's all just tasks that amy or may not share namespaces.

04:16 <heat> exactly

04:16 <mrvn> doesn't mean everyone does that or that it's even a good idea.

04:16 <heat> (well, it kind of does, its called a thread group, but whatever, doesn't matter here)

04:17 <heat> freebsd (and most BSDs?) also do it like that

04:17 <heat> zircon also schedules threads, but has a distinction between threads and processes

04:17 <mrvn> unix in general

04:17 <mrvn> their design kind of predates threads

04:19 <heat> anyway, linux has a similarish concept called restartable sequences

04:19 <heat> it's like a critical section but without the DoS

04:20 <heat> you give it a range of instructions, and if you get preempted in the middle of it, you get restarted back to the beginning

04:20 <mrvn> heat: does that just save the addess in the vdso or does it need a syscall?

04:21 <heat> syscall

04:21 <heat> well, you set them up in advance

04:21 <heat> you don't syscall when entering and exiting

04:22 <mrvn> So on every task switch the kernel has to look through the processes list of restartable sequences to see if the PC is in any one of them?

04:26 <heat> no, it doesn't use lists

04:26 <heat> its kind of complicated and I honestly don't understand the specifics very well

04:27 <mrvn> .oO(obviously, linked list would be O(n))

04:27 <mrvn> you can probably use some mix of hashtable and tree to make it O(1) in most cases.

04:27 <heat> ok so reading through the man page

04:27 <mrvn> what it's called?

04:28 <geist> it can probaqbly see a preempt time and rewind it then

04:28 <geist> or at the time it was rescheduled

04:28 <heat> you register a structure, the kernel looks at that structure (directly from user space), that structure then has a pointer to the current code sequence (which you set when entering the rseq)

04:28 <heat> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2

04:28 <bslsk05> git.kernel.org: rseq.2 « man « doc - librseq/librseq.git - Restartable Sequences Library

04:29 <geist> yah probably cant nest them

04:29 <geist> which would make it much easier at task switch time, it would just need to look to see if it had been registered

04:29 <heat> you can't even do syscalls

04:29 <geist> oh this is a user space thing, interesting

04:29 <heat> it's very much just a neat little thing for highly optimized stuff

04:30 <heat> tcmalloc for instance is a user of it

04:30 <heat> for percpu thread cache lists

04:33 <mrvn> "Each critical section need to be implemented in assembly." that sucks

04:37 <mrvn> Looks like librseq defines a number of atomic operations (e.g. rseq_cmpeqv_storev) that you can then call from your own code. You aren't ment to write your own asm so much.

04:39 heat has quit [Ping timeout: 256 seconds]

04:41 [itchyjunk] has quit [Remote host closed the connection]

04:45 <mrvn> geist: in a kernel it would be a bit useless as you could just disabled IRQs

04:46 <geist> not unless what you wanted to do is expensive

04:46 <geist> or something you expect to fault

04:46 <geist> but yes.

04:46 <geist> in general if it's a short critical section you just disable preemption one way or another

04:47 <geist> if what you awnt is something like 'i want to do this potentially expensive idempotent operation in a critical way, but i dont want to actually disable preemption, so it's okay to just rewind me to the start of you do desire to preempt'

05:03 <remexre> are there any nice "posixy subsets" that have a more modern select alternative? Was looking at WASI, but looks like its' poll_oneoff is basically just select(), performance problems and all

05:03 xenos1984 has quit [Read error: Connection reset by peer]

05:03 <mrvn> you could go half way, not disable IRQs but disable task switching.

05:05 <mrvn> when I see "potentially expensive idempotent operation" then what I read is "I don't want to do this twice, it's too expensive for that"

05:08 <moon-child> > wasi

05:08 <moon-child> > nice

05:08 <moon-child> pick one

05:08 <geist> oh yeah re: posix, forgot about select() and poll()

05:08 <geist> those are also a pain

05:08 <remexre> yeah, I'd never worked with it before, but I was at least _aware_ of it -- looking for other things to be aware of instead

05:08 <moon-child> but is there a problem with just stealing epoll or kqueue?

05:08 <geist> if you're doing posix you have to at least implement the old ones

05:09 <geist> you can't just pick only the new bits (no fork, just vfork. no select, only epoll, etc)

05:09 <geist> or at least you can but you will only get a subset of working stuff

05:10 <moon-child> sure. But implementing anything 'new and nice' will get you even fewer working applications

05:10 <remexre> yeah, but select can at least be emulated with epoll, and afaik faithfully; similarly, fork and vfork with clone (not that I want clone...)

05:10 <moon-child> if you want legacy compat, you gotta do the legwork. Or don't, which is completely valid

05:12 <remexre> I don't think there's a problem with stealing epoll or kqueue (all of this is emulated anyway), but I'm more looking for a subset that's complete enough to get more than just coreutils working

05:13 <moon-child> I mean

05:13 <geist> reallyi guess klange or sortie are the best to ask here, being that they have actually trod this particular path before us

05:13 <geist> and what their experience has been

05:13 <moon-child> I feel like the answer there is: look at what you want to get working, and see what they need

05:13 <geist> yah that's my guess

05:13 <moon-child> repeat until all the compile errors go away

05:13 <geist> plus kinda fun anyway. if you stress about all the things you might need you will just despair

05:21 xenos1984 has joined #osdev

05:57 <mrvn> I always find myself having one problem with epoll. I want edge trigger for writing and level trigger for reading. So every socket needs 2 FDs.

05:59 <mrvn> remexre: also look at async IO and uring

06:01 <klange> added a quick dumb glyph image cache for the terminal, really helping a lot there; doing a generic one for regular text will take some thought

06:02 vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

06:05 <clever> mrvn: what if you just remove an FD from epoll when its writable, and store it in your own internal list

06:05 <clever> and if it ever returns EWOUDLBLOCK, move it back to epoll?

06:05 <mrvn> clever: that's an expensive syscall

06:05 <clever> and being level triggered, it becoming writable before you epoll_wait() again, will solve itself

06:06 <remexre> mrvn: yeah, my actual syscalls are much more uring-shaped, plus a dash of objcaps

06:06 <mrvn> better to do edge trigger and keep track of the level manually

06:07 <clever> mrvn: but how do you know the depth of the buffers?

06:07 <mrvn> what depth?

06:07 <clever> so you know how many bytes you can safely write

06:07 <mrvn> you just write non-blocking

06:08 <clever> or do you just write as much as you want, and let the kernel do a partial write?

06:08 <clever> and EWOULDBLOCK when it cant do any

06:08 <mrvn> when you get a short write you know you hit the full level and then you wait for the edge to fall again

06:08 <clever> and that edge trigger, i assume means it has some room?

06:08 <mrvn> yep.

06:08 <clever> so you can constantly top it up, and never let the fifo run dry

06:09 <mrvn> the complex case is when you don't have enough to write. You have to track the "can still write" level.

06:09 <clever> yeah

06:10 <mrvn> or more to the point the "still have stuff to read" level.

06:11 kof123 has quit [Ping timeout: 268 seconds]

06:11 <mrvn> What I want to do is "epoll" and then read one request from every FD with data and repeat. But if an FD has 2 requests pending the edge trigger won't trigger again on the second epoll.

06:12 <mrvn> You can read as much data from each FD as present but then one FD can DOS you.

06:13 <clever> i think i saw it in the LK pl011 uart code, but when you get an rx interrupt, you dont just read 1 byte, you keep reading until the fifo reports its empty

06:13 <clever> and translating that to epoll, just keep reading until EWOULDBLOCK, and then move to the next FD

06:13 <mrvn> So for reading you have to keep a list of FDs that haven't returned EWOULDBLOCK, do an epoll with 0 timeout (don't block) and merge the two lists.

06:14 <clever> but i can see some risk for a race condition here, what if more data arrives before you epoll again? does the edge trigger buffer up?

06:14 <mrvn> clever: if you get EWOULDBLOCK, then epoll triggers again

06:14 <clever> ah, but you just reminded me of a complex case

06:14 <clever> you cant just read everything, in all cases

06:15 <clever> if your doing something like an nginx reverse proxy, the write-channel blocking, can stop you from reading the read-channel

06:15 <clever> and then you have to leave data un-read

06:15 <clever> and now it wont edge trigger again, so yeah, you have to track that

06:15 <mrvn> yeah, that's where level trigger on read stops working.

06:16 <mrvn> if you want to do read throtteling then you need edge trigger and your own logic for levels.

06:16 <clever> yeah

06:17 <clever> https://github.com/librerpi/rpi-open-firmware/blob/master/uart-manager/uart-manager.cpp

06:17 <bslsk05> github.com: rpi-open-firmware/uart-manager.cpp at master · librerpi/rpi-open-firmware · GitHub

06:17 <clever> this is the only time ive actually written my own epoll code

06:18 <clever> lets see, it watches 3 FD's, the /dev/ttyUSB0, stdin, and a signalfd so i dont have to deal with volatiles on ctrl+c

06:18 <clever> but there is no read throttling, and no watching the write side

06:19 <mrvn> using a eventfd to watch the epoll_fd for activity is fun too.

06:19 <clever> if anything comes from /dev/ttyUSB0, just copy it to stdout, stdout is always faster

06:19 <mrvn> or slecting on the epoll_fd

06:19 <clever> if anything comes from stdin, its a slow-ass human, he can never fill a 115200 baud buffer :P

06:19 GeDaMo has joined #osdev

06:20 <mrvn> hehe, until you have foobar | uart-thing

06:20 <clever> but now that i look over it, the xmodem side

06:20 <clever> this program does 2 things a normal serial console doesnt

06:20 <clever> 1: it pulls DTR high on startup, and low on ctrl+c, wire that to the reset/run pin, and now code can only ever run when being watched

06:21 <clever> 2: if it detects the string "press X to stop autoboot", it will send an X over the uart, and then xmodem transmit an entire file, then stdin goes back into control

06:21 <clever> mrvn: if i did `foobar | uart-thing`, then i cant use stdin afterwards, at least not without opening /dev/tty

06:22 <mrvn> clever: I just ment that stdin isn't always human.

06:22 <clever> yaah

06:22 <clever> but it being something else, would defeat most of the point of this program

06:22 <clever> ah, and i see why xmodem never has to deal with writes blocking

06:22 <clever> it basically has a window-size of 256 bytes

06:23 <clever> it just sends a single block, and then waits for an ack/nak

06:23 <mrvn> was that bidirectional?

06:24 <clever> its setup to transfer a .elf file from the host pc to a remote cpu over serial

06:24 <clever> https://github.com/librerpi/rpi-open-firmware/blob/master/uart-manager/uart-manager.cpp#L218-L234

06:24 <bslsk05> github.com: rpi-open-firmware/uart-manager.cpp at master · librerpi/rpi-open-firmware · GitHub

06:24 <mrvn> or the other way. But can it do both at the same time?

06:24 <clever> not currently

06:24 <clever> the VPU code is receive only, and the host code is transmit only

06:25 <clever> and it wound up taking >60 seconds to transfer a file, which was just too slow to be of use

06:26 <mrvn> too long to wait, not long enough to get a coffee

06:26 <clever> fixing that would require usb-host and usb NIC drivers

06:26 <clever> then i can do 100mbit

06:28 <clever> and speaking of that...

06:29 <clever> https://github.com/librerpi/lk-overlay/blob/2b48c0914b39c34cac0c535ef51e3b789669bb0b/platform/bcm28xx/arm/arm.c#L174

06:29 <bslsk05> github.com: lk-overlay/arm.c at 2b48c0914b39c34cac0c535ef51e3b789669bb0b · librerpi/lk-overlay · GitHub

06:29 <clever> mrvn: can you guess what kind of bug this line can cause?

06:30 <mrvn> no

06:30 <clever> that assumes that the PLL is running at a given freq

06:30 <clever> and if that freq is wrong, this just malfunctions, and doesnt produce the required 25mhz clock

06:30 <clever> which entirely breaks usb

06:31 <mrvn> then get your divisor right

06:31 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/arm/arm.c#L173

06:31 <bslsk05> github.com: lk-overlay/arm.c at master · librerpi/lk-overlay · GitHub

06:31 <clever> this is a fix i did yesterday, it now computes the correct divisor on the spot

06:32 <mrvn> That sucks, can't do that before being in virtual higher half because it needs lbgcc

06:33 <clever> mrvn: this is running on a core without an MMU, and its integer based math

06:33 <mrvn> clever: I believe it generates a call do udivmod3

06:34 <clever> *looks*

06:35 <clever> c40036c6: 03 e7 ea c6 07 f8 ld r3,c407fdb0 <freq_pllc_core0>

06:35 <clever> c40036cc: 1a e8 9f 5c a0 57 mov lr,0x57a05c9f

06:35 <clever> c40036d2: 60 c4 1a 1f mulhd.uu r0,r3,lr

06:35 <mrvn> but it's only a problem if you have a higher half kernel and it happens in boot.S before mapping the higher half.

06:35 <clever> c40036d6: 03 46 sub r3,r0

06:35 <clever> c40036d8: 13 7a lsr r3,0x1

06:35 <mrvn> clever: ARM64?

06:35 <clever> c40036da: 30 42 add r0,r3

06:35 <clever> c40036dc: c0 7a lsr r0,0xc

06:35 <clever> VPU

06:35 <mrvn> well, no idea what mul/div/mod functions the VPU has in libgcc.

06:35 <clever> this never called libgcc

06:36 <clever> it did the entire thing in asm

06:36 <mrvn> I thought that was the arm side.

06:36 <clever> arm.c is responsible for turning the arm core on

06:36 <clever> and enabling peripherals that the arm side lacks drivers for

06:37 <clever> ok, so the first thing the above asm does, is load the pllc_core0 freq (324MHz) and a magic constant of 0x57a05c9f, and then does a high-side mult

06:38 <clever> i believe that will result in 0x69c3ba35093c700 and then truncate it down to 0x69c3ba3

06:39 <clever> then i get lost in why its doing all of those shifts

06:40 <clever> ive seen some magic tricks before (and forgot the name) which let gcc do division with mults

06:40 genpaku has quit [Remote host closed the connection]

06:40 <clever> if it knows the divisor ahead of time

06:40 <mrvn> clever: some cases will be off by one. the shifts do a correction for those

06:40 genpaku has joined #osdev

06:40 <mrvn> clever: it's x * (1<<N/z) >> N

06:40 <clever> ah, its recovering the wrong bit0, that makes the shifts and adds make a bit more sense

06:41 <clever> ah yes, that reminds me, the 3d core uses 1/w for some stuff

06:41 <clever> and thats probably what gcc is doing here

06:42 <clever> and like you said, 1<<N lets you do that in a fixed-point form

06:42 <mrvn> except lr and r3 are constants. why doesn't the compiler do that at compile time?

06:42 <mrvn> ups, ld, never mind

06:42 <mrvn> I was reading "lea"

06:43 <clever> freq_pllc_core0 is kinda a constant, but you would need whole program LTO to discover that

06:43 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/platform.c#L320-L334

06:43 <bslsk05> github.com: lk-overlay/platform.c at master · librerpi/lk-overlay · GitHub

06:44 <clever> this sets up the divisors for configuring the PLL

06:44 <clever> https://github.com/librerpi/lk-overlay/blob/master/platform/bcm28xx/pll/pll_control.c#L576-L578

06:44 <bslsk05> github.com: lk-overlay/pll_control.c at master · librerpi/lk-overlay · GitHub

06:44 <clever> and after setting the PLL up, it computes what the freq should be based on those, and stores it in a global

06:44 <clever> and with enough LTO, you could optimize that variable out, and turn it into a constant

06:45 <mrvn> it's worse if you get the frequency from the DT

06:45 <clever> yeah, at that point you cant optimize it out

06:46 <mrvn> nor multiply by 1/z

06:46 <clever> only if the 25mhz goal comes from DT

06:47 <mrvn> you can do a lookup table though if you really don't want to do division.

06:47 <clever> the VPU does also have a 32bit division opcode, takes about 20 clock cycles

06:47 <clever> but these are 64bit ints

06:48 <clever> which leads into a bug the official firmware had, for the longest time, the config parser uses signed 32bit ints

06:48 <clever> so anything over 2.147ghz, would overflow

06:48 <clever> and it was impossible to request an overclock above that

06:49 <clever> it wasnt until the pi400 came out, that the SoC could reliably overclock that hard, and the bug was discovered

06:53 <clever> mrvn: as for why i even need this 25mhz clock, thats because RPF decided to shave a few cents off the production cost!

06:53 <clever> the chip that does usbhub and usbNIC, needs a 25mhz crystal, or a 25mhz square wave to function

06:54 <clever> and RPF swapped the crystal out for a GPCLK from the main SoC

06:59 <mrvn> at a 100 million boards that's a few million bucks.

07:00 jimbzy has quit [Ping timeout: 252 seconds]

07:00 <clever> mrvn: exactly

07:02 <mrvn> At least it's something you can work without. Not like the memory-protection-unit they removed from the Amiga back when. Couldn't separate processes from each other so they didn't include that in the OS design. Then later when MMU came about you couldn't put it back into the OS.

07:04 <clever> there is also the vl805 spi flash chip, which is a grey zone

07:04 <mrvn> anyone here using a global address space for their OS?

07:04 <clever> you can do without it, but you need to modify your xhci driver some, and then its no longer obeying the xhci specs

07:05 <mrvn> specs are there to be violated

07:07 <clever> mrvn: https://xkcd.com/927/ they can also multiply

07:07 <bslsk05> xkcd - Standards

07:08 <mrvn> that's just incremental growth. I want exponential.

07:09 <netbsduser> mrvn: interestingly some members of the amiga community developed a sort of cult following to the lack of MMU

07:11 <netbsduser> despite the father of the amiga Jay Miner recognising it as a demerit, and designing one into the (rejected by Commodore) amiga ranger which would've been released around 1987

07:11 <Ermine> Otoh they have the fastest microkernel possible

07:12 <mrvn> and it's OO

07:12 <mrvn> libraries are objects with a vtable.

07:12 <netbsduser> what is striking about Amiga Workbench is that it's 'different' in how it feels in some ways to modern OSes, but it fundamentally is a modern OS, nothing like the typical "just a BASIC interpreter" ROMs of other home computers of its era, nor even like DOS

07:13 <mrvn> the libraries even have versioned symbols

07:13 <netbsduser> well, fundamentally a modern OS minus virtual memory

07:13 <mrvn> netbsduser: and all that in 256k rom

07:13 <mrvn> or 512k later

07:14 <netbsduser> mrvn: i love how they used OOP to great effect in things like datatypes.library, which lets an amiga app written 1991 that wants to deal with "some kind of image" able to seamlessly deal with a WebP or PNG implemented as a datatype by someone decades later

07:15 <netbsduser> this is one area where workbench even *beats* most modern OSes

07:15 <mrvn> yeah. They had the concept of plugins decades ahead of everyone else

07:16 <mrvn> any code that used datatypes to load sound can also play mp3 or aac

07:17 <mrvn> I think the only thing missing was a video data type.

07:18 <netbsduser> at such a universal level, too, i think the majority of programs use datatypes.library. the ARexx port is another nicety, though for some bizarre reason, nowadays it seems it's become unfashionable for most programs to let themselves be scripted

07:26 gxt has quit [Ping timeout: 268 seconds]

07:27 gxt has joined #osdev

07:28 <mrvn> netbsduser: if your kernel came with code to handle image and sound formats for you would you implement your own?

07:28 <mrvn> s/your kernel/the kernel/

07:29 <mrvn> datatypes was something they added in 2.0. older code didn't have it.

07:30 <mrvn> and games didn't use it much at all I think because a) they didn't have their data in such nice, readable formats. b) they didn't run on the workbench.

07:31 <mrvn> lots of custom GUIs in games.

07:40 <clever> mrvn: the windows kernel had font rendering support, complete with buffer overflow exploits and a turing complete font language, lol

07:40 <clever> hence, nobody runs untrusted fonts thru it

07:42 nyah has joined #osdev

07:49 foudfou has joined #osdev

07:50 foudfou_ has quit [Ping timeout: 268 seconds]

07:53 gxt has quit [Remote host closed the connection]

07:54 gxt has joined #osdev

08:21 ecs has quit [Remote host closed the connection]

08:22 ecs has joined #osdev

08:23 ecs has quit [Remote host closed the connection]

08:23 ecs has joined #osdev

08:42 the_lanetly_052 has joined #osdev

09:10 <kazinsal> There's still an official HTTP server kernel module for Windows

09:11 <clever> wut

09:13 <kazinsal> IIS in fact uses it as the low-level HTTP/HTTPS responder/filter

09:14 <kazinsal> The complicated stuff all happens in userspace but requests are grabbed by http.sys and if possible responded to without passing them on to user mode

09:19 <clever> kazinsal: thats just crazy

09:22 <kazinsal> Yeah it's rad

09:28 <clever> kazinsal: about the only reason i can see that being a benefit, is response time for static files in the fs cache?

09:28 xenos1984 has quit [Read error: Connection reset by peer]

09:29 <Ermine> Need more funky kernel modules

09:29 <clever> no need to context switch, just slap some IP and http headers on it, and stream the fs cache right into the NIC with scatter-gather

09:31 <kazinsal> That's pretty much what they list as the benefits for it

09:32 <kazinsal> Cached requests basically get turned right around immediately

09:46 gxt has quit [Ping timeout: 268 seconds]

09:47 xenos1984 has joined #osdev

09:48 gxt has joined #osdev

09:57 \Test_User has quit [Ping timeout: 252 seconds]

10:09 kof123 has joined #osdev

10:09 <jjuran> mrvn: I develop two OSes with a global address space, though neither runs on bare metal.

10:12 vdamewood has joined #osdev

10:20 sm2n has quit [Ping timeout: 244 seconds]

10:21 sm2n has joined #osdev

10:51 wootehfoot has joined #osdev

11:00 vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

11:13 gog has joined #osdev

11:17 mavhq has quit [Quit: http://quassel-irc.org - Chat comfortably. Anywhere.]

11:18 mavhq has joined #osdev

11:47 the_lanetly_052 has quit [Remote host closed the connection]

11:47 toluene has quit [Read error: Connection reset by peer]

11:48 toluene has joined #osdev

12:01 the_lanetly_052 has joined #osdev

12:19 foudfou has quit [Quit: Bye]

12:20 foudfou has joined #osdev

12:51 freakazoid333 has quit [Ping timeout: 255 seconds]

13:05 FreeFull has joined #osdev

13:11 ptrc has quit [Remote host closed the connection]

13:12 ptrc has joined #osdev

13:34 [itchyjunk] has joined #osdev

14:37 <LittleFox> I don't want to use newlib anymore x.x

14:37 <LittleFox> but is there any other sensible libc to use?

14:37 <LittleFox> llvm libc does not look nearly ready yet, sadly

14:42 \Test_User has joined #osdev

15:04 the_lanetly_052 has quit [Ping timeout: 248 seconds]

15:07 wootehfoot has quit [Ping timeout: 252 seconds]

16:26 divine has quit [Read error: Connection reset by peer]

16:27 divine has joined #osdev

16:28 heat has joined #osdev

16:29 <heat> LittleFox, I know I'm late but musl

16:29 <mjg> llvm libc? wut

16:30 <heat> wut wut

16:30 <heat> do you not know what that great piece of software is

16:30 <zid> enver heard of it either

16:30 <heat> it's Google's new NIH

16:31 <heat> https://github.com/llvm/llvm-project/tree/main/libc

16:31 <bslsk05> github.com: llvm-project/libc at main · llvm/llvm-project · GitHub

16:32 <mjg> gotta check string routines

16:33 <zid> Looks.. kinda like what i'd expect

16:33 <zid> it's very basic

16:33 <zid> cos() is just fcos etc

16:33 <heat> mjg, they're doing it in C++ with intrinsics

16:33 <heat> that's just nasty

16:34 <LittleFox> heat, musl is strictly linux only iirc

16:34 <heat> well ofc

16:35 <heat> you need to adapt it to your own needs

16:35 <heat> that's what I did

16:35 <bauen1> LittleFox: but it's easy to start hacking away on it (and the code is much easier to understand and modify than newlib from my experience)

16:35 <heat> >the code is much easier to understand

16:35 <heat> >musl

16:35 <heat> does not compute

16:35 * kof123 gives heat fonz-like hit

16:35 <zid> musl's code is amazing

16:35 <kof123> *elbow

16:35 <zid> we spent an hour trying to figure out what one function was doing here between 4 of us

16:36 <bauen1> heat: I didn't exactly found newlib code to be particularely nice

16:36 <bauen1> *find

16:36 <LittleFox> I read that linux-only as not caring about portability to other OS', but that's not the case then?

16:36 <heat> bauen1, but musl is not it

16:36 <LittleFox> newlib is ... something, but not nice

16:36 <heat> LittleFox, and it doesn't care, but everything is hackable

16:36 <LittleFox> hmhm

16:36 <heat> fuchsia also used musl

16:37 <LittleFox> mahk, pointed me to mlibc which I forgot existed

16:37 <LittleFox> the one from managarm

16:37 <LittleFox> fuchsia uses musl?

16:37 <heat> now it uses a hybrid of musl and llvm

16:37 gelatram has joined #osdev

16:37 <heat> yes

16:37 <raggi> fuchsias musl is heavily forked

16:37 <heat> correct

16:38 <heat> I think it can be best described as a hybrid of musl, fuchsia and llvm

16:38 <raggi> musl will treat you well as being easy to integrate and mostly correct. It won't provide highly competitive performance

16:38 <raggi> Fuchsia will probably move to llvm once it's mature enough

16:39 gelatram has quit [Quit: Client closed]

16:39 <LittleFox> llvm looks extremely incomplete still, like, am I blind or is there no malloc yet?

16:39 gelatram has joined #osdev

16:39 <raggi> Llvm has a portable secure malloc already

16:39 <LittleFox> ok, I'm blind then ^^

16:39 <heat> right, you're probably supposed to use scudo

16:40 <raggi> Yeah, scudo, I was drawing a blank on the name for a minute

16:40 <raggi> LittleFox: 6/half dozen, you're probably right it's not wired up out the box

16:41 <heat> i think llvm will take a substantial amount of time to get complete and stable enough

16:42 <LittleFox> already have a forked llvm as submodule in my repo, would be great to stay in that ecosystem

16:42 <heat> (and the concept of "toolchain-official libc" is also dubious)

16:42 <raggi> It's definitely not complete yet, I recently needed a better performing sysroot than musl and I looked at it disappointed and made a static safe glibc instead :-(

16:42 <heat> raggi, what do you find slow?

16:43 <heat> malloc is noticeably slow but that can be replaced pretty trivially

16:43 <raggi> heat: I kinda disagree. Having the toolchain and libc agree on what becomes intrinsics and how is no bad thing, and llvm will make both portable and easily crossable which is a breath of fresh air frankly

16:44 <raggi> heat: there are no native implementations for lots of stuff, the use case I had recently was static SQLite embedded into a hosted runtime, and the libc switch cost nearly 40% on a fairly heavy workload, which was more than we could accept

16:45 <heat> yeah but *what*

16:46 <raggi> malloc was one, that'd be easy to fix, but also mem*, str*, etc

16:46 <heat> right, string functions, those are hard

16:47 <raggi> It's perhaps not really fair to accuse musl of being slow, and more fair to say glibc really does have fast impls

16:47 <heat> the problem with musl is that there's no MIT-like-licensed string ops

16:47 <heat> maybe bionic's. I was going to try them out today or so

16:48 <heat> but those are still old. 2014 or so

16:48 <raggi> Bionic had slow ones for a long time, I did see some fast impls going in recently, I did a quick pass too :-)

16:48 <heat> there's a patch going around #musl for an AMD memcpy implementation - but it's GPLv2

16:48 <raggi> I'm sure some profs could be sniped into writing some

16:49 <heat> of course it beats the shit out of musl's horrible memcpy

16:49 <raggi> Yep

16:49 <raggi> But e.g. this is where llvm can just do this

16:49 <heat> they're using rep movsq, they don't even use ERMS...

16:49 <heat> can it? it's just SIMD in C++

16:50 <heat> there's no tuning

16:50 <raggi> It'll take a while though, as they also want to intentionally try to reduce the reliance on dropping to asm, and instead reach for competitive performance with compiler work

16:50 <heat> just SIMD instructions in the middle of whatever shitpaste the compiler gives you

16:50 <raggi> I mean the project as a whole, not the vm :-)

16:50 <raggi> That group of people will get it done

16:51 <heat> maybe

16:51 <raggi> fair

16:51 <raggi> I have hopes

16:51 <heat> but they could've just contributed to one of the libcs instead

16:51 <raggi> Debatable

16:52 <heat> and *I know* that musl people are a bit stubborn

16:52 <raggi> Rich got extremely vocally angry at fuchsia over the musl fork

16:52 <heat> even then, forking

16:52 <heat> Rich got angry because he didn't see changes back

16:53 <heat> now, I've heard that you folks tried to upstream stuff and it didn't go well. I don't have the full story

16:53 <LittleFox> maybe not declare something as linux-only when you want changes from os-specific forks back?

16:53 <raggi> But there's a fairly fundamental goal difference here with integrating well with the toolchain to solve problems like performance, that's hard to put into an intentionally language spec portable impl

16:53 wootehfoot has joined #osdev

16:53 <raggi> A lot of the fuchsia fork things have no place in upstream, like deleting entire dirs of stuff it doesn't need or want

16:54 <raggi> Or removing specification insecure stuff / making it obviously broke (like rand)

16:54 <raggi> Those are strong project local opinions

16:56 <heat> LittleFox, it being blatantly linux-only is what makes the project simpler

16:56 <heat> well, one of the contributing factors

16:56 <raggi> Anyway, whatever the triggers were, getting publicly angsty wasn't a good way to create change :-)

16:57 <raggi> Though it was noticed, I remember people jumping in to watch, and some discussions were had

16:57 <heat> probably. but forking something was pretty doable

16:57 <heat> or, I don't know, use bionic ;)

16:58 <heat> musl is the second largest libc in the GNU/linux space and it's severely understaffed

16:58 saltd has quit [Quit: joins libera]

16:59 <heat> C standard libraries are hard :|

16:59 <raggi> the judgement was made, before my time, that bionic would have been more work to fork/port, and while I haven't looked that closely I can totally believe that to be true

17:00 <raggi> Yeah, I would have liked to have seen more investment back to musl in some form, that's also not always as easy as it could be

17:01 <raggi> The fact that only individuals show up on the big donations list is a bit sad

17:04 frkzoid has joined #osdev

17:11 <heat> true

17:11 <heat> wasn't freebsd who got a 3000 USD donation from apple?

17:13 <heat> sorry, $100-249

17:14 <mjg> that would be funny

17:14 <heat> anyway, friendly reminder that most open source projects are severely underfunded or unmaintained

17:15 <j`ey> https://freebsdfoundation.org/our-donors/donors/?donationType=individual&donationYear=2020 apple

17:15 <bslsk05> freebsdfoundation.org: Donors | FreeBSD Foundation

17:16 <mjg> :D

17:16 <mjg> nice

17:16 wootehfoot has quit [Ping timeout: 252 seconds]

17:18 gelatram has quit [Ping timeout: 252 seconds]

17:25 wootehfoot has joined #osdev

17:26 ghee has joined #osdev

17:35 wootehfoot has quit [Ping timeout: 256 seconds]

18:04 heat_ has joined #osdev

18:05 heat has quit [Ping timeout: 256 seconds]

18:05 xenos1984 has quit [Read error: Connection reset by peer]

18:16 heat_ is now known as heat

18:16 <heat> this is ridiculous

18:17 <heat> glibc does twice the speed of rep movsb on memcpy and memmove

18:23 xenos1984 has joined #osdev

18:27 <mjg> twice than what

18:27 <mjg> also are you sure you are aligning the target

18:29 <mjg> you should start with aligned_alloc(BIGNUM, 1024); and then start playing with misaligning it

18:33 <heat> https://gist.github.com/heatd/97ecb1421d4c43ce2efddb5495be3821

18:33 <bslsk05> gist.github.com: bionic_bench.cpp · GitHub

18:34 <mjg> you don't want 8 byte alignment

18:34 <mjg> somtimes it will land at 16 or even 32, the latter being optimally aligned

18:35 <mjg> use aligned_alloc

18:35 <mjg> and later play with adding an offset to the buffer

18:35 <mjg> also glibc will NOT use ERMS for small bufs

18:35 <mjg> afair < 2KB

18:36 <heat> my ERMS get peak 54GB/s

18:36 <heat> glibc's memcpy gets 90

18:36 <mjg> with what alignment

18:36 <heat> whatever malloc is giving me

18:36 <mjg> for src and dst

18:36 <heat> 16?

18:36 <mjg> that's my point, see the remarks above

18:37 <heat> how does that matter?

18:37 <heat> glibc is manually destroying ERMS on similar sizes

18:37 <heat> and its *not* written in microcode

18:37 <mjg> there is significant penalty for the target not being aligned to 16, and some for not being aligned to 32

18:38 <mjg> i don't remember about src

18:38 <mjg> also erms is going to get REKT for all the sizes below 2KB

18:38 <heat> I know

18:38 <mjg> in the current bench you may sometimes hand over a buf which is optimally aligned, when you should not

18:38 <heat> that's not the point, you should run it

18:39 <mjg> i'm saying the glibc win >= 2KB is most likely accidental

18:39 <mjg> due to lucking out on alignment

18:39 <mjg> i have to go afk for about 90 minutes

18:42 <heat> https://gist.github.com/heatd/0d107a2933f81e81cc649dd6bf06962e

18:42 <bslsk05> gist.github.com: gist:0d107a2933f81e81cc649dd6bf06962e · GitHub

18:42 <heat> after manually unaligning everything by 1

18:46 ghee has quit [Ping timeout: 268 seconds]

18:48 MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]

18:49 brynet has quit [Quit: leaving]

18:50 frkzoid has quit [Ping timeout: 244 seconds]

18:51 ghee has joined #osdev

18:52 MiningMarsh has joined #osdev

19:04 Matt|home has quit [Ping timeout: 255 seconds]

19:09 <heat> I don't understand what's going on or how they're doing things

19:11 <heat> all I know is that this is definitely not ERMS

19:11 <heat> a straight up rep movsb (no extra logic) is half as fast as glibc memcpy

19:13 <heat> mybe its prefetching?

19:15 <GeDaMo> I would have expected movsb to do that too

19:16 Matt|home has joined #osdev

19:20 <GeDaMo> Could it be using page mapping?

19:23 <heat> like moving pages?

19:23 <heat> that would be slow af

19:24 brynet has joined #osdev

19:28 <zid> huzzah, my code is finally confirmed to work on a real playstation

19:29 <zid> It basically worked before but the person I had test it happened to be using a disc where my DMA source address was all zeros so it looked like it failed

19:39 MiningMarsh has quit [Quit: ZNC 1.8.2 - https://znc.in]

19:43 MiningMarsh has joined #osdev

19:50 <heat> okay I got a nice speedup in all string functions except memcpy

19:56 GeDaMo has quit [Quit: A program is just a bunch of functions in a trenchcoat.]

19:58 frkzoid has joined #osdev

20:26 <mjg> heat: past certain size they may be doint nt stores

20:26 <mjg> heat: you can check with perf top what's going on

20:27 <heat> I tried perf record + perf report and I got garbagio

20:27 <mjg> perf top while you bench

20:27 <mjg> BM_string_memcpy/65536 -- is this glibc?

20:27 <mjg> BM_string_erms/65536 1936 ns 1936 ns 360292 bytes_per_second=31.5324G/s

20:27 <heat> memcpy is glibc, erms is erms

20:27 <mjg> BM_string_memcpy/65536 1858 ns 1857 ns 375127 bytes_per_second=32.8738G/s

20:27 <heat> yes, those are way past the caches

20:28 <mjg> looks like excluding measurement error you got the same result for the misaligned case

20:28 <heat> the L1 cache I mean

20:28 <heat> there's no measurement error

20:28 <mjg> look, if they are twice as fast for the same misalignment, they are NOT doing erms

20:28 <mjg> most likely some simd they picked for your cpu

20:29 <mjg> just write a lol prog which keeps doing said memcpy in a loop and have a look at it with perf top

20:30 <mjg> for the 8k case

20:30 <heat> REP_MOVSB_THRESHOLD(2048 * (VEC_SIZE / 16))

20:30 <heat> 2048 * (32/16) = 4096

20:31 <heat> it should start using it after 4KB

20:31 <mjg> don't speculate and just perf top already

20:32 <mjg> what cpu do you have btw

20:34 <heat> Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz

20:34 <heat> kbl

20:36 <mjg> so what's the result

20:36 <mjg> i mean what instructions

20:37 <heat> just sec

20:41 <heat> mjg, it's using vmovdqa

20:44 <mjg> there you go then, no magic

20:45 <heat> how's it not magic

20:45 <mjg> how is it magic

20:45 <heat> it's beating rep movsb's ass

20:45 <mjg> of course it is

20:45 <mjg> you have way too much faith in erms

20:45 <heat> wdym of course it is

20:46 <mjg> if you fuck around writing memcpy and memset you immeidately find that erms sucks

20:46 <mjg> now that i said it, can you bench memset on a 4KB

20:46 <mjg> i mean perf top it

20:47 <mjg> the question is if perhaps simd will also outperform erms on that one

20:48 <mjg> and if so, could it be that clear_page would be faster with it, even with the cost of saving/restoring the simd regs

20:48 <mjg> this would be a big deal because clear_page likes to be up there on the profile when doing a fork/exec heavy load

20:49 <mjg> fwiw intel optimization manual claims that rep stosb + the size of 4096 is special cased

20:50 <heat> glibc picks stosb for 8192

20:50 <mjg> do 4096

20:50 <heat> same

20:50 <mjg> bummer

20:55 <heat> all praise glibc, the source of ancient knowledge

20:55 <heat> and.. shit static linking?

20:57 <mjg> is that even supported anymore?

20:57 <mjg> curious if all the hotpatching magic works there

20:58 <heat> i think they started supporting again recently

20:58 <heat> don't quote me on that one though

20:58 <mjg> pinky swear

20:59 <heat> https://cs.opensource.google/fuchsia/fuchsia/+/main:zircon/third_party/ulib/musl/src/string/x86_64/memcpy.S

20:59 <bslsk05> cs.opensource.google <no title>

20:59 <heat> aw no

20:59 <heat> no you didn't

21:01 <heat> i think that whatever bionic-musl hybrid I've crafted here for Onyx should me mergeable

21:02 <zid> .google

21:02 <zid> they've officially BECOME the web

21:02 <heat> .goog, .gle, .youtube

21:02 <moon-child> iirc I found rep stosb beats avx2 around l2

21:03 <mjg> heat: that routine is defo crap

21:03 <mjg> erms or not

21:03 <moon-child> and yeah you wanna align the thingy when you erms

21:04 <mjg> huh. i'm pretty sure musl uses regular stores for memset

21:04 <mjg> curious https://cs.opensource.google/fuchsia/fuchsia/+/main:zircon/third_party/ulib/musl/src/string/x86_64/memset.S does not

21:04 <bslsk05> cs.opensource.google <no title>

21:05 <mjg> up to some size of course

21:05 <mjg> interestingly even that is simplemented in a pessimal manner

21:05 <mjg> [ignoring whether simd is an option]

21:05 <mjg> [... but sse2 should be]

21:05 <heat> https://cs.opensource.google/fuchsia/fuchsia/+/b6a5a281ce50a036d20bda12805ee047f3e644d4

21:05 <bslsk05> cs.opensource.google <no title>

21:05 <mjg> wow

21:06 <mjg> that's blatantly false

21:06 <moon-child> yeah

21:06 <heat> no its not

21:06 <mjg> bro

21:06 <heat> musl's string ops are shit

21:07 <heat> I've just sped them up like 5x

21:07 <mjg> their memset was questionable, but would still beat erms for very small sizes

21:07 <mjg> < 128 or so

21:07 <mjg> and they are very common

21:07 <moon-child> mjg: out of curiosity, how is the 'fast short rep movs' on icl?

21:07 <mjg> looking at memcpy it is total crap

21:07 <mjg> the original

21:08 <mjg> moon-child: i could not be bothered to bench yet

21:08 <mjg> i don't have my original test jig so wuldh ave to recreate something

21:08 <heat> have you noticed the original memcpy has no comments?

21:08 <heat> it's great

21:08 <heat> true musl

21:08 <mjg> again, their original memcpy is 200% crap

21:08 <mjg> no argument

21:09 <moon-child> wait what the fuck is this

21:09 <moon-child> why even

21:09 <heat> why what

21:09 <moon-child> the memcpy

21:10 <heat> idk

21:10 <heat> memcpy copies memory

21:10 <heat> you need that

21:10 <heat> why is it written like that? beats me

21:11 <mjg> looks like they added code to align 8 bytes

21:11 <mjg> except they do it with per-byte loop

21:11 <mjg> utter crap

21:11 <heat> I've told them that rep movsq is also crap

21:11 <heat> which it is

21:11 <mjg> ye-ish

21:11 <geist> there's an open task to revisit this

21:11 <mjg> geist: i'm commenting on the replaced musl code here

21:11 <mjg> as for the code it got replaced with, i already commented :)

21:12 <geist> yeah i remember at the time pointing out that not everything loves rep movsb

21:12 <mjg> (if this is for the kernel, you definitely want regular stores for sizes < $SMALLNUM (256-ish))

21:12 <mjg> if this is for userspace, simd

21:12 <geist> for the kernel we dont have that either, but we at least pick variants based on the bits

21:12 <geist> erms, etc

21:13 <moon-child> by the by, thoughts on gnu_ifunc?

21:13 <moon-child> it clearly serves a purpose, but

21:13 <moon-child> https://sourceware.org/legacy-ml/libc-alpha/2015-11/msg00108.html this is an unpleasant set of provisos

21:13 <bslsk05> sourceware.org: Szabolcs Nagy - using IFUNC outside the libc

21:13 <geist> i've had a fairly hard time convincing folks at work over the years that something other than intel exists on x86. at the time someone replaced that they were like 'well all modern x86 cpus are erms, so lets do that'

21:13 <geist> but OTOH it was a general improvement over what was there

21:13 <mjg> https://git.musl-libc.org/cgit/musl/tree/src/string/x86_64/memcpy.s

21:13 <bslsk05> git.musl-libc.org: memcpy.s\x86_64\string\src - musl - musl - an implementation of the standard library for Linux-based systems

21:13 <mjg> remains screwed

21:14 <heat> it does

21:14 <mjg> geist: well ye, that code is seriously terrible and i'm very negatively surprised

21:14 <mjg> i mean the original musl memcpy

21:15 <mjg> finishing the tail with another byte loop is another wtf

21:15 <heat> https://github.com/sabotage-linux/sabotage/blob/master/KEEP/musl-improved-memcpy.patch

21:15 <bslsk05> github.com: sabotage/musl-improved-memcpy.patch at master · sabotage-linux/sabotage · GitHub

21:15 <moon-child> geist: iirc zen 3 has erms now. So just a matter of time. Unless you're talking about via or something like that?

21:15 <geist> or goldmont, etc

21:15 <heat> this is super fast but not mergeable due to licensing issues

21:16 <geist> though goldmont better have erms and ferms because it's the efficiency half of an alder lake

21:16 <mjg> heat: honestly i think they should just take raw bionic strings

21:16 <mjg> heat: bsd licensed

21:16 <heat> yes but

21:16 <moon-child> I think we've discussed specialising separately for big/little cores here before

21:16 <mjg> heat: not super optimized, but would be perfectly acceptable

21:17 <geist> okay. haven't been reading the scrollback

21:17 <moon-child> but there's still no os-level affordances for it, and seemingly no one wants to add them

21:17 <geist> as i said, there's a task to do this, but x86 has not gotten a tremendous amount of love on fuchsia

21:17 <moon-child> and yeah bionic is solid

21:17 <geist> or more to the point it runs rings around the arm hardware we also run on

21:17 <geist> so there doesn't tend to be as much of a push to optimize every bit

21:17 <moon-child> haha, well

21:18 <geist> one of the big issues is to properly optimize this stuff you really should run on a wide variety of hardware

21:18 <geist> and one of the issues i generally have with how we (fuchsi and google) tend to go about testing is if you can't continually reproduce it there's a negative impetus to get started

21:19 <geist> so it'll tend to be 'yeah you want to test on 10 differnet microarchitectures? well, right now we can only run CQ/CI on 2'

21:19 <heat> mjg, the bionic memcpy isn't a win/big win

21:19 <moon-child> heat: hahaha, they have an optimised memcpy, but memmove is still just rep stosq

21:19 <heat> it's slightly better, but that's it. I've seen rep movsq be fast

21:19 <heat> who?

21:20 <moon-child> the musl thig you linked

21:20 <moon-child> thing

21:20 scoobydoo has quit [Read error: Connection timed out]

21:20 <heat> oh yeah sure, that's just a dollar-store patch

21:21 <heat> not merged, not going to be merged, not mergeable

21:21 <heat> that AMD code is GPLv2

21:21 <geist> put it another way, since we're so testing focused (not bad), changes that are performance oriented tend to require a corresponding performance regression benchmark get added to the fleet

21:21 <geist> and we currently dont have the capability to run on a variety of microarchitectures

21:21 <geist> so it tends to be 'ugh, lemme work on something else'

21:21 <geist> vs just getting it done an then hoping it doesn't regress later

21:22 scoobydoo has joined #osdev

21:22 <geist> whereas i'd assume something like freebsd would just be yolo, get it done

21:22 <geist> which is refreshing

21:23 <geist> i mean not yolo as in completely unstructured, but i would assume somone that does a fair amount of due diligence on their own isn't required to maintain a regression tester that runs daily forever more

21:28 <mjg> tru.dat

21:28 <mjg> heat: it has some issues for sure, which is why i did not import it into freebsd

21:29 <mjg> heat: ... yet. when i have time i'm going to patch it up

21:29 <mjg> i think key is that musl cannot runtime check for any extensions

21:29 <mjg> so at best it can do sse2

21:31 <mjg> heat: in https://github.com/sabotage-linux/sabotage/blob/master/KEEP/musl-improved-memcpy.patch that memmove patch remains atrocious

21:32 <mjg> +2:movsb

21:32 <mjg> +dec %edx

21:32 <mjg> +jnz 2b

21:32 <mjg> say no to loops

21:32 <mjg> say yes to overlapping stores

21:50 <zid> https://preview.redd.it/a01fv2j6r2j91.jpg?width=772&auto=webp&s=0b25d970219811c9e4427c659a816cfa7bd44d55

21:50 toluene has quit [Ping timeout: 252 seconds]

21:50 <zid> oh that was meant for elswhere, but you can have it too

21:58 <mjg> is that zoomer humor?

21:59 <zid> no, millenial

22:07 <moon-child> zid, do you have two monitors and a light-up keyboard

22:08 <klange> lol, look at this loser with only two monitors

22:08 <zid> I do

22:11 <geist> oh daaaamn two monitors

22:11 <geist> gosh i remember when that was so exotic. way back in the dos days the only way to do it was vga + monochrome

22:12 <geist> (and yes i'm sure amiga did it like 35 years earlier than that)

22:12 <zid> I mean, my setup isn't my different

22:12 <zid> much*

22:12 <zid> one is a hd lcd, the other is a crt

22:14 <mjg> crt?

22:14 <mjg> are you playing old games or soemthing?

22:14 <zid> cathode ray tomato

22:14 <zid> oh

22:14 <zid> I do do that with it yes, but it's also just my old main monitor

22:15 <zid> good enough to throw a video onto or whatever

22:15 <zid> or another pdf

22:16 <geist> yah for a long time there they were still pretty coveted for having better color reproduction too than first or second gen LCDs

22:17 <geist> i think in geeral against a IPS or so the line was crossed. but i also remember first gen color LCDs were washed out crap

22:18 <zid> They still have basically 0 input latency

22:18 <geist> side note it seems that monitor class OLEDs may be finally starting to happen, though they're of course really pricey

22:18 <geist> yah that too

22:18 <zid> so still superior for games as long as you don't have unlocked fps PC titles

22:18 <geist> well, i wouldn't quite say superior, but they hold their own

22:18 <zid> in terms of performance

22:18 <geist> there are basically zero latency good gaming LCDs too

22:18 <geist> but then those tend to be TN, which look worse, etc etc

22:18 <zid> There are *low* latency LCDs, there are no 0 latency LCDs

22:19 <geist> though, if you can then run the LCD at a higher refresh rate you can make up for it

22:19 <zid> yea, I mentioned

22:19 <geist> since a LCD running at twice the hz even if it takes up to a frame its still faster than a 60

22:19 <zid> it's better to have an LCD if you're running unlocked fps PC stuff

22:19 <zid> but if you're playing a console game or whatever, CRT still wins

22:20 <zid> I have an arcade game I play on it a fair amount

22:20 <geist> oh 100% especially if it's a retro console. there was that period there with the advent of 3d gaming where it was really relying on the CRT to smear out the jagged lines or whatnot

22:20 <zid> I still want a GDM-FW900

22:21 <geist> yah though i didn't have tha tone, i did have a sony 19" for a long while and it was lovelyk

22:21 <geist> replaced the 17" also nice viewsonic (i think) i had before

22:21 <moon-child> https://www.ebay.com/p/52108670 $4k, apparently

22:21 <bslsk05> www.ebay.com: Sony GDM-FW900 CRT Monitor for sale online | eBay

22:22 <geist> i remember the trinitron being the shit

22:22 <zid> I've actually seen them for $50, collection only

22:22 <geist> i'm not sad i got rid of it though, they were so huge

22:22 <zid> and also for yea, 4000

22:22 <zid> depending on if they realize what it is and market it well

22:22 <geist> the last trinitron i had on hand was a SGI Indy monitor, got rid of it in 2015 or so before moving

22:22 <zid> my CRT is infact, a trinitron

22:22 <geist> i think it was 21"

22:23 <zid> basically all hdcrts are technically trinitrons though, I think either they buy the licence or it fell out of patent, not sure which

22:23 <geist> not so sure about that, i think sony was pretty hard about not licensing it

22:23 <zid> in the 80s and 90s definitely

22:23 <geist> but maybe very late on either they were the only maker or as you say people copioed it

22:23 <geist> yah

22:23 <zid> CRTs didn't stop being made until like 2012

22:24 <geist> i vaguely remember the first LCD i got was a viewsonic in about 2001. and it was terrible, i returned it

22:24 <geist> but 2 or 3 years later they started to get better enough that they were acceptable

22:24 <geist> probably 2007 or so? I still have the early Dell monitor i got

22:24 heat has quit [Ping timeout: 256 seconds]

22:24 heat_ has joined #osdev

22:24 <zid> Somewhat counter-intuatively, LCDs tend to die easily

22:24 <zid> because of cheap powersupplies and stuff

22:25 <geist> 17" Dell. still a great monitor for hooking up random things since it also does composite and whatnot

22:25 <geist> yah or the backlight in the early gens tended to be a big gas filled thing

22:25 <geist> also early gen LCDs were power hogs

22:25 <zid> They still are tbh

22:25 <heat_> mjg, sse2 is still as fast/slightly slower than rep movsb/q

22:25 <zid> until LED TV was a thing

22:25 <zid> and the LED there literally just means it's backlit by LEDs

22:25 <zid> which is *very* recent

22:25 <heat_> I assume you get the big improvement with a super tuned loop with avx stores

22:25 <geist> nah, not anymore. the ones with lights distributed across the back use mostly nothing sitting still

22:26 <zid> and they *still* die

22:26 <geist> sure, but i think computer monitors have generally doe the LED styff for a while

22:26 <zid> they get warm and the LEDs cool

22:26 <zid> cook*

22:26 <heat_> mjg, re: runtime checks, it's an odd issue. they don't support GNU ifuncs

22:26 <geist> all of my monitor deaths i've experienced have been either cracked screen (fairly fragile, something poked em) or a line or a quadrant just dies

22:26 <heat_> iirc it was a big philosophical question, like most of musl problems really

22:27 heat_ is now known as heat

22:27 <geist> re: libc and runtime checks, i also think that's why we've put it off in fuchsia. roland is starting to work on a) getting the runtime patching working nicely and i think b) moving away from musl

22:27 <geist> so i suspect the memcpy/etc updates will come as a result of that

22:28 <geist> i think there's some emphasis on the llvm-libc project. hopefully us putting some energy behidn it will accellerate that project

22:28 <geist> it *seems* like a pretty good idea and the design is i think supposed to honor embedded libc needs as well

22:28 <heat> what's the minimum cpu requirement for fuchsia?

22:28 <zid> 1

22:28 <geist> so maybe finally will get a good modern libc that can also be scaled down properly

22:28 <heat> thank

22:29 <geist> i think the needs of the pigweed project (also under the fuchsia umbrella) is informing that

22:29 <heat> that's under fuchsia?

22:29 <geist> heat: on arm it's armv8.0, 64bit. on x86 it's basically -march=x86-64-v2 and about that line in the sand where cores supported it

22:30 <geist> so approximately 2009 era x86s

22:30 <zid> huh what's the -v2?

22:30 <heat> sysv abi stuff

22:30 <heat> google ti

22:30 <heat> *it

22:30 <heat> or bing it

22:30 <zid> I'll ask jeeves

22:30 <geist> https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels

22:30 <bslsk05> en.wikipedia.org: x86-64 - Wikipedia

22:30 <zid> ah, sse3

22:30 <geist> it's a relatively new categorization that gcc and llvm have standardized on

22:30 <zid> oh sse4

22:31 <geist> yah basically 2nd gen x86-64s

22:31 <geist> so that cuts out very first gen atoms (bonnell, etc) and is about where nehalem came around

22:31 <zid> That's the "non intel core that only appeared in 2 small laptop ranges and nobody has ever actually seen" designation :P

22:31 <geist> which is i think a fairly acceptable line

22:31 <heat> I require Haswell

22:31 <heat> which has, among other things, AVX

22:32 <zid> rip me then

22:32 xenos1984 has quit [Read error: Connection reset by peer]

22:32 <geist> also there are things like XN and 1GB pages that zircon intriniscally assumes is present

22:32 <zid> what's haswell got that you need?

22:32 <geist> the 1GB pages we can probably standardize on, but haven't looked super closely at where it precisley showed up

22:32 <heat> avx, 1GB pages (I don't need them, my kernel doesn't assume any of the features really, but they're nice)

22:32 <zid> oh sandy has those

22:32 <geist> XN i think was basically there from day one except maybe some very early P4s

22:32 <heat> I think I use rdrand?

22:32 <geist> also v2 picks up things like cmpxhg8b

22:32 <zid> I do not have rdrand

22:33 <geist> ie, dual word cmpxgh, which wasn't present in K8 (x86-64-v1)

22:33 <geist> there are a few things in the kernel that use it

22:33 frkzoid has quit [Ping timeout: 244 seconds]

22:33 <heat> my kernel is pretty cpuid agnostic

22:33 <heat> give it an x86_64 cpu with long mode and it Just Works

22:33 <heat> userspace isn't though

22:34 <geist> yah, LK generally is too, but tere are a few features that aren't necessarily implicit (1GB pages is a common one that's easy to just assume is there)

22:34 <zid> I am also cpuid agnostic

22:34 <zid> in that I don't check it

22:34 <geist> and of course the kernel has to opt in for the various AVX context switch stuff

22:34 <geist> also v2 is kinda nice becaus eyou can rely on things like popcnt which comes in handy

22:35 <geist> anyway recent clang and gcc both support -march=x86-64-vN as a switch

22:35 <geist> so it's nice to use to set the baseline

22:35 <zid> yea seems like a vrey good addition

22:35 <zid> considering literally 0 people own a 'core' cpu and not a core2

22:35 <mjg> heat: sse2 is way faster than erms up to a point

22:35 <mjg> heat: and definitely faster than regular movs

22:35 <moon-child> why use rdrand?

22:35 <heat> entropy

22:35 <mjg> heat: ... for the range where they make sense

22:35 <heat> early boot entropy at least

22:35 <moon-child> ehhhh

22:36 <geist> mjg: yeah but i think at the point that really starts to matter you're already tuning something that works well

22:36 <geist> i think most of us are just trying to get the thing to work. whether or not it's tuned, that's gravy

22:36 <geist> and or a fun weekend project

22:36 <mjg> for real man, erms for sizes < 256 or so is straight up garbage

22:36 <mjg> but i think it is time to ditch this subject

22:36 <geist> my general take is the kernel shouldn't copy things around if i has to, or if it is it's probalby user/kernel copies (which tend to be special code anyway) or it's page aligned

22:36 <mjg> want the last word? :)

22:36 <moon-child> heat: sufficiently malicious rdrand can look at what entropy you've already gathered and deliberately poison it

22:36 <mjg> it's not just copy, it is tons of zeroing

22:37 <heat> mjg, my benchmarks show that it's not for "usual erms sizes"

22:37 <geist> sure. trouble is most zeroing i've seen (in zircon at least) is compiler generated

22:37 <geist> and then it does whatever the fuck it wants

22:37 <zid> tinybench shows that I might as well use movs unless I use a super avx version on a big copy

22:37 <geist> or it's page zeroing, in which case you already have ac ustom routine

22:37 <zid> I get 40GB/s no matter what the hell happens

22:37 <geist> yah totes mcgotes

22:37 <heat> moon-child, oh no?

22:37 <mjg> geist: do you have a way to attach yourself to memset?

22:38 <mjg> and trace generated calls?

22:38 <mjg> it happens a lot on linux nad freebsd, and sizes tend to be small

22:38 <geist> in general most of the zeroing is stuff like 'initialize this pobject to zero' in which case it's implicit

22:38 <mjg> i would be surprised if it did not on fuchsia

22:38 <geist> but a lot of that is because c++

22:38 <heat> moon-child, i'm not sure why my first concern would be "rdrand is fucked" if the cpu is against me

22:38 <geist> that *tends* to be a bunch of movqs

22:38 <mjg> it is a bunch of them if the compiler knows the size

22:39 <geist> which actually drives me nuts to see the codegen is a series of 10 byte moveqs with 8 bytes of 0 in the instruction

22:39 <mjg> and it is small-ish

22:39 <zid> Yea small objects almost always just assemble to a couple of inline movqs regardless

22:39 <mjg> when it does not you are scrweed

22:39 <geist> *otoh* the cpu probably eats that shit up

22:39 <geist> my risc brain says 'fill in a register and write it out!'

22:39 <geist> but most likely movqs is superior, even if it's huge code

22:40 <heat> movq?

22:40 <heat> the mmx thing?

22:40 <geist> x86-64's 8 byte move

22:40 <zid> mov qword

22:40 <zid> (or movabs if you're feeling constnat)

22:40 <geist> *with* an immediate. iirc it's the only way to easily get an 8 byte immediate into a register or diretly to memory

22:40 <geist> oh way, movabs isactually what i was thinking about

22:40 <moon-child> huh I'm surprised that's better than the alternative

22:40 <geist> the compiler tends to spit out a ton of movabs

22:40 <moon-child> given zeroing idiom should just rename

22:41 <geist> moon-child: that's my thought too, but both gcc and clang seem to think movabs is superiod

22:41 <geist> and they're probably right, annoyingly

22:41 <geist> or at least they're right in a microbenchmark where icache pressure doesn't matter

22:41 <heat> they probably avoid it due to register allocation?

22:41 <moon-child> mmmm

22:42 <geist> ignoring it chewing up lots of icache i assume the compiler just flattens a movabs 0 to some specific µop that slams it out

22:42 <geist> s/compiler/core

22:43 <moon-child> but again, zeroing idiom should be handled in the frontend, so all you have to retire is a store

22:43 <geist> but yeah agreed. perhaps you have to use -Os to get a zered register + store

22:44 <moon-child> probably yeah. I bet it's register allocation, like heat said, but would be nice if it could do it dynamically, depending on if there are free registers

22:45 <moon-child> usually do instruction selection before register allocation. I heard there was some research on doing them at the same time

22:45 <geist> https://gcc.godbolt.org/z/5dn939GTY is what i'm talking about

22:45 <moon-child> apparently it's a lot more expensive, but somebody was working on making it less expensive than it otherwise would be

22:45 <geist> it'd be far less code if it zeroed a register and then used smaller instructions

22:45 <zid> Yea it does feel wrong doesn't it

22:45 <geist> and yeah it's movq (movabs i think can only move a constant to a register)

22:45 <zid> to see it generate all those immediate 0s

22:46 <moon-child> haha, gcc generates rep stos with -Os

22:47 <zid> gcc is a bro

22:47 <geist> https://gcc.godbolt.org/z/vaEnWdf5W also interesting that gcc switches to rep stos pretty fast

22:47 <geist> actually doesn't need -O2 really

22:47 <geist> though i picked an arbitrary -mtune there

22:47 <geist> https://gcc.godbolt.org/z/eWEhnsv41 switch it to skylake and it's back to movq

22:48 <geist> that itself is intereting that gcc thinks haswell vs skylake is a different thing

22:49 <moon-child> yeah

22:49 <moon-child> rep stos is the same on haswell and skylake according to agner

22:49 <geist> though at 256 bytes it switches to rep stosb

22:50 <geist> so it presumably knows about the 'stosb doesn't get fast until approx 256 bytes' thing that mjg is talking about

22:51 <mjg> stosb does not get fast for a long time, it's just without simd your only option is to do movs and those start sucking real fast

22:52 xenos1984 has joined #osdev

22:52 <mjg> so it's the lesser crapper

22:52 <heat> gcc with -Os will make memcpy into rep movsb pretty quickly

22:52 <geist> so really the summary of all of this is 'with erms you should move <256 with a loop and ideally 8 bytes at a time untl you cant'

22:52 <mjg> fastest is 32-byte loop afaics

22:53 <mjg> for the < 256 range

22:53 <geist> via 4 stores of 8?

22:53 <mjg> it beats the 16 byte i had and 64 byte does not add speed up

22:53 <heat> how does frms work?

22:53 <mjg> yea

22:53 <geist> and then frms says just use rep stosb and be done with it?

22:54 <mjg> frms claims the startup latency is way slower, but i don't see any numbers

22:54 <geist> so without emrs it's back to the usual 'move 32 bytes at a time until you have less than 32 bytes and then whatever?'

22:54 <mjg> for all i know it still remains pessimal up to a range

22:54 <geist> well, i'm putting together an order for an alder lake i7-12700k tomorrow probably

22:54 <geist> so should have one floating around to run tests for folks once it omes in and i build it

22:54 <mjg> i have free credits for ec2 'n shit

22:55 <mjg> i can probably get an fsrm-capable box within minutes

22:55 <geist> yah though might want to be careful there since the memory bandwidth may be all over the place

22:55 <mjg> i just don\t have benchmark code readily available

22:55 <mjg> one can start with most naive test possible: just zero or copy something small in a loop

22:56 <mjg> with the same src and target

22:56 <heat> you should use my thingy

22:56 <geist> i dont mind helping out if you need it. have a medly of skylakes, ivy bridges, zen 1, zen 2, and soon alder lake

22:56 <moon-child> I don't have credits, but I had to rent one of their servers recently

22:56 <mjg> if startup latency is to be measured here, that's good enough

22:56 <moon-child> to test/repro a 32-bit overflow; needed 100gb of ram or so

22:56 <mjg> geist: i got skylake, westmere, sandy, haswell and some lol atoms

22:56 <moon-child> just for a few minutes, it was less than a dollar I think

22:56 <moon-child> really cheap

22:56 <geist> mjg: yah you should get some AMDs into the mix since they tend to be different enough

22:57 <geist> actually the erms startup cost on a zen 3 would be fascinating, since that

22:57 <mjg> i know

22:57 <geist> 's when they claim ERMS (though no FRMS)

22:57 <mjg> i'm working on long term access to some fresh zens

22:57 <geist> i have one, can run code for you if you want

22:57 <geist> my desktop machine is a 5950x

22:57 <mjg> it's getting late here, will have to turn in soon(tm)

22:57 <geist> no worries

22:58 <mjg> i'll hack up something trivial to just measure startup latency this week

22:58 <mjg> [already monday here :>]

22:58 <geist> i'm just hanging out sunday afternoon at the local brewery. it's inside and away from the daystar

22:58 <mjg> very likely a funny test will do fine: plug in the code into will-it-scale and check iterations/s

22:59 <mjg> no cache flushing or other shenanigans

22:59 <mjg> since again, only startup latency is to be checked

22:59 <geist> yah funny i was firing up some older raspberry pi the other day (pi400 i think) and it had some benchmark code from doug16k on it

22:59 <geist> had got me to run something on an arm

22:59 <moon-child> need some kinda barrier

22:59 <kof123> you guys are really hammering the "summon doug16k" today :D

22:59 <moon-child> to depend on the movs result

22:59 <moon-child> to make sure they don't overlap

22:59 <geist> haha doug16k is the chosen one

23:00 <geist> he wil bring balance to the osdev

23:00 <kof123> he came back for a short appearance

23:00 <geist> oh earlier today? or just earlier in the last fwe weeks?

23:01 <kof123> weeks/months.

23:01 <heat> like a month or two ago

23:01 <kazinsal> couple weeks back iirc

23:01 <geist> ah yeah

23:04 <mjg> really weird part about all of this is that intel folks patched linux to just roll with erms

23:04 <mjg> claiming FAST

23:04 <mjg> years later FSRM shows up, which debunks the previous claim

23:04 <mjg> in the meantime someone patched copy_to/from_user to use regular movs up to 64 bytes or so

23:04 <mjg> instead of erms

23:04 <mjg> but did not patch memcpy

23:05 <mjg> weird af

23:05 <heat> it's like a conspiracy but everyone is bad at everything

23:05 <moon-child> lol

23:05 <mjg> i mean if you found out the hard way erms for small sizes sucks, why do you only patch ONE place

23:06 <mjg> although in our current landscape with all the vuln mitigations i can't say if fixing this kind of stuff is more important than ever

23:06 <mjg> ... or meaningless

23:06 <geist> yeah well also intel probably assumes AMD doesn't eixst and vice versa

23:07 <geist> though they at least had the cpommon courtesy to put it behind a cpuid bit

23:07 <mjg> well i'm happy to add to "wtf man"

23:07 <geist> mjg: yeah that's also what i've been a little sad about

23:07 <geist> with vulns it feels like a lot of these microoptimzations which used to be fun and/or matter dont

23:07 <mjg> the documented idiom is to rep stosq + finish with rep stosb

23:07 <geist> since it's far more important the code be safe etc than doing things quickly

23:08 <geist> or the thing you thought mattered now doesn't because the codegen is terrible around it, etc

23:08 <mjg> ... except that's TERRIBLY slow and there is a cheap hack which works around it big time

23:08 <geist> very much an instance of 'we can't have nice things'

23:08 <mjg> you can finish it off like so:

23:08 <mjg> movq %r10,-8(%rdi,%rdx)

23:09 <mjg> rdi is the target buf, rdx is the tail

23:09 <geist> the arm64 memcpy does also the interesting idiom of always moving by (multiple of register bytes) including th elast iteration

23:09 <geist> where the last iteration is just offset and double copied

23:09 <geist> a thing i wouldn't hae thought of honestly

23:09 <geist> since i had always assumed you always copy every byte precisely once

23:09 <heat> overlapping stores?

23:09 <geist> yah

23:09 <heat> mjg loves overlapping stores

23:09 <geist> like if you have to copy, say, 9 bytes. the trivial way would be an 8 byte move + another 8 byte move offset by 1

23:10 <mjg> that's the trick

23:10 <geist> (obviously it's more complicated, but that's a simplified version)

23:10 <mjg> movq %r10,(%rdi)

23:10 <mjg> movq %r10,-8(%rdi,%rcx)

23:10 <geist> yah my old days of doing memcopies was all about aligning everything because can't do unaligned

23:10 <mjg> bam, range 8-16 with no branching on the exact size

23:11 <mjg> not my idea, but fucking great

23:11 <geist> yah

23:11 <heat> your idea for sure

23:11 nyah has quit [Quit: leaving]

23:11 <geist> OWN IT

23:11 <mjg> actually you are correct

23:11 <heat> genius memcpy man show us the way

23:11 <mjg> as soon as i get back from the patent office

23:12 <geist> i havne't looked into what the current state of the art is on riscv but presumably there's room to optimize for cores that can and cant do unaligned accesses

23:12 <geist> it's like the early 2000s era arm all over agian

23:12 <geist> where the exact alignment and prefetcability and whatnot really matters

23:12 <mjg> heh i had not looked at riscv in almost any capacity

23:12 <heat> i assume that the current state of the art on riscv is "none, cpus are still slow, check again in 4 years"

23:12 <mjg> i ran into one lolzer though

23:12 <moon-child> i thought riscv has an sve

23:12 <moon-child> so you can do masked stores

23:12 <mjg> there was a hand-rolled bit op somewhere which i replaced with a compiler builtin

23:13 <mjg> ... which turn out to generate slower code on that platform

23:13 <mjg> (:

23:13 <geist> moon-child: right except it's also optional and probalby dont want to use in the kernel, so you're back to multiple versions again

23:13 <geist> i think the annoying thing is the riscv arch does not specify that unaliged accesses work and/or are efficient

23:13 <geist> unlike say armv8 which finally just flat out mandated it

23:13 <mjg> makes me wonder if the above addressing was invented by someone pissed at memset et al

23:14 <mjg> it's now my headcannon

23:14 <moon-child> which? -8(%rdi,%rcx)?

23:14 <moon-child> i like it

23:14 <geist> and as far as i can tell there's no good way to determine if unaligned is supported, since there's no cpuid equivalent and i ont think it's described in the device tree

23:14 <geist> this is part of the growing up that riscv is dealing with currently

23:14 <moon-child> 'no cpuid equivalent' wat

23:14 <heat> yeah

23:14 <heat> its a machine register

23:14 <heat> misa?

23:14 <moon-child> for something that relies so heavily on extensions, how do you even

23:14 <moon-child> oh

23:15 <heat> it's just a string in the device tree

23:15 <moon-child> so it does have a cpuid equivalent :P

23:15 <heat> you parse the characters

23:15 <moon-child> oh

23:15 <moon-child> ok

23:15 <geist> yah and that only tells you if a feature is present,a nd it's only available in machine mode

23:15 <moon-child> that's fine, I guess

23:15 <heat> starts with "rv$BITNESS"

23:15 <geist> so it's literally like 26 bits of features

23:15 <heat> then each char is an extension

23:15 <mjg> you do the access on purpose and see if you got the trap

23:15 <mjg> there

23:15 * moon-child trouts mjg

23:15 <geist> mjg: yeah except perhaps the machine mode monitor emulates it for you

23:15 <heat> practical development with overlapping stores man

23:16 <mjg> it's not a serious proposal, but i would expect linux to do it

23:16 <heat> why is it not serious?

23:16 <heat> it totally works

23:16 <geist> anyway, it's part of this ying and yang that riscv is going through now. trying to keep simple but also deal with practical realities of a fractured arhitecture

23:16 <geist> so will se

23:16 <mjg> heat: hack as fuck

23:16 <heat> you're in the kernel

23:16 <heat> hack as fuck is your middle name

23:16 <geist> edxdcept in riscv kernel is not the root. you have this machine mode to worry about

23:17 <heat> do you think .text patching isn't hack as fuck? it's hack as fuck^fuck

23:17 <mjg> .text patching is fine

23:17 <geist> anyway i think mjg was going to bed

23:17 <mjg> unless it's what's solaris is doing

23:17 <geist> i tried not to keep going!

23:17 <mjg> last rant!

23:17 <moon-child> lemme just memcpy the sse memcpy over the avx memcpy if cpuid says there's no avx support--wait...

23:18 <mjg> they introduced "zero cost probes" for dtrace

23:18 <mjg> except if you take a look, they cost a lot

23:18 <zid> have you considered AVX copying your memset code into memory, to see if it makes it faster by filling icache sooner

23:18 <mjg> the marketing was that there is no branching on whether the probe is on

23:18 <mjg> what they did is the laziest fcking hack you can imagine

23:18 <mjg> the let the compiler generate the func call to the probe as if it was there all along

23:19 <mjg> and then they nop out actual call instruction

23:19 <heat> ok?

23:19 <heat> that's pretty standard

23:19 <heat> linux does that

23:19 <mjg> no

23:19 <mjg> they still set up registers for the call

23:19 <geist> yes and no. if you did it the way mjg descvribes it also means all the code around it is... yeah

23:19 <mjg> and recovery afterwards

23:19 <mjg> in the fast path

23:19 <geist> flattening regs, etc

23:19 <heat> particularly for ftrace

23:19 <geist> whereas what you probablyr eally want is something that calls a vaneer routine that dumps registers an whatnot and then calls through to the real one

23:19 <mjg> hot patching as done normally with asm goto injects a nop sled

23:19 <heat> the only thing you know is the callsite

23:20 <geist> that way you take the hit only when you make the call

23:20 <mjg> but all the reg + call manipulation is moved elsewhere

23:20 <mjg> so the impact on the func with the probe is just the nops

23:20 <mjg> as opposed to partially erased function call

23:20 <geist> it's the hidden cost of a function call, you end up dumping regs, trashing some of them, etc

23:21 <mjg> so tl;dr the "zero cost probes" are far from zero cost

23:21 <geist> yah seems like it'd also turn things like leaf nodes into not leaf nodes

23:21 <mjg> and to be clear, the 5 byte nops would qualify as zero in my book

23:21 <geist> beause there is otherwise a function call

23:22 <mjg> .. as seen with asm goto

23:22 <geist> yah and they deeloped it mostly in sparc probably, which would have been just a single call (and potentially the branch delay slot after it)

23:22 <mjg> i need to write a complete rant about supposed solaris scalability and post it to tuhs

23:22 <mjg> i need answers

23:23 <geist> well most likely the arguments for it was in the 90s/2000s wher eit probalby was superior to the alternatives

23:23 <geist> but times change

23:23 <moon-child> could even make it a 2 byte nop; 1 byte jump to a 4 byte jump

23:23 <moon-child> that's what windows does iirc

23:23 <mjg> cheeky

23:24 <geist> god reminds me of some exploit fix we have to compile the kernel with for some reason taht causes it to nop stuff branches that cross 32 byte boundaries or whatnot

23:24 <geist> beause of some stupid skylake hack

23:24 <mjg> geist: i don't know about that man. i can tell you for a fact though that an era appropriate machine (sandy bridge, 40 threads) from where sun was still "alive" runs into drastic perf problems on solaris

23:24 <geist> it's bad. i should ask if we still need it

23:24 <mjg> geist: which i profiled to just terrible scalability overall

23:24 <geist> mjg: frankly i i'm not sure solaris was ever that serious on x86

23:25 <mjg> i guarantee it sucks terribly on sparc as well

23:25 <geist> it was mostly serious when they had 32 core sparc machines when everything else was doing go to be 2 or 4 way

23:25 <mjg> it's all mutexes et al taken A LOT

23:25 <mjg> on shared objects

23:26 <mjg> in general their kernel weirdly lacks smp infrastructure to begin with

23:27 <mjg> even the basic stuff like annotations to keep vars in disjoint cache lines

23:27 <geist> (these are all things you're describing that zircon does not yet do :) )

23:27 <mjg> i mean one annotated var per line

23:27 <mjg> does zircon have a reputation of scaling to high core counts?

23:27 <mjg> i don't see any mentions in the old books man :)

23:27 <geist> depends on what high is nowadays

23:28 <geist> 8 or 10 cores was a shit ton of cores like 20 years ago

23:28 <moon-child> wasn't zircon supposed to replace android

23:28 <mjg> even bonwick's slab paper is kind of funny here

23:28 <moon-child> on phones with, like, 4 cores?

23:28 <geist> that was written in the 90s man

23:28 <geist> seriously 4 cores was a fairly large machine then

23:28 <mjg> replaced globally locked allocator, got a great speed up for it

23:28 <mjg> .. but that means the kernel did not scale for shit already

23:28 <mjg> the paper is late 90s afair

23:29 <heat> moon-child, no

23:30 <heat> fuchsia is supposed to run on everything

23:30 <geist> i met jeff bonwick once. he's a really nice guy

23:30 <heat> just like zombocom, everything is possible

23:30 <geist> ZOMBOCOM

23:30 <geist> we have a tremendous amount of work to do, which is why we're always looking for good kernel engineers

23:30 <heat> but probably not too low end hardware; it relies on 64-bit stuff and memory fragmentation and all that

23:30 <geist> and engineers that are also good at dealing with the fact that you an't do everything at once in one step

23:31 <geist> it's a long process. that's part of the essence of engineering. making baby steps

23:31 <mjg> aah there it is

23:31 <mjg> supposed multithreading-friendly userspace malloc turned out to be de facto globally locked

23:31 <mjg> seen in the vmem paper

23:31 <mjg> bonwick accidentally calling sun out

23:31 <heat> hm?

23:31 <geist> again 'm not really sure it's fair to judge old designs by modern stuff

23:32 <geist> though os design hasn't changed tha tmuch int he last 20-30 years i think a tremendous amount of the details of how to scale has

23:32 <geist> and also the very real realization that you can't always get wha tyou want

23:32 <geist> it may be their stuff wans't perfect at the time, but was sitll better htna most of the competition which was even 'worse'

23:32 <mjg> i don't judge ideas

23:33 <mjg> i'm pointing out their own tests result show that the code did not scale

23:33 <moon-child> apparently hoard showed up in 1999

23:33 <mjg> ... at time of publication

23:33 <moon-child> sure, but did anyone know about it prior to that?

23:33 <geist> mjg: scale to what?

23:33 <mjg> to whatever hw they were testing on

23:33 <geist> compared to what they had before? to the competition?

23:33 <moon-child> anyway, I'd expect all allocators prior to that to suck on multithreaded

23:33 <geist> to modern standards?

23:34 <mjg> well let me restate

23:34 <geist> moon-child: yeah i remember hoard being a thing there. i actually used it in newos years ago. we were thinkig about it (or aybe did use it) in beos at the time

23:34 <mjg> by the end of the 90s solaris kernel had a globally locked kernel allocator. with this state there is no way it scaled for shit at the time

23:35 <geist> mjg: okay. fine.

23:35 <mjg> despite popular opinions to the contrary

23:35 <mjg> then they patched "mtmalloc". which was for userspace and was supposed to scale

23:36 <mjg> erm, bad sentence

23:36 <mjg> then they ported the new allocator to userspace and benchmarekd it against mtmalloc

23:36 <mjg> turns out the multithreading-friendly allocator they had was de facto globally locked as well

23:36 <mjg> you can literally see it in the vmem paper

23:36 <mjg> so again, did not scale for shit

23:36 <mjg> the graph claims a 10 cpu system

23:37 <mjg> and that's from sun's own published material

23:38 <mjg> https://www.usenix.org/legacy/publications/library/proceedings/usenix01/full_papers/bonwick/bonwick.pdf page 16

23:39 <moon-child> oh they cite hoard

23:40 <moon-child> anyway their thing looks pretty linear? What's the issue with that graph?

23:40 * geist goes to take a walk

23:40 <mjg> from checking commit history to mutex code (2005 onward i think) i also know for a fact that any contentnion on boxes bigger than few cores very negatively affected perf

23:41 <mjg> moon-child: the graph is fine. the point is the malloc they had for multithreading progs turned out to not work

23:41 <mjg> see mtmalloc performance

23:41 <heat> oh yeah I should switch to scudo

23:42 <moon-child> oh the mtmalloc thing was what they were using?

23:42 <moon-child> I see

23:42 <mjg> re mutex, they check if the lock owner is running. but they used to do it by walking all per-cpu scheduler state and checking if the thread is perhaps scheduled on that one

23:43 <moon-child> urk

23:43 <mjg> effectively constantly pulling out what should be virtually always exclusively owned lines

23:44 <mjg> rant over

23:44 <mjg> good night :)

23:45 <heat> what kind of night doesn't end with a rant about solaris scalability really

23:45 <heat> or D O O R S

23:45 <moon-child> new possibilities lie just outside the door!

23:46 <moon-child> oh, on the topic of self-published benchmarks where you Just Don't Scale

23:47 <klange> suffering from decision paralysis

23:47 <klange> too many things I want to do

23:47 <moon-child> https://github.com/xacrimon/conc-map-bench/blob/master/results/ReadHeavy.fx.throughput.svg here's a benchmark from rust people of their 'blazing fast concurrent hashmap' (ahem)

23:47 <bslsk05> github.com: conc-map-bench/ReadHeavy.fx.throughput.svg at master · xacrimon/conc-map-bench · GitHub

23:47 <heat> come unix

23:48 <klange> Also I discovered last night that I had accidentally comitted a (shareware, thankfully) DOOM1.WAD to my repository earlier this year while working on ARM stuff.

23:48 FreeFull has quit []