#osdev on 2022-02-19 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:09 <klange> would be nice, I use __seg_gs for x86_64; for extra fun, it's static const = 0;

00:11 <klange> I think gcc is producing very similar code against x18 for me, so I've been content

00:27 <mrvn> klange: as long as you always access stuff through the pointer. I would like to make x18 part of the type so pointers in structs will be relative to x18.

00:29 <mrvn> auto *p = &this_core->foo; should e.g. be 8 and *p compile to 8(x18)

00:32 [itchyjunk] has quit [Ping timeout: 240 seconds]

00:36 [itchyjunk] has joined #osdev

00:41 blockhead has joined #osdev

01:01 Matt|home has joined #osdev

01:07 <klange> mrvn: I think it does, but I'm away from my desk - I'll check when I get home.

01:14 <klange> oh I misread, yeah, especially not in C, but also yes I always access through the pointer and anything else is bound to be nonsensical; if I want pther references, like pointers to the values within the structs, those end up as refs through the array that the thread pointers point into

01:15 <heat> im so confused

01:15 <heat> everything is fucked

01:19 <mrvn> klange: I think I will try to trick the compiler into doing this in C++ with some operator overloads, making pointer like objects. I want the peripheral base address in a register on the RPi.

01:22 heat has quit [Remote host closed the connection]

01:34 <clever> mrvn: mostly, ive just made it a compile time constant, i can make the entire pi0-pi3 range have the same value

01:34 <clever> but the MMU can also be used to solve the model differences

01:34 <mrvn> pre-mmu stuff.

01:35 <clever> yeah, that becomes a bit more of an issue

01:35 <mrvn> are you tweaking the config to make them all use the same base address? How?

01:35 <clever> there is a broadcom custom mmu between "arm physical" and the real bus

01:35 <clever> which can remap 16mb pages

01:35 <mrvn> I know, but how do I change the mapping?

01:35 <clever> you would need to use the open firmware

01:35 <mrvn> or do you mean your own VC firmware?

01:36 <mrvn> ahh, ok

01:36 <clever> the closed firmware doesnt give you the choice

01:36 <clever> and picks a random number out of a hat every time you jump models :P

01:36 <mrvn> I like getting the addres from the device tree. Have to do that on other archs too.

01:37 <clever> if your respecting DT properly, every single peripheral gets its own base-addr, and you have a dozen variables, for the base of each hw block

01:37 <clever> so your always doing local-base + reg-offset

01:37 <mrvn> hmm, true.

01:37 <clever> but if your ignoring DT like i am, you can hard-code the right addr, and it just becomes a 32bit constant to load

01:38 <mrvn> the local-base won't be a compile time constant so there is no gain in basing it of a fixed register.

01:38 <clever> the compiler may cheat, load a common base, and then do short offsets

01:38 <mrvn> *sigh* there goes that idea.

01:38 <clever> the register idea only works if your not respecting DT

01:38 <mrvn> *nod*

01:38 <clever> LK solves this whole issue, by just not doing mmio until the mmu has been setup

01:39 <mrvn> using a special pointer like object will be usefull though. Capsulate the volatile access and such automatically.

01:39 <clever> linux mostly does the same thing, but early-printk can either compile a phys addr into the kernel, or set a phys addr in the kernel cmdline

01:40 <clever> rpi-open-firmware is doing things in a few layers

01:40 <clever> #define SCALER_DISPCTRL HW_REGISTER_RW( 0x7e400000 )

01:40 <clever> first, you define a register like this

01:40 <mrvn> It's for raspbootin, my chain loader to load the real kernel over serial. So I don't want to change anything in the hardware I don't have to before the real kernel starts. But I need serial IO.

01:40 <clever> for VPU side code, that is the real address, so you just #define HW_REGISTER_RW(addr) (*(volatile uint32_t *)(addr))

01:41 <mrvn> I want a single raspbootin image that boot on all my arm devices no matter what.

01:41 <clever> for ARM baremetal, you have to translate things, so you:

01:41 <clever> # define HW_REGISTER_RW(addr) (*(volatile uint32_t *)(VC4_TO_ARM_PERIPH(addr)))

01:41 <clever> #define VC4_TO_ARM_PERIPH(addr) ((addr - VC4_PERIPH_BASE) + ARM_PERIPH_BASE)

01:41 <clever> but, i added more magic:

01:41 <clever> # define HW_REGISTER_RW(addr) (*(volatile uint32_t *)(mmiobase + (addr & 0x00ffffff)))

01:41 <clever> mrvn: this assumes you have an mmiobase variable in scope, and references everything off that!

01:42 Brnocrist has quit [Ping timeout: 240 seconds]

01:42 <clever> this lets me access MMIO from linux userland (mmap /dev/mem) or what your wanting, just set a var after you detect the model

01:42 <mrvn> but as you said every perpheral base address is variable, coming from the DT

01:43 <klange> what if peripheral base... as externally defined symbol, with self-relocations? fill it in on load... :thinking-emoji:

01:43 <clever> heh

01:43 * gog contemplates becoming one of the ARM kool kids

01:43 <klange> we've got delicious pi

01:43 <gog> that is alluring ngl

01:44 <clever> mrvn: u-boot doesnt even try to solve the problem your doing, they just compile a new copy for each model, and the config.txt syntax can be told to load the right one

01:44 Brnocrist has joined #osdev

01:44 <mrvn> The way this currently works is that my peripherals have a placement new, where the argument is the base address of the peripheral. The constructor then initializes the periphery.

01:45 <mrvn> auto uart = new(dt->find_base_address("serial")) Uart(115200);

01:45 <clever> mrvn: but the pi has between 2 and 6? uarts, so you want something more loop based

01:45 <mrvn> clever: only one serial alias

01:46 <clever> serial0 and serial1 aliases last i looked

01:46 <mrvn> but yeah, the real code has some fallbacks looking for different things till it finds a serial.

01:46 <clever> and you could wind up needing either driver

01:46 <clever> serial0 is always the one on the gpio header

01:46 <clever> serial1 is always the one on the bt controller

01:46 <clever> but, which one is the PL011 and which is the mini-uart, varies, depending on config.txt entries

01:47 <clever> so you need to read the compatible, and spawn the right driver

01:47 <mrvn> ideally the raspbootin should listen on every single one because I can't know where I will connect on the next ARM board I will try this on.

01:48 <mrvn> I want something where I can buy a brand new ARM board, flash raspbootin on it as is and have it give me a prompt when I power it up no matter whatr.

01:48 <clever> in my most recent fun, i have been doing this:

01:48 <clever> [root@amd-nixos:/home/clever/apps/rpi/usbboot]# ./rpiboot -d /home/clever/apps/rpi/lk-overlay/build-bootcode-fast-vga/

01:48 <clever> [nix-shell:~/apps/rpi/lk-overlay]$ make PROJECT=bootcode-fast-vga && ls -lh build-bootcode-fast-vga/lk.bin

01:48 <clever> -rwxr-xr-x 1 clever users 26K Feb 18 21:22 build-bootcode-fast-vga/lk.bin

01:49 <clever> that produces a custom bootcode.bin file (a symlink renames it), and then ships it off to a pi0 in usb-device mode (like DFU on other devices)

01:49 <clever> so, i just plug the pi0 in (or trigger a reset), and it runs the latest build

01:49 <clever> no fussing with SD cards

01:49 <mrvn> nice

01:50 <clever> once i had it doing the desired job, i realized it was only using ~60kb of the binary

01:50 <clever> that leaves plenty of L2 cache to spare, so i just deleted the dram driver :P

01:50 <clever> and now its down to 26kb

01:50 <clever> libc and kernel are the biggest chunks of fat now

01:51 <clever> 80000c68 000007da T __adddf3

01:52 <clever> and checking the `vc4-elf-nm build-bootcode-fast-vga/lk.elf -S --size-sort | tail` report, i can see that floating point routines are a large cost

01:52 <clever> which i'm using to compute the pixel clocks

01:52 <clever> but, i have hw float, why am i using double?

01:53 <klange> i am a wee bit bigger than that

01:53 <clever> platform/bcm28xx/dpi/dpi.c:103:45: error: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Werror=double-promotion] desired_divider = (int)(desired_divider + 0.5);

01:54 <clever> hmmm, how do i tell gcc that 0.5 is a float, not a double?

01:54 <klange> put an f after it?

01:54 <clever> -rwxr-xr-x 1 clever users 23K Feb 18 21:54 build-bootcode-fast-vga/lk.bin

01:54 <clever> perfect, that shakes another 6kb off!

01:56 <clever> __divdf3 is now the next victim

01:57 <clever> -rwxr-xr-x 1 clever users 21K Feb 18 21:57 build-bootcode-fast-vga/lk.bin

01:58 <klange> -rwxrwxr-x 1 klange klange 17M Feb 18 18:44 kernel8.img

01:59 <klange> you are rounding error to my payload :(

01:59 <clever> :D

01:59 <clever> i also just saved a div, without even removing code

02:00 <clever> `static const uint32_t xtal_freq = CRYSTAL;` this was an extern in a .h, and defined in a .c

02:00 <klange> (to be fair... there's a lot of JPGs and PNGs and shit in that 17M...)

02:00 <clever> but its a constant, so why not declare it as such? now `(double)xtal_freq/1000/1000` gets computed at compile time

02:01 <clever> i only support tga files currently

02:01 <clever> __udivdi3 is now coming up...

02:02 <klange> dividing integers is for weenies

02:02 <clever> i think thats 64bit division, cant really avoid that

02:02 <clever> i have a 32bit div opcode, but no 64bit div opcode

02:02 <klange> i don't even get out of bed unless the divisor is a power of two

02:02 <clever> my hw clock ticks in uSec, but LK wants msec

02:02 <clever> so i need to /1000 things

02:02 <klange> f (and not the float suffix one)

02:03 <clever> _printf_engine is my next biggest cost, this binary has zero way of emiting text

02:03 <clever> if only i had an easy way to disable printf globally...

02:04 <klange> #define printf(...) (void)0;

02:05 <klange> er, minus the semicolon I guess

02:05 <klange> or if you know all your printfs are statement function calls, just `#define printf(...)` and yolo it

02:05 <clever> :D

02:05 <mrvn> clever: do you have a printf that does "uint64_t / 10"?

02:06 <clever> mrvn: that is why __umoddi3 is in the binary, longlong_to_string.constprop.0

02:07 <clever> and why i'm trying to put printf on the chopping block :P

02:07 <gog> aaay clang caught a bug for me

02:07 <mrvn> clever: I replaced that with a few shifts and bit ops for a hand crafted / 10. No need to pull in libgcc for that.

02:07 <gog> using || when i needed &&

02:08 <klange> null dereference? foo || foo->bar?

02:08 <gog> nah overlapping comparisons

02:08 <klange> oh that's neat... I don't think gcc has anything for those

02:08 <gog> it apparently doesn't because that code's been there for about 30 commits back

02:09 <klange> i'm sure it figures them out as part of optimization, but haven't seen a warning or whatever and I know I've done it once or twice...

02:09 <clever> mrvn: https://github.com/littlekernel/lk/blob/master/lib/libc/printf.c#L103 would be the source

02:09 <bslsk05> github.com: lk/printf.c at master · littlekernel/lk · GitHub

02:09 <gog> i don't think it does i have all warnings turned on

02:10 <clever> mrvn: yeah, its the %10, not the /10 that does the __umoddi3

02:12 <clever> https://github.com/littlekernel/lk/blob/master/lib/debug/debug.c#L29

02:12 <bslsk05> github.com: lk/debug.c at master · littlekernel/lk · GitHub

02:12 <clever> panic also depends on printf!

02:12 <clever> and that module isnt optional

02:13 <clever> so yeah, i would have to neuter printf and vprintf globally, via some header

02:15 <clever> klange: in terms of my binary being a rounding error for you, i dont even need the ram to be online, lol, and its not even using 20% of the L2 cache

02:15 <clever> but its not quiet enough to have a full framebuffer

02:16 <clever> i could maybe get a 320x320 palette based framebuffer

02:16 <clever> or RGB332

02:19 <mrvn> clever: can't find my code for uint64_t div10 but it's like this: https://stackoverflow.com/questions/5558492/divide-by-10-using-bit-shifts 2nd answer

02:19 <bslsk05> stackoverflow.com: math - Divide by 10 using bit shifts? - Stack Overflow

02:20 <clever> mrvn: ive also seen gcc doing similar with high-side mults and bit masking

02:20 <clever> its pretty weird

02:20 <clever> oh, what happens if i hit the LTO button?

02:20 <mrvn> clever: mult is something diverent and needs higher precision for the mult. x/10 == (x * (1/10 << 32)) >> 32

02:21 <mrvn> clever: sometimes you need other shifts than 32 but 32 is the best for 32bit cpus.

02:23 <clever> yeah

02:23 <mrvn> gcc and clang are pretty good at finding inverse constants so the multiply actually is true for all inputs but they don't always find the same constants.

02:24 <mrvn> The bit shifts and adds (I had) are derived from multiplying with the inverse. You just do the mult with shifts and adds basically.

02:25 <mrvn> 1/10 has a nice binary pattern so that works out great.

02:26 pretty_dumm_guy has quit [Quit: WeeChat 3.4]

02:27 pretty_dumm_guy has joined #osdev

02:29 heat has joined #osdev

02:29 pretty_dumm_guy has quit [Client Quit]

02:31 * clever re-reads https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html

02:31 <bslsk05> gcc.gnu.org: LTO Overview (GNU Compiler Collection (GCC) Internals)

02:37 orthoplex64 has joined #osdev

02:44 <heat> holy fucking shit I was looking at the wrong thing

02:44 <heat> it was a context switching bug

02:46 <clever> been there, got the t-shirt

02:48 <heat> i got no t-shirt

02:48 <heat> :(

02:48 <clever> in my case, i wrote a bit of critical-section code, that would disable irq's, then enable irq's

02:49 <clever> it got ran by the scheduler when irq's where off

02:49 <clever> so it turned irq back on, in the middle of the context switching routine

02:49 <clever> which then context switched in the middle of context switching

02:50 <heat> i was getting issues with corruption of *everything* in the stack

02:50 <heat> looked at the mmu a bunch, no dice

02:50 <heat> (although I did find a cute little bug)

02:50 <heat> it turns out the bug was that I forgot a user thread could be inside the kernel

02:51 <heat> so I was setting the scratch tp register wrong when switching back to user threads in kernel space

02:51 <clever> ah

02:51 <heat> and it wasn't faulting, but just wrecking havoc everywhere

02:52 <clever> for me, it faulted, but only if an irq happens at the wrong time in context switching

02:52 <clever> so it was fine until i tried to do irq heavy tasks

02:52 <heat> the havoc really reminded me of TLB problems, so I looked at that a bunch (since I wasn't doing TLB stuff, but I also wasn't really touching memory such that I would need to shootdown stuff)

02:53 <heat> I also tried porting a new allocator just to check

02:53 <heat> now I have a half-finished port of scudo

02:54 <heat> oh also this was somehow triggered by a malloc call in fpu code

02:54 <heat> malloc corrupts everything turned out to be a context switching bug

02:55 <heat> scudo might be a good allocator though

02:55 <heat> dunno about the perf though

02:56 <heat> i know it's supposed to be more secure, and that both android, fuchsia and trusty use it

02:56 <heat> I'm using the old musl malloc, which is just *not good*

02:59 ElectronApps has joined #osdev

03:01 masoudd has quit [Ping timeout: 272 seconds]

03:35 srjek_ has quit [Read error: Connection reset by peer]

03:36 <klange> mine's pretty shit, too, predates all the rest of my OS

03:37 <klange> one of my uni+apple friends helped me with it, smart guy

03:47 mahmutov has quit [Ping timeout: 240 seconds]

04:02 nyah has quit [Ping timeout: 272 seconds]

04:08 heat has quit [Read error: Connection reset by peer]

04:08 heat has joined #osdev

04:21 heat_ has joined #osdev

04:22 heat has quit [Read error: Connection reset by peer]

04:51 gog has quit [Ping timeout: 240 seconds]

05:04 bradd has joined #osdev

06:06 k8yun has joined #osdev

06:20 k8yun has quit [Quit: Leaving]

06:33 elastic_dog has quit [Ping timeout: 240 seconds]

06:36 elastic_dog has joined #osdev

06:38 eroux has joined #osdev

06:47 xenos1984 has quit [Remote host closed the connection]

06:47 xenos1984 has joined #osdev

06:55 fkrauthan has quit [Quit: ZNC - https://znc.in]

06:57 fkrauthan has joined #osdev

07:23 heat_ has quit [Ping timeout: 245 seconds]

07:34 mahmutov has joined #osdev

07:36 eroux has quit [Ping timeout: 272 seconds]

07:52 chronon has left #osdev [#osdev]

07:54 [itchyjunk] has quit [Read error: Connection reset by peer]

07:56 copier_ has joined #osdev

07:56 copier_ has left #osdev [#osdev]

08:05 lkurusa has joined #osdev

08:06 <j`ey> https://twitter.com/a13xp0p0v/status/1494774701176021001/photo/1

08:06 <bslsk05> twitter: <a13xp0p0v> Finished my new security research: ␤ ␤ Hacking Zircon microkernel of Fuchsia OS developed by @GoogleOSS . ␤ ␤ Sharing the screenshot of the PoC exploit demo! ␤ ␤ I'll publish a detailed write-up. https://pbs.twimg.com/media/FL6BIyDX0AElQob.jpg

08:10 lkurusa has quit [Client Quit]

08:18 ThinkT510 has quit [Quit: WeeChat 3.4]

08:21 ThinkT510 has joined #osdev

08:27 mahmutov has quit [Ping timeout: 256 seconds]

10:01 <klange> rpi crashed last night; started it up this morning, it ran for 12 hours fine, deployed a new kernel it froze a few minutes in... worried I've got a really unlikely deadlock...

10:02 <klange> gonna run the hvf vm in a similar config and see if I can get the same result where I can attach a debugger...

11:10 GeDaMo has joined #osdev

11:25 vin has quit [Remote host closed the connection]

12:18 <froggey> I have a magic debug button, press the special key combo and back traces for all threads gets dumped to the serial port

12:18 <froggey> Extremely useful for debugging deadlocks like that on real hardware

12:27 <mrvn> would be nice to run kernel code through some thread sanitizers.

12:32 kleinweby has quit [Quit: ZNC 1.6.6+deb1ubuntu0.2 - http://znc.in]

12:32 <mjg> linux has kcsan

12:33 <j`ey> now on arm64 too!

12:42 <mjg> oh?

12:43 <j`ey> https://github.com/torvalds/linux/commit/dd03762ab608e058c8f390ad9cf667e490089796

12:43 <bslsk05> github.com: arm64: Enable KCSAN · torvalds/linux@dd03762 · GitHub

12:44 <mjg> nice

13:00 pretty_dumm_guy has joined #osdev

13:05 <catern> is there a name for the observed phenomenon that the CPUs and memory have gotten much further away from disks in speed than they used to be?

13:07 <GeDaMo> Time dilation? :P

13:08 <mjg> i wonder, do you have machinery in your kernels to detect lock ordering issues?

13:08 <mjg> without having to run into actual deadlocks

13:16 gog has joined #osdev

13:24 <mrvn> catern: has it actually? With SSDs and M2.key disk speed has leaped. 2GB/s compares much better to memory/cpu speed than the 160MB/s of rotating disks.

13:24 <kazinsal> being an order of magnitude off instead of two orders of magnitude off is definitely an improvement, yeah

13:26 <mrvn> cpu speed also hasn't improved in the last years. You only got more cpus. (basically)

13:38 <catern> i mean since 1970

13:40 <mrvn> do you have data for all that time?

13:55 nyah has joined #osdev

14:19 orthoplex64 has quit [Ping timeout: 256 seconds]

14:20 orthoplex64 has joined #osdev

14:48 [itchyjunk] has joined #osdev

14:56 mahmutov has joined #osdev

15:16 X-Scale` has joined #osdev

15:17 X-Scale has quit [Ping timeout: 256 seconds]

15:17 X-Scale` is now known as X-Scale

15:18 masoudd has joined #osdev

15:28 ElectronApps has quit [Read error: Connection reset by peer]

15:38 X-Scale` has joined #osdev

15:39 X-Scale has quit [Ping timeout: 272 seconds]

15:39 X-Scale` is now known as X-Scale

16:06 troseman has joined #osdev

16:38 Bonstra has joined #osdev

16:43 xenos1984 has quit [Remote host closed the connection]

16:44 xenos1984 has joined #osdev

16:51 myon98 has quit [Ping timeout: 250 seconds]

17:15 dude12312414 has joined #osdev

17:16 simpl_e has quit [Remote host closed the connection]

17:54 zaquest has quit [Ping timeout: 252 seconds]

17:57 zaquest has joined #osdev

19:21 rustyy has quit [Quit: leaving]

19:35 MiningMarsh has quit [Ping timeout: 240 seconds]

19:57 rustyy has joined #osdev

20:11 dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]

20:12 wootehfoot has joined #osdev

20:21 [itchyjunk] has quit [Remote host closed the connection]

20:27 Bonstra has quit [Quit: Pouf c'est tout !]

20:45 GeDaMo has quit [Remote host closed the connection]

20:49 k8yun has joined #osdev

21:28 MiningMarsh has joined #osdev

21:33 Bonstra has joined #osdev

21:50 heat_ has joined #osdev

21:51 heat_ is now known as heat

21:52 <heat> updog

21:52 <gog> what's opdog

21:52 <heat> its all good

21:52 <heat> hehehehehe

21:52 <heat> hehehehe

21:52 <heat> hehe

21:52 <heat> he

21:53 <j`ey> at

21:53 <heat> hat

21:54 <heat> i feel compelled to start my arm64 port

22:00 heat has quit [Remote host closed the connection]

22:01 heat has joined #osdev

22:04 matrice64 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]

22:10 <heat> asm goto in GCC doesn't support output constraints

22:11 <heat> your outputs need to be passed as inputs, and if you change memory, you add memory as a clobber, forcing the compiler to reload everything from memory

22:11 <heat> yuck

22:11 <klange> 13 minutes in, instruction abort on a ret... 40 minutes in, instruction abort on a ret...

22:20 <mrvn> why do you want to asm goto?

22:20 <heat> get_user without doing it all in assembly

22:21 <mrvn> Why would that need any asm at all?

22:24 <heat> to be able to take a page fault and recover

22:25 <j`ey> I had a look at linux's 'extable' stuff for that recently, it's cool

22:26 <heat> yes

22:26 <heat> i do something like that

22:26 <j`ey> my coworker refactored the arm64 impl

22:26 <j`ey> to use less asm

22:26 <heat> define a label at the instruction that accesses user memory, .pushsection eh_table and fill out the struct, .popsection

22:26 <klange> if this turns out to be that I needed one instruction before this other instruction imma be mad... i hate these bugs where running a full-blast VM for an hour or more is what it takes to reproduce them...

22:28 <j`ey> heat: leads to funny code: https://github.com/torvalds/linux/blob/master/arch/arm64/mm/extable.c#L16

22:28 <bslsk05> github.com: linux/extable.c at master · torvalds/linux · GitHub

22:28 <j`ey> &ex->fixup + ex->fixup

22:28 <heat> it's PC relative?

22:29 <j`ey> yeah

22:29 <j`ey> well, relative to the fixup itself

22:31 <heat> that's funny

22:31 <heat> so you get a smaller exception table?

22:32 <j`ey> I assume that's part of it yet

22:32 <j`ey> *yeah

22:34 <mrvn> heat: why don't you check the memory region against your memory information for the task? You don't want the user process accessing bits of memory where only the kernel has access.

22:35 <heat> i check for the limit address

22:36 <klange> 30 minutes in, nothing yet...

22:36 <klange> it is absolutely pouring rain outside, tho

22:36 <heat> but checking in the vm tree if it's mapped (and the perms) doesn't work and is slow

22:36 <heat> way faster to optimise for the common case (access goes through) and works with threads

22:37 <mrvn> heat: you don't have a range based VM structure?

22:37 <heat> yes

22:37 <heat> lock_address_space(); do_access(uptr); unlock_address_space(); is slow

22:37 <heat> it also runs into horrible issues when you take a page fault

22:38 <heat> the pf handler needs to somehow know you were holding the lock and doing a user access, as to avoid running into a deadlock

22:38 <mrvn> you have to check the range, not just the pointer. And keep the lock while copying the memory.

22:39 <heat> as I was saying, that is slow

22:39 <mrvn> read lock, so the PF handler can read lock too

22:39 <mrvn> From my point of view the memcpy is the bigger problem, takes longer.

22:39 <heat> the atomic operation for a read lock is still a hell of a lot slower than just doing a cute load

22:41 <mrvn> So your solution is to just copy blindly and hope the MMU pagefaults when the user tries something bad?

22:41 <heat> yes my kernel expects the MMU to work

22:41 <mrvn> except you are in kernel mode, so you have access to kernel only mapped memory.

22:41 <heat> except I literally told you I check for the user address limit

22:42 <mrvn> not sure what is supposed to tell me

22:42 <mrvn> +that

22:43 <heat> if (uptr < USER_ADDRESS_LIMIT) return -EFAULT;

22:43 <heat> er, >

22:44 <j`ey> heat: you know aarch64 has some 'unpriveleged' instructions for this

22:44 <j`ey> LDTR etc

22:45 <heat> cool

22:45 <klange> https://developer.arm.com/documentation/ddi0595/2021-06/AArch64-Instructions/AT-S1E0R--Address-Translate-Stage-1-EL0-Read

22:45 <bslsk05> developer.arm.com: Documentation – Arm Developer

22:45 <heat> riscv and x86 have turn on, turn off user access instructions

22:46 <mrvn> heat: this is all so much simpler if you don't have threads

22:46 <klange> 42 minutes, doing fine...

22:47 <heat> mrvn, nah this is pretty simple

22:47 <heat> https://github.com/heatd/Onyx/blob/master/kernel/arch/riscv64/usercopy.cpp#L117

22:47 <bslsk05> github.com: Onyx/usercopy.cpp at master · heatd/Onyx · GitHub

22:48 <mrvn> klange: "Absence of evidence is not evidence of absence."

22:49 <heat> then I have a piece of code that gets a recovery PC if I take an exception and it exists

22:49 <klange> please, my blood pressure is already over 150, I don't need reminders that this situation is f***ed

22:49 <heat> it's like C++ exceptions but way simpler

22:49 <mrvn> klange: you have a special get 64bit from user?

22:49 <mrvn> heat^^

22:49 <heat> yes

22:50 <mrvn> heat: would have thought the compiler optimizes the generic path to that if the size is 8.

22:51 <heat> the compiler can't see into assembly, and it doesn't technically work for x86

22:51 <heat> copy_from/to_user use rep movsb, and those have weird caching properties

22:52 <mrvn> heat: as discussed yesterday that's slower for <256 byte.

22:52 <heat> I don't have SIMD

22:52 <heat> this is the kernel

22:53 <mrvn> hence 256 and not 8k or what it was for simd

22:53 <heat> no it's 256 for SIMD

22:54 <mrvn> must be remembering it wrong then, it was 4 in the morning the other day.

22:54 <heat> https://i.imgur.com/KakoZvw.png

22:55 <mrvn> heat: where is the comparison between repl and just movs?

22:56 <heat> it's not

22:56 <heat> but "Using ERMSB always delivers better performance than using REP MOVSD+B"

23:02 <mrvn> I think I will try this: define a fixed size byte array in a struct of the right size, cast the pointers, assign it.

23:02 <mrvn> Let the compiler insert the best code to copy a block of memory of known size.

23:02 <mrvn> s/byte/alignment appropriate type/

23:13 <klange> 1h10m on this vm running a workload with a lot of yield calls and the suspected fix in place, so far so good, also spinning up the rpi again

23:14 <mrvn> klange: does you fix include a check for the condition that was a bug and log "Fix triggered"?

23:14 <klange> the fix would happen thoudsands of times a second, so no