#osdev on 2021-07-12 — irc logs at libera.irclog.whitequark.org

2021-05-23 01:57 klange changed the topic of #osdev to: Operating System Development || Don't ask to ask---just ask! || For 3+ LoC, use a pastebin (for example https://gist.github.com/) || Stats + Old logs: http://osdev-logs.qzx.com New Logs: https://libera.irclog.whitequark.org/osdev || Visit https://wiki.osdev.org and https://forum.osdev.org || Books: https://wiki.osdev.org/Books

00:00 <doug16k> no it doesn't

00:00 <heat> yes it does

00:00 <doug16k> no it doesn't

00:00 <heat> yes it does

00:00 <doug16k> no it doesn't

00:00 <heat> y e s

00:00 <geist> my experience is it generally does, but only in that it causes there to be possibly extraneous destructor paths

00:00 <geist> accessing/etc is usually pretty free, but now it exposes anyone that has it to the 'omg you might be the last'

00:00 <heat> https://www.youtube.com/watch?v=rHIkrotSwcc

00:00 <bslsk05> 'CppCon 2019: Chandler Carruth “There Are No Zero-cost Abstractions”' by CppCon (00:59:52)

00:00 <geist> well, shared_ptr does that a lot

00:01 <doug16k> heat, when's the last time you profiled something?

00:01 <geist> honestly that's sort of the biggest cause of bloat i've seen in shared_ptr environments. everyone has to destruct everything

00:01 <heat> doug16k: dunno

00:01 <heat> why?

00:01 <doug16k> just wondering how you are so sure. I profile stuff all the time

00:01 <heat> (go to the 17:00 for the relevant part)

00:02 <heat> I look at disassembly sometimes

00:02 <doug16k> I also disassemble the code and check

00:02 <heat> and I watched this talk

00:02 <heat> it's an ABI issue, not a compiler issue

00:02 <geist> geezus would you guys quit trying to out profile each other?

00:02 <doug16k> I watched my years of experience

00:03 <doug16k> unique_ptr is zero cost abstraction

00:03 <geist> aaaand the thing at 19 is precisely what i was talking about

00:03 <geist> it has to add all these delete paths

00:03 <doug16k> geist, and you wouldn't have added them by hand? just leak

00:03 <heat> there's no destructive move

00:04 <geist> doug16k: in cases where i know it wont go out of scope, yes

00:04 <doug16k> so your solution is to make it shit C code

00:04 <doug16k> where you hand leak it

00:04 <doug16k> because you think you are so smart

00:05 <doug16k> this is why there are 100k security holes in OSes

00:05 <geist> geezus dude

00:05 <geist> i'm steps outta this one

00:05 <doug16k> I mean people, not you personally

00:05 <geist> sounded pretty personal to me

00:05 <doug16k> I realized that

00:05 <doug16k> sorry

00:05 immibis_ has quit [Ping timeout: 255 seconds]

00:05 <geist> anyway o

00:05 <geist> i'm on vacation! and learning rust

00:06 <geist> and rust akes all of this moot because you can never ever ever write bad code in rust

00:06 <geist> because it's perfect

00:06 <geist> and all your C++ people need to get with the program!

00:06 * geist hands you a rust bible

00:06 <heat> hexchat is on 100% CPU usage

00:06 <heat> rust rewrite?

00:06 heat has quit [Quit: Leaving]

00:06 <geist> yes there are perfect programs and programs that just haven't been rewritten in rust yet

00:07 heat has joined #osdev

00:07 <geist> (sarcasm in case folks ddn't pick up, but so far i'm actually kinda pleased with the language)

00:08 <heat> yes it's nice but I'm like on the 2nd or 3rd year of "trying to learn rust and giving up in the process"

00:08 <heat> i somehow can't get myself to learn new programming languages

00:08 <geist> yah i'm being more or less forced to deal with it at work, which is probably not a bad way to go about it

00:08 <geist> grumble, but then i have to stick with it so i can See The Light

00:08 <doug16k> yeah, being forced by clients is the best way to learn new languages :P

00:09 <Mutabah> geist: If you need help, ##rust is pretty active

00:10 <geist> word.

00:10 <doug16k> no way in hell I'd touch React or AWS with a ten foot pole, unless clients dragged me there, kicking and screaming

00:10 <doug16k> forced to learn Swift same way

00:12 <doug16k> I thought simplistic apache shared hosting was bad, until I saw AWS

00:12 <doug16k> apache shared host is super high performance in comparison

00:13 <doug16k> if you asked me to come up with a webserver as slow as AWS, I'm not sure I could without deliberate sleeps/stalls

00:13 <doug16k> maybe if I kept the server off and powered it up with libvirt and restored snapshot

00:13 <doug16k> ...between requests

00:14 <doug16k> if you told me that webservers would be resuming from sleep in 2005, I'd have said you have a very vivid imagination

00:15 <kingoffrance> i would have said dont give them ideas :)

00:19 <doug16k> the comedy begins when your lambda server is off, and your db server is off, and the server hosting your secure credential storage is off, you stall starting the lambda, stall getting credentials, stall turning on sql server, stall waiting for query. could be 8 seconds to do a request

00:20 * kingoffrance sees ghost of inetd

00:28 silverwhitefish has joined #osdev

00:35 <heat> the osdev wiki is quoted by kernel.org docs :0

00:35 <heat> https://www.kernel.org/doc/html/latest/filesystems/ext4/about.html

00:35 <bslsk05> www.kernel.org: 1. About this Book — The Linux Kernel documentation

00:35 <heat> s/quoted/linked

00:44 iorem has joined #osdev

00:49 ElectronApps has joined #osdev

00:49 dutch has quit [Quit: WeeChat 3.0.1]

00:55 <geist> oh neat

00:59 <klysm> is the HLT on the trampoline really such a bad idea?

01:01 <geist> what do you mean?

01:01 <klysm> and if so, how long do the delays get?

01:02 <klysm> I've heard some advocacy in here for removing HLT from the trampoline.

01:02 <klysm> so if all the threads yield, and there is no interrupt, the cpu isn't doing nothing

01:03 <heat> what trampoline?

01:03 <klysm> is the trampoline a misgiven idea then?

01:03 <heat> what trampoline

01:04 <klysm> the scheduler's piece where it falls through to when there is nothing to do

01:04 <heat> you mean the idle loop?

01:04 <heat> ok you mean the idle loop

01:04 <klysm> probably, yes.

01:04 <heat> removing HLT is a bad idea

01:04 <heat> the point of idling is well, to idle

01:04 <heat> busy looping just wastes power

01:04 <klysm> doug16k was saying he had removed HLT, I think.

01:05 <heat> you get smaller latencies but you use a lot more power

01:05 <heat> klysm: dunno what you're talking about but note using monitor/mwait != removing hlt

01:05 <klysm> yes, and smaller latencies was what he says is desirable

01:06 <heat> *shrug*

01:06 <heat> smaller latencies is nice but so is low power usage

01:07 dutch has joined #osdev

01:09 mekeor has quit [Quit: ERC (IRC client for Emacs 27.2)]

01:12 <moon-child> if you just busy loop there your cpu is gonna burn up

01:16 <doug16k> I didn't remove hlt

01:17 <klysm> oh thx

01:17 <doug16k> must be someone else

01:17 <doug16k> mwait would be the other way on x86

01:19 <doug16k> wait, what trampoline

01:19 <klysm> in linux I think that file of code is called trampoline.S, not completely sure though

01:20 <pony> why the capital s

01:20 <klysm> assembly

01:20 <doug16k> pony, makes it preprocessed

01:20 <pony> but

01:20 <heat> what doug said

01:20 <pony> why not small s

01:20 <doug16k> #include #define #if works

01:20 <pony> oh, ok

01:20 <doug16k> macros, etc

01:20 <doug16k> it passes it through cpp

01:22 <doug16k> normally, C compile would mean preprocess the c, cc1 compile that to assembly, assemble that

01:22 <doug16k> S file skips the cc1 step, it's already assembly

01:23 <doug16k> s file skips both, assembles original

01:24 <klysm> I did find this, might be wrong though: https://github.com/torvalds/linux/blob/master/arch/x86/realmode/rm/trampoline_64.S#L80

01:24 <bslsk05> github.com: linux/trampoline_64.S at master · torvalds/linux · GitHub

01:25 <heat> klysm: that's a panic

01:25 <doug16k> ya that is just a deliberate hang

01:25 <heat> but since it can't panic that early, it hangs

01:25 <doug16k> the jmp is just in case of NMI

01:27 <doug16k> not needed for SMM, if SMM preempts hlt, it crawls back into the halt on return from smm

01:27 <doug16k> they can't have it break every program not thoughtful enough to jmp after hlt

01:28 <heat> when do NMIs fire?

01:28 <heat> if not for the local APIC IPIs that is

01:28 <doug16k> could come from anything

01:29 <doug16k> every MSI(x) PCI device can send an NMI

01:29 <doug16k> all the LINT inputs can be mapped to generate an NMI

01:29 <doug16k> other cpus can send you an NMI

01:29 <doug16k> can send yourself one

01:29 <heat> huh

01:29 <heat> didn't know MSI could do that

01:29 <doug16k> it's the delivery type

01:30 <doug16k> it's encoded into the address

01:30 <doug16k> on x86 it can

01:30 <doug16k> it's not universal

01:30 <doug16k> msi just uses whatever address you say

01:31 <doug16k> on x86, you can deliver different types of interrupts

01:32 <doug16k> correction, delivery type is in the data value

01:32 isaacwoods has quit [Quit: WeeChat 3.2]

01:32 <doug16k> 10.11.2 of vol3 sdm

01:33 <doug16k> lol, you can send a sipi with msi if you want

01:35 <heat> inb4 you can wake up smp cores with a PCI device

01:36 freakazoid333 has joined #osdev

01:39 <doug16k> the probability of getting NMI isn't even close to zero, at least early on. after you have configured every PCI and LAPIC, then you can say NMI is not going to occur

01:41 <klysm> no, the trampoline is not what I thought it was. I meant the idle loop. the trampoline was used for booting other CPU cores in SMP configurations.

01:42 tacco has quit []

01:43 <geist> yah for the idle loop you want to hlt or mwait

01:43 <geist> otherwise you'll just burn power

01:43 <geist> and on a modern machine it would probably slow the system down. even if teh core was inside cache entirely, it'd bring up the thermal floor

01:43 <geist> and the cpu would probably throttle itself

01:43 <doug16k> what do you do if ACPI tells you to use NMI?

01:43 gog has quit [Ping timeout: 272 seconds]

01:44 <doug16k> ACPI_MADT_REC_TYPE_NMI

01:45 <heat> great question

01:45 <heat> I don't do anything right now

01:46 <doug16k> halt is good because it can save up power budget, which it can use up when it does have something useful to do, by boosting the cpu to a high clock

01:47 <doug16k> same idea with temperature. it can save up coolness so it can boost longer, if thermally limited, when you hlt

01:47 <doug16k> hlt helps battery life drastically

01:47 <heat> is mwait slower?

01:47 <doug16k> it would be madness to spin on battery

01:47 <heat> if you go for the same power state as hlt

01:47 <doug16k> you can control how slow it is allowed to be

01:48 <geist> in practice it's all very complicated

01:48 <geist> sometimes hlt is better, sometimes mwait is better, sometimes both

01:48 <doug16k> you can make it not even evict anything and stay fully awake, or go all the way to writeback invalidate powerdown cache

01:48 <geist> but by the time you're worrying about taht you're already into microoptimize territory

01:49 <geist> there's how it should work, and then there's no Real Implementations deal with it

01:52 <doug16k> you tell the hardware what cstate to go down to and mwait does that

01:52 <geist> i only say that because we've been fiddling with it on fuchsia, and it turns out there are plenty of cases where mwait isn't the right implementation, etc

01:53 <geist> even though it's supported, etc

01:53 <geist> for example we found experimentally that Zen version 1 machines save much more power (10W at the wall) when idling using hlt instead of mwait

01:53 <geist> for Reasons

01:53 <doug16k> yeah I noticed that long ago from perf top

01:53 <heat> step 1: use hlt

01:53 <heat> step 2: find out mwait exists

01:54 <heat> step 3: try out mwait

01:54 <heat> step 4: cry

01:54 <geist> also various virtual machinse do or dont like mwait, but that's usually obvious because they'll just mask it off in cpuid

01:54 <doug16k> if you force linux to use hlt it doesn't even show up in perf top. with mwait, you get huge percent screwing with cstate or something

01:54 <geist> well it's like lots of things in x86. there's levels of implementation, start with the basics and then add more advanced (newer, partially supported, etc) stuff

01:55 <doug16k> kernel parameter is idle=halt

01:56 <doug16k> there is also, oddly, idle=nomwait

01:56 <klysm> here it is: https://github.com/torvalds/linux/blob/a931dd33d370896a683236bba67c0d6f3d01144d/arch/x86/kernel/process.c#L760 ; https://github.com/torvalds/linux/blob/64a925c9271ec50714b9cea6a9980421ca65f835/arch/x86/include/asm/irqflags.h#L57

01:56 <bslsk05> github.com: linux/process.c at a931dd33d370896a683236bba67c0d6f3d01144d · torvalds/linux · GitHub

01:56 <bslsk05> github.com: linux/irqflags.h at 64a925c9271ec50714b9cea6a9980421ca65f835 · torvalds/linux · GitHub

01:56 <heat> doug16k: does amd use intel_idle?

01:57 <doug16k> no

01:57 <heat> klysm, that's not it either

01:57 <klysm> oh, it's stop_this_cpu()

01:57 <klysm> :)

01:57 <doug16k> heat, not sure I understood the question. you mean module?

01:58 <doug16k> or do you mean kernel function

01:58 <heat> doug16k: i was asking if amd uses the intel_idle idling infrastructure (see drivers/idle/intel_idle.c)

01:58 <heat> a quick look at the source says no

01:59 <doug16k> on amd you would use schedutil and that controls the throttle

02:00 <doug16k> makes throttle a pushbutton like intel. where it's max turbo most of the time

02:00 <heat> what's the cpuidle driver?

02:01 <doug16k> schedutil

02:01 <doug16k> default is probably ondemand

02:01 <doug16k> schedutil is default sometimes I think

02:01 <doug16k> it lets the scheduler directly control the throttle

02:02 <heat> that's a governor no?

02:02 <doug16k> oh I see what you mean

02:02 <doug16k> same thing, same instructions

02:02 <heat> sure but surely the behaviour is different

02:03 <doug16k> why?

02:03 <heat> amd should know how their cpus work pretty well

02:03 <heat> like if hlt saves 10W they go for hlt

02:04 <doug16k> I'd expect amd's implementation to save all the power no matter which way

02:04 <klysm> https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/tools/include/linux/irqflags.h#L25 there is no hlt per default, I may be finding

02:04 <bslsk05> github.com: linux/irqflags.h at 5bfc75d92efd494db37f5c4c173d3639d4772966 · torvalds/linux · GitHub

02:05 <doug16k> iirc amd can have twice the cores for same power as intel

02:05 <heat> klysm: that's in tools/

02:06 <heat> you can't hlt in user-space

02:06 <doug16k> funny one would even say amd won't run circles around intel's halt/mwait/whatever

02:06 <heat> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/irqflags.h#L57

02:06 <bslsk05> elixir.bootlin.com: irqflags.h - arch/x86/include/asm/irqflags.h - Linux source code (v5.13.1) - Bootlin

02:07 <doug16k> if you mean building little low-core-count processors, yes, intel is good at that

02:07 <doug16k> they will run well for hideous clock and small cache and won't turbo to hell

02:07 <doug16k> amd would have trouble making a little zen that beats it

02:08 <doug16k> zen is good for being still fast even though lot of cores

02:09 <klysm> heat, just that I'm not finding anything good that calls that function

02:09 <heat> klysm: cpu idling under linux is complex

02:10 <heat> you're not going to find a simple void do_idle() { while(true) asm volatile("hlt"); }

02:10 <heat> if you're on an intel cpu, this is how linux knows how to idle: https://elixir.bootlin.com/linux/latest/source/drivers/idle/intel_idle.c

02:10 <bslsk05> elixir.bootlin.com: intel_idle.c - drivers/idle/intel_idle.c - Linux source code (v5.13.1) - Bootlin

02:12 <doug16k> all you need is an idle thread that does while (1) halt() it's not even a waste, that is the stack that hosts the irq handler if idling

02:12 Izem has quit [Quit: Izem]

02:12 <heat> (it doesn't hlt, but rather prefers mwait)

02:12 Izem has joined #osdev

02:12 <heat> ^^for all purposes hobby osdev, you can do that

02:12 <heat> it's already great

02:13 <doug16k> most irqs will be handled in the idle thread

02:13 <doug16k> it's not even close to a waste to have a stack for idle thread

02:17 <doug16k> most computers idle far more than doing anything else

02:18 <doug16k> you always imagine all these pegged cpus everywhere when designing a scheduler, when really, it's an unreasonably large number of threads that almost entirely block

02:18 <doug16k> pegged is the unlikely case

02:19 <heat> (unless hexchat is stupid and hogs a core)

02:21 <doug16k> ya and with 32 cpus, pegged core shows up as 3% cpu usage

02:22 <doug16k> I wonder how much that screws up the scheduler

02:22 <doug16k> it shouldn't be 1600*2 percent. it should be 1600*1.25 percent total

02:22 <doug16k> because SMT thread is at best 25% more throughput

02:23 <doug16k> not 100

02:23 <doug16k> I have effectively 20 cpus of throughput. 16*1.25. not 16*2

02:24 <doug16k> shouldn't the scheduler treat it that way?

02:25 <doug16k> actually, what I mean is, shouldn't "top" treat it that way

02:25 ElectronApps has quit [Read error: Connection reset by peer]

02:25 <doug16k> would affect accounting too though

02:26 <doug16k> both threads are approx 62% throughput when both running, but accounting treats it as 200%

02:26 <doug16k> as 100% each

02:26 ElectronApps has joined #osdev

02:27 <doug16k> getting ripped off. paying 1s of timeslice when you got 620ms

02:27 <doug16k> does it account for that ripoff?

02:33 <doug16k> mine doesn't

02:35 <doug16k> it should discount the timeslice consumption by about 38%, prorated by what portion of time it overlapped the other thread being active

02:39 <doug16k> interference has been conveniently ignored in most schedulers

02:39 <doug16k> if the other thread is killing perf for this victim, tough

02:40 <doug16k> even though today we can tell when it happens

02:41 <doug16k> usually you have enough measurement capability to know for sure how much work this thread got done, precisely

02:41 <doug16k> ops retired, for instance

02:42 <doug16k> measure ops vs insns ratio to estimate this program and use ops retired per unit of time to detect interference

02:43 sts-q has quit [Ping timeout: 272 seconds]

02:43 <doug16k> maybe even slightly starve the thread that hammers everyone else

02:44 <doug16k> could even detect how much it hammers cache

02:44 <doug16k> hammering could make your timeslices penalized

02:44 <doug16k> there is nothing like that at all, is there?

02:45 <doug16k> I read an article that talked about using it to detect malicious stuff, like hammering clflush in rowhammer or something

02:47 sts-q has joined #osdev

02:48 __sen has joined #osdev

02:48 <doug16k> is there an advantage to blocking in linux? do you get bonus time, or is that just lost opportunity and you might as well have pulled all your data into the cache at 100% cpu?

02:49 <heat> blocking doesn't waste cpu time

02:50 <doug16k> so a selfish program has no reason not to spin on all cores to get minimum workitem startup latency

02:50 <heat> if the scheduler doesn't punish you yeah

02:52 <doug16k> in my kernel, the more nice the thread is, the more the kernel bends over backward to give it instant wakeup latency

02:55 <doug16k> if you were burning up cpu, you wouldn't be any more important than the currently running thread

02:56 <doug16k> it's likely I mean

02:58 <doug16k> that causes threads that wait for input queues to wake up with realtime latency

02:59 <doug16k> so gui threads are instant response, even if all cpus at 100%

02:59 <doug16k> gui threads use practically zero cpu

03:02 <kingoffrance> "I have effectively 20 cpus of throughput. 16*1.25. not 16*2" makes total sense, i suspect maybe some expensive real time system does distinguish, while most os do not

03:03 <kingoffrance> seemingly gets more skeewed the more "almost cpus" you have

03:04 <doug16k> yeah, it's off by 12 cpus by the time you are up to 16C 32T

03:05 <doug16k> there are probably ultra corner cases where you get almost 200%, but realistically it's in the 20's

03:05 <kingoffrance> and not to drift off, but rather: im certain other types of "accounting" have this same issue too. probably can find similar in a few fields

03:05 <kingoffrance> if you want accuracy, better distinguish and not just lump disparate things as "the same" when they are not

03:06 <doug16k> yeah. cache I briefly mentioned. you can screw up the execution pipeline or screw up cache content

03:06 <doug16k> or both

03:07 <doug16k> cache one goes beyond SMT, can happen even on single core

03:07 <doug16k> add zero to the first byte of as many lines as you can in an infinite loop, evict everything

03:08 <doug16k> cpu will pray the write allocate was worth it, who'd be dumb enough to touch one byte per line? :P

03:10 <doug16k> that's basically what gcc looks like it is doing when you run concurrently with it :D

03:10 <doug16k> except each byte it touches is super computer scienced and it is actually being fast

03:12 <doug16k> I am amused by how cpu power use and heat goes way up when doing real work in good algorithm, and stays quite a bit cooler with naive algorithm

03:12 <doug16k> the universe won't let you get away with doing that much more work without more power :D

03:14 Retr0id9 has joined #osdev

03:15 freakazoid333 has quit [Read error: Connection reset by peer]

03:15 Retr0id has quit [Ping timeout: 252 seconds]

03:15 Retr0id9 is now known as Retr0id

03:17 <doug16k> is crc32 any good as a hash table algorithm?

03:18 <doug16k> for string hashing

03:18 <geist> good question!

03:18 <doug16k> it's almost free on good x86 cpu

03:18 <geist> yah same on arm

03:18 <geist> qustion is does it distribute well across 32bit space

03:18 <geist> its not really required to

03:19 <doug16k> is the field prime?

03:19 <doug16k> if so then the low bits will be well randomized

03:20 <geist> well, someone asked the same question on stack overflow and gave a what looks on the surface pretty good answer

03:20 <doug16k> can you believe this is java's string.hashCode: while (i < len) hash += 31 * hash + charcode(str[i])

03:20 <doug16k> er, not +=, just =

03:20 <doug16k> 31 is prime, so it makes the low bits noisy

03:21 <kazinsal> CRC32's prime appears to be 79764919

03:21 <geist> note that x86 does crc32c

03:21 <geist> but it should probably be similar to better

03:21 <geist> just a different polynomial

03:22 <kazinsal> s/prime/polynomial/

03:22 <geist> side note, ben eater did a pretty nice explanation of CRC

03:22 <geist> i'd read it before but he walks you through it in fairly good detail

03:22 <geist> https://youtu.be/izG7qT0EpBw

03:22 <bslsk05> 'How do CRCs work?' by Ben Eater (00:47:30)

03:22 <doug16k> yes, even if you already know something, if you watch his stuff he gives you a really nice way to think about it

03:22 <geist> right

03:24 <doug16k> jenkins one-at-a-time hash is soooo good for string keys

03:25 <doug16k> I get O(1) lookup for 99% of keys

03:25 <doug16k> and almost all the rest in 2 steps

03:25 <doug16k> java hashcode is around 92% O(1)

03:26 <heat> ben eater is the best

03:26 <geist> anyway someone on stack overflow does a comparison with the jenkins hash

03:26 <geist> mark adler actually chimes in

03:27 <doug16k> yes, but people get all caught up in how quickly it can compute a hash, and forget how poorly it might make their scan for match

03:27 <doug16k> s/ly//

03:28 <doug16k> java hashcode destroys jenkins for computing the hash. and jenkins destroys java for instantly finding the right one in the table

03:29 <doug16k> so if you lookup far more than you compute hash (because you can often have the string literal hash at compile time with constexpr) then the best distributed hash overwhelms speed of computing hash

03:30 <doug16k> but you can't go too far because you do compute some hashes

03:30 <doug16k> I was considering throwing crc32 instruction at it and seeing if that is any good

03:31 <heat> i've never looked into collisions

03:31 <doug16k> I guess with a hash table, you have to TIAS every time

03:31 <heat> i usually just use fnv and pray for the best

03:31 <doug16k> the "best" hash table might be hash = str[0]; and that's it

03:32 <doug16k> assuming they always start with a different letter and you hash a bazillion new strings

03:32 <doug16k> er, best hash function

03:32 <heat> the best hash table is a linked list

03:32 <doug16k> no way

03:33 <doug16k> linear probe ftw

03:33 <doug16k> excellent locality

03:33 <doug16k> have a good hash function and keep load at or below 0.5

03:33 <doug16k> cpu loves it

03:33 <doug16k> it doesn't have to stall to find next one to look at

03:34 <kazinsal> I'm surprised the GCC preprocessor doesn't have some kind of compile time hash function

03:34 <doug16k> kazinsal, it does in C++, very carefully, with char const (&str)[N] template

03:34 <doug16k> constexpr

03:34 <doug16k> I got endless-sky drastically faster with that

03:35 <doug16k> 1500 ships is about 60% cpu

03:35 <doug16k> choppy before

03:35 <kazinsal> nice

03:36 <doug16k> before the game was pretty much all strcmp calls

03:36 <doug16k> looking for keys in map<string

03:36 <doug16k> memcmp actually

03:37 <doug16k> now it is DictKey key that knows hash and O(1) 99% of the time straight to the right one

03:37 <doug16k> and when hash collision it didn't even compare strings, compares hash 1st

03:38 <doug16k> table collision I mean. there hasn't been a hash value collision in 32 bit yet

03:38 <doug16k> I measured it 99, not hypothetical

03:38 <heat> i do something like that on my dentry cache

03:38 <heat> each dentry has a name hash and only name hashes that match get the full memcmp

03:39 <doug16k> excellent

03:39 <doug16k> that helps drastically

03:39 <heat> no fancy data structures though

03:40 <heat> it's a linked list

03:40 <heat> i should add a hash table or something

03:40 <doug16k> ah

03:41 <doug16k> so just doing it inefficiently, really efficiently :D

03:41 <heat> :D

03:41 <heat> it's still probably faster than reading from disk

03:41 <heat> even with really big directories

03:41 <doug16k> ya for small list it is probably amazing speed

03:42 <doug16k> would beat some fancy ones up to some threshold

03:42 <doug16k> linear probe hash table is amazing though

03:42 <doug16k> if you can afford 0.5 load factor

03:43 <doug16k> if you want to sledgehammer more speed, making it 0.25 load factor

03:43 <doug16k> it'll be O(1) practically every time

03:43 <doug16k> and in the unlikely case of it not being instantly right, it has 3 more chances before it gets grim

03:44 <heat> thing is I don't want to use up a lot of memory per directory

03:44 <doug16k> your hash function won't be that bad

03:44 <heat> most of them are small

03:44 <doug16k> you could switch to that when it's huge dir?

03:44 <heat> possibly

03:44 <heat> a-la ext4

03:45 <heat> they only use the fancy hash trees when you have hundreds of entries

03:48 <doug16k> beautiful thing about linear probe is, you are actually using (much of) the rest of the cache line it brought in, when you scan for match

03:49 <doug16k> you are likely to find match before the next line is needed

03:49 <doug16k> nevermind ops, thing locality

03:49 <doug16k> think*

03:50 <doug16k> and the more pathological the search, the more the cpu prefetches 100% right

03:50 <doug16k> so it becomes awesome right when you are sucking

03:50 <doug16k> if it pathologically scanned far before finding a certain one

03:55 <doug16k> and 98+% of the time you are awesome and the 1st place you looked was correct

03:56 <doug16k> probably touch one line

03:57 heat has quit [Ping timeout: 255 seconds]

04:00 <doug16k> hash table trades worst worst case for unbeatably good common case

04:01 <doug16k> if you are already linear searching, you are simulating worst case linear probe already, it can only be faster

04:01 YuutaW has quit [Quit: WeeChat 3.1]

04:01 <doug16k> if you used the world's dumbest hash function it would be faster

04:01 <doug16k> at least it might start close to where it should

04:02 <doug16k> and nice hash function will give you the 90's of percent of lookups instantly finding the right one

04:02 Affliction has quit [Killed (NickServ (GHOST command used by Affliction`!affliction@user/affliction))]

04:03 Affliction has joined #osdev

04:03 Oshawott has joined #osdev

04:05 <doug16k> my string hash table keys are uint32_t hash, len; char *str; which is 128 bits

04:06 <doug16k> 4 keys per line

04:06 archenoth has quit [Ping timeout: 252 seconds]

04:08 <doug16k> you could intern all the names in the dir into a char vector and hand around those key structs that have precomputed hash and owned string

04:08 ckie has quit [Quit: off I go~]

04:08 <doug16k> then operator== does the tricks like comparing hash 1st then len, then memcmp

04:08 <doug16k> for example

04:09 ckie has joined #osdev

04:09 <doug16k> I also have a thing that wraps a std::string temporary in a DictKey so it uses the string c_str during its lifetime

04:10 <doug16k> optimizer loves it, completely understands what I mean

04:13 <doug16k> so if your code is checking if it is == "this" or == "that" you have a char const (&str)[N] constexpr that just knows the hash of "this" and "that" and can shortcut the comparison to hash comparison

04:13 <doug16k> in your case, it would hardly do that

04:16 <doug16k> now the engine has difficulties with things like the AI aiming 4000 turrets, like it should, not hammering memcmp in map find

04:17 srjek|home has quit [Ping timeout: 240 seconds]

04:18 <doug16k> I wonder why there was no linear probe hash table algorithm in the standard library

04:19 <doug16k> only the bucket style one, which stores them all over the place

04:24 <doug16k> linear probe is basically a vector that is kept at 2x the capacity a vector would when running. not that bad for damn near instant lookup and deletion and insertion

04:24 <doug16k> it's amortized O(1) insertion too, since you expand it more each time you expand it

04:26 <doug16k> probably two instructions to move the entry to the new table when expanding, 128 bit load, 128 bit store

04:28 <doug16k> store uses index with different mask

04:31 <doug16k> cpu can speculate a mile into that and everything runs as soon as latency permits

04:32 <doug16k> no guessing

04:33 <doug16k> up to 32 items, you can probably even assume the outer loop branch will predict the end of the loop correctly

04:33 <doug16k> if not, it will just cause a lot of false speculation off the end that probably won't be reached due to keeping up with retirement mostly

04:35 <doug16k> so you probably have perfect speculation all the way to the last loop branch, then one pipeline flush

04:36 <doug16k> it'll overlap the loads, stores, and loop counter updates and address calculations deep enough for them to proceed completely back to back

04:37 <doug16k> not like it has no idea where the next one will be until it reads this one

04:40 <doug16k> neat thing too is the access pattern, as it scans down, there are only two places it would put this next key, same place in new table, or same offset from middle in 2nd half of new table

04:41 <doug16k> so the prefetcher gets to know that you write two contiguous things

04:41 <doug16k> the hardware loves linear probe table

04:44 <doug16k> about half of them go to 2nd half of new table, so the prefetcher keeps knowing

04:45 <doug16k> it's not branches deciding where to put it, it's branch free math - just masking the index

04:46 <doug16k> it can overlap that into nothingness

04:47 <doug16k> the index is just hash & (table_sz-1)

04:47 <doug16k> constrain table_sz is power of two

04:47 <doug16k> new table sz is 2x more

04:48 <doug16k> one more bit

04:49 <doug16k> you can compensate for non-prime table size by using a hash function that emphasizes well mixed least significant bits

04:51 <doug16k> just multiplying a number by any prime makes a mess of the low bits

04:51 <doug16k> some better than others though

04:52 <doug16k> or, test your hash function bswapped, and see if it is suddenly better

04:53 <doug16k> i.e., if you would have gotten 1 for your 32 bit hash, you return 0x01000000

04:53 <doug16k> it might be better

04:54 <doug16k> realistically you would never get 1

04:54 <doug16k> most numbers are 8 hex digits. 15/16 of them

04:54 <doug16k> 15/16th of all 32 bit numbers have nonzero 31:24

04:55 <doug16k> if your hash has even remotely decent mixing, it'll overflow way further than that

04:55 <doug16k> er, 31:28

04:58 <doug16k> 93.75% of all 32 bit values are >= 268435456

05:00 <doug16k> (they would be 8 hex digits to printf %x)

05:01 ids1024 has quit [*.net *.split]

05:01 yuriks has quit [*.net *.split]

05:01 mrkajetanp has quit [*.net *.split]

05:01 yuriks_ has joined #osdev

05:01 mrkajetanp_ has joined #osdev

05:01 ids1024 has joined #osdev

05:02 <doug16k> 93.75% of all 64 bit values are >= 1.15292150461e+18

05:04 <doug16k> imagine the unlikelihood of 1 for 64 bit hash

05:04 <doug16k> assuming decent hash distribution of course

05:06 andreas303 has quit [*.net *.split]

05:06 clever has quit [*.net *.split]

05:06 colona has quit [*.net *.split]

05:06 hl has quit [*.net *.split]

05:06 wereii has quit [*.net *.split]

05:06 amj has quit [*.net *.split]

05:06 Arsen has quit [*.net *.split]

05:06 ElementW has joined #osdev

05:06 andreas3- has joined #osdev

05:07 <doug16k> point being, usually you aren't worried about the number being too close to zero and upper hash bits not doing anything

05:07 clever has joined #osdev

05:07 <doug16k> you have index bits coming out your ears

05:07 hl has joined #osdev

05:08 Arsen has joined #osdev

05:08 wereii has joined #osdev

05:08 YuutaW has joined #osdev

05:11 ZipCPU has quit [*.net *.split]

05:11 gruetzkopf has quit [*.net *.split]

05:11 paulbarker has quit [*.net *.split]

05:11 transistor has quit [*.net *.split]

05:11 jakesyl has quit [*.net *.split]

05:11 zgrep has quit [*.net *.split]

05:11 tds has quit [*.net *.split]

05:11 seds has quit [*.net *.split]

05:11 nohit has quit [*.net *.split]

05:11 transistor has joined #osdev

05:12 tds has joined #osdev

05:12 zgrep has joined #osdev

05:13 gruetzkopf has joined #osdev

05:13 jakesyl has joined #osdev

05:14 Patater has quit [*.net *.split]

05:14 abbie has quit [*.net *.split]

05:14 les has quit [*.net *.split]

05:14 puck has quit [*.net *.split]

05:14 MrBonkers has quit [*.net *.split]

05:14 kori has quit [*.net *.split]

05:14 kciredor has quit [*.net *.split]

05:14 tux3 has quit [*.net *.split]

05:14 jimbzy has quit [*.net *.split]

05:14 kc8apf has quit [*.net *.split]

05:15 abbie has joined #osdev

05:15 les has joined #osdev

05:15 jimbzy has joined #osdev

05:15 puckipedia has joined #osdev

05:15 kciredor has joined #osdev

05:16 tux3 has joined #osdev

05:16 Patater has joined #osdev

05:16 <doug16k> x86 also has a universal way to do crc of any polynomial with vector instructions

05:17 <doug16k> the plain crc32c instruction is Castagnoli polynomial 0x1EDC6F41

05:19 <doug16k> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-paper.pdf

05:20 <doug16k> "Computing a CRC for an Arbitrary Polynomial using PCLMULQDQ"

05:20 <clever> that reminds me, the rp2040 MCU, its dma engines can compute a crc as it copies data

05:20 <clever> it has a lot of flexibility, to reproduce different checksums, but SD crc isnt one of them

05:21 <doug16k> yeah, in hardware, crc computation is almost nothing

05:22 <doug16k> for each term of the polynomial there's another xor gate for one bit

05:22 <doug16k> nearly nothing

05:22 <doug16k> if you do it bit serial

05:22 <geist> for lulz the other day i wrote a quickie crc routine to benchmark ARM using the crc instructions

05:22 <doug16k> byte serial is more than nothing

05:22 <geist> was fairly quick, a few GB/sec

05:23 <geist> on a rpi4

05:24 <doug16k> probably right at memory bandwidth?

05:24 <doug16k> or did you crc the cache

05:24 <geist> crced the cache

05:24 <geist> 64k in a loop

05:25 <geist> but wasnt' far off ideal speed. cpu is running at like 1.8ghz, the crc32c instruction does 32bits at a time, so if it's one per cycle that's like max theoretical around 7GB/sec

05:25 <geist> iirc it was something like 4 or 5

05:26 <doug16k> 32 bits at a time?

05:26 <geist> iirc int he a72 dcos it's about 1 per cycle, so that's about right

05:26 <doug16k> one crc instruction takes 32 bits of data?

05:26 <geist> yah

05:27 <geist> i have it turned off or i'd check

05:28 <doug16k> unrolled it a good bit so it didn't hide 3 cycle latency behind inc dec jnz?

05:28 <geist> not particularly

05:28 <geist> like i said it was fairly close to ideal. mostly was interested in making it work

05:28 <doug16k> it's 3 cycle latency on zen2, 1 cycle throughput

05:29 <doug16k> at like 4.whatever

05:29 <doug16k> ghz

05:29 <doug16k> and probably way more power

05:29 <geist> lets see. the a72 docs sazy.....

05:30 <geist> 2 cycle latency, 1 cycle throughput

05:32 <doug16k> amd would have to be 4.7 cycle latency to be same latency

05:32 <geist> it also mentions "CRC execution supports late forwarding of the result from a producer CRC μop to a consumer CRC μop. This results in a one cycle reduction in latency as seen by the consumer."

05:32 <doug16k> accounting for 4.3/1.8

05:33 <geist> sure. but anyway, point is both of them are fairly close to about as good as you can get

05:33 <doug16k> point being, not fair to directly compare

05:33 <geist> only really thing you can do there is have multiple copies of that execution unit

05:33 Mooncairn has quit [Quit: Quitting]

05:33 <geist> more to the point it hink using somethign like crc32 is a fairly good bet nowadys

05:34 <doug16k> yeah, they are close to the limit of how well you can lay that logic out

05:35 <doug16k> making it pretty awesome

05:35 <geist> i suppose could have a 64bit wide crc32 instruction as well

05:35 <doug16k> x86 doesn't, I checked

05:35 <geist> iforget if arm provides that, but it *does* provide a 16 and 8 bit one too

05:35 <doug16k> 8, 16, 32

05:35 <geist> so nice for the tail bits

05:35 <geist> yah, same on arm i think too

05:37 <doug16k> arm has no shortage of special instructions for particular things, right?

05:37 <doug16k> it's full of those isn't it?

05:39 kc8apf has joined #osdev

05:39 <doug16k> not even close to "please do everything with and or xor sub add mul div mod not"

05:40 <geist> it's pretty good with bit inserttions and whatnot

05:40 <geist> it's not crazy, but seems to be generally good enough that the compiler has a lot of options to do stuff without long sequences of stuff

05:41 <doug16k> yeah, x86 has it rough there. there are amazing instruction set extensions that go largely unused unless you have one of those crazy march=native everything distros

05:41 <doug16k> it has extremely fast and complex bit manipulation stuff

05:41 <geist> yah arm64 seems to be well designed to Get Stuff Done

05:41 <geist> not as highly regular as you'd think, but it's tuned for pretty much what you need

05:42 <geist> like, multiple ways to get immediates into the instructions, based on the class of instructinos, for example

05:42 <geist> less regular, but more powerful. can synthesize more interesting masks for logic ops, vs alu ops which tend to be more integer constant based

05:43 <doug16k> they did a great job of cramming a ton of meaning into the instructions

05:43 <doug16k> I can imagine disassembling it is pretty rough

05:44 <doug16k> just error prone I mean - easy to mess up the interpretation of the bits

05:45 <geist> yah but it's a cleaner opcode layout than old arm32 or thumb2

05:45 <doug16k> it's actually kind of hard to make a fixed length ISA

05:45 <geist> but yeah, there are more than a handful of instruction formats

05:45 <geist> vs something like riscv

05:46 <doug16k> got to figure out how to use the bits right

05:46 <geist> and even riscv has some odd complexities with immediates and whatnot

05:46 <geist> designed for hardware, not software emulators

05:53 <doug16k> yeah? I thought riscv was easy to emulate due to lack of flags

05:53 <geist> oh sure, but the parsing of the immediates is a little wonky

05:53 Izem has quit [Quit: Izem]

05:53 <geist> wonky as in lots of the immdiates are broken across 3 or 4 fields

05:53 <doug16k> ah, nitpicky sometimes

05:56 <doug16k> I'm having too much fun optimizing this open source game code that hammers map,set,string. you keep optimizing the top thing until all the numbers are close together for top bunch of functions

05:57 <doug16k> where it was one huge number at top when you started

05:58 <doug16k> then the numbers indicate actual good work being done, not how much time it wasted

06:00 <moon-child> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html#optimizations-without-atomics what in the hell

06:00 <bslsk05> www.open-std.org: N4455 No Sane Compiler Would Optimize Atomics

06:00 <doug16k> then you figure out how to make it do less work, on top of it doing the work it does more efficiently

06:01 <moon-child> compiler can use random memory locations to spill to, if it can prove no code will observe those spills

06:02 <doug16k> yes, but what it can do and what it will actually do are two different things

06:02 <moon-child> sadly

06:03 <doug16k> gcc could just crash on purpose the moment you use two union members in C++. it doesn't

06:03 <doug16k> it coddles that

06:03 <moon-child> yes but using random memory locations for spill is actually ingenius, and doesn't affect program behaviour at all

06:03 <moon-child> *ingenious

06:08 <doug16k> that's practically how a stack redzone works

06:09 <doug16k> you can do that invalid thing up to a point

06:09 <doug16k> putting data below the stack pointer being the invalid thing

06:09 <moon-child> redzone is only for leaf functions though

06:09 <doug16k> yeah

06:09 <moon-child> spill usually is you want to preserve a register across a function call

06:09 <doug16k> leaves spill too

06:10 <moon-child> yeah. But not as much

06:10 <doug16k> they tend to be simplistic, yeah

06:10 <moon-child> in particular, that optimization improves code size for uninlined code; if you're less likely to inline, your codesz benefits are redoubled

06:13 <nur> I now have interrupts working.

06:13 <doug16k> neat thing about LTO though, it will take large sections of the program and turn them into one giant leaf

06:13 <moon-child> yeah lto is cool

06:14 <nur> do I need to enable the timer interrupt to get the timer running?

06:14 <doug16k> if you say it really right it will be able to violate the abi and use a more optimal strategy with the registers

06:14 <doug16k> like in an anomymous namespace, guaranteeing dynamic link couldn't possibly replace it

06:15 <moon-child> I've always wanted compilers to be able to do ad-hoc calling conventions

06:15 <moon-child> but do they actually do that in practice?

06:15 <doug16k> yes, gcc has an optimization that is exactly that

06:15 <doug16k> that's why LTO stresses your __asm__ statements. it holds you to it with your clobbers

06:15 <doug16k> can't rely that function already clobbers whatever

06:16 <doug16k> it might "know" you don't clobber rcx and rely on it!

06:16 <doug16k> even though rcx is supposed to be clobbered

06:16 <doug16k> if you asm isn't exactly right, you'll subtly break it in a nearly-impossible-to-debug way

06:17 <doug16k> it'll just look like bad codegen

06:18 <doug16k> without LTO, it was too blind and had to assume you clobber rcx

06:18 <doug16k> usually

06:18 <doug16k> so if you forgot rcx clobber, nobody would notice

06:19 <doug16k> or if you lied, and said you use a register and didn't say you changed it, but you did

06:19 <doug16k> it might ingeniously rely on you not changing that

06:19 <doug16k> where it wouldn't have done the crazy flow analysis in normal compile

06:20 <doug16k> your code said to set it to whatever value, and flow analysis did a wink and didn't emit instruction, because it "knew" that register is that value

06:20 <doug16k> so source is 100% right at point of malfunction

06:22 <doug16k> dataflow analysis / value analysis

06:23 <doug16k> you proceed with trashed value that your lying __asm__ caused

06:24 <moon-child> recent patch to llvm added a thing to automatically check __asm__ constraints

06:24 <moon-child> really neat stuff

06:24 <doug16k> the problem always exists for callee saved registers. the calling convention can hide the problem though

06:25 <moon-child> (though of course my position is that __asm__ is harmful, but¯\_(ツ)_/¯)

06:25 <doug16k> for clobbered registers

06:26 <doug16k> oh I love __asm__. it is just complete manual override where everything wrong is my fault, and __asm__ gets all the credit when it works right

06:26 <doug16k> ideal to me

06:26 <doug16k> I wish everything was like that

06:26 <moon-child> asm is great, c is great, asm and c don't mix

06:26 <moon-child> imo

06:29 <doug16k> you get to make IL

06:30 <doug16k> it optimizes your stuff into the code

06:30 <doug16k> it'll backpressure the register allocator to make stuff already be in the right register

06:30 <doug16k> if you constrain it

06:30 <doug16k> how awesome is that?

06:30 <doug16k> you can let go and say I don't care what register, pick one you like

06:31 <doug16k> and register allocator says thank you very much use this stupid one

06:31 <doug16k> one that nobody wants to constrain it to use

06:31 <doug16k> I don't know how it can be much better than that

06:32 <doug16k> more than I'd ask for

06:32 <doug16k> you can make it just tell you what register to find things

06:32 <doug16k> you don't even care which one

06:33 <doug16k> if it's not volatile, it could even participate in code duplication optimizations where it duplicates it for multiple scenarios that are inlined

06:33 <doug16k> and each one is optimized into its surroundings

06:34 <doug16k> that is 100x more than I'd ask for

06:34 <doug16k> in an inline assembly syntax

06:34 puckipedia is now known as puck

06:34 <doug16k> I can hardly believe how good it is

06:37 <doug16k> I'd timidly ask for a way to reliably access local variables and hope they didn't get mad

06:41 <doug16k> and I'd fully expect it to be hideous and always force that data into a local variable so I can fetch it

06:42 <doug16k> it's drastically better than that

06:50 <doug16k> I love what's behind the hideous syntax

06:50 <doug16k> the syntax with the string literal is awful

06:51 <doug16k> the constraints are hard to use too

06:51 nyah has quit [Ping timeout: 272 seconds]

06:51 <doug16k> high tax for it being optimized into its context

06:52 <doug16k> it has hideous maintainability too, it won't even try to validate a thing in gcc

06:52 <doug16k> well, it will error a bit, but tons of mistakes are not diagnosed

06:54 <doug16k> all the railguns are pointed at your foot but they are not fired if you do it right :P

06:55 <doug16k> the railguns point at the problem you are solving with the asm :D

06:57 <doug16k> it can perfectly reduce something to one instruction if I want and I can

06:58 <doug16k> codegen would be just as good as if the authors of gcc handled that with a real builtin

06:58 <doug16k> you are using the same machinery as the backend when you write inline asm

07:00 <doug16k> you are emitting a node with inputs and outputs that are handled like everything else

07:06 <doug16k> gcc hardly even needs intrinsics, you could just write __asm__ for everything

07:07 <doug16k> nothing like msvc where they have an intrinsic for every instruction

07:07 <doug16k> and asm is banned

07:07 <doug16k> I can see that being a good idea though

07:07 <doug16k> as long as they are totally thorough and they always have the instruction I want

07:08 <doug16k> gcc method allows it to work with instructions that didn't exist

07:09 <doug16k> just coerce it to use different as

07:14 lucf117 has quit [Remote host closed the connection]

07:15 <moon-child> intrinsics let it optimize though

07:15 <moon-child> it can't reason about the _contents_ of asm

07:16 * Griwes . o O (optimizing assemblers)

07:16 * Griwes looks at ptxas eating literal hours of cpu time at times

07:53 MarchHare has quit [Ping timeout: 240 seconds]

08:05 mhall has joined #osdev

08:30 <doug16k> yeah, gpu compilers go berserk inlining

08:31 <doug16k> that can make compile very long

08:32 <Griwes> it's not just inlining

08:32 <doug16k> not too bad nowadays, but in the past when branches were way worse, they went crazy inlining

08:32 <Griwes> in tests for our implementation of atomics, we instantiate atomics for a huge number of types and then emit a stream of essentially all operations for them all

08:33 <Griwes> ptxas was OOMing on some 8 gig build machines for that test

08:33 <Griwes> because it was trying to be clever about the interactions of that entire instruction stream :D

08:33 <Griwes> (some __noinline__s placed in a number of places did help)

08:34 <doug16k> yeah, you drove up the exponent on some exponential optimization pass

08:34 <Griwes> ...yep

08:34 <Griwes> working on something else I made it take over 24 hours to finish on a *relatively* simple program

08:34 <doug16k> that's the trickiest part of optimizers. you have to have algorithms that can take more time than the sun has fuel, but usually they complete in milliseconds

08:35 ElectronApps has quit [Remote host closed the connection]

08:35 <doug16k> optimizers have to be able to give up when they realize it's unreasonable

08:36 ElectronApps has joined #osdev

08:37 <moon-child> tricky, idk about _trickiest_ though

08:37 <doug16k> that happened to AMD's compute shader compiler when first trying huge raytracers. the compiler literally inlined everything, so it was more shader than all the gpus in the world put together

08:37 <doug16k> no amount of ram would be enough

08:39 <doug16k> the way it works it caused an incredible explosion of cases, because of all the types of materials and all the settings variations exploding

08:43 <moon-child> gpus hate loops and branches

08:43 <moon-child> so I wouldn't be surprised if they also tend to bloat somewhat unrolling and making things branchfree

08:44 ElectronApps has quit [Ping timeout: 255 seconds]

08:45 ElectronApps has joined #osdev

08:47 Giedrius has joined #osdev

08:47 <Griwes> all processing units hate loops and branches ;>

08:48 <Griwes> and gpus of today are... getting much better at those

08:49 <clever> Griwes: i recently found that one of the DSP's ive been playing with has scalar and vector in seperate issue channels

08:49 <clever> so while a vector opcode may take 100 clock cycles, the scalar opcodes can freely run in parallel

08:49 <moon-child> Griwes: eh...no not really

08:49 <clever> so the branch opcode in a for loop, is essentially free, and runs in the same cyclces as the vector opcodes inside the loop

08:49 <moon-child> correctly predicted branches are great on cpus

08:50 <Griwes> and incorrectly predicted branches mean death

08:50 <Griwes> (and side-channels ;p)

08:50 <moon-child> and you can arrange for your branches to be predicted correctly. You're not going to be worse off than if you had no branch predictor

08:50 <moon-child> actually the same thing is true of gpus, kinda. If _all_ your branches go the same way you're fine. But you have a lot less flexibility than with cpus

08:51 <clever> moon-child: that reminds me, the GPU on the rpi is a vector unit, thats wearing a scalar mask

08:51 <moon-child> cpus can learn patterns in branches. Newer zen even uses a neural network

08:51 <Griwes> warp divergence is much less of a deal nowadays than it used to be

08:51 <clever> you can treat it like a scalar processor, if you never branch

08:51 <clever> but in reality, its a 16? wide vector unit

08:51 <moon-child> clever: huh, neat

08:51 <clever> but the branch opcodes, have extra conditions, "if any lane", "if all lanes"

08:51 <clever> conditional branch*

08:52 <clever> moon-child: essentially, when programming it, you only think of one register bank, and think of the code-flow as scalar

08:53 <clever> moon-child: but behind the scenes, the GPU will run your code on a vector core, and compute the color of 16 pixels in parallel, if all use the same shader

08:54 <clever> and if you never have conditionals, you never realize the trickery its pulling

09:03 gog has joined #osdev

09:18 sprock has quit [Ping timeout: 265 seconds]

09:22 GeDaMo has joined #osdev

09:33 aquijoule__ has quit [Ping timeout: 252 seconds]

09:37 richbridger has joined #osdev

09:53 dormito has quit [Ping timeout: 268 seconds]

10:24 dormito has joined #osdev

10:37 dennis95 has joined #osdev

10:41 vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]

10:42 aquijoule_ has joined #osdev

10:43 richbridger has quit [Read error: Connection reset by peer]

11:15 Izem has joined #osdev

11:16 <dzwdz> do y'all think that it would be possible to fit some super simple networking into a boot sector?

11:18 <dzwdz> it's probably possible to create a valid ethernet frame which is also a bootable x86 binary

11:18 <dzwdz> and a boot sector which sends itself across the network sounds like a cool project

11:19 <GeDaMo> https://en.wikipedia.org/wiki/Network_booting

11:19 <bslsk05> en.wikipedia.org: Network booting - Wikipedia

11:19 <dzwdz> isn't that handled by the bios?

11:19 <GeDaMo> Nowadays it might be, it wasn't always

11:30 xenos1984 has quit [Quit: Leaving.]

11:34 xenos1984 has joined #osdev

11:35 <klange> PXE is generally implemented in the option ROM for a NIC.

11:38 z_is_stimky_ has joined #osdev

11:38 z_is_stimky has quit [Read error: Connection reset by peer]

11:57 ElectronApps has quit [Read error: Connection reset by peer]

12:01 ElectronApps has joined #osdev

12:03 isaacwoods has joined #osdev

12:04 Vercas has quit [Remote host closed the connection]

12:05 Vercas has joined #osdev

12:17 flx has quit [Remote host closed the connection]

12:36 air has joined #osdev

12:50 Izem has quit [Quit: Izem]

12:55 MrBonkers has joined #osdev

13:04 ahalaney has joined #osdev

13:06 mcs51 has joined #osdev

13:18 <sahibatko> Hi, just got to a question about UEFI being indifferent (correct me if I'm wrong) about top level paging. So I get my code to run in long mode, 64-bit, can detect the paging level being used + level being supported. So is it really bootloaders job to make a transition between paging level 4 and 5? Or should I rather just detect + stick with the setting used?

13:23 bslsk05 has quit [Quit: ZNC 1.7.4 - https://znc.in]

13:23 bslsk05 has joined #osdev

13:30 heat has joined #osdev

13:36 <heat> sahibatko: what do you mean with indifferent?

13:37 <heat> UEFI firmware runs on identity-mapped page tables, but they may or may not enable pml5 paging

13:37 <heat> (certainly depends on your firmware's version and whether or not they actually care)

13:37 <heat> it's your job as the kernel to reinitialise everything

13:40 flx has joined #osdev

13:57 nyah has joined #osdev

14:02 <sahibatko> actually, that answers it, reinitialise everything

14:10 srjek|home has joined #osdev

14:32 aquijoule_ has quit [Quit: Leaving]

14:40 MarchHare has joined #osdev

14:40 Celelibi has quit [Quit: Quitte]

14:42 Celelibi has joined #osdev

14:47 gareppa has joined #osdev

14:48 phr3ak has joined #osdev

14:49 gareppa has quit [Client Quit]

14:52 paulbarker has joined #osdev

14:53 ElectronApps has quit [Remote host closed the connection]

15:00 amj has joined #osdev

15:29 silverwhitefish has quit [Quit: One for all, all for One (2 Corinthians 5)]

16:01 Giedrius has quit [Remote host closed the connection]

16:04 heat has quit [Ping timeout: 272 seconds]

16:18 srjek|home has quit [Ping timeout: 272 seconds]

16:33 freakazoid333 has joined #osdev

16:38 heat has joined #osdev

16:42 iorem has quit [Quit: Connection closed]

16:46 zoey has joined #osdev

17:13 freakazoid333 has quit [Read error: Connection reset by peer]

17:35 Izem has joined #osdev

17:36 IRCMonkey has joined #osdev

17:37 IRCMonkey is now known as DeepComa

17:38 DeepComa has quit [Client Quit]

17:43 heat has quit [Ping timeout: 255 seconds]

17:53 heat has joined #osdev

17:55 mahmutov has joined #osdev

17:55 jjuran has quit [Ping timeout: 258 seconds]

18:05 freakazoid333 has joined #osdev

18:06 Skyz has joined #osdev

18:08 jjuran has joined #osdev

18:16 dennis95 has quit [Quit: Leaving]

18:39 Skyz has quit [Quit: Client closed]

18:42 Izem has quit [Quit: Izem]

18:50 Skyz has joined #osdev

19:03 Skyz has quit [Quit: Client closed]

19:04 Skyz has joined #osdev

19:06 silverwhitefish has joined #osdev

19:30 springb0k has quit [Read error: Connection reset by peer]

19:40 * geist yawns

19:40 <geist> good afternoon folks

19:40 <clever> evening

19:41 <nur> hey geist

19:41 <nur> I got my isr working

19:41 <nur> :D

19:42 <heat> nur: an apple a day keeps the sys v abi away

19:42 tenshi has quit [Quit: WeeChat 3.2]

19:42 <geist> yay

19:43 <geist> re: te 5 level paging question in history. got me thinking, did pml5 ever show up on consumer hardware?

19:43 <geist> looks like sunny cove microarch has it from wikipedia

19:44 <nur> do I need to enable the timer to get timer interrupts

19:44 <nur> or should that just work

19:44 <heat> well yes

19:44 <geist> the former

19:45 <kazinsal> I think the 11th gen Core i-series chips have PML5

19:45 <heat> there are also about 4 or 5 different timers so good luck trying to figure things out

19:45 <nur> oh boy

19:46 <clever> x86 or arm?

19:46 <nur> x86 32

19:46 <clever> ah, not that familiar with baremetal x86

19:46 <nur> but I will ask about arm later

19:46 <clever> i know arm in more depth

19:47 <nur> will make a point to ask you when that time comes

19:47 <heat> the PIT is old and bad (but simple), the HPET is complex but flexible(and kind of bad), the local APIC timer is relatively simple, but has lots of dependencies on ACPI and whatnot, probably the best timer you have, the ACPI timer doesn't have interrupts so it's a plain old, bad clock source

19:47 <nur> I have a RPI I am itching to hack on

19:47 <geist> kazinsal: yah those are rocket lake

19:47 <geist> nur: which model?

19:47 <nur> heat, I think PIT is what I am looking for

19:47 <nur> the... 2B I think

19:47 <heat> the TSC is the best clock source you have but has no interrupts on its own and requires the local APIC's help to fire events

19:47 <nur> also the 1st one

19:48 <geist> yah throw those out and get a new one

19:48 Skyz50 has joined #osdev

19:48 <nur> ah I can't afford it right now

19:48 <nur> maybe I'll just use qemu's rpi mode

19:48 <clever> nur: for hacking, the 2b is fine i think, and some parts are better documented then the 4b

19:48 <clever> if you want performance, the 4b is the answer

19:48 <heat> note: you need the HPET, PIT or ACPI timer to calibrate the local apic timer and to get the tsc frequency

19:48 <geist> 2b is arm32 however

19:48 immibis__ has joined #osdev

19:48 <geist> 3 and 4 are 64bit

19:49 <nur> oh

19:49 immibis__ is now known as immibis

19:49 <geist> usuall doesn't matter if you're just running linuxy stuff on it, but if you want to bare metal hack it's a fork in the road

19:49 <geist> and 64bit is the only fork with a future

19:50 <geist> but anyway, start off with the PC stuff

19:50 <geist> it's super well documented and lots of folks will help you

19:51 <nur> how do I enable the PIT

19:51 mahmutov has quit [Ping timeout: 255 seconds]

19:51 Skyz has quit [Ping timeout: 246 seconds]

19:51 Skyz50 is now known as Skyz

19:52 Skyz has quit [Client Quit]

19:52 <heat> https://github.com/heatd/Onyx/blob/master/kernel/arch/x86_64/pit.cpp#L53

19:52 <bslsk05> github.com: Onyx/pit.cpp at master · heatd/Onyx · GitHub

19:52 Skyz has joined #osdev

19:52 <geist> nur: did you consult the wiki?

19:52 <nur> looking at it now

19:52 <geist> more to the point, there are lots of articles on the topic, probablybetter to read those first, then come to us with questions

19:53 scaleww has joined #osdev

19:53 <heat> oh how did I forget that

19:53 <heat> you also have paravirtual clocks like kvm clocks that give you the time

19:54 <heat> there's clocks and timers for everyone in x86

19:54 <geist> yah linux considers the kvm timesource to be the best AFAICT

19:54 <heat> yes

19:54 <heat> you can also just use it to get the tsc frequency

19:54 <geist> right

19:55 <geist> we dont use it as a time base in zircon, but we do read the TSC freq out of it

19:57 Skyz has quit [Quit: Client closed]

19:57 <heat> also a good tip is to forget periodic interrupts, those suck

19:58 <geist> honestly i would say the opposite

19:58 <geist> far easier to start with periodic, and it's not as terrible as folks make it out to be

19:58 <heat> you can build a timer struct/class that keeps a list of pending clock events and sets the oneshot to the closest event

19:58 <geist> sure. but when just getting started that's extra complexity for no real gain

19:58 <heat> it's easier but then you get used to a bad design

19:59 <geist> disagree. thats an area where you can abstract it the same way, you just replace t with a 'better' design later

19:59 <heat> and suddenly your scheduler and timing are mixed together

19:59 <kazinsal> yeah you could totally do it with periodics and then dump in a one to one replacement that uses one shots or whatever later so long as your function interface doesn't suck

19:59 <geist> while that is a thing i just dont think that's an area that's a huge disaster. you'll end up rewriting that a few times anyway, and it's fairly localized

20:00 <heat> if you take the time to abstract it why not do a slightly different code path that does exactly what you want?

20:00 mahmutov has joined #osdev

20:00 <geist> because they're literally just getting started

20:00 <geist> and when getting started, it's a lot f times better to get something basically working, then the best solution

20:00 <geist> secondly, the whole periodic timer thing isn't nearly as bad as you think. it's not modern, but it's totally sufficient

20:00 <nur> I am getting overwhelmed

20:01 <kazinsal> and this is why the simple solution is the best one to start with

20:01 <heat> nur, disregard me

20:01 <geist> precisely why i'm saying just set it up to periodic, roll a counter. super simple, will work for years

20:01 <heat> do what geist said

20:01 <nur> right

20:01 <geist> can use it as a time base when just getting started, though it only ticks a 10 or 1ms intervals

20:01 <geist> but that's also totally sufficient. dont have to fiddle with TSC, time, etc

20:02 <geist> here's my suggestion: figure out how to get the PIT ticking at 100 or 1000hz. get the PIC working so you can take an interrupt

20:02 <geist> every time it ticks bump a global volatile variable that is the current time (in units of 10 or 1ms)

20:02 <geist> now you killed two birds with one stone with maybe 20 lines of code

20:02 <nur> nice!

20:02 <geist> now, knowing which 20 lines is the hard part, but my point is it's very simpl and it works on *all* PCs

20:03 <nur> and qemus

20:03 <geist> then you can replace it later with something better as long as you keep that fairly localized behind apis

20:03 <heat> nur, qemu emulates a PC

20:03 <geist> like 'get_time()' or 'set_timer()' etc for later when you want to build a software timer queue (which you will eventually)

20:03 <geist> the key is dont expose the internals of the PIT/etc outside of the timer module so you can upgrade it later, as heat is saying

20:04 <nur> right

20:04 <geist> but that's a general software engineering thing anyway

20:05 <heat> my biggest regret is that I never fully understood how the PIT and PIC work

20:05 <heat> they're so cute and simple

20:05 <geist> yah

20:06 <geist> a bit esoteric, but there's only so much complexity there. simple and weird is at some point still less work than complex and clean

20:06 <nur> thanks

20:07 <geist> and also to the point there are a bazillion examples/tutorials for PIC and PIT

20:07 <geist> though for some reason someone a while back put in a bunch of articles on osdev in some bizarre assembly

20:08 <geist> there's always someone like that that has to do it in the hardest possible way to show off their chops

20:08 <clever> i prefer doing it in C when possible, then it doesnt matter if your 32bit or 64bit

20:08 <geist> "here's an example of writing to the VGA text mode in brainfuck"

20:08 <clever> so you can use older 32bit cpu's without any downsides

20:09 <geist> or different architectures

20:09 <clever> when possible, i dont expect to find a PIT on arm

20:10 <geist> indeed. however i *have* seen PIC and PIT on non x86 a bunch of times

20:10 <geist> actually if you go into qemu-mips and select one of the machines, 'malta' i think?

20:10 <geist> it's basically a PC with a mips on it

20:10 <clever> weird, but also reminds me of a post i recently made on the rpi forums

20:10 <geist> and maybe some of the CHRP and/or PREP stuff back in the 90s was somewhat PC centric

20:11 <clever> somebody was claiming they should make a risc-v based board next, and the engineers replied claiming that they would have to re-learn all of the soc internals

20:11 <geist> back in the90s where were lots of cheapo PC southbridges floating around, so it kinda made sense to toss one on your mobo for your random RISC machine

20:11 <clever> my "fix" was to just swap the arm out for risc-v, but keep all of the other broadcom crap :P

20:11 <clever> but obviously, its not something RPF can do, it would need to be done by broadcom

20:12 <geist> right, and it wouldn't be performant yet since theres no really a riscv core that's as fast as an a72 yet

20:12 <geist> maybe the new sifive performance cores

20:12 <geist> on paper they look like they're finally in a7x class

20:13 <clever> the rpi is also a weird edge case, where C drivers help a lot

20:13 <clever> there is a lot of rpi specific drivers, but writing them in arm asm locks out using them on the VPU

20:14 <clever> its an edge case, where you can access the hw from 2 different arches, and dont need to swap out the cpu to change arch

20:15 <GeDaMo> I read about an open POWER processor and there was a mention of an Alibaba RISC-V where they had to add addressing modes for performance

20:15 <GeDaMo> http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-May/002581.html

20:15 <bslsk05> lists.libre-soc.org: [Libre-soc-dev] ISA analysis

20:22 <clever> https://media-www.micron.com/-/media/client/global/documents/products/white-paper/ecc_for_mobile_devices_white_paper.pdf?la=en&rev=b892d7cdf84a495d83b2560b2fe523ef

20:23 <clever> this also got linked in a recent rpi thread about ecc memory on the pi4b

20:23 <clever> the claims in the thread, is that the pi4 has on-chip ecc ram

20:23 <clever> so the ram is internally handling ecc, and then presenting a non-ecc api back to the ram controller

20:27 <clever> the above pdf, also mentions power savings

20:27 <clever> it claims that managing the ecc data increases the ram's power usage by ~5%, but the ability to correct the bit-flips, allows for a much slower refresh rate when in standby/self-refresh mode, resulting in far lower power usage

20:33 sprock has joined #osdev

20:34 dormito has quit [Ping timeout: 256 seconds]

20:44 GeDaMo has quit [Quit: Leaving.]

20:47 riverdc has quit [Quit: quitting]

20:47 riverdc has joined #osdev

21:13 andreas3- has quit [Remote host closed the connection]

21:16 travisg has joined #osdev

21:16 dormito has joined #osdev

21:33 andreas303 has joined #osdev

21:33 <freakazoid333> https://www.youtube.com/watch?v=8HFhnvKmxn4&t=3117s

21:33 <bslsk05> 'Momentum in Open Source Hardware Projects' by OpenPOWER Foundation (01:10:23)

21:33 <freakazoid333> mentions IBM Power pi

21:41 srjek|home has joined #osdev

21:56 V has joined #osdev

22:09 <heat> screw it, i'm learning rust

22:14 Arthuria has joined #osdev

22:23 sortie has quit [Quit: Leaving]

22:43 scaleww has quit [Quit: Leaving]

22:55 ahalaney has quit [Remote host closed the connection]

23:12 <kazinsal> heat: oh no, the rust evangelism strike force claims another one

23:15 Arthuria has quit [Ping timeout: 272 seconds]

23:44 <heat> they claimed me but I got sidetracked and decided to get firefox to use hardware video acceleration

23:45 <heat> my attention span is short.

23:46 <heat> i do love it's fucking 2021 and you still need a PhD to enable hardware video acceleration in Linux