<bslsk05>
'CppCon 2019: Chandler Carruth “There Are No Zero-cost Abstractions”' by CppCon (00:59:52)
<geist>
well, shared_ptr does that a lot
<doug16k>
heat, when's the last time you profiled something?
<geist>
honestly that's sort of the biggest cause of bloat i've seen in shared_ptr environments. everyone has to destruct everything
<heat>
doug16k: dunno
<heat>
why?
<doug16k>
just wondering how you are so sure. I profile stuff all the time
<heat>
(go to the 17:00 for the relevant part)
<heat>
I look at disassembly sometimes
<doug16k>
I also disassemble the code and check
<heat>
and I watched this talk
<heat>
it's an ABI issue, not a compiler issue
<geist>
geezus would you guys quit trying to out profile each other?
<doug16k>
I watched my years of experience
<doug16k>
unique_ptr is zero cost abstraction
<geist>
aaaand the thing at 19 is precisely what i was talking about
<geist>
it has to add all these delete paths
<doug16k>
geist, and you wouldn't have added them by hand? just leak
<heat>
there's no destructive move
<geist>
doug16k: in cases where i know it wont go out of scope, yes
<doug16k>
so your solution is to make it shit C code
<doug16k>
where you hand leak it
<doug16k>
because you think you are so smart
<doug16k>
this is why there are 100k security holes in OSes
<geist>
geezus dude
<geist>
i'm steps outta this one
<doug16k>
I mean people, not you personally
<geist>
sounded pretty personal to me
<doug16k>
I realized that
<doug16k>
sorry
immibis_ has quit [Ping timeout: 255 seconds]
<geist>
anyway o
<geist>
i'm on vacation! and learning rust
<geist>
and rust akes all of this moot because you can never ever ever write bad code in rust
<geist>
because it's perfect
<geist>
and all your C++ people need to get with the program!
* geist
hands you a rust bible
<heat>
hexchat is on 100% CPU usage
<heat>
rust rewrite?
heat has quit [Quit: Leaving]
<geist>
yes there are perfect programs and programs that just haven't been rewritten in rust yet
heat has joined #osdev
<geist>
(sarcasm in case folks ddn't pick up, but so far i'm actually kinda pleased with the language)
<heat>
yes it's nice but I'm like on the 2nd or 3rd year of "trying to learn rust and giving up in the process"
<heat>
i somehow can't get myself to learn new programming languages
<geist>
yah i'm being more or less forced to deal with it at work, which is probably not a bad way to go about it
<geist>
grumble, but then i have to stick with it so i can See The Light
<doug16k>
yeah, being forced by clients is the best way to learn new languages :P
<Mutabah>
geist: If you need help, ##rust is pretty active
<geist>
word.
<doug16k>
no way in hell I'd touch React or AWS with a ten foot pole, unless clients dragged me there, kicking and screaming
<doug16k>
forced to learn Swift same way
<doug16k>
I thought simplistic apache shared hosting was bad, until I saw AWS
<doug16k>
apache shared host is super high performance in comparison
<doug16k>
if you asked me to come up with a webserver as slow as AWS, I'm not sure I could without deliberate sleeps/stalls
<doug16k>
maybe if I kept the server off and powered it up with libvirt and restored snapshot
<doug16k>
...between requests
<doug16k>
if you told me that webservers would be resuming from sleep in 2005, I'd have said you have a very vivid imagination
<kingoffrance>
i would have said dont give them ideas :)
<doug16k>
the comedy begins when your lambda server is off, and your db server is off, and the server hosting your secure credential storage is off, you stall starting the lambda, stall getting credentials, stall turning on sql server, stall waiting for query. could be 8 seconds to do a request
* kingoffrance
sees ghost of inetd
silverwhitefish has joined #osdev
<heat>
the osdev wiki is quoted by kernel.org docs :0
<bslsk05>
github.com: linux/trampoline_64.S at master · torvalds/linux · GitHub
<heat>
klysm: that's a panic
<doug16k>
ya that is just a deliberate hang
<heat>
but since it can't panic that early, it hangs
<doug16k>
the jmp is just in case of NMI
<doug16k>
not needed for SMM, if SMM preempts hlt, it crawls back into the halt on return from smm
<doug16k>
they can't have it break every program not thoughtful enough to jmp after hlt
<heat>
when do NMIs fire?
<heat>
if not for the local APIC IPIs that is
<doug16k>
could come from anything
<doug16k>
every MSI(x) PCI device can send an NMI
<doug16k>
all the LINT inputs can be mapped to generate an NMI
<doug16k>
other cpus can send you an NMI
<doug16k>
can send yourself one
<heat>
huh
<heat>
didn't know MSI could do that
<doug16k>
it's the delivery type
<doug16k>
it's encoded into the address
<doug16k>
on x86 it can
<doug16k>
it's not universal
<doug16k>
msi just uses whatever address you say
<doug16k>
on x86, you can deliver different types of interrupts
<doug16k>
correction, delivery type is in the data value
isaacwoods has quit [Quit: WeeChat 3.2]
<doug16k>
10.11.2 of vol3 sdm
<doug16k>
lol, you can send a sipi with msi if you want
<heat>
inb4 you can wake up smp cores with a PCI device
freakazoid333 has joined #osdev
<doug16k>
the probability of getting NMI isn't even close to zero, at least early on. after you have configured every PCI and LAPIC, then you can say NMI is not going to occur
<klysm>
no, the trampoline is not what I thought it was. I meant the idle loop. the trampoline was used for booting other CPU cores in SMP configurations.
tacco has quit []
<geist>
yah for the idle loop you want to hlt or mwait
<geist>
otherwise you'll just burn power
<geist>
and on a modern machine it would probably slow the system down. even if teh core was inside cache entirely, it'd bring up the thermal floor
<geist>
and the cpu would probably throttle itself
<doug16k>
what do you do if ACPI tells you to use NMI?
gog has quit [Ping timeout: 272 seconds]
<doug16k>
ACPI_MADT_REC_TYPE_NMI
<heat>
great question
<heat>
I don't do anything right now
<doug16k>
halt is good because it can save up power budget, which it can use up when it does have something useful to do, by boosting the cpu to a high clock
<doug16k>
same idea with temperature. it can save up coolness so it can boost longer, if thermally limited, when you hlt
<doug16k>
hlt helps battery life drastically
<heat>
is mwait slower?
<doug16k>
it would be madness to spin on battery
<heat>
if you go for the same power state as hlt
<doug16k>
you can control how slow it is allowed to be
<geist>
in practice it's all very complicated
<geist>
sometimes hlt is better, sometimes mwait is better, sometimes both
<doug16k>
you can make it not even evict anything and stay fully awake, or go all the way to writeback invalidate powerdown cache
<geist>
but by the time you're worrying about taht you're already into microoptimize territory
<geist>
there's how it should work, and then there's no Real Implementations deal with it
<doug16k>
you tell the hardware what cstate to go down to and mwait does that
<geist>
i only say that because we've been fiddling with it on fuchsia, and it turns out there are plenty of cases where mwait isn't the right implementation, etc
<geist>
even though it's supported, etc
<geist>
for example we found experimentally that Zen version 1 machines save much more power (10W at the wall) when idling using hlt instead of mwait
<geist>
for Reasons
<doug16k>
yeah I noticed that long ago from perf top
<heat>
step 1: use hlt
<heat>
step 2: find out mwait exists
<heat>
step 3: try out mwait
<heat>
step 4: cry
<geist>
also various virtual machinse do or dont like mwait, but that's usually obvious because they'll just mask it off in cpuid
<doug16k>
if you force linux to use hlt it doesn't even show up in perf top. with mwait, you get huge percent screwing with cstate or something
<geist>
well it's like lots of things in x86. there's levels of implementation, start with the basics and then add more advanced (newer, partially supported, etc) stuff
<doug16k>
all you need is an idle thread that does while (1) halt() it's not even a waste, that is the stack that hosts the irq handler if idling
Izem has quit [Quit: Izem]
<heat>
(it doesn't hlt, but rather prefers mwait)
Izem has joined #osdev
<heat>
^^for all purposes hobby osdev, you can do that
<heat>
it's already great
<doug16k>
most irqs will be handled in the idle thread
<doug16k>
it's not even close to a waste to have a stack for idle thread
<doug16k>
most computers idle far more than doing anything else
<doug16k>
you always imagine all these pegged cpus everywhere when designing a scheduler, when really, it's an unreasonably large number of threads that almost entirely block
<doug16k>
pegged is the unlikely case
<heat>
(unless hexchat is stupid and hogs a core)
<doug16k>
ya and with 32 cpus, pegged core shows up as 3% cpu usage
<doug16k>
I wonder how much that screws up the scheduler
<doug16k>
it shouldn't be 1600*2 percent. it should be 1600*1.25 percent total
<doug16k>
because SMT thread is at best 25% more throughput
<doug16k>
not 100
<doug16k>
I have effectively 20 cpus of throughput. 16*1.25. not 16*2
<doug16k>
shouldn't the scheduler treat it that way?
<doug16k>
actually, what I mean is, shouldn't "top" treat it that way
ElectronApps has quit [Read error: Connection reset by peer]
<doug16k>
would affect accounting too though
<doug16k>
both threads are approx 62% throughput when both running, but accounting treats it as 200%
<doug16k>
as 100% each
ElectronApps has joined #osdev
<doug16k>
getting ripped off. paying 1s of timeslice when you got 620ms
<doug16k>
does it account for that ripoff?
<doug16k>
mine doesn't
<doug16k>
it should discount the timeslice consumption by about 38%, prorated by what portion of time it overlapped the other thread being active
<doug16k>
interference has been conveniently ignored in most schedulers
<doug16k>
if the other thread is killing perf for this victim, tough
<doug16k>
even though today we can tell when it happens
<doug16k>
usually you have enough measurement capability to know for sure how much work this thread got done, precisely
<doug16k>
ops retired, for instance
<doug16k>
measure ops vs insns ratio to estimate this program and use ops retired per unit of time to detect interference
sts-q has quit [Ping timeout: 272 seconds]
<doug16k>
maybe even slightly starve the thread that hammers everyone else
<doug16k>
could even detect how much it hammers cache
<doug16k>
hammering could make your timeslices penalized
<doug16k>
there is nothing like that at all, is there?
<doug16k>
I read an article that talked about using it to detect malicious stuff, like hammering clflush in rowhammer or something
sts-q has joined #osdev
__sen has joined #osdev
<doug16k>
is there an advantage to blocking in linux? do you get bonus time, or is that just lost opportunity and you might as well have pulled all your data into the cache at 100% cpu?
<heat>
blocking doesn't waste cpu time
<doug16k>
so a selfish program has no reason not to spin on all cores to get minimum workitem startup latency
<heat>
if the scheduler doesn't punish you yeah
<doug16k>
in my kernel, the more nice the thread is, the more the kernel bends over backward to give it instant wakeup latency
<doug16k>
if you were burning up cpu, you wouldn't be any more important than the currently running thread
<doug16k>
it's likely I mean
<doug16k>
that causes threads that wait for input queues to wake up with realtime latency
<doug16k>
so gui threads are instant response, even if all cpus at 100%
<doug16k>
gui threads use practically zero cpu
<kingoffrance>
"I have effectively 20 cpus of throughput. 16*1.25. not 16*2" makes total sense, i suspect maybe some expensive real time system does distinguish, while most os do not
<kingoffrance>
seemingly gets more skeewed the more "almost cpus" you have
<doug16k>
yeah, it's off by 12 cpus by the time you are up to 16C 32T
<doug16k>
there are probably ultra corner cases where you get almost 200%, but realistically it's in the 20's
<kingoffrance>
and not to drift off, but rather: im certain other types of "accounting" have this same issue too. probably can find similar in a few fields
<kingoffrance>
if you want accuracy, better distinguish and not just lump disparate things as "the same" when they are not
<doug16k>
yeah. cache I briefly mentioned. you can screw up the execution pipeline or screw up cache content
<doug16k>
or both
<doug16k>
cache one goes beyond SMT, can happen even on single core
<doug16k>
add zero to the first byte of as many lines as you can in an infinite loop, evict everything
<doug16k>
cpu will pray the write allocate was worth it, who'd be dumb enough to touch one byte per line? :P
<doug16k>
that's basically what gcc looks like it is doing when you run concurrently with it :D
<doug16k>
except each byte it touches is super computer scienced and it is actually being fast
<doug16k>
I am amused by how cpu power use and heat goes way up when doing real work in good algorithm, and stays quite a bit cooler with naive algorithm
<doug16k>
the universe won't let you get away with doing that much more work without more power :D
Retr0id9 has joined #osdev
freakazoid333 has quit [Read error: Connection reset by peer]
Retr0id has quit [Ping timeout: 252 seconds]
Retr0id9 is now known as Retr0id
<doug16k>
is crc32 any good as a hash table algorithm?
<doug16k>
for string hashing
<geist>
good question!
<doug16k>
it's almost free on good x86 cpu
<geist>
yah same on arm
<geist>
qustion is does it distribute well across 32bit space
<geist>
its not really required to
<doug16k>
is the field prime?
<doug16k>
if so then the low bits will be well randomized
<geist>
well, someone asked the same question on stack overflow and gave a what looks on the surface pretty good answer
<doug16k>
can you believe this is java's string.hashCode: while (i < len) hash += 31 * hash + charcode(str[i])
<doug16k>
er, not +=, just =
<doug16k>
31 is prime, so it makes the low bits noisy
<kazinsal>
CRC32's prime appears to be 79764919
<geist>
note that x86 does crc32c
<geist>
but it should probably be similar to better
<geist>
just a different polynomial
<kazinsal>
s/prime/polynomial/
<geist>
side note, ben eater did a pretty nice explanation of CRC
<geist>
i'd read it before but he walks you through it in fairly good detail
<bslsk05>
'How do CRCs work?' by Ben Eater (00:47:30)
<doug16k>
yes, even if you already know something, if you watch his stuff he gives you a really nice way to think about it
<geist>
right
<doug16k>
jenkins one-at-a-time hash is soooo good for string keys
<doug16k>
I get O(1) lookup for 99% of keys
<doug16k>
and almost all the rest in 2 steps
<doug16k>
java hashcode is around 92% O(1)
<heat>
ben eater is the best
<geist>
anyway someone on stack overflow does a comparison with the jenkins hash
<geist>
mark adler actually chimes in
<doug16k>
yes, but people get all caught up in how quickly it can compute a hash, and forget how poorly it might make their scan for match
<doug16k>
s/ly//
<doug16k>
java hashcode destroys jenkins for computing the hash. and jenkins destroys java for instantly finding the right one in the table
<doug16k>
so if you lookup far more than you compute hash (because you can often have the string literal hash at compile time with constexpr) then the best distributed hash overwhelms speed of computing hash
<doug16k>
but you can't go too far because you do compute some hashes
<doug16k>
I was considering throwing crc32 instruction at it and seeing if that is any good
<heat>
i've never looked into collisions
<doug16k>
I guess with a hash table, you have to TIAS every time
<heat>
i usually just use fnv and pray for the best
<doug16k>
the "best" hash table might be hash = str[0]; and that's it
<doug16k>
assuming they always start with a different letter and you hash a bazillion new strings
<doug16k>
er, best hash function
<heat>
the best hash table is a linked list
<doug16k>
no way
<doug16k>
linear probe ftw
<doug16k>
excellent locality
<doug16k>
have a good hash function and keep load at or below 0.5
<doug16k>
cpu loves it
<doug16k>
it doesn't have to stall to find next one to look at
<kazinsal>
I'm surprised the GCC preprocessor doesn't have some kind of compile time hash function
<doug16k>
kazinsal, it does in C++, very carefully, with char const (&str)[N] template
<doug16k>
constexpr
<doug16k>
I got endless-sky drastically faster with that
<doug16k>
1500 ships is about 60% cpu
<doug16k>
choppy before
<kazinsal>
nice
<doug16k>
before the game was pretty much all strcmp calls
<doug16k>
looking for keys in map<string
<doug16k>
memcmp actually
<doug16k>
now it is DictKey key that knows hash and O(1) 99% of the time straight to the right one
<doug16k>
and when hash collision it didn't even compare strings, compares hash 1st
<doug16k>
table collision I mean. there hasn't been a hash value collision in 32 bit yet
<doug16k>
I measured it 99, not hypothetical
<heat>
i do something like that on my dentry cache
<heat>
each dentry has a name hash and only name hashes that match get the full memcmp
<doug16k>
excellent
<doug16k>
that helps drastically
<heat>
no fancy data structures though
<heat>
it's a linked list
<heat>
i should add a hash table or something
<doug16k>
ah
<doug16k>
so just doing it inefficiently, really efficiently :D
<heat>
:D
<heat>
it's still probably faster than reading from disk
<heat>
even with really big directories
<doug16k>
ya for small list it is probably amazing speed
<doug16k>
would beat some fancy ones up to some threshold
<doug16k>
linear probe hash table is amazing though
<doug16k>
if you can afford 0.5 load factor
<doug16k>
if you want to sledgehammer more speed, making it 0.25 load factor
<doug16k>
it'll be O(1) practically every time
<doug16k>
and in the unlikely case of it not being instantly right, it has 3 more chances before it gets grim
<heat>
thing is I don't want to use up a lot of memory per directory
<doug16k>
your hash function won't be that bad
<heat>
most of them are small
<doug16k>
you could switch to that when it's huge dir?
<heat>
possibly
<heat>
a-la ext4
<heat>
they only use the fancy hash trees when you have hundreds of entries
<doug16k>
beautiful thing about linear probe is, you are actually using (much of) the rest of the cache line it brought in, when you scan for match
<doug16k>
you are likely to find match before the next line is needed
<doug16k>
nevermind ops, thing locality
<doug16k>
think*
<doug16k>
and the more pathological the search, the more the cpu prefetches 100% right
<doug16k>
so it becomes awesome right when you are sucking
<doug16k>
if it pathologically scanned far before finding a certain one
<doug16k>
and 98+% of the time you are awesome and the 1st place you looked was correct
<doug16k>
probably touch one line
heat has quit [Ping timeout: 255 seconds]
<doug16k>
hash table trades worst worst case for unbeatably good common case
<doug16k>
if you are already linear searching, you are simulating worst case linear probe already, it can only be faster
YuutaW has quit [Quit: WeeChat 3.1]
<doug16k>
if you used the world's dumbest hash function it would be faster
<doug16k>
at least it might start close to where it should
<doug16k>
and nice hash function will give you the 90's of percent of lookups instantly finding the right one
Affliction has quit [Killed (NickServ (GHOST command used by Affliction`!affliction@user/affliction))]
Affliction has joined #osdev
Oshawott has joined #osdev
<doug16k>
my string hash table keys are uint32_t hash, len; char *str; which is 128 bits
<doug16k>
4 keys per line
archenoth has quit [Ping timeout: 252 seconds]
<doug16k>
you could intern all the names in the dir into a char vector and hand around those key structs that have precomputed hash and owned string
ckie has quit [Quit: off I go~]
<doug16k>
then operator== does the tricks like comparing hash 1st then len, then memcmp
<doug16k>
for example
ckie has joined #osdev
<doug16k>
I also have a thing that wraps a std::string temporary in a DictKey so it uses the string c_str during its lifetime
<doug16k>
optimizer loves it, completely understands what I mean
<doug16k>
so if your code is checking if it is == "this" or == "that" you have a char const (&str)[N] constexpr that just knows the hash of "this" and "that" and can shortcut the comparison to hash comparison
<doug16k>
in your case, it would hardly do that
<doug16k>
now the engine has difficulties with things like the AI aiming 4000 turrets, like it should, not hammering memcmp in map find
srjek|home has quit [Ping timeout: 240 seconds]
<doug16k>
I wonder why there was no linear probe hash table algorithm in the standard library
<doug16k>
only the bucket style one, which stores them all over the place
<doug16k>
linear probe is basically a vector that is kept at 2x the capacity a vector would when running. not that bad for damn near instant lookup and deletion and insertion
<doug16k>
it's amortized O(1) insertion too, since you expand it more each time you expand it
<doug16k>
probably two instructions to move the entry to the new table when expanding, 128 bit load, 128 bit store
<doug16k>
store uses index with different mask
<doug16k>
cpu can speculate a mile into that and everything runs as soon as latency permits
<doug16k>
no guessing
<doug16k>
up to 32 items, you can probably even assume the outer loop branch will predict the end of the loop correctly
<doug16k>
if not, it will just cause a lot of false speculation off the end that probably won't be reached due to keeping up with retirement mostly
<doug16k>
so you probably have perfect speculation all the way to the last loop branch, then one pipeline flush
<doug16k>
it'll overlap the loads, stores, and loop counter updates and address calculations deep enough for them to proceed completely back to back
<doug16k>
not like it has no idea where the next one will be until it reads this one
<doug16k>
neat thing too is the access pattern, as it scans down, there are only two places it would put this next key, same place in new table, or same offset from middle in 2nd half of new table
<doug16k>
so the prefetcher gets to know that you write two contiguous things
<doug16k>
the hardware loves linear probe table
<doug16k>
about half of them go to 2nd half of new table, so the prefetcher keeps knowing
<doug16k>
it's not branches deciding where to put it, it's branch free math - just masking the index
<doug16k>
it can overlap that into nothingness
<doug16k>
the index is just hash & (table_sz-1)
<doug16k>
constrain table_sz is power of two
<doug16k>
new table sz is 2x more
<doug16k>
one more bit
<doug16k>
you can compensate for non-prime table size by using a hash function that emphasizes well mixed least significant bits
<doug16k>
just multiplying a number by any prime makes a mess of the low bits
<doug16k>
some better than others though
<doug16k>
or, test your hash function bswapped, and see if it is suddenly better
<doug16k>
i.e., if you would have gotten 1 for your 32 bit hash, you return 0x01000000
<doug16k>
it might be better
<doug16k>
realistically you would never get 1
<doug16k>
most numbers are 8 hex digits. 15/16 of them
<doug16k>
15/16th of all 32 bit numbers have nonzero 31:24
<doug16k>
if your hash has even remotely decent mixing, it'll overflow way further than that
<doug16k>
er, 31:28
<doug16k>
93.75% of all 32 bit values are >= 268435456
<doug16k>
(they would be 8 hex digits to printf %x)
ids1024 has quit [*.net *.split]
yuriks has quit [*.net *.split]
mrkajetanp has quit [*.net *.split]
yuriks_ has joined #osdev
mrkajetanp_ has joined #osdev
ids1024 has joined #osdev
<doug16k>
93.75% of all 64 bit values are >= 1.15292150461e+18
<doug16k>
imagine the unlikelihood of 1 for 64 bit hash
<doug16k>
assuming decent hash distribution of course
andreas303 has quit [*.net *.split]
clever has quit [*.net *.split]
colona has quit [*.net *.split]
hl has quit [*.net *.split]
wereii has quit [*.net *.split]
amj has quit [*.net *.split]
Arsen has quit [*.net *.split]
ElementW has joined #osdev
andreas3- has joined #osdev
<doug16k>
point being, usually you aren't worried about the number being too close to zero and upper hash bits not doing anything
clever has joined #osdev
<doug16k>
you have index bits coming out your ears
hl has joined #osdev
Arsen has joined #osdev
wereii has joined #osdev
YuutaW has joined #osdev
ZipCPU has quit [*.net *.split]
gruetzkopf has quit [*.net *.split]
paulbarker has quit [*.net *.split]
transistor has quit [*.net *.split]
jakesyl has quit [*.net *.split]
zgrep has quit [*.net *.split]
tds has quit [*.net *.split]
seds has quit [*.net *.split]
nohit has quit [*.net *.split]
transistor has joined #osdev
tds has joined #osdev
zgrep has joined #osdev
gruetzkopf has joined #osdev
jakesyl has joined #osdev
Patater has quit [*.net *.split]
abbie has quit [*.net *.split]
les has quit [*.net *.split]
puck has quit [*.net *.split]
MrBonkers has quit [*.net *.split]
kori has quit [*.net *.split]
kciredor has quit [*.net *.split]
tux3 has quit [*.net *.split]
jimbzy has quit [*.net *.split]
kc8apf has quit [*.net *.split]
abbie has joined #osdev
les has joined #osdev
jimbzy has joined #osdev
puckipedia has joined #osdev
kciredor has joined #osdev
tux3 has joined #osdev
Patater has joined #osdev
<doug16k>
x86 also has a universal way to do crc of any polynomial with vector instructions
<doug16k>
the plain crc32c instruction is Castagnoli polynomial 0x1EDC6F41
<doug16k>
"Computing a CRC for an Arbitrary Polynomial using PCLMULQDQ"
<clever>
that reminds me, the rp2040 MCU, its dma engines can compute a crc as it copies data
<clever>
it has a lot of flexibility, to reproduce different checksums, but SD crc isnt one of them
<doug16k>
yeah, in hardware, crc computation is almost nothing
<doug16k>
for each term of the polynomial there's another xor gate for one bit
<doug16k>
nearly nothing
<doug16k>
if you do it bit serial
<geist>
for lulz the other day i wrote a quickie crc routine to benchmark ARM using the crc instructions
<doug16k>
byte serial is more than nothing
<geist>
was fairly quick, a few GB/sec
<geist>
on a rpi4
<doug16k>
probably right at memory bandwidth?
<doug16k>
or did you crc the cache
<geist>
crced the cache
<geist>
64k in a loop
<geist>
but wasnt' far off ideal speed. cpu is running at like 1.8ghz, the crc32c instruction does 32bits at a time, so if it's one per cycle that's like max theoretical around 7GB/sec
<geist>
iirc it was something like 4 or 5
<doug16k>
32 bits at a time?
<geist>
iirc int he a72 dcos it's about 1 per cycle, so that's about right
<doug16k>
one crc instruction takes 32 bits of data?
<geist>
yah
<geist>
i have it turned off or i'd check
<doug16k>
unrolled it a good bit so it didn't hide 3 cycle latency behind inc dec jnz?
<geist>
not particularly
<geist>
like i said it was fairly close to ideal. mostly was interested in making it work
<doug16k>
it's 3 cycle latency on zen2, 1 cycle throughput
<doug16k>
at like 4.whatever
<doug16k>
ghz
<doug16k>
and probably way more power
<geist>
lets see. the a72 docs sazy.....
<geist>
2 cycle latency, 1 cycle throughput
<doug16k>
amd would have to be 4.7 cycle latency to be same latency
<geist>
it also mentions "CRC execution supports late forwarding of the result from a producer CRC μop to a consumer CRC μop. This results in a one cycle reduction in latency as seen by the consumer."
<doug16k>
accounting for 4.3/1.8
<geist>
sure. but anyway, point is both of them are fairly close to about as good as you can get
<doug16k>
point being, not fair to directly compare
<geist>
only really thing you can do there is have multiple copies of that execution unit
Mooncairn has quit [Quit: Quitting]
<geist>
more to the point it hink using somethign like crc32 is a fairly good bet nowadys
<doug16k>
yeah, they are close to the limit of how well you can lay that logic out
<doug16k>
making it pretty awesome
<geist>
i suppose could have a 64bit wide crc32 instruction as well
<doug16k>
x86 doesn't, I checked
<geist>
iforget if arm provides that, but it *does* provide a 16 and 8 bit one too
<doug16k>
8, 16, 32
<geist>
so nice for the tail bits
<geist>
yah, same on arm i think too
<doug16k>
arm has no shortage of special instructions for particular things, right?
<doug16k>
it's full of those isn't it?
kc8apf has joined #osdev
<doug16k>
not even close to "please do everything with and or xor sub add mul div mod not"
<geist>
it's pretty good with bit inserttions and whatnot
<geist>
it's not crazy, but seems to be generally good enough that the compiler has a lot of options to do stuff without long sequences of stuff
<doug16k>
yeah, x86 has it rough there. there are amazing instruction set extensions that go largely unused unless you have one of those crazy march=native everything distros
<doug16k>
it has extremely fast and complex bit manipulation stuff
<geist>
yah arm64 seems to be well designed to Get Stuff Done
<geist>
not as highly regular as you'd think, but it's tuned for pretty much what you need
<geist>
like, multiple ways to get immediates into the instructions, based on the class of instructinos, for example
<geist>
less regular, but more powerful. can synthesize more interesting masks for logic ops, vs alu ops which tend to be more integer constant based
<doug16k>
they did a great job of cramming a ton of meaning into the instructions
<doug16k>
I can imagine disassembling it is pretty rough
<doug16k>
just error prone I mean - easy to mess up the interpretation of the bits
<geist>
yah but it's a cleaner opcode layout than old arm32 or thumb2
<doug16k>
it's actually kind of hard to make a fixed length ISA
<geist>
but yeah, there are more than a handful of instruction formats
<geist>
vs something like riscv
<doug16k>
got to figure out how to use the bits right
<geist>
and even riscv has some odd complexities with immediates and whatnot
<geist>
designed for hardware, not software emulators
<doug16k>
yeah? I thought riscv was easy to emulate due to lack of flags
<geist>
oh sure, but the parsing of the immediates is a little wonky
Izem has quit [Quit: Izem]
<geist>
wonky as in lots of the immdiates are broken across 3 or 4 fields
<doug16k>
ah, nitpicky sometimes
<doug16k>
I'm having too much fun optimizing this open source game code that hammers map,set,string. you keep optimizing the top thing until all the numbers are close together for top bunch of functions
<doug16k>
where it was one huge number at top when you started
<doug16k>
then the numbers indicate actual good work being done, not how much time it wasted
<bslsk05>
www.open-std.org: N4455 No Sane Compiler Would Optimize Atomics
<doug16k>
then you figure out how to make it do less work, on top of it doing the work it does more efficiently
<moon-child>
compiler can use random memory locations to spill to, if it can prove no code will observe those spills
<doug16k>
yes, but what it can do and what it will actually do are two different things
<moon-child>
sadly
<doug16k>
gcc could just crash on purpose the moment you use two union members in C++. it doesn't
<doug16k>
it coddles that
<moon-child>
yes but using random memory locations for spill is actually ingenius, and doesn't affect program behaviour at all
<moon-child>
*ingenious
<doug16k>
that's practically how a stack redzone works
<doug16k>
you can do that invalid thing up to a point
<doug16k>
putting data below the stack pointer being the invalid thing
<moon-child>
redzone is only for leaf functions though
<doug16k>
yeah
<moon-child>
spill usually is you want to preserve a register across a function call
<doug16k>
leaves spill too
<moon-child>
yeah. But not as much
<doug16k>
they tend to be simplistic, yeah
<moon-child>
in particular, that optimization improves code size for uninlined code; if you're less likely to inline, your codesz benefits are redoubled
<nur>
I now have interrupts working.
<doug16k>
neat thing about LTO though, it will take large sections of the program and turn them into one giant leaf
<moon-child>
yeah lto is cool
<nur>
do I need to enable the timer interrupt to get the timer running?
<doug16k>
if you say it really right it will be able to violate the abi and use a more optimal strategy with the registers
<doug16k>
like in an anomymous namespace, guaranteeing dynamic link couldn't possibly replace it
<moon-child>
I've always wanted compilers to be able to do ad-hoc calling conventions
<moon-child>
but do they actually do that in practice?
<doug16k>
yes, gcc has an optimization that is exactly that
<doug16k>
that's why LTO stresses your __asm__ statements. it holds you to it with your clobbers
<doug16k>
can't rely that function already clobbers whatever
<doug16k>
it might "know" you don't clobber rcx and rely on it!
<doug16k>
even though rcx is supposed to be clobbered
<doug16k>
if you asm isn't exactly right, you'll subtly break it in a nearly-impossible-to-debug way
<doug16k>
it'll just look like bad codegen
<doug16k>
without LTO, it was too blind and had to assume you clobber rcx
<doug16k>
usually
<doug16k>
so if you forgot rcx clobber, nobody would notice
<doug16k>
or if you lied, and said you use a register and didn't say you changed it, but you did
<doug16k>
it might ingeniously rely on you not changing that
<doug16k>
where it wouldn't have done the crazy flow analysis in normal compile
<doug16k>
your code said to set it to whatever value, and flow analysis did a wink and didn't emit instruction, because it "knew" that register is that value
<doug16k>
so source is 100% right at point of malfunction
<doug16k>
dataflow analysis / value analysis
<doug16k>
you proceed with trashed value that your lying __asm__ caused
<moon-child>
recent patch to llvm added a thing to automatically check __asm__ constraints
<moon-child>
really neat stuff
<doug16k>
the problem always exists for callee saved registers. the calling convention can hide the problem though
<moon-child>
(though of course my position is that __asm__ is harmful, but¯\_(ツ)_/¯)
<doug16k>
for clobbered registers
<doug16k>
oh I love __asm__. it is just complete manual override where everything wrong is my fault, and __asm__ gets all the credit when it works right
<doug16k>
ideal to me
<doug16k>
I wish everything was like that
<moon-child>
asm is great, c is great, asm and c don't mix
<moon-child>
imo
<doug16k>
you get to make IL
<doug16k>
it optimizes your stuff into the code
<doug16k>
it'll backpressure the register allocator to make stuff already be in the right register
<doug16k>
if you constrain it
<doug16k>
how awesome is that?
<doug16k>
you can let go and say I don't care what register, pick one you like
<doug16k>
and register allocator says thank you very much use this stupid one
<doug16k>
one that nobody wants to constrain it to use
<doug16k>
I don't know how it can be much better than that
<doug16k>
more than I'd ask for
<doug16k>
you can make it just tell you what register to find things
<doug16k>
you don't even care which one
<doug16k>
if it's not volatile, it could even participate in code duplication optimizations where it duplicates it for multiple scenarios that are inlined
<doug16k>
and each one is optimized into its surroundings
<doug16k>
that is 100x more than I'd ask for
<doug16k>
in an inline assembly syntax
puckipedia is now known as puck
<doug16k>
I can hardly believe how good it is
<doug16k>
I'd timidly ask for a way to reliably access local variables and hope they didn't get mad
<doug16k>
and I'd fully expect it to be hideous and always force that data into a local variable so I can fetch it
<doug16k>
it's drastically better than that
<doug16k>
I love what's behind the hideous syntax
<doug16k>
the syntax with the string literal is awful
<doug16k>
the constraints are hard to use too
nyah has quit [Ping timeout: 272 seconds]
<doug16k>
high tax for it being optimized into its context
<doug16k>
it has hideous maintainability too, it won't even try to validate a thing in gcc
<doug16k>
well, it will error a bit, but tons of mistakes are not diagnosed
<doug16k>
all the railguns are pointed at your foot but they are not fired if you do it right :P
<doug16k>
the railguns point at the problem you are solving with the asm :D
<doug16k>
it can perfectly reduce something to one instruction if I want and I can
<doug16k>
codegen would be just as good as if the authors of gcc handled that with a real builtin
<doug16k>
you are using the same machinery as the backend when you write inline asm
<doug16k>
you are emitting a node with inputs and outputs that are handled like everything else
<doug16k>
gcc hardly even needs intrinsics, you could just write __asm__ for everything
<doug16k>
nothing like msvc where they have an intrinsic for every instruction
<doug16k>
and asm is banned
<doug16k>
I can see that being a good idea though
<doug16k>
as long as they are totally thorough and they always have the instruction I want
<doug16k>
gcc method allows it to work with instructions that didn't exist
<doug16k>
just coerce it to use different as
lucf117 has quit [Remote host closed the connection]
<moon-child>
intrinsics let it optimize though
<moon-child>
it can't reason about the _contents_ of asm
* Griwes
. o O (optimizing assemblers)
* Griwes
looks at ptxas eating literal hours of cpu time at times
MarchHare has quit [Ping timeout: 240 seconds]
mhall has joined #osdev
<doug16k>
yeah, gpu compilers go berserk inlining
<doug16k>
that can make compile very long
<Griwes>
it's not just inlining
<doug16k>
not too bad nowadays, but in the past when branches were way worse, they went crazy inlining
<Griwes>
in tests for our implementation of atomics, we instantiate atomics for a huge number of types and then emit a stream of essentially all operations for them all
<Griwes>
ptxas was OOMing on some 8 gig build machines for that test
<Griwes>
because it was trying to be clever about the interactions of that entire instruction stream :D
<Griwes>
(some __noinline__s placed in a number of places did help)
<doug16k>
yeah, you drove up the exponent on some exponential optimization pass
<Griwes>
...yep
<Griwes>
working on something else I made it take over 24 hours to finish on a *relatively* simple program
<doug16k>
that's the trickiest part of optimizers. you have to have algorithms that can take more time than the sun has fuel, but usually they complete in milliseconds
ElectronApps has quit [Remote host closed the connection]
<doug16k>
optimizers have to be able to give up when they realize it's unreasonable
ElectronApps has joined #osdev
<moon-child>
tricky, idk about _trickiest_ though
<doug16k>
that happened to AMD's compute shader compiler when first trying huge raytracers. the compiler literally inlined everything, so it was more shader than all the gpus in the world put together
<doug16k>
no amount of ram would be enough
<doug16k>
the way it works it caused an incredible explosion of cases, because of all the types of materials and all the settings variations exploding
<moon-child>
gpus hate loops and branches
<moon-child>
so I wouldn't be surprised if they also tend to bloat somewhat unrolling and making things branchfree
ElectronApps has quit [Ping timeout: 255 seconds]
ElectronApps has joined #osdev
Giedrius has joined #osdev
<Griwes>
all processing units hate loops and branches ;>
<Griwes>
and gpus of today are... getting much better at those
<clever>
Griwes: i recently found that one of the DSP's ive been playing with has scalar and vector in seperate issue channels
<clever>
so while a vector opcode may take 100 clock cycles, the scalar opcodes can freely run in parallel
<moon-child>
Griwes: eh...no not really
<clever>
so the branch opcode in a for loop, is essentially free, and runs in the same cyclces as the vector opcodes inside the loop
<moon-child>
correctly predicted branches are great on cpus
<Griwes>
and incorrectly predicted branches mean death
<Griwes>
(and side-channels ;p)
<moon-child>
and you can arrange for your branches to be predicted correctly. You're not going to be worse off than if you had no branch predictor
<moon-child>
actually the same thing is true of gpus, kinda. If _all_ your branches go the same way you're fine. But you have a lot less flexibility than with cpus
<clever>
moon-child: that reminds me, the GPU on the rpi is a vector unit, thats wearing a scalar mask
<moon-child>
cpus can learn patterns in branches. Newer zen even uses a neural network
<Griwes>
warp divergence is much less of a deal nowadays than it used to be
<clever>
you can treat it like a scalar processor, if you never branch
<clever>
but in reality, its a 16? wide vector unit
<moon-child>
clever: huh, neat
<clever>
but the branch opcodes, have extra conditions, "if any lane", "if all lanes"
<clever>
conditional branch*
<clever>
moon-child: essentially, when programming it, you only think of one register bank, and think of the code-flow as scalar
<clever>
moon-child: but behind the scenes, the GPU will run your code on a vector core, and compute the color of 16 pixels in parallel, if all use the same shader
<clever>
and if you never have conditionals, you never realize the trickery its pulling
gog has joined #osdev
sprock has quit [Ping timeout: 265 seconds]
GeDaMo has joined #osdev
aquijoule__ has quit [Ping timeout: 252 seconds]
richbridger has joined #osdev
dormito has quit [Ping timeout: 268 seconds]
dormito has joined #osdev
dennis95 has joined #osdev
vdamewood has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
aquijoule_ has joined #osdev
richbridger has quit [Read error: Connection reset by peer]
Izem has joined #osdev
<dzwdz>
do y'all think that it would be possible to fit some super simple networking into a boot sector?
<dzwdz>
it's probably possible to create a valid ethernet frame which is also a bootable x86 binary
<dzwdz>
and a boot sector which sends itself across the network sounds like a cool project
<bslsk05>
en.wikipedia.org: Network booting - Wikipedia
<dzwdz>
isn't that handled by the bios?
<GeDaMo>
Nowadays it might be, it wasn't always
xenos1984 has quit [Quit: Leaving.]
xenos1984 has joined #osdev
<klange>
PXE is generally implemented in the option ROM for a NIC.
z_is_stimky_ has joined #osdev
z_is_stimky has quit [Read error: Connection reset by peer]
ElectronApps has quit [Read error: Connection reset by peer]
ElectronApps has joined #osdev
isaacwoods has joined #osdev
Vercas has quit [Remote host closed the connection]
Vercas has joined #osdev
flx has quit [Remote host closed the connection]
air has joined #osdev
Izem has quit [Quit: Izem]
MrBonkers has joined #osdev
ahalaney has joined #osdev
mcs51 has joined #osdev
<sahibatko>
Hi, just got to a question about UEFI being indifferent (correct me if I'm wrong) about top level paging. So I get my code to run in long mode, 64-bit, can detect the paging level being used + level being supported. So is it really bootloaders job to make a transition between paging level 4 and 5? Or should I rather just detect + stick with the setting used?
<heat>
sahibatko: what do you mean with indifferent?
<heat>
UEFI firmware runs on identity-mapped page tables, but they may or may not enable pml5 paging
<heat>
(certainly depends on your firmware's version and whether or not they actually care)
<heat>
it's your job as the kernel to reinitialise everything
flx has joined #osdev
nyah has joined #osdev
<sahibatko>
actually, that answers it, reinitialise everything
srjek|home has joined #osdev
aquijoule_ has quit [Quit: Leaving]
MarchHare has joined #osdev
Celelibi has quit [Quit: Quitte]
Celelibi has joined #osdev
gareppa has joined #osdev
phr3ak has joined #osdev
gareppa has quit [Client Quit]
paulbarker has joined #osdev
ElectronApps has quit [Remote host closed the connection]
amj has joined #osdev
silverwhitefish has quit [Quit: One for all, all for One (2 Corinthians 5)]
Giedrius has quit [Remote host closed the connection]
heat has quit [Ping timeout: 272 seconds]
srjek|home has quit [Ping timeout: 272 seconds]
freakazoid333 has joined #osdev
heat has joined #osdev
iorem has quit [Quit: Connection closed]
zoey has joined #osdev
freakazoid333 has quit [Read error: Connection reset by peer]
Izem has joined #osdev
IRCMonkey has joined #osdev
IRCMonkey is now known as DeepComa
DeepComa has quit [Client Quit]
heat has quit [Ping timeout: 255 seconds]
heat has joined #osdev
mahmutov has joined #osdev
jjuran has quit [Ping timeout: 258 seconds]
freakazoid333 has joined #osdev
Skyz has joined #osdev
jjuran has joined #osdev
dennis95 has quit [Quit: Leaving]
Skyz has quit [Quit: Client closed]
Izem has quit [Quit: Izem]
Skyz has joined #osdev
Skyz has quit [Quit: Client closed]
Skyz has joined #osdev
silverwhitefish has joined #osdev
springb0k has quit [Read error: Connection reset by peer]
* geist
yawns
<geist>
good afternoon folks
<clever>
evening
<nur>
hey geist
<nur>
I got my isr working
<nur>
:D
<heat>
nur: an apple a day keeps the sys v abi away
tenshi has quit [Quit: WeeChat 3.2]
<geist>
yay
<geist>
re: te 5 level paging question in history. got me thinking, did pml5 ever show up on consumer hardware?
<geist>
looks like sunny cove microarch has it from wikipedia
<nur>
do I need to enable the timer to get timer interrupts
<nur>
or should that just work
<heat>
well yes
<geist>
the former
<kazinsal>
I think the 11th gen Core i-series chips have PML5
<heat>
there are also about 4 or 5 different timers so good luck trying to figure things out
<nur>
oh boy
<clever>
x86 or arm?
<nur>
x86 32
<clever>
ah, not that familiar with baremetal x86
<nur>
but I will ask about arm later
<clever>
i know arm in more depth
<nur>
will make a point to ask you when that time comes
<heat>
the PIT is old and bad (but simple), the HPET is complex but flexible(and kind of bad), the local APIC timer is relatively simple, but has lots of dependencies on ACPI and whatnot, probably the best timer you have, the ACPI timer doesn't have interrupts so it's a plain old, bad clock source
<nur>
I have a RPI I am itching to hack on
<geist>
kazinsal: yah those are rocket lake
<geist>
nur: which model?
<nur>
heat, I think PIT is what I am looking for
<nur>
the... 2B I think
<heat>
the TSC is the best clock source you have but has no interrupts on its own and requires the local APIC's help to fire events
<nur>
also the 1st one
<geist>
yah throw those out and get a new one
Skyz50 has joined #osdev
<nur>
ah I can't afford it right now
<nur>
maybe I'll just use qemu's rpi mode
<clever>
nur: for hacking, the 2b is fine i think, and some parts are better documented then the 4b
<clever>
if you want performance, the 4b is the answer
<heat>
note: you need the HPET, PIT or ACPI timer to calibrate the local apic timer and to get the tsc frequency
<geist>
2b is arm32 however
immibis__ has joined #osdev
<geist>
3 and 4 are 64bit
<nur>
oh
immibis__ is now known as immibis
<geist>
usuall doesn't matter if you're just running linuxy stuff on it, but if you want to bare metal hack it's a fork in the road
<geist>
and 64bit is the only fork with a future
<geist>
but anyway, start off with the PC stuff
<geist>
it's super well documented and lots of folks will help you
<bslsk05>
github.com: Onyx/pit.cpp at master · heatd/Onyx · GitHub
Skyz has joined #osdev
<geist>
nur: did you consult the wiki?
<nur>
looking at it now
<geist>
more to the point, there are lots of articles on the topic, probablybetter to read those first, then come to us with questions
scaleww has joined #osdev
<heat>
oh how did I forget that
<heat>
you also have paravirtual clocks like kvm clocks that give you the time
<heat>
there's clocks and timers for everyone in x86
<geist>
yah linux considers the kvm timesource to be the best AFAICT
<heat>
yes
<heat>
you can also just use it to get the tsc frequency
<geist>
right
<geist>
we dont use it as a time base in zircon, but we do read the TSC freq out of it
Skyz has quit [Quit: Client closed]
<heat>
also a good tip is to forget periodic interrupts, those suck
<geist>
honestly i would say the opposite
<geist>
far easier to start with periodic, and it's not as terrible as folks make it out to be
<heat>
you can build a timer struct/class that keeps a list of pending clock events and sets the oneshot to the closest event
<geist>
sure. but when just getting started that's extra complexity for no real gain
<heat>
it's easier but then you get used to a bad design
<geist>
disagree. thats an area where you can abstract it the same way, you just replace t with a 'better' design later
<heat>
and suddenly your scheduler and timing are mixed together
<kazinsal>
yeah you could totally do it with periodics and then dump in a one to one replacement that uses one shots or whatever later so long as your function interface doesn't suck
<geist>
while that is a thing i just dont think that's an area that's a huge disaster. you'll end up rewriting that a few times anyway, and it's fairly localized
<heat>
if you take the time to abstract it why not do a slightly different code path that does exactly what you want?
mahmutov has joined #osdev
<geist>
because they're literally just getting started
<geist>
and when getting started, it's a lot f times better to get something basically working, then the best solution
<geist>
secondly, the whole periodic timer thing isn't nearly as bad as you think. it's not modern, but it's totally sufficient
<nur>
I am getting overwhelmed
<kazinsal>
and this is why the simple solution is the best one to start with
<heat>
nur, disregard me
<geist>
precisely why i'm saying just set it up to periodic, roll a counter. super simple, will work for years
<heat>
do what geist said
<nur>
right
<geist>
can use it as a time base when just getting started, though it only ticks a 10 or 1ms intervals
<geist>
but that's also totally sufficient. dont have to fiddle with TSC, time, etc
<geist>
here's my suggestion: figure out how to get the PIT ticking at 100 or 1000hz. get the PIC working so you can take an interrupt
<geist>
every time it ticks bump a global volatile variable that is the current time (in units of 10 or 1ms)
<geist>
now you killed two birds with one stone with maybe 20 lines of code
<nur>
nice!
<geist>
now, knowing which 20 lines is the hard part, but my point is it's very simpl and it works on *all* PCs
<nur>
and qemus
<geist>
then you can replace it later with something better as long as you keep that fairly localized behind apis
<heat>
nur, qemu emulates a PC
<geist>
like 'get_time()' or 'set_timer()' etc for later when you want to build a software timer queue (which you will eventually)
<geist>
the key is dont expose the internals of the PIT/etc outside of the timer module so you can upgrade it later, as heat is saying
<nur>
right
<geist>
but that's a general software engineering thing anyway
<heat>
my biggest regret is that I never fully understood how the PIT and PIC work
<heat>
they're so cute and simple
<geist>
yah
<geist>
a bit esoteric, but there's only so much complexity there. simple and weird is at some point still less work than complex and clean
<nur>
thanks
<geist>
and also to the point there are a bazillion examples/tutorials for PIC and PIT
<geist>
though for some reason someone a while back put in a bunch of articles on osdev in some bizarre assembly
<geist>
there's always someone like that that has to do it in the hardest possible way to show off their chops
<clever>
i prefer doing it in C when possible, then it doesnt matter if your 32bit or 64bit
<geist>
"here's an example of writing to the VGA text mode in brainfuck"
<clever>
so you can use older 32bit cpu's without any downsides
<geist>
or different architectures
<clever>
when possible, i dont expect to find a PIT on arm
<geist>
indeed. however i *have* seen PIC and PIT on non x86 a bunch of times
<geist>
actually if you go into qemu-mips and select one of the machines, 'malta' i think?
<geist>
it's basically a PC with a mips on it
<clever>
weird, but also reminds me of a post i recently made on the rpi forums
<geist>
and maybe some of the CHRP and/or PREP stuff back in the 90s was somewhat PC centric
<clever>
somebody was claiming they should make a risc-v based board next, and the engineers replied claiming that they would have to re-learn all of the soc internals
<geist>
back in the90s where were lots of cheapo PC southbridges floating around, so it kinda made sense to toss one on your mobo for your random RISC machine
<clever>
my "fix" was to just swap the arm out for risc-v, but keep all of the other broadcom crap :P
<clever>
but obviously, its not something RPF can do, it would need to be done by broadcom
<geist>
right, and it wouldn't be performant yet since theres no really a riscv core that's as fast as an a72 yet
<geist>
maybe the new sifive performance cores
<geist>
on paper they look like they're finally in a7x class
<clever>
the rpi is also a weird edge case, where C drivers help a lot
<clever>
there is a lot of rpi specific drivers, but writing them in arm asm locks out using them on the VPU
<clever>
its an edge case, where you can access the hw from 2 different arches, and dont need to swap out the cpu to change arch
<GeDaMo>
I read about an open POWER processor and there was a mention of an Alibaba RISC-V where they had to add addressing modes for performance
<clever>
this also got linked in a recent rpi thread about ecc memory on the pi4b
<clever>
the claims in the thread, is that the pi4 has on-chip ecc ram
<clever>
so the ram is internally handling ecc, and then presenting a non-ecc api back to the ram controller
<clever>
the above pdf, also mentions power savings
<clever>
it claims that managing the ecc data increases the ram's power usage by ~5%, but the ability to correct the bit-flips, allows for a much slower refresh rate when in standby/self-refresh mode, resulting in far lower power usage
sprock has joined #osdev
dormito has quit [Ping timeout: 256 seconds]
GeDaMo has quit [Quit: Leaving.]
riverdc has quit [Quit: quitting]
riverdc has joined #osdev
andreas3- has quit [Remote host closed the connection]