<griddle>
should those be something the kernel cares much about?
<wikan>
yes if i use it for /etc :D
<wikan>
if i will
<griddle>
on linux /etc/ is parsed by userspace applications and pushed up to the kernel by various means
<griddle>
not sure if you plan on a userspace of some kind
<wikan>
whell I learnt Linux alot. Used BSD, Windows, BeOS.
<wikan>
I just want my os to work "my way"
<wikan>
don't want to write unix-like or dos-like
<wikan>
so it is hard to explain what I have in my head
<griddle>
well good luck to you friend
<wikan>
it is just small os idea :)
<griddle>
they all start that way
<griddle>
mine started as a "kernel" I used when playing around with the KVM api directly
<griddle>
3 years ago ;)
<wikan>
i am much more stupid yet
<wikan>
i fell like a 14yo kid
<griddle>
I doubt it
<griddle>
starting an os is alot of work and is a huge design space
<wikan>
fill/fell/dunno
<griddle>
makes sense to think about it for a long time, as it's hard to change it significantly 2 years down the line
<wikan>
i think about it for 10 years
<wikan>
but working on idea for maybe few months
<griddle>
most of my time in my kernel was spent getting stuff to work at the very start. Once stuff works (memory management, disk access, etc) iterating is really swift
<griddle>
even then, it took 2 months to get RISC-V working on top of the existing system
<klange>
64-bit port over a lunch break, aarch64 in a weekend...
<griddle>
design is the hardest part of programming :)
<wikan>
good i don't need windows :D
<wikan>
no X system required
<wikan>
that would be next thing to write lol
<griddle>
klange: reading the list of arm MSRs takes more than a weekend
<griddle>
too many letters
<griddle>
TPIDRRO_EL2
<klange>
You don't need too many to get your foot in the door on aarch64.
<wikan>
is osdev a source of your knowledle/docs/specs etc?
<klange>
And I had a good reference as a friend of mine is rather well known in the space of Apple M1 bringup ;)
ripmalware_ has joined #osdev
<griddle>
:)
<griddle>
yeah I imagine knowing an arm expert helps
<griddle>
that arm expert in particular :0
ripmalware__ has quit [Ping timeout: 240 seconds]
<PapaFrog>
Have you ever forgotten writing code so badly you run it through a plagiarism checker just to be sure?
<griddle>
i forget writing my entire x86_64 boot process
<clever>
i once remembered writing an entire feature, but couldnt find a trace of the code on any of my systems
<clever>
my only explanation is that i wrote it in my dreams, lol
<wikan>
hmmm. maybe i would use lucid dream to desing something
<griddle>
work/life balance is important folks :)
<wikan>
am I the only one without degree?
<griddle>
klange: if I already have a portable kernel (I support x86_64 and rv64) how much of a pita is getting aarch64 working?
* kof123
<--
<griddle>
I just want to dev on my M1 w/o the TCG slowdown
* wikan
send thanks to everybody and waves to say goodbye
<bslsk05>
www.usenix.org: AIFM: High-Performance, Application-Integrated Far Memory | USENIX
<geist>
insert <why would anyone ever want $NEW_THING> answer
<mrvn>
griddle: again: laptop? You plan to carry around a unch of laptops all interconnected for far memory?
<griddle>
the answer in this case is virtualizing "infinite" PGAS
<geist>
anyway finally got around to adding proper feature bit detection, and acutally reading all these things out of various machines
<mrvn>
s/unch/bunch/
<klange>
geist: My understanding was it was actually tested _in QEMU_ before showing up in real hardware?
<geist>
was surprised to notice that
<geist>
yah, probably so
<geist>
finally decoding all the xsave size bits and whatnot. it's not as bad as i thought
<griddle>
the practical reason it's in laptops is probably because intel is lazy and doesn't want 2 mmu impls
<mrvn>
klange: officially by intel or just some user?
<griddle>
linux will probably not enablwe
<griddle>
for power reasons or whatever
<griddle>
since it doesn't seem to give a new huge page size
<moon-child>
I don't know if it's actually in laptops, just guessed it might be
<geist>
the practical reason those things get killed in consumer hardware right now is the big.LITTLe nonsense thye have now
<moon-child>
since there were some icelake mobile processors
<geist>
ie, AVX512 getting killed in performance cores because the goldmont efficiency cores dont have it
* geist
rolls eyes
<griddle>
yeah
<geist>
come on intel, do a better job
<griddle>
they managed to put avx512 in KNL cores
<moon-child>
yeppp :/
<klange>
mrvn: Supposedly, officially by Intel.
<griddle>
idk what they are smoking over there
liz has quit [Ping timeout: 244 seconds]
<mrvn>
geist: we need more work for supporting big.LITTLE with different feature sets.
<griddle>
not like anyone wrote code to use avx512 anywho
<geist>
they're smoking 'some VP says cram these two pieces of tech together now'
<moon-child>
griddle: why do you think no one wrote code for avx512?
<griddle>
cause nobody paid for icc :^)
<moon-child>
I mean, m1 does fine for itself. 14 pipes on firestorm, just 6 or 7 on icestorm, but same semantics
<griddle>
but also, nobody likes 16 character op codes
<griddle>
p sure the pipeline is smaller on the intel cores as well
<griddle>
don't think that matters from a programmer perspective though
<mrvn>
I've been thinking along the line of compilers supporting function overloads with march=avx512. You could put those overloads into different physical pages at identical offsets and then map the right pages into the address space when swapping cores.
<geist>
i was a little dissapint to see that M2 didn't apparently pick up SVE yet
<geist>
yeah M1s are doing okay, but the NEON 128 bit fpu is looking at bit weak right now
<griddle>
M2 is based on A15 I think
<geist>
though apple is good about just diverting attention elsewhere
<moon-child>
m2 seems like a pretty marginal upgrade over m1. Might be in m3 or 'm2 max'
<griddle>
they are about 2y out dated
<griddle>
p sure the only arch change is nested virt?
<geist>
yah, it's whenever they officially pick up ARMv9, where it's mandatory to support SVE
<moon-child>
geist: m1 is similar performance to avx2. avx2 is double the vector width, but m1 has twice the ports, so it balanced
<mrvn>
So before migrating the kernel would wait for the code to leave the march= pages and then remap them for big or LITTLE.
<moon-child>
*balances
<mrvn>
insane?
<geist>
i have been meaning to look at the new cpuid leaf that describes big.little
<geist>
probably documented in intels stuff now
<griddle>
yeah, also, big.LITTLE's usage is incredibly boring
<bslsk05>
dl.acm.org: Don't shoot down TLB shootdowns! | Proceedings of the Fifteenth European Conference on Computer Systems
<geist>
maybe not poll, but you can have async yes. that's intrinsically how amd64 works
<griddle>
clever: on topic
<clever>
so you can issue a shootdown, sleep all threads that depend on it, and then schedule something entirely different
<clever>
so the cpu doesnt stall waiting for it
<geist>
arm64 even
<griddle>
I feel like shootdowns are in the awkward latency area of "too long to spin" but also "not long enough to reschedule"
<mrvn>
clever: why stop other threads?
<moon-child>
this is why we should have \inf hyperthreads
<moon-child>
make the cpu do the scheduling!
<clever>
mrvn: any thread within that virtual space, where you did munmap
<moon-child>
(kidding, mostly)
<mrvn>
clever: unless they have a synchronization point the time when the unmap (akak shootdown) happens is indetermined.
<griddle>
didn't intel try to do the hardware context switching
<griddle>
but nobody liked it
<griddle>
(intel did the typical limited table size thing they often do)
<mrvn>
griddle: only because it was slower
<griddle>
im sure intel also limited the entries to 16 some other useless number
<griddle>
like they did with MPX
<mrvn>
griddle: you can have as many task gates as you like
<griddle>
ah
<griddle>
my bad. It's alos only around on IA32 :)
<griddle>
which honestly would have been an awesome architecture
<griddle>
outside of the intel-isms
<griddle>
(mixed up IA32 and IA64 in my head again)
<geist>
hardware tasking was more or less effectively dead on the vine. probably by 386 and definitely by 486 i dont think too many systems used it if they wanted to be efficient
<geist>
aside from the 'you need it for #NMI or #DF' stuff
<geist>
and yeah i think early linux used it, but then it wasn't yet optimized
<mrvn>
geist: I wonder if modern CPUs wouldn't have made it faster than switching manual.
<griddle>
I also feel like it's a problem because the kernel might want some feature that hte hardware doesn't provide
<geist>
i did actually implement some code to use it a while back and tested it on some modern hardware. it was *increidbly* slow
<geist>
but then they explicitly say not to use it so it's clearly unoptimized
<griddle>
like, sure the hardware can do it faster than software, but it's also hardware
<mrvn>
griddle: they solved that fore xsave
<geist>
funny i'm literally writing that code right now to parse all of those feature bits
<geist>
that's when i noticed the 5 level paging, since the cpuid leaf that reports the vaddr size says 57
<griddle>
put professional reverse engineer on your resume
<mrvn>
griddle: you configure what to save in XCR0 and EDX:EAX
<griddle>
Sure it works in that sense
<geist>
yep and that cpuid leaf i'm parsing right now tells you of all the optional things you can save in it how much space they occupy and what offset into the saved bits
<griddle>
but lets say you want to optimize some IPC or something between two processes
<griddle>
letting the kernel do the switching takes alot of control away from the kernel
<mrvn>
geist: the task gate should do a burst write/read of all the state as opposed to the software doing lots of little push/pop. Should give some benefit.
<griddle>
and it's the 1% percentiles that kill performance
<geist>
mrvn: sure, except it doesnt. but thats because they have defacto killed the feature
<mrvn>
geist: sure. I doubt it does any bursts at all.
<geist>
worse it seems to probably dump the pipeline and run through a series of slow ass microcode
<geist>
and who knows if/how it's interruptable or whatnot
lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]
<geist>
when i was timing it i was seeing like orders of magnitude worse performance using HW task switching than the equivalent instruction sequence on modern hardware
<geist>
on classic hardware (ie, a 386) it was merely slower
<mrvn>
geist: what wonders me a bit is that you have XSAVE. Which is basically the same but as opcode.
<geist>
yes and no. xsave has a bunch of saved internal state to opimize what it writes out
<mrvn>
why doesn't a task gate fall into the xsave code path?
<geist>
ie, doesn't write out or load bits if it knows the registers are zeros and whatnot
<geist>
go ask intel.
<griddle>
i'm sure it's MMU related
<geist>
but again the task gate is dead, so it's an irrelevant discussion
<geist>
as in it's literally unimplemented for 64bit code, etc etc
<griddle>
the complexity of the MMU went through the roof when they moved past i386
<griddle>
from a hardware perspective, you probably also don't want to have to maintain and test 2 codepaths to do context switching
<griddle>
(interrupts and task switch)
<mrvn>
It could be nice to have a hardware scheduler, something trivial like round-robin. Make a circular linked list of tasks and the CPU runs them as hyperthreads.
<griddle>
though the other question is why this can't just be crammed into SMT
<griddle>
mrvn: isn't this kinda what gpus do?
<mrvn>
whenever a task blocks it could switch to tne next
<mrvn>
griddle: I believe so.
<griddle>
a problem might come up with how intel does ASID in the TLB
<griddle>
pretty sure hyperthreads share a TLB
<griddle>
so if you cram a bunch of hyperthreads into a single HART, you'll need to extend ASID
<mrvn>
it's also a lot of registers if you have more hyperthreads
<griddle>
what does the core do if all threads are blocked on memory ops?
<mrvn>
griddle: I wouldn't assume you could have more hyperthreads than ASID.
<mrvn>
griddle: wait for memory.
<griddle>
last time intel had more than 2 hyperthreads was KNL
<griddle>
they needed to do barrel processing
<clever>
barrel processing?
<griddle>
round-robbin each cycle between threads
<griddle>
each hyperthread is effectively 4x slower
<clever>
ah, thats very much like the shader core on the rpi
<mrvn>
But these would be virtual hyperthreads. So even with 2 real ones you would put 128 threads in the ring.
<clever>
where the pipeline is 4 clocks long, and is ALWAYS running 4 different threads
<griddle>
but, if your application is memory bound (each KNL core had avx512, so that was the desired memory model)
<griddle>
s/memory/usage
<mrvn>
griddle: there are still opcodes that don't use memory.
<griddle>
true
<clever>
but the shader core is more restricted, in that all 4 threads must be running the same opcode, and any stalling is shared across all 4 threads
<griddle>
those are expensive
<griddle>
it's not every cycle, it's more like every N cycles
<griddle>
not sure what N is
<mrvn>
griddle: those would become cheap. There is no waiting for a branch condition or speculative execution. By the time the thread runs the next opcode the last one will have finished no matter what.
<griddle>
I have access to a few KNL boxes thru my lab. Running coremark on it
<mrvn>
If you have 64 virtual threads and 4 hyperthreads then anything that finished in 16 cycles has no wait.
<griddle>
4406, which is what my kernel inside a RISC-V VM, inside an aarch64 vm gets
<griddle>
mrvn: what's the policy on switching to a new virtual thread?
<griddle>
s/what's/what might be/
<mrvn>
griddle: as you suggested every cycle.
<griddle>
I think part of the reason why we haven't seen something like this is that the kernel can make better sense of the system than the hardware can, right?
<clever>
one trick barrel processing gives with the QPU, is that you never need to stall because 2 opcodes depend on eachother too early
<griddle>
it's pretty rare that you have that much CPU over-provisioning
<clever>
because each opcode takes 4 clock cycles to run, and has fully cleared the pipeline before the next begins
<mrvn>
griddle: don't think so. The problem is that to switch 4 threads every cycle you have to have all the state of 64 threads in the CPU at all time.
<mrvn>
64 sets of regular registers, 64 sets of AVX512 registers. That gets big.
<griddle>
would probably pull an intel and kill avx512 again :)
<clever>
mrvn: i think the QPU has 192 copies of all of the "scalar" registers, the vector stuff is shared
<mrvn>
clever: yeah, but that's probably less than 64x AVX512
<griddle>
but like, the kernel can determine usage patterns of an app, priority levels (which may be a new representation for your OS), etc
<clever>
mrvn: yeah, its only something like 192 x 32 x 32bit regs
<griddle>
hardware can know about stall times and whatnot, but if you want to encode something new, you need new hardware
<griddle>
forgive me, but what is the QPU again?
<mrvn>
griddle: I always found that that is mostly pointless and when you actually need it more wrong than right
<clever>
griddle: QPU is the shader core on all rpi models
<griddle>
ahh okay
<griddle>
what is the Q?
<clever>
quad processing unit
<clever>
its got a lot of quad'ing going on
<clever>
4 threads/reg-banks, sharing 1 pipeline, in a barrel processing manner
<griddle>
ah so its a marketing name
<clever>
with the restriction that all 4 threads are running the same opcode
<griddle>
not particularly descriptive :)
<mrvn>
4 values in a vector?
<clever>
mrvn: 16 i think
<clever>
i'm fuzzy on the exact details
<mrvn>
4 squared, doublle the power
<griddle>
basically, the "warp" size is 4
h4zel has quit [Ping timeout: 268 seconds]
<clever>
nearly all of the docs treat it as a scalar core
<mrvn>
Note 16 is exactly what you need for 3D graphics.
<clever>
but behind the scenes, its a 16 lane vector, and runs your program 16 times in parallel
liz has joined #osdev
<griddle>
16 b/c 4x4 projection matrix>?
<mrvn>
that would be the one
<clever>
i dont think its doing that
<clever>
each lane is computing a seperate vertex
<clever>
so for every 4 clock cycles, it makes 1 opcode of progress, on 4 seperate verticies, and then things get a bit fuzzy
<clever>
the way i originally learned it, is that 4 pipelines are sharing 1 opcode decoder, taking turns decoding an opcode
<clever>
because each only has to decode an opcode every 4 cycles
<clever>
but that doesnt fit with the 16 lane nature, and running the same opcode in all 16 lanes
<griddle>
gpus are scary to me
<griddle>
you often need so much support around them to do anything
<griddle>
a gpu driver often requires you have some kind of compiler as well
<griddle>
or at least some kind of abstract command queue you can compile
<bslsk05>
github.com: lk-overlay/v3d.c at master · librerpi/lk-overlay · GitHub
<clever>
moon-child: this says to use vertex 0, 1, and 2, from the vertex data on line 335-386
<clever>
triangle_fan would let you reuse indices
<clever>
it also has a non-index mode, where it just uses every vertex in the list
<clever>
with that example code, you just want to increase the length of the vertex data, the primitiveList, and the size passes in ~3 places, and boom, more polygons
<bslsk05>
github.com: gl/core.c at master · cleverca22/gl · GitHub
<clever>
moon-child: an example of implementing opengl ontop of this
<griddle>
you're telling me you didn't write a glsl compiler
<griddle>
smh
<clever>
griddle: i never got to that step, all shaders where hand written in asm, and then pre-assembled
<griddle>
:)
<griddle>
impressive regardless
<clever>
griddle: originally, i was trying to implement all of opengl from scratch, in a race, and hit a brick wall when i couldnt figure out vertex shaders
<clever>
i later re-visited that code to get 3d working in baremetal
<griddle>
decided i'm gonna get my aarch64
<griddle>
... port working
<geist>
ugh. intel really screwed up the big.little thing in cpuid. it looks like it's cpuid leaf 0x1a
<geist>
instead of describing which cores are big and little, they simply defined a 23 bit field in eax that says 'this is an atom' or 'this is a core'
<geist>
with yet another long ass future list of what the bits are supposed to mean. goddamnit intel
<clever>
geist: ive heard that big.little has screwed with anti-cheat software, which thinks your VM is leaking and bans you
<geist>
it doesn't even seem to be laid out interestingly: 10h is reserved, 20h means atom, 30h is reseved, 40h is core
<geist>
not defined as everything else is resered, no documentation as to where they're going to do with that
<geist>
ie, not future proof at all. does this mean in gen 13 cores tey're going to use 0x41? or 0x50? why 20h and 40h?
<bslsk05>
www.intel.com: Game Dev Guide for 12th Gen Intel Core Processor Hybrid Architecture
SpikeHeron has quit [Quit: WeeChat 3.6]
<geist>
gosh i can't think of a more half assed implementation
<geist>
sigh.
SpikeHeron has joined #osdev
<griddle>
ah intel
<griddle>
all of that implies things can be multiple
<griddle>
atom and core
<geist>
yeah. and that now breaks the implicit assumption that you've been able to make up until now that from *any* cpu in the system you can determine the topology of the rest of the system
<geist>
which i guess was always doomed to fail, but it had a good 30 year run i guess
<geist>
the arm and riscv folks are laughing on the sidelinse, because they've never had this luxury to be taken away
<griddle>
CPUID is kinda a goofy feature anyways
<geist>
you wanna know big.little on arm? you gotta be told, in the form of something outside of the cpu
<griddle>
DTB isn't better
<griddle>
but stiill
<geist>
right
<griddle>
it's better for actual external hardware
<griddle>
it's something that could be burned into rom
<geist>
i was hoping leaf 0x1a would have something like 'there are N clusters, here's how you iterate through them'
<heat>
you need machine mode or the dtb for a cpuid-like thing in riscv
<geist>
and then a mechanism to describe each cluster of cores and what their apic id range is
<geist>
would be perfect
<griddle>
"sounds like something that could go in the acpi tables" - intel
<geist>
there is already a bnuch of precedent for this on other cpuid leaves, alas
<geist>
so that's a question: are there more acpi tables for this?
<griddle>
idk
<griddle>
ask microsoft
<heat>
great question
<griddle>
they got special treatment w/ 12th gen
<geist>
i'm bitching about this because will have to add this exact nonsense to fuchsia soon as soon as i get ahold of a 12th gen intel
zaquest has quit [Remote host closed the connection]
<geist>
and so far we've been relying on 'cpu 0 + acpi is enough info to discover everything about topology'
<griddle>
fuchsia doesnt support machines big enough for numa yet right?
<geist>
at least they defined another feature bit elsewhere that says 'this is a hybrid cpu'
<geist>
so you at least know if you should be checking leaf 0x1a
<geist>
yeah not really. we parse the SRAT and SLIT table but dont currently do anything with it
<geist>
but we do make a good attempt to discover the full topology, which we then feed into the scehduler
<bslsk05>
uefi.org: Advanced Configuration and Power Interface (ACPI) Specification — ACPI Specification 6.4 documentation
h4zel has joined #osdev
<geist>
heat: oh that's nice. also huh didnt notice the MSI frame table in the MADT
<geist>
that describes how the GICv2 MSI works.
zaquest has joined #osdev
<geist>
as a side note the GIC CPU fields for ARM in the acpi spec *does* have a field that describes big.little
<geist>
though it's not that meaningful. basically just a 8 bit number whos relative value is meaningless, but tells you which cpus are more uber than other ones
<geist>
oh hey, there's as new MADT table entry for 'multiprocessor wakeup structure'. looks like precisely a description of how to use the raspberry pi 4 style wakup
<griddle>
think we will ever get aarch64 cores which are socketed?
<geist>
well, technically they exist, but i assume you mean in a consumer format
<griddle>
I feel like so much of arm is designed in a way that assumes you won't change the CPU on a motherboard
<griddle>
My lab just bought one of the big ampere machines, but still
<heat>
PRM is the new thing intel and microsoft are pushing to replace some good chunks of SMM
lkurusa has joined #osdev
<griddle>
I'm gonna head out for the night. Been nice chatting yall
<heat>
<3
<geist>
kk!
<gog>
hi
<heat>
greetings gog
nick64 has quit [Quit: Connection closed for inactivity]
gog has quit [Ping timeout: 268 seconds]
pretty_dumm_guy has quit [Quit: WeeChat 3.5]
griddle has quit [Ping timeout: 268 seconds]
Matt|home has quit [Ping timeout: 268 seconds]
[itchyjunk] has quit [Remote host closed the connection]
lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]
heat has quit [Ping timeout: 264 seconds]
liz has quit [Quit: Lost terminal]
opal has quit [Remote host closed the connection]
opal has joined #osdev
clever has quit [*.net *.split]
ozarker has quit [*.net *.split]
ornxka has quit [*.net *.split]
mcfrdy has quit [*.net *.split]
night has quit [*.net *.split]
nether has quit [*.net *.split]
LambdaComplex has quit [*.net *.split]
hl has quit [*.net *.split]
night has joined #osdev
ornxka has joined #osdev
hl has joined #osdev
mcfrdy has joined #osdev
ozarker has joined #osdev
LambdaComplex has joined #osdev
CYKS has quit [*.net *.split]
andreas303 has quit [*.net *.split]
eau has quit [*.net *.split]
klange has quit [*.net *.split]
ccx has quit [*.net *.split]
energizer has quit [*.net *.split]
k4m1 has quit [*.net *.split]
dminuoso has quit [*.net *.split]
codez has quit [*.net *.split]
amj has quit [*.net *.split]
klange has joined #osdev
andreas303 has joined #osdev
amj has joined #osdev
ccx has joined #osdev
dminuoso has joined #osdev
energizer has joined #osdev
k4m1 has joined #osdev
elderK has quit [Quit: Connection closed for inactivity]
clever has joined #osdev
eau has joined #osdev
aejsmith has quit [Remote host closed the connection]
alpha2023 has quit [Read error: Connection reset by peer]
aejsmith has joined #osdev
chartreuse has quit [Ping timeout: 272 seconds]
alpha2023 has joined #osdev
the_lanetly_052_ has joined #osdev
the_lanetly_052_ has quit [Ping timeout: 245 seconds]
m3a has joined #osdev
bauen1 has quit [Ping timeout: 244 seconds]
the_lanetly_052_ has joined #osdev
the_lanetly_052_ has quit [Ping timeout: 245 seconds]
CYKS has joined #osdev
h4zel has quit [Quit: WeeChat 3.0.1]
<ddevault>
when I unmap a page table, do I need to invalidate its entire address space?
<ddevault>
yes, or reload cr3
<ddevault>
reload cr3 sounds better
mavhq has joined #osdev
<geist>
what precisely do you mean by unmap a page table?
<geist>
you mean because there are no more pages in a range, remove the table that held it, but otherwise eep the page table *structure*?
<geist>
(sometimes folks say page table when they mean the whole tree of tables, and vice versa)
<ddevault>
I mean any of the page tables (PDPT, PD, PT)
<ddevault>
and yeah, empty or not, the whole address range they describe becomes invalid
pretty_dumm_guy has joined #osdev
<geist>
yah you dont have to dump the entire thing. Basically on x86 you simply have to invalidate a page that the page table covered
<geist>
it's mentioned in the manual, but effectively what's going on is the cpu is allowed to cache the page table walker cache (caches inner 'pointers' between page tables) but it must invalidate the path leading to the page table when you invalidate anything that it 'covers'
<geist>
so if you are unmapping something and you unmap the last thing in a page table if you invlpg that address you're free to just remove the page table from the layer above
<geist>
since the invlpg will throw away the page table walker cache as well
<geist>
AMD has some optional features where you can turn off that behavior so you can hypothetically take more control into your own hands and explicitly invalidate the page table walker cache, but i wouldn't recommend it
fwg has joined #osdev
heat has joined #osdev
GeDaMo has joined #osdev
bauen1 has joined #osdev
the_lanetly_052_ has joined #osdev
bauen1 has quit [Ping timeout: 268 seconds]
bauen1 has joined #osdev
arch_angel has quit [Remote host closed the connection]
arch_angel has joined #osdev
arch_angel has quit [Remote host closed the connection]
arch_angel has joined #osdev
arch_angel has quit [Remote host closed the connection]
fwg has quit [Quit: .oO( zzZzZzz ...]
the_lanetly_052 has joined #osdev
gog has joined #osdev
the_lanetly_052_ has quit [Ping timeout: 252 seconds]
gog` has joined #osdev
opal has quit [Remote host closed the connection]
opal has joined #osdev
fwg has joined #osdev
lkurusa has joined #osdev
sikkiladho_ has joined #osdev
the_lanetly_052 has quit [Remote host closed the connection]
gildasio has joined #osdev
<mrvn>
So INVLPG will always throw away the L4 table?
<mrvn>
or just the one entry to the page?
fwg has quit [Quit: .oO( zzZzZzz ...]
<mrvn>
ddevault: if you unmap a range then at some point reloading cr3 is faster than INVLPG every page.
<ddevault>
aye
<ddevault>
this was my conclusion
<mrvn>
beware of global pages, you have to INVLPG them as reloading CR3 doesn't wipe them.
<ddevault>
global pages?ack
<ddevault>
? ack*
<mrvn>
Entries with the global bit set
<mrvn>
So basically everything in kernel space.
<ddevault>
aye
Matt|home has joined #osdev
<mrvn>
do you use PCID?
<ddevault>
not yet
gildasio has quit [Write error: Broken pipe]
gxt_ has quit [Write error: Broken pipe]
opal has quit [Remote host closed the connection]
foudfou_ has quit [Remote host closed the connection]
foudfou has joined #osdev
gildasio has joined #osdev
opal has joined #osdev
gxt_ has joined #osdev
eroux has quit [Ping timeout: 272 seconds]
eroux has joined #osdev
gxt_ has quit [Remote host closed the connection]
gog has quit [Ping timeout: 240 seconds]
Dyyskos has quit [Quit: Leaving]
gildasio has quit [Remote host closed the connection]
gxt_ has joined #osdev
gildasio has joined #osdev
elastic_dog has quit [Ping timeout: 240 seconds]
mzxtuelkl has joined #osdev
xenos1984 has joined #osdev
elastic_dog has joined #osdev
gog has joined #osdev
nyah has joined #osdev
foudfou has quit [Remote host closed the connection]
foudfou has joined #osdev
fwg has joined #osdev
the_lanetly_052 has joined #osdev
the_lanetly_052 has quit [Client Quit]
foudfou has quit [Remote host closed the connection]
the_lanetly_052 has joined #osdev
foudfou has joined #osdev
ThinkT510 has quit [Ping timeout: 268 seconds]
ThinkT510 has joined #osdev
sikkiladho_ has quit [Quit: Connection closed for inactivity]
andydude has joined #osdev
andydude has quit [Quit: andydude]
bauen1 has quit [Ping timeout: 245 seconds]
bauen1 has joined #osdev
tsraoien has quit [Ping timeout: 268 seconds]
bauen1 has quit [Ping timeout: 245 seconds]
LittleFox has quit [Quit: ZNC 1.8.2+deb2+b1 - https://znc.in]
LittleFox has joined #osdev
tsraoien has joined #osdev
fwg has quit [Quit: .oO( zzZzZzz ...]
lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]
<bslsk05>
en.wikipedia.org: Intel High Definition Audio - Wikipedia
<gog>
each codec can implement support for different types of streams and exposes them to the bus
<PapaFrog>
I with there was a damn consensus when it comes to some hardware.
<zid`>
It turns out though that it's not as portable as you'd hope, as the chips support 'extras' each, and have different pin wirings and stuff
<zid`>
but it's much better than them all being randomly totally different
<zid`>
(soundblaster era)
<PapaFrog>
I'm tempted to play with the speaker. lol
wootehfoot has joined #osdev
<GeDaMo>
1 bit should be enough for anyone :P
<gog>
yes the pit and cpu-bound I/o is all the sound hardware you need
<gog>
and very careful timing
<mrvn>
GeDaMo: welcome to cmov computing.
<geist>
mrvn: invlpg will throw away the page, and all of the page table cache leading up to it
<geist>
thus why it's safe to just throw away a page table as long as it has no entries
<geist>
that's why i was asking questions about what is precisely going on
<geist>
if you want to en-masse toss a bunch of page table entries and not individually invlpg them, then yes you can toss the entire cr3 (though as you say global pages are different)
<geist>
but *generally* aside from special cases most mmu code does one page at a time, and thus when you're removing a page table it's because you just unmapped the last page from it
<geist>
and if you invlpged it, then you can simply toss the page table, and you dont have to reload cr3
<mrvn>
geist: but what does it cache? The whole 4k page of each level or individual entries?
<geist>
neither. you're talking about the page table cache?
<geist>
more specifially the page table walker cache?
<mrvn>
the page walker cache
<geist>
it stores basically 'this virtual address's range is at page table X physically'
<geist>
baiscally short circuits walknig the entire page table by jumping directly to the end of the walk
<geist>
so when you invlpg it also throws away any page table cache entry that refers to the page table leaf you just invlpged
<geist>
thus forced the page table tree to be fully walked next time it loads a TLB
<geist>
in that space
<mrvn>
I would have thought it's levels of cache. If it can't find the L4 table in cache it looks for the L3 table then the L2 table and last it lookups the L1 table entry from scratch.
<geist>
nah the whole point is to short circuit the entire walk
<geist>
note x86 manuald doesn't describe how it works, so it's also possible it does what you say, but the rules are written: you dont have to worry about evicting it if you use invlpg
<geist>
ARM64 has the same thing and the manual describes it in fairly intricate detail
<mrvn>
It would. But when you looked at 2MB the next access fails and then has to walk from scratch. My way it would still have the L3 table in cache and only had to walk the last step.
<geist>
so i'm assuming the x86's work basically the same way
* geist
shrugs
<geist>
it would be less useful because of the mechanism i just described: evicting the page table walker cache
<geist>
would have to walk up the tree and evict everything in the walker cache
<mrvn>
That's why I wondered how much it would evict.
<geist>
such that the inner nodes would be far less useful, they'd get tossed all the time
<geist>
if it just stores leaf notes in the walker cache it only has to evict at most 1
<mrvn>
probably why they aren't doing it that way.
<geist>
right
<mrvn>
Drawback is that you get a full page table walk every 2MB.
<geist>
indeed, however it can store as many walker entries as it has cache for it
<geist>
so it may track a large amount
<mrvn>
Every entry caches 2MB. Except when you have huge pages then it's 1GB or 512GB.
<geist>
right
<geist>
only real reason i know about it is like all things on ARM64 this is fully exposed. software is responsible for maintaining it
<geist>
also recentl AMD cpus have a feature bit you can hit that enables the explicit maintenance of it, but AFAICT linux doesn't use that feature, so it's probably dead on the vine
<geist>
idea is if you know you aren't invlpging the last entry in a page table, dont evict the walker
<geist>
or conversely, only evict the walker when you know you're removing the last entry, whcih is what you have to do on ARM64
lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]
<mrvn>
And since you need to track that anyway to free the PT that is basically free.
<mrvn>
can you reload CR3 without evicting the walker?
<geist>
i dont think so. also i *assume* it gets evicted even when you have globla pages
<geist>
since it'd only be able to mark the cache walker as global if it somehow knew that every single pagein the page table is also global, etc
<mrvn>
Every time you change a page from RW to RO or vice versa you don't have to change the walker.
<geist>
right
<mrvn>
Or populate a mmaped region
<geist>
obviously if you did something (this would be odd) like copy a page table from one to another for the purposes of moving the page table to somewhere else physically
<geist>
you'd have to evict the page table walker cache and not the pages themselves
ZipCPU has joined #osdev
<geist>
say you were defragging physical ram
<mrvn>
if you replace a page table or free large amounts of memory you have to kill the walker. But that is really the minority.
<geist>
anyway PT walker caches are i think a really big reason the page tables themselves aren't *that* much of an overhead
<geist>
would be interesting to see what kinda hit rate they get. i dunno if that's exposed in any of the profile vars
jafarlihi has joined #osdev
<mrvn>
totally. Doing 4 (soon 5) entry lookups would be much slower.
<jafarlihi>
geist: Hey! Do you know if there's DHCPv4 client impl in Fuchsia? I can find DHCPv4 server and DHCPv6 client but no DHCPv4 client.
<mrvn>
Hmm, why aren't page table lookups in the L1 cache? :)
<geist>
there should be
<geist>
fuchsia gets ip addresses from dhcp servers all the time
<mrvn>
or are they?
<geist>
i think the cpu cache also is free to cache the page tables yes
<geist>
on ARM it's a bunch of control bits in the TTBR and whatnot
<mrvn>
Would be hard to do with virtual address indexing
<geist>
but on ARM at least that's why there's some memory order issues when dealing with modifying page table entries and whether or not the page table walker 'sees' it
<geist>
since in ARM at least the walker is considered basically it's own cpu. from the point of view of the data cache, it is independently walking through it, with more or less the same way a cpu core os
<geist>
and thus you need memory barriers and whatnot at particular spots for the walker to see what you just wrote, etc
<mrvn>
geist: that says the walker isn't using the cache, so you have to flush writes out of it to become visible.
<geist>
no memory barriers are not a cache flush
<geist>
they're simply forcing the cpu to drain it's write buffer and get stuff into the cache subsystem
<geist>
weakly ordered cpus are a bitch!
<geist>
it gets even more complicated when you enable the A and D bit feature on ARM. now you have to do everything with careful atomics, since the page walker can be modifying page table entries in a weakly ordered way
<geist>
i haven't yet written that code for fuchsia, but it's on my list of things to wade through this year
<mrvn>
only if you have threads or modifying another threads tables.
wootehfoot has quit [Ping timeout: 252 seconds]
<mrvn>
I never turned on the hardware A/D bits since I believe the RPi don't (all?) have it.
<geist>
no, not at all. think of the walker itself as another cpu
<geist>
it's also asynchronously modifying page table entries
<geist>
there are a huge pile of rules about it but it makes the page table code much more complicated, since you have to use atomics for all of them
<mrvn>
geist: is there a barrier the walker respects?
<geist>
well, it's less of the barrier respecting it and more making sure the cpu is coherent relative to the walker at points where it matters
<mrvn>
or rather one that respects the walker and waits for it to finish writing
<geist>
same as any other weakly ordered shared data accesses, basically
<mrvn>
I assume the walker is in the inner domain, right?
<geist>
yes
<geist>
the ARM ARM defines this is fairly intricate detail, precisely what semantics the walker uses when it accesses or modifies page tables
<geist>
based on that the cpu has a series of rules to follow to keep itself in sync. without A/D bit the walker is RO, and never modifies entries
<geist>
so the rules are far simpler. that's the gist of it
<mrvn>
So barrier, modify tables, barrier. The walker can't modify any (user) tables while I'm in kernel.
<mrvn>
atomics maybe for kernel tables
<geist>
sure it can. other cpu's walkers can
<mrvn>
geist: no threading here. The process can't run on another core.
<geist>
okay then.
<mrvn>
With threading it's hell, yes.
<mrvn>
Do you know of any kernel that has non-shared thread local storage? Memory only mapped in one thread.
<geist>
not that i know of
fwg has joined #osdev
<mrvn>
Seems like it would be a useful thing as many page table operations would get a lot cheaper.
tsraoien has joined #osdev
<mrvn>
if (addr >= 0xC00..00) { /* no TLB shootdown for thread private mappings
<gorgonical>
Okay I think I'm losing my mind
<gorgonical>
User space processes should, in general, have interrupts enabled, right?
<dh`>
having to switch the MMU context to switch threads makes that much more expensive and defeats a lot of the point of having threads
<bslsk05>
'How Many Glass Panes Will a Bullet Go Through? - The Slow Mo Guys' by The Slow Mo Guys (00:15:17)
<zid`>
gorgonical: depends if you want that cpu to be able to service IRQs or not
<zid`>
(you probably do)
<j`ey>
gorgonical: yes because how else to preempt?
<gorgonical>
Right. So I'm trying to understand exactly when that should be enabled. Looking through Linux's (and netbsd's a little) I can't figure out when they actually set the user process to have interrupts
<zid`>
iret
<zid`>
set it in eflags
<zid`>
well, iretq and rflags I guess
<gorgonical>
On riscv its sret and the SPIE flag anyway
<gorgonical>
But the same idea
<gorgonical>
But I don't see anywhere that SPIE is actually set in the status reg it gets restored from
<mrvn>
dh`: I thought the point of threads was to run the code on multiple cores. :)
<geist>
yep. it's in eflags, and the I bit is not writable by ring 3 code (though it's visible)
<geist>
gorgonical: so you're specifically interested in the riscv solution?
<gorgonical>
Yes
<gorgonical>
I know it *should* be set, but I want to understand where/how Linux actually sets it, since we're aiming for broad compatibility
<mrvn>
gorgonical: you set the flag when you first create the threads context and then it's always saved on kernel entry and restored on exit.
<zid`>
It's probably set when it drops into init, re the linux kernel?
<geist>
so the sstatus register has a saved copy when it enters an exception. moves the bits
<geist>
so when setting up the cpu for entering into user space, its much like x86 in that you arrange for interrupts to be enabled when you eret to it
<geist>
in the case of riscv it's not on the stack, it's in the sstatus register itself
<geist>
in another field
<gorgonical>
mrvn: yes so in copy_thread when they create kthreads it does get explicitly set there, but otherwise its cloned from parent regs
<dh`>
mrvn: just that much you can do with separate processes
<gorgonical>
geist: right, in spie
<gorgonical>
so in theory the init_process should have it set
<mrvn>
dh`: My point was that with different cores you just load the page table on each. Makes no difference if it's the same or slightly different ones.
<geist>
maybe not even that. depends on how the OS initially switches into user space
<mrvn>
dh`: You only pay a price if you switch a core from one thread to another thread of the same process.
<geist>
for the very first switch it may just set things up manually and then eret
<bslsk05>
github.com: lk/arch.c at master · littlekernel/lk · GitHub
<gorgonical>
hmm
<dh`>
oh, true.
<mrvn>
gorgonical: I think you have to set it every time you create a thread.
<geist>
it just sets up sstatus so that when it eventually srets (at the bottom of the function) interrupts get flipped on
<gorgonical>
geist: I am thinking that whatever gets set has to be spie due to the swapping semantics on exception/sret
<geist>
yes.
<mrvn>
gorgonical: you don't copy the parents eflags as they are somwhere lost on the kernel stack when you do the thread creating.
<geist>
mrvn: it doesn't work that way on riscv
<geist>
this is riscv, things are slightly different here. though functionally it's the same thing, you're probably just confusing them
<gorgonical>
oh I'm real dumb, I think I just found it
<geist>
found what where?
<gorgonical>
Kitten combines the create/start code and linux breaks them up
<dh`>
riscv is like mips, there's one register that masks sources and another with a master switch
<mrvn>
geist: you don't have the flags in a banked register and save it to the stack?
<gorgonical>
start_thread does explicitly set SR_PIE
<geist>
mrvn: not necessarily. that's my point, your suggestions are assuming that the hardware mechanism works a certain way
<gorgonical>
mrvn: every priv level has a register to bank in
<geist>
gorgonical: yep. so on theinitial switch to the thread there's no existing stack frame to return from, so you simply set the sstatus's PIE and then sret
<mrvn>
I guess you could not save it assuming the banked register doesn't get touched and you won't switch tasks. Then it's still the right value on exit.
<dh`>
according to 4.1.1 supervisor-level interrupts are always enabled when in user mode
<geist>
from then on out when you enter the kernel from that thread the sstatus is *probably* saved on the stack and you dont have to do it again
<geist>
dh`: yeah that's probably true
<geist>
it may be that it's just implicitly enabled on riscv and setting PIE does nothing when switching from supervisor to user
<geist>
i always found this part of the riscv spec to be especially confusing.
<geist>
not because it's complicated, but because it's poorly written
<dh`>
according to my riscv spec there's SPIE and UPIE
<dh`>
since the user interrupt stuff has been withdrawn, I assume the PIE bit you're talking about is the SPIE bit
<gorgonical>
I am not aware of a board that actually supports u* regs
<dh`>
no
<gorgonical>
I wasn't aware they withdrew it
<dh`>
the scheme they invented doesn't really work so it got punted
<dh`>
i think that's the state of things
catern has joined #osdev
<dh`>
"when a SRET instruction is executed, SIE is set to SPIE, then SPIE is set to 1"
<dh`>
so when you trap from supervisor mode to supervisor mode, that bit controls the master interrupt switch
<dh`>
but in user mode the master switch is apparently always on
<geist>
so that begs the questin: before entering user space does it make sense (or matter) to set SPIE or does it not matter?
<dh`>
I think it doesn't matter
<gorgonical>
It definitely seems to matter in qemu. I was getting bug_on triggers for thread migration
<geist>
i think the answer is probaby doesn't matter because it's about to be implicitly enabled, and then when it comes back from user mode SIE will get cleared
<geist>
mrvn: anyway the big difference in riscv vs x86 or ARM is the saved state of the previous interrupt disable flags is saved into the same control register you already have. there's no backup copy. it's the bits that are copied from one field to another
<gorgonical>
The board manual for the sifive doesn't say that sret sets spie to 1, fwiw
<dh`>
but in general the trap handler should restore what was there in the previous state, because if the trap came from the kernel and you mess with it things will go off the rails
<geist>
so it's slightly different than x86 pushing eflags on the stack, or arm copying things into SPSR
<geist>
right, so in general a trap handler should push *status on the stack and restore it before *retting
jafarlihi has quit [Ping timeout: 245 seconds]
<dh`>
yes
<geist>
[ms]status [ms]ret
<gorgonical>
dh`: Oh you are right though the priv spec says spie *should* be set to 1. wtf lol
<geist>
but when entering user space for the first time you dont have to set up a stack frame, you can simply set up sstatus and then sret
<dh`>
and for entering usermode the first time, the best thing to do usually is initialize a trap frame and call the return path
<geist>
can do that too
<geist>
it's a bit more annoying if you're already on that stack, etc, which is why i dont do it that way in LK
<dh`>
helps to avoid either forgetting things or leaking kernel data
<geist>
indeed
<dh`>
in OS/161 typically you initialize a trapframe on your stack, point at it, and jump to the return code even though that frame won't be in the same place as one generated by a trap
<geist>
yeah, makes sense
<dh`>
this is to some extent up to what students decide to do, but there are limits
<mrvn>
geist: interrupts don't get disabled on interrupt entry?
<dh`>
e.g. you can't malloc the frame because there's no useful way to avoid leaking it :-)
<geist>
they do
<mrvn>
but that would overwrite the bit in the register so you can't know the previous state
<dh`>
mrvn: on a trapa the SIE bit is copied into the SPIE bit and then the SIE bit is set to 0
<dh`>
s/trapa/trap/
<geist>
right. it saves the previous interrrupt state (and the cpu mode it was in) but it just moves it into the same register
<geist>
[ms]status. it's fairly close to CPSR on arm
<mrvn>
sounds exactly like SPSR on ARM.
<dh`>
the riscv supervisor stuff has nothing like the design quality of the riscv base :-|
<geist>
except instead of making a copy into SPSR, it just copies into fields within CPSR
<geist>
so it's not exactly like SPSR
<dh`>
it is much more like mips than arm
<geist>
that's the point i was trying to make, it's not copying the register into a saved one, it copies from one part of the register to the other
<dh`>
but not that much like mips either
<geist>
so the saved state is always 'live'
<mrvn>
On ARM64 you have 4 banked status registers, right?
<mrvn>
one per EL
<geist>
yes. on riscv you have 2 (or 3 if you think of the virtualization extensions)
<geist>
because user mode doesn't have a '*status' register
<dh`>
if you remember the r3000 status register, there's three bit pairs for interrupt and user/kernel state at the bottom of the register, and traps/returns shift them left/right respectively
<mrvn>
Who has interrupts enabled inside their kernel?
<geist>
that's the part where where *status register is *not* like CPSR on arm. it's closer to SCTLR in the sense that it's not user visible
<dh`>
so basically there's the user state, the state after trapping into the kernel, and a third set for a nested trap within the TLB refill handler
* geist
raises hands
<dh`>
it is more like that
<geist>
yah
<dh`>
mrvn: unless your kernel is very micro you need to have at least some interrupts on to avoid dropping some
<mrvn>
not afraid to get too many interrupts and running out of stack?
<geist>
this is a solved problem like 50 years ago mrvn
<geist>
you can control the amount of nesting you get
<geist>
enabling interrupts within the kernel does not mean you *always* enable interrupts within the kernel
<geist>
just in most of the code
<dh`>
mrvn: usually only one at a time, though if you have interrupt priority levels you might allow one per level at a time
<geist>
also keep in mind most modern arches are pretty stupid in terms of having exactly 2 or maybe 3 interrupt levels. more sophisticated arches of yore were much more friendly with multiple nested interrupts via different irq levels
<geist>
68k, VAX (where 68k copied it from most liikely), etc
<geist>
though cortex-m class hardware looks suspoiciously similar to VAX
<dh`>
traditionally, if you don't react quite rapidly to serial port interrupts you drop characters
<mrvn>
geist: Do you enable interrupts only in code that will block or always up to a certain limit?
<geist>
more like enable it by default and disable it in code where you can't take another one, including interrupt handlers themselves
<geist>
that's a fairly standard reentrant, preemptive kernel design
<dh`>
or even nonpreemptive
<dh`>
traditional unix is nonpreemptive but enables everything else in the kernel
<mrvn>
dh`: serial? Seriously? That generally has 16 byte fifo and you can get an interrupt at 7/8th to 1/8th full. At 115200 BAUD that's forever.
<geist>
oh my god dont get me started about the woes of serial ports
<dh`>
mrvn: "traditionally"
<dh`>
fifos on UARTs only started to appear in the 90s
<mrvn>
3 decades ago :)
<geist>
and t's still common for arm soc makers to put 1 byte fifos
<geist>
*right now*
<dh`>
kernel design hasn't changed much in the past 25 years
<geist>
with the assumption that 'software will just get to it quickly enough' or 'use dma'
<mrvn>
wow, I've never used anything but a 16650 clone.
<geist>
then you dont know what you're missing
<geist>
a) 8250 derivates suck. it's a terrible design
<geist>
and b) there are worse
<mrvn>
must have been lucky with the ARMs I bought.
<geist>
there are far nicer uarts to work with out there that aren't based on 8250 designs
<geist>
and there are even worse ones. the 'console uar't on the raspberry pi is a good example of an even worse one
<geist>
it's intentionally stupid because it's supposed to just be used for slow transfers
<dh`>
anyway, if you want to run 19200 bps on a 1992 machine you can't fuck around with your interrupt latency
<geist>
it even shares an irq with some other hardware because broadcomm couldn't be bothered
<mrvn>
1 byte FIFO is 0.039ms. Still not too bad.
<geist>
yah i remember back in the 386 days you actually wanted to go buy a 16450 card or your cpu couldn't keep up
<geist>
but like i said even on modern hardware it can be a challenge if you have an exceptionally dumb serial port
<geist>
enough that you gotta be careful, and extended irq disablement can cause you to miss windows
<mrvn>
geist: yeah, and I was laughing at you 386 users and doing 230400 BAUD on my serial.
<geist>
that's a common thing btw: running serial ports at a few megabits
<mrvn>
AmigaOS hardly ever disabled interrupts.
<geist>
then even 16 byte fifos start to look pretty small
<geist>
mrvn: it's also possible they used 68k's native irq level stuff
<mrvn>
geist: obviously. :) m68k is a lot better there than 386.
<geist>
68k is not a particularly good cpu for bit banging, but it does have a decent irq handling mechanism
<dh`>
wait a sec. 19200 bps is 2400 bytes/sec and that's 0.4 ms per character
<mrvn>
dh`: my number was for 115200
<dh`>
ah oops
justmatt has joined #osdev
<geist>
my general rule of thumb is 115200 is about 10 characters per ms
<geist>
since it's approx 10k chars/sec
<mrvn>
At 19200 BAUD (0.4ms) that's nearly half the time slice each process gets on AmigaOS + multiuser/dynamic priorities patches
<dh`>
anyway on a 1992 machine you've got say 25000 cycles per ms
<geist>
also remember lots of embedded things are say 25Mhz or so now
<geist>
though a cortex-m at 25mhz would run rings around a 486 or 68k at 25mhz
wootehfoot has joined #osdev
<mrvn>
I was so used to 1000Hz timer for the scheduler with AmigaOS and then I tried Linux with it's 100Hz. *blahh*
<geist>
omg amigaos is so amazing!
<geist>
why did any of us ever survive?
<mrvn>
It was.
<geist>
dont trigger me or i'll start going off on how great VAX is again
<mrvn>
and don't ask me how we are still living
<geist>
and then you'll be sorry!
<dh`>
so for 19200 you get about 10000 cycles, and to not drop a character you have to finish whatever you're doing, reenable interrupts, and take the interrupt before that budget runs out
<mrvn>
I watched a VAX running BSD scrolling it's console once.
<mrvn>
blink, blink, blink
<mrvn>
dh`: yeah, definetly want to enable interrupts in kernel, at least for some levels.
<mrvn>
On the other hand you had hardware flow control. So if you are late the serial just pauses.
<geist>
mrvn: well which vax was it?
<geist>
there was a huge array of them. how was the terminal connected?
<mrvn>
geist: no idea, it was like 15 years ago.
<geist>
i once saw an amiga in the dumpster
<geist>
ergo all amigas are dumpsters
<dh`>
the serial _might_ just pause if all the flow control bits actually work
<dh`>
I once had an rs232 cable that was down to one wire
<dh`>
(by luck, the wire that had remained attached was the data line)
<mrvn>
so GND was floating?
<dh`>
yeah
<dh`>
eventually it got resoldered, or maybe just thrown out
<geist>
keep in mind i think 8250 doesn't do full hw flow control. one of the reasons why i dont like them
<mrvn>
but you need recv and send
<geist>
iirc they just allow you to use hw flow, but doesn't fully implement it
<mrvn>
geist: like many ARMs and all the USB serial dongles I've used.
<heat>
>dont trigger me or i'll start going off on how great VAX is again
<geist>
huh? usb serial dongles aren't 8250
<heat>
dont trigger me or i'll start going off on how great itanium is again
<geist>
see!
<dh`>
well, gnd was probably connected to the outer shield around the rim, right? so it may have also been connected
<dh`>
but certainly that cable did not support RTS/CTS
<mrvn>
dh`: I think it wouldn't have worked otherwise.
<dh`>
dunno, sometimes things work even when they have no right to whatsoever
<mrvn>
dh`: It must have never had RTS/CTS. If you cut those lines and don't short the pins the serial never works.
<dh`>
it probably never had rts/cts
<dh`>
most RS-232 cables didn't
<mrvn>
geist: I have no idea what's inside the USB dongles but I only get 4 wires out: 5V GND, send, recv.
<geist>
sure, but that's not 8250
<geist>
8250 is the programming interface
<heat>
what's the difference between a 8250 and a 16550?
<heat>
i get the two confused
<mrvn>
heat: the fifo?
GeDaMo has quit [Quit: There is as yet insufficient data for a meaningful answer.]
<heat>
i dunno, you tell me
<mrvn>
The 16450(A) UART, commonly used in IBM PC/AT-series computers, improved on the 8250 by permitting higher serial line speeds.
<mrvn>
With the introduction of multitasking operating systems on PC hardware, such as OS/2, Windows NT or various flavours of UNIX, the short time available to serve character-by-character interrupt requests became a problem, therefore the IBM PS/2 serial ports introduced the 16550(A) UARTs that had a built-in 16 byte FIFO or buffer memory to collect incoming characters.
<dh`>
double the 82, obviously it must be an improved model
<mrvn>
So 8250: 1 byte fifo, 16450 -> more speed, 16550 -> 16 byte fifo
<heat>
ah ok so all "8250" drivers that touch the fifo are in reality 16550 drivers
<mrvn>
There is also something about the BAUD generator. iirc the 16650 can generate it's own BAUD.
wootehfoot has quit [Ping timeout: 252 seconds]
wootehfoot has joined #osdev
<mrvn>
The BAUDrate generator was something 8250 vs. 8251 that doesn't.
<mrvn>
Hey, enjoice, Intel EVO is better at virus protection. You know, because everything runs faster (except viruses apparently) and and well, stuff.
<mrvn>
snake oil V2, only with intel evo
<heat>
my laptop is INTEL EVO POWERED BY CORE VPRO CERTIFIED
<geist>
also there are defactor 16650s and 16750s and whatnot
<geist>
but basically i call them all 8250s, the newer ones are extensions to it
<Bitweasil>
I like big serial FIFOs...
amine has joined #osdev
<Bitweasil>
Makes life far easier.
<mrvn>
Bitweasil: like 16 bytes?
<mrvn>
or DMA capable serials?
<Bitweasil>
I was think more the FTDI USB ones, I think they've got 256 or 512 byte FIFOs.
<Bitweasil>
I was shoving a lot of data around at 3Mbaud for a while, and that was quite useful.
<dh`>
why not just use arcnet?
* dh`
hides
<Bitweasil>
I was talking to firmware stuff on a Minnowboard Max or Pi4, they had serial ports...
<Bitweasil>
And far easier to deal with serial than anything fancier, especially when I didn't really tell the OS I'd taken over the serial port from it.
<heat>
i want a minnowboard
<mrvn>
clever: can you have the serial start DMA on the RPi?
<heat>
mostly just to hack on firmware
<clever>
mrvn: the uart fifo's have a dreq signal, that can turn the dma on/off
<mrvn>
thought so
<Bitweasil>
heat, Max or the base one?
<clever>
mrvn: you then program the dma to copy to/from the fifo register, with addr inc disabled
<clever>
and set an axi burst size that fits whatever the dreq trigger is
<heat>
Bitweasil, something newer
<heat>
apparently it's the turbot now
<clever>
as far as i know, the dma cant detect an over/underflow, and the axi port cant stall
<Bitweasil>
I think that's the Max with a slightly faster chip on it.
<clever>
so if the dma reads/writes too much, bytes will be lost/faked
<Bitweasil>
... if you're US based, I can toss mine in a box, I'm not using it for anything anymore.
<mrvn>
clever: but you know exactly how much to read write by the trigger level you set
<Bitweasil>
I was using it as a light desktop, but I've got other boards for that now.
<clever>
mrvn: i think the dreq is using a hard-set trigger level, not the irq fifo level
<Bitweasil>
brb, coffee underflow error.
<clever>
mrvn: the rp2040 dreq is far better then the broadcom dreq
fwg has quit [Quit: .oO( zzZzZzz ...]
xenos1984 has quit [Read error: Connection reset by peer]
<heat>
Bitweasil, I'm not :(
<clever>
rp2040 dreq, will hold the dreq line active, for one clock cycle, for every byte that is added to the fifo, and the dma block then counts how many cycles dreq has been active
<clever>
so dma knows exactly how many bytes can be read, and can fire off a perfect axi burst
<mrvn>
clever: not surprising witht eh customary good-enough-to-work-around it Broadcom quality.
<clever>
mrvn: i think it was more about axi burst size and acceptable latencies
<clever>
broadcom dreq is just a "level is over X" signal, and you must then do a burst of X reads
<heat>
the minnowboards are all kinda old anyway
<heat>
i want an open firmware machine to play around with :(
<heat>
not literal open firmware though
<heat>
i'm not interested in that crap
<mrvn>
clever: that should be enough. DMA should do that in no time, even before the next char is recv/send.
gildasio has quit [Remote host closed the connection]
opal has quit [Remote host closed the connection]
<heat>
i'm a crap connoisseur, UEFI only
<clever>
mrvn: but if you want to allow an 8 byte dma burst, you cant enable the dma until 8 bytes are in the fifo
opal has joined #osdev
<clever>
mrvn: triggering with 1 byte in the fifo, would result in 7 bytes of junk, due to a fifo underrun
gildasio has joined #osdev
<clever>
and thats where the rp2040 dma is better, its aware of how much can actually be read
<mrvn>
clever: but if you don't expect 8 bytes to come in why are you setting up DMA? :)
<clever>
but the rp2040 design, doesnt deal with clock domains
gildasio has quit [Remote host closed the connection]
<clever>
mrvn: what if you want to receive 1234 bytes, and your dma is configured to do 8 byte bursts?
<mrvn>
clever: the rp2040 can trigger DMA after a timeout with less bytes, right?
<clever>
rp2040 dma doesnt use timeouts, it will trigger a dma copy with even 1 byte in the fifo
gildasio has joined #osdev
<mrvn>
clever: then at the end you poll the last few bytes.
<clever>
yeah
<mrvn>
isn't 1 byte DMA rather wasteful?
<clever>
i think it usually operates in 32bit chunks on the rp2040
<mrvn>
bus width?
<clever>
32bits on the 2040
<clever>
i think the bigger problem, is the clock domains
<clever>
on the rp2040, everything is in a single clock domain, so if the uart holds dreq high for 5 clocks, the dma counts +5, and knows it can read 5 times
<clever>
but on the broadcom SoC's, the uart and dma are in different clock domains
<clever>
so its harder to give an exact count like that
<mrvn>
you could toggle the dreq or pull it down for a few cycles between chars.
<mrvn>
Use an edge trigger and you only have to pulse it every time a char comes in
<mrvn>
So no, I don't think the clock domains are a real problem.
<clever>
what if the dma is in a slower clock domain
<clever>
and it misses a pulse because its clock is too slow?
<Bitweasil>
heat, fair enough. Yeah, export is a pain.
<mrvn>
clever: slower than the BAUD rate?
<clever>
mrvn: dreq is also used by internal things, like the 2d compositor
<mrvn>
or even 1/4 BAUDrate if you raise the signal for half a char.
<mrvn>
clever: now that is a bigger problem
<mrvn>
clever: a edge trigger is damn fast though regardless of the clock. Even a mini pulse would latch the trigger high till you read it and reset.
<mrvn>
You just can't send a second pulse before it's reset.
<clever>
yep, but you can only count 1 edge per clock
<clever>
exactly
<clever>
so the dma will loose track of how many reads it should issue
<mrvn>
but again BAUD rate speed vs DMA speed. No contest.
<clever>
DSI is one of the dreq sources
<clever>
thats 4 lanes of DDR 500mhz traffic
gog has quit [Ping timeout: 252 seconds]
<clever>
4000mbit
<clever>
what where you saying about baud rate?
gog` has quit [Remote host closed the connection]
gog` has joined #osdev
<mrvn>
you can also feed the edge into a clock-less adder directly. You can add GHz pulses easily and the DMA then only has to read out the adder before it overflows.
<clever>
yeah, that could potentially be done
<clever>
i feel like the PLL's are using that kind of hw
<clever>
drive the PLL output directly into a clockless adder, and when the count hits $divider, reset and emit 1 pulse
<clever>
then phase-compare that slower clock, with the reference, and feedback loop
gog has joined #osdev
<clever>
but, even that, has limits
<mrvn>
I've done that for some arduino project. Use the pule to generate a clock signal going into a counter and the carry out pin on the counter is connected to a pin on the arduion with IRQ set up.
<clever>
the adder/compare stage cant run over 3ghz on the rpi
<clever>
there is a dedicated /2 that is much dumber logic, for >3ghg speeds
<mrvn>
Only trigger an interrupt every 512 pulses.
<clever>
so for low speeds, you do PLL/divider==crystal
<clever>
but for high speeds, you instead do PLL/2/divider==crystal
<clever>
the bcm2835 datasheets also mention, it has 2 sets of dma controllers
<clever>
the full dma controllers, have a 256 bit bus
<clever>
while the lite dma, is only a 128bit bus
<clever>
they also differ in fifo depth
<mrvn>
but the uart only has 32bit, right?
gildasio has quit [Ping timeout: 268 seconds]
xenos1984 has joined #osdev
<clever>
and it warns that if you do too big of an axi burst read, the fifo can fill, and the reads will stall and jam that entire axi path up
<clever>
and if the writes conflict with that path, the whole system will deadlock
<clever>
mrvn: yeah, peripherals are on a dedicated 32bit only bus, and it says you can do a 4x burst on peripherals, and it will happily fill the 128bit fifo without issue
<clever>
> The Lite engine will have about half the bandwidth of a normal DMA engine, and are intended for low bandwith peripheral servicing.
gildasio has joined #osdev
<clever>
mrvn: i really need to get around to actually testing out the dma engines, linux using dma on the rpi fails, when under my open firmware
opal has quit [Ping timeout: 268 seconds]
<clever>
and that greatly hampers performance
psykose has joined #osdev
<dzwdz>
is there a name for the subset of libc that doesn't interact with the kernel? memcpy, strlen, snprintf, etc
<mrvn>
not syscall?
<Bitweasil>
Not sure either...
psykose has quit [Remote host closed the connection]
psykose has joined #osdev
<gog>
you can't necessarily guarantee that those never syscall
<dzwdz>
but they can be implemented without syscalls, their point isn't interacting with the OS
<mrvn>
right, memcpy calls the DMA syscall :)
<dzwdz>
as opposed to e.g. fread
<heat>
freestanding?
<dzwdz>
as in "freestanding libc"?
<heat>
it's not quite a term for that but it's close enough I think
<heat>
freestanding parts of the libc
<mrvn>
since when does freestanding have strlen?
<dzwdz>
i suppose that works, but that's a mouthful
wootehfoot has quit [Quit: Leaving]
<heat>
why would you care?
<dzwdz>
because i'm considering splitting my libc in two
<dzwdz>
well it already kinda is, but i'm considering making it more explicit
<mrvn>
I have a libstring basically
<clever>
dzwdz: newlib is designed like that, with the libc half being entirely free-standing, and then libgloss deals with the syscall half
<clever>
and the user i think is supposed to replace gloss with their own thing
<heat>
i'm not a fan of that
<heat>
i'm a strong believer than a kernel should have its own libc
<mrvn>
I have the same problem because I have no ELF loader in my kernel. Everything is just linked together even if some of it is user space apps.
<heat>
anyway, that was not the original point
<heat>
disabling SSE, AVX codegen is trivial
lkurusa has joined #osdev
<heat>
it's less trivial when you add optimized versions of your routines
<heat>
i.e memcpy which uses SSE and AVX
<j`ey>
heat: cant you just disable avx or whatnot for the core kernel code?
<mrvn>
or dumping the state of a thread including fpu context
<heat>
you will not be able to use it in the kernel so you'll end up maintaining two routines
<heat>
j`ey, as per the travis of the g "define some way in the build system to mark modules as 'may use fpu' and 'no fpu' and then segregate modules accordingly. Would work well except for shared bits like libc (printf for example)."
<mrvn>
I really don't want to add a full soft-float implementation to the kernel just to printf() the FPU registers in a crash dump.
<mrvn>
heat: If I segregated the modules into fpu and no-fpu then how do I link them together or load them?
<heat>
why would you print your fpu registers in floating point format?
<mrvn>
heat: so I can see it's 1.024
<heat>
that's not useful
<heat>
how can you know its 1.024 and not a random SIMD bit
<mrvn>
heat: I show hex and decimal
<heat>
mrvn, applets would run with fpu saving and restoring, core code would run with no fpu
<heat>
boom, problem solved
<heat>
core code never uses the fpu, applets use the fpu
<mrvn>
heat: that doesn't help with 23:47 < heat> i.e memcpy which uses SSE and AVX
<heat>
of course it doesn't
<mrvn>
heat: or do you want a kernel memcpy and user memcpy?#
<heat>
yes
<heat>
the solution isn't "enable the FPU inside the kernel"
<heat>
it's "don't use SSE code inside the kernel"
<heat>
thus making code sharing a bit dubious
<mrvn>
I have one rare case where I want a fast memcpy. When looking for huge pages I have to copy up to 2MB of memory. Might even be worth using DMA for that.
griddle has joined #osdev
<heat>
at the end of the day, with a "proper libc", how much code will you share?
<bslsk05>
github.com: Onyx/Makefile at master · heatd/Onyx · GitHub
<heat>
all of this because I wanted <type_traits>
<mrvn>
heat: What I would like for a shared printf would be some "#pragma enable-fpu" and "#pragma disable-fpu"
<griddle>
Came in late, are you talking about using the same code for the kernel and libc's printf and string routines?
<mrvn>
if you reach the "%f" part of printf do the fpu things there. kernel never goes there.
<clever>
that reminds me, LK's printf has a global float support flag
<mrvn>
griddle: as an example, yes
<griddle>
Hmm
<clever>
GLOBAL_DEFINES += WITH_NO_FP=1
<clever>
if i disable printf fpu support, then %f just prints a literal %f
<clever>
and i think it sanely skips that entry in the va_args
<clever>
but the gcc still does all of the float math
<griddle>
for shared code I have a macro `_KERNEL` that I just check for. Still have to compile shared code twice :^)
<mrvn>
clever: va_args does FPU stuff when you get a double from VA at any point in the function.
<mrvn>
clever: if that is #ifdef-ed out then no fpu stuff
<clever>
*looks*
<geist>
something ike that
<mrvn>
i.e. the %f case gets repalced by print a literal %f
<clever>
#if WITH_NO_FP
<clever>
#define FLOAT_PRINTF 0
<geist>
it calls into an inner functon that generates the double string, but it ifdefs out the call to it, etc
<griddle>
I mean, printf could print floats w/o float hardware right
<geist>
that's still an inssue in a mixed float/no float build. haven't decided what to do about that
<geist>
griddle: that's the *real* answer. and it's of course a total bitch
<clever>
s = double_to_string(num_buffer, sizeof(num_buffer), d, flags);
nyah has quit [Ping timeout: 240 seconds]
<clever>
geist: yeah, this function is skipped
<mrvn>
The problem is the argument parssing though. You could easily print float/double in the hex format if you can get at the value.
<geist>
but it still has to deal with the calling convention
<clever>
double d = va_arg(ap, double);
<clever>
as is this one
<griddle>
I think the real answer ought to be that the fpu registers should be read only in the kernel
<griddle>
imo
<clever>
so i think its not eating the va_arg things, and everything desyncs?
<geist>
which may involve passing floats via floating point registers, and then even if you do all the work outside of the fpu you still have to deal with varargs and marshalling float args (or not)
<griddle>
allowing the kernel to use xmmN or whatever means you have to include the old state in your trapframe
<mrvn>
clever: if you have va_arg(ap, double) in the function then (on x86) it checks ax for the float bit and saves fpu regs to the buffer. Different things happen on ARM but it blows up at the function entry.
<clever>
mrvn: in the past, i did have linux userland printf blowing up at function entry, because the FPU was disabled when linux started
<mrvn>
SO it doesn't matter if the format string actually has "%f" in it. It always saves fpu regs and fails
<mrvn>
clever: exactly.
<clever>
mrvn: LK's lazy FPU context switching, had left the FPU disabled, when i exec()'d linux, and linux then just assumed the FPU doesnt work
<clever>
and then userland tried to use the FPU, and SIGILL!
<heat>
what?
<griddle>
yeah the var args calling convention requires saving all registers into memory in the order of register usage in the base calling convention
<clever>
heat: LK disables the FPU when context switching, and leaves the FPU regs in a random state, from whatever thread last used it
<mrvn>
Isn't there some hardware mode that allows reading FPU regs?
<clever>
heat: upon an FPU exception, it then does the context switch for the FPU state, and retries that op
<griddle>
lazy fpu?
<heat>
how is disabled = not work ?
<clever>
heat: linux assumes that if its disabled upon entry, its disabled for a reason
<mrvn>
griddle: that fauls and then turns the FPU fully on
<clever>
and just leaves it disabled
<mrvn>
clever: not on ARM
<heat>
that's so cursed
<clever>
mrvn: yes, ive run into this exact problem on arm32
<mrvn>
clever: but the FPU is off after the bootloader
<clever>
mrvn: there are 2 seperate flags, an on/off, and a trap/dont-trap, i believe
* clever
gets link
<mrvn>
clever: might also differ pre and post neon
<bslsk05>
github.com: lk/faults.c at master · littlekernel/lk · GitHub
<clever>
id still say thats ugly, that its trapped via the undefined opcode exception, just because an enable flag was turned off
<mrvn>
geist: supporting soft float and hard float is worse.
<geist>
yep. arm64 of course cleans all of this up and you get a nicely broken out exception syndrome
<clever>
my rough understanding, is that hardfloat can pass floats via FPU regs
<mrvn>
fpu isn't optional on arm64, right?
<clever>
while softfloat can pass floats via the stack only
<clever>
but, softfloat can still use the hw fpu, if you choose to, on a per-.o basis
<mrvn>
clever: hard to pass args in float registers if you don't have an FPU
<geist>
mrvn: it is, but not in practice. it's possible to build an extremely low end armv8 core with fpu optional, but i've never seen one in practice
<mrvn>
geist: armv6
<clever>
mrvn: thats the special bit, you can still use the FPU with soft-float!
<geist>
mrvn: yes?
<mrvn>
clever: sure, if you have one you can.
<clever>
(i believe)
<mrvn>
geist: ARMv6 frequently has no fpu
<geist>
sure, but you explicitly asked about arm64
<clever>
mrvn: so you could compile one .o file with the fpu enabled, and another with the fpu disabled, and then have a runtime if() to call the right variant of the functions
<mrvn>
geist: oh, sorry
<geist>
and arm64 == armv8. and thus my answer re: armv8
<clever>
mrvn: and the rest of the codebase assumes no fpu, and uses floats on the stack
<mrvn>
clever: that's what I did on the RPI1
<clever>
hardfloat instead puts args in the fpu regs, so the ABI is different, and every function has to agree on the new ABI
<mrvn>
RPi2 I think had neon FPU
<clever>
pi1 also has an FPU
<clever>
its just smaller, half the number of regs
<mrvn>
clever: but an older one
<clever>
yeah
<clever>
and most people building for armv6 disabled FPU usage
<mrvn>
neon was a big step forward in speed
<clever>
because it was rare for v6 to have an FPU
<clever>
thats what made the pi1 weird, and needing a special build
<mrvn>
Raspian used soft-float
<mrvn>
Debian ARM used neon
<mrvn>
So even though you had an FPU it didn't have enough registers for the Debian ARM ABI.
<clever>
yeah
<clever>
i think armv7 had twice the fpu regs
<mrvn>
I need one of those black-light fly buzzers.
<geist>
there were a ton of variants of vfp, with varying levels of floating point regs, it was a real nightmare
<geist>
by the later ends of v6 and v7 the defacto standard fpu was vfp3, and the defacto subset you could compile for was called vfpv3-d16
<geist>
which explicitly limited itself to 16 double precision floating point regs. code compiled for that would work on all vfpv3s, even if they had say -d32 implemented
<clever>
this is one area, where i think the VPU is far more sane, compare to arm/x86, there are no dedicated float registers!
<clever>
any 32bit register can be either an int32 or a float32
<clever>
the type, just depends on which opcode you use to interact with it
<geist>
the calling convention explicitly threw out d16-d31 on calls, so code that didn't know it existed was okay, etc
<clever>
load/store doesnt care about the type
<mrvn>
clever: something they learned from SIMD
<geist>
yep. i remember the SPU processors on a cell processor were like that too
<clever>
one thing i still dont fully understand, where is the line drawn between vector and float stuff in arm?
<geist>
simply 128 128-bit vector registers, integer or float. and there were scalar versions of all the instructions
<clever>
VFP implies both vector and float?
<geist>
yes. there qas some previous floating point standard thing on arm called FPE i think, but it was the early days
<geist>
VFP was an early attempt at vector bits on ARM. however, it wasn't vector in the simd sense
<clever>
so is VFP both vector and float?
<geist>
iirc it was vector in the 'repeat this N times' sense
<geist>
i had rarely seen it used, not sure most compilers knew how to do it except maybe arms, but i'm sure one could write some asm to make use of it
<clever>
ive looked at some arm64 vector code in gnuradio, and it was basically just support for loading a float[4] i think
<geist>
yah arm64 is basically an extension of NEON, and NEON in the armv7 days extended and largely replaced VFPv3
<clever>
so you could then do `float a[4], b[4], c[4]; for (int i=0; i<4; i++) { a[i] = OP(b[i], c[i]); }`
<clever>
but there was so few vector registers, that its very load/store heavy
<mrvn>
clever: even better if you mark them as vectors and then do a = b + c;
<mrvn>
pre vectorizer solution
<geist>
that's why i was saying the vfpv3-d16 was the common subset of all of them arm32 modern stuff. that calling convention and whatnot was the minimum standard hard fpu, so it was fairly common to use it as the baseline
<clever>
mrvn: they where all wrapped up in intrinsics
<geist>
and then individual code could be aware of additional registers and/or instructions
<clever>
i just wrote it as scalar, to make it more understandable
<clever>
let me find the source...
<geist>
and d0-d15 were the only registers used to pass args and be saved
<geist>
vfpv3-d32 was extended by NEON, which i think added even more registers
<geist>
the v0-v31 regs, iirc, or maybe v0-v15? i forget
<moon-child>
32 64-bit regs
<moon-child>
and they get paired up into 16 128-bit regs
<geist>
there is still a mess of subset of fpu contexts in the cortex-m world, but in that universe it's much more common to compile the whole thing for a given subset
<bslsk05>
github.com: volk/volk_32f_x2_dot_prod_32f.h at main · gnuradio/volk · GitHub
<clever>
float32x4x4_t is just a struct, that holds a float[8] i think
<geist>
moon-child: yeah arm32 floating point regs had a bad pairing thing that was confusing as heck. a thing they fixed in arm64, where they are a strict subset
<geist>
ie, s0 = d0 = q0
<clever>
or was it float[16]
<clever>
i think 16
<geist>
vs [s0,s1] = d0, [s2,s3] = d1 like you got with arm32
<clever>
but at the hardware level, its only float[4]
scaleww has joined #osdev
<clever>
and the float[16] is purely to give you a bit of pipelining, for when you dont use too many regs
<clever>
so you can create a virtual vector that is bigger
<clever>
the other thing i notice here, is the lack of a vectorized sum
<clever>
so this winds up creating 4 sums, from each lane
<clever>
and then has to finish the job in scalar mode
<mrvn>
clever: ARM64 has that
<clever>
SVE right?
<clever>
i think thats hw support for virtual vectors of a larger size
<clever>
while this code is purely software level support
<geist>
yeah SVE is a new extensino to NEON/ASIMD that lets you define up to i think 2048 byte vectors
<geist>
and then somewhat dynamically declare how wide you want to do your work in
<clever>
and in theory, future chips can run the same job in fewer clocks
<clever>
without having to rewrite your asm every time
<geist>
that's exactly right
<clever>
the volk code for example, is only using 1/4th of the vector registers, because it only has float[4] at the hw level
<geist>
i think 256 is the minimum, and probably what most will implement, but iirc the fujitsu cpu for some supercomputer is using 768 byte hardware vectors or something
<clever>
and its trying to let the pipeline do its job, by operating on a float[4][4]
<clever>
but its just doing 4 vector loads, 4 vector mults, and then 4 vector stores
<clever>
so when the vector core doubles in size next year, you have to rewrite this to work on chunks of float[4][4][2]
<mrvn>
changing vector size was costly though I think.
<clever>
as an example:
<clever>
float a[n], b[n]; float sum = 0; for (int i=0; i<n; i++) { sum += a[i] * b[i]; }
<mrvn>
clever: it reminds me of "rep" on x86
<clever>
mrvn: for this code, the vector width doesnt really matter, just do as many mults in parallel as you can, sum them all up, and your done
<clever>
and the best the above volk code can do, is load 16 a's, load 16 b's, load 16 accumulators, then do a fused mult+add, acc += a*b;
<clever>
then store 16 accumulators, and repeat
<mrvn>
clever: with the virtual vectors can you keep anything in registers between operations? Like a+b+c+d+e+f would store 4 temporaries to ram
<clever>
ive not read the SVE specs yet
<mrvn>
I don't want each "+" to go over the whole vector but do the 5 "+" on one register load and then repeat for the vector size.
<clever>
VPU vectors, are instead always 16 lanes wide, and has a REP flag, that can repeat it a power of 2 times between 1 and 64, or use a scalar reg (no power of 2 limit)
<mrvn>
clever: yeah, but that needs to store the temporaries
<clever>
for the VPU, it doesnt, the vector registers can hold the entire int16_t[1024] at once
<clever>
2 of them infact
<clever>
its got a whole 4096 bytes of vector registers
<mrvn>
clever: that allows doing a simple accumulate like my example. 2 rges are rather limited for more complex cases though.
<clever>
and you can do the entire load of the int16_t[1024] in a single opcode, which can axi burst properly and saturate the dram bus
<mrvn>
clever: I think the virtual vector size extension allows you to implement a loop over the vector size, advancing by the actual vector size of the hardware each loop. So this years cpu adds 16 each loop, next years cpu adds 32 for the same opcodes.
<clever>
yeah, i could see how that might work
<clever>
just query how big the vector is, and allow the reg# to come from a scalar
<clever>
so you can index into register[n]
<mrvn>
and load/store in the native register size, whatever that is
<clever>
that is also possible on the VPU, you can use immediate+register as a coord into the vector register bank
<clever>
mrvn: if true, then SVE is less of a virtual vector size, and more of a way to query how big the VFP really is, and iterate over the registers based on a scalar reg, so you can use more regs
<clever>
so instead of hard-coding it to do something 4x, you instead have a loop within a loop
<mrvn>
clever: yeah, more of a self adjusting loop size
<clever>
yep
<mrvn>
but that's what you want.
<clever>
volk is doing similar
<clever>
its creating a virtual 16 lane vector, by running the 4lane vector 4x
<mrvn>
You have 8 regs and the size is variable
<clever>
and then it dynamically changes how many loops of the 16lane vector it runs
<clever>
and then does a scalar loop at the end, to deal with the leftover
<mrvn>
They should have done that with AVX and AVX512 would run the same code twice as fast
<clever>
part of what you need there though, is for opcodes to be far more async
<mrvn>
how do you mean?
<clever>
so you can issue 4 vector loads back2back, and the cpu wont stall out on just the first one
<clever>
while the 1st is doing a fetch, the 2nd, 3rd, and 4th should add to the fetch queue
<clever>
to keep the bus saturated
<mrvn>
clever: the cpus loop unrolling (branch predictor) does that
<clever>
yeah
<mrvn>
does anyone prefetch anymore?
<heat>
yes
<heat>
micro-optimized string.h code for instance
<clever>
i saw a blog post before, about how a memcpy loop, had a prefetch opcode in it
<mrvn>
I remember on alpha they had extra opcodes just for saying "I'm going to load <address> in the next loop"
<clever>
so as its copying some bytes, the cpu can be pre-fetching future bytes
<clever>
yeah, it was that kind of thing
justmatt has quit []
<clever>
a purely async opcode, that produces no results, but primes the d-cache
<clever>
however, there was a fatal hw bug, if there was a TLB miss, the prefetch just gives up
<clever>
and the fetch happens later, at dcache miss, which stalls the cpu
<mrvn>
is that still a think with AVX or just for regular register sizes data?
<clever>
the blog post i read, was about one of the xbox models
<clever>
and they "fixed" the problem by just using larger pages
<mrvn>
ppc?
<clever>
so a TLB entry covers more bytes
<mrvn>
clever: it always sucks when you code suddenly drops to half speed because your data lands on a page boundary.
<clever>
yep
<clever>
and they fixed it by just having fewer page boundarys
<bslsk05>
randomascii.wordpress.com: 11 mm in 1.25 nanoseconds | Random ASCII – tech blog of Bruce Dawson
<geist>
i dont see the prefetch stuff as much on modern arm64 stuff, but i think it's assumed that the more modern cores are better at prefetching on their own anyway
<griddle>
Is it best to read other kernels to learn how arm64 works? The docs aren't fantastic for what I can tell
<geist>
dumber arm32 bits from 10-15 years ago i remember it being quite essential to get good looping performance for memcpies and wehatnot
<clever>
geist: what happens if you do say a load opcode, but then never use the resulting register?
<mrvn>
geist: they just fill the pipeline with the next loops fetches before the first loop finishes. Add to that the fetch predictor...
<geist>
griddle: perhaps, but it's complicated enough that i doublt you're learn from just reading code
<griddle>
lots of the docs kinda feel like I should already know how it all works
<clever>
geist: could the cpu core keep on going without stalling for that cache miss?
<heat>
griddle, arm64 docs are garbage but mindlessly reading code won't get you anywhere
<clever>
and that is effectively a prefetch opcode
<heat>
well, not garbage
<geist>
clever: probably
<heat>
but written for hardware engineers maybe
<griddle>
well I'd be implementing the port on my end as well
<geist>
sure, it's a combination of everything. also we can help if you have questions
<mrvn>
clever: if you avoid a register register dependency, or load into the zero register
<geist>
though we'll generally refer you to the manual for lots of things after pointing you in the right direction
<griddle>
Appreciate it
<griddle>
Yeah I'll hack at it for a bit
<mrvn>
geist: is load into zero register allowed?
<geist>
on arm64?
<mrvn>
yep
<geist>
depends on if the xzr encoding is reused for something else
<griddle>
the kernel already builds for arm64, but doesn't work :) I set that up to test if my kernel was "portable" after also abstracting for a risc-v port
<mrvn>
seems like the perfect place to hide a pre-fetch. Load and throw away.
<geist>
in some instructions it is, acts as the standing for the PC or SP register. possible here the xzr register encoding refers to SP
<geist>
i do vaguely remember some discussion on the riscv irc channel or forum or whatnot as to whether or not a cpu is allowed to elide a load into the x0 register there. i forget the answer
<griddle>
rv doesn't have side effect regs, so I figure it's fine to do that
<geist>
but i suspect the answer on arm is even if it was allowed, it probably wouldn't encode as such
<mrvn>
geist: pre-fetch could be totally optional.
<geist>
since loads *do* have side effects, in the sense that it can page fault, etc
<mrvn>
geist: yeah. If you have speculative execution you would pre-fetch to probably use the same path as speculative load
<mrvn>
(and add one more side channel vector)
<geist>
re: prefetching there are actually instructions for that with a fair amount of flexibility
<geist>
so you'd almost certainly just use those
<clever>
mrvn: and thats the cause of another bug from the randomascii blog
<clever>
mrvn: there was a special opcode, that would prefetch in a non-coherent way, disabling the write-back, so it wouldnt steal the cacheline from another core
<clever>
and speculative execution would still do the non-coherent load, and poison the resulting cache-line
<clever>
and then the whole cache-line is just lost upon eviction
<mrvn>
... and what if it's a speculative pre-fetch?
<clever>
so, despite that being gated behind an if statement, it still had an effect
griddle_ has joined #osdev
<clever>
mrvn: yeah, a speculative execution of a non-coherent prefetch, resulted in a non-coherent cache line
<clever>
so basically, it was: bool noncoherent=false; if (noncoherent) non_coherent_prefetch(foo);
<clever>
and despite the fact that it was never true, it was still non-coherent!
griddle has quit [Ping timeout: 268 seconds]
griddle_ has left #osdev [#osdev]
griddle_ has joined #osdev
griddle_ has left #osdev [#osdev]
<clever>
the nasty part, is that this was corrupting the malloc state, causing it to assert() out
<clever>
but the core dump then read ram, and claimed the assert couldnt have possibly fired
griddle has joined #osdev
griddle has left #osdev [#osdev]
griddle has joined #osdev
<geist>
anyway FWIW i just checked and indeed you can load into xzr. there's no special encoding on the target/source register. there *is* special encoding on the base regster, the 'xzr' encoding (31) is interpreted as the SP instead of a xN register
<clever>
mrvn: so the coredump says the assert cant have fired, but the stack trace says it did!!
<mrvn>
clever: much more fun is the MIPS cpu reseting on a certain opcode sequence and binutils emiting that sequence with an minor update for common code.
<clever>
mrvn: ouch
lkurusa has quit [Quit: I probably fell asleep (or went out). Who will ever know.]