<heat_>
i know alignas but i don't use it for some reason
<heat_>
i just forget it exists
<geist>
and yeha i think he wrote that and then we argued about getting it back, learned that you could apply it to an else clause, and everyone was happy
<heat_>
it's a pretty good commit message, just that slight mistake at the end
<heat_>
and i love good commit messages
<heat_>
some maintainers in linux really want fully imperative commit messages but i find that really awkward
<heat_>
like "In the kernel, we generally run with pretty much all warnings turned on, and warnings set to error" wouldn't pass, could maybe be "The kernel generally runs with pretty much all warnings turned on, and warnings set to error"
<heat_>
but i like saying we :( i think it reads nicer too
<geist>
agreed. i had this discussion before at work. i'm a huge fan of 'we' in comments and commits
<geist>
i have bumped into fully imperative folks, but thankfully they aren't big at work on my team at least
<geist>
i'm not sure the entire rationale behind it
<zid>
The!
<zid>
passive only
<zid>
is it almost monday again yet, I need my honzuki fix already
<heat_>
geist, btw what do you mean with cache coherency?
<heat_>
i-cache d-cache coherency seemed pretty well described in the ISA manuals at least
<geist>
well, how you can get away without needing to make MMIO regions as no cached, and how DMA devices are coherent
<geist>
ie, basically just like x86
<geist>
yah i mean coherent with respect to the rest of the world
<geist>
if you look at something like the u74 manual it's all about the different AXI ports, and how they're responsibile for different ranges of the aspace
<geist>
and how 'physical memory attributes' are applied as a result. basically kinda MTRR like, but fixed
<heat_>
yeah the PMP thing?
<geist>
no, not PMP. it's lower level than that
<geist>
PMP only modifies protection. PMA is the default cachability/etc of range of physical address space, based on which AXI port it uses
<geist>
but then you ask how are devices that use DMA coherent with the L1/L2 caches? the idea is you wire them up *through* the cpu's 'front port'
<geist>
basically if a DMA coherent device triggers a read/write, it should go through the cpu cluster, which can then inspect the transfers and do all the appropriate L1/L2 stuff, then forward the transfers out of the cpu's main memory AXI port
<geist>
thsi is confirmed in the JH7110 data sheet's quite detailed block diagram which shows all the busses
<heat_>
but is the PMA thing a hardware thing or a software table you set up in M mode?
<heat_>
cuz afaik x86 firmware needs to properly set up the MTRR, there's no hardware automagic there
agent314 has joined #osdev
<heat_>
ugh just saw on your CL there's a fence.i sbi call for other CPUs
<heat_>
i like it when i naturally code things like other CPUs would do it and suddenly SBI has yet another call for something weird
<heat_>
if they hate the M mode <-> S mode transitions so much they could've just skipped M mode :)
netbsduser`` has quit [Ping timeout: 260 seconds]
craigo has quit [Remote host closed the connection]
craigo has quit [Remote host closed the connection]
craigo has joined #osdev
MiningMarsh has joined #osdev
<geist>
yeah you gotta fence all over the place
<geist>
the PMA thing is hardware. hard coded
<geist>
but, the Svpbmt extension (which i have a CL in flight) lets you add additional override bits per page table entry
<geist>
to be more restrictive than the base hardware
<geist>
re fence.i there's an extension in the discussion phase i think to make it more like ARM where i cache flushes are intrinsically cross cpu and per cache line
<heat_>
isn't there an extension that reworks fences?
<geist>
which isn't really a tremendous leap, since you already have d cache coherency on a line at a time
<heat_>
i think i saw that somewhere...
<geist>
there's a Sinval extension the reworks sfence.vma
<geist>
breaks it into a start, flush flush flush, endsync like mechanism for moar perofmrnace
<heat_>
oh, like arm?
<geist>
the fence.i stuff is super pessimal though
<geist>
yeah there's lots f extensions in flight that are trying to drag riscv over into the arm way of doing things
<heat_>
what's wrong with fence.i and why is it super PESSIMAL
<geist>
presumably lots of companies are trying to think about taking what knowledge they have and switching to riscv cores, and they dont want to rethink the world so would rather just drop it in
<heat_>
A SOLARIS ICACHE FENCE IF I'VE EVER SEEN ONE
<geist>
fence.i shoots down the *entire* icache
<geist>
and is only local to the current cpu
<geist>
hcen eneeding to fence.i for the local core, then IPI all other cores
<geist>
also the side effect of that is it must be done from supervisor mode, so if yuo're writing some sort of JITting user space thing you have to make a syscall to sync the icache after blatting out some code
<heat_>
yeah i know about localness but that's not really riscv specific
<heat_>
x86 needs similar logic, but with a serializing insn vs an explicit icache flush instruction
<geist>
well, sort of. other cores also discover it so you dont need to IPI for them
<geist>
and ARM just broadcasts the icache flush to all cores
<heat_>
what cores?
<geist>
SMP, other cores but then one you're on
<heat_>
(as in, architecture)
<geist>
on x86 the i and d cache are coherent, except in the very narrow case where you need to branch to clear the speculation or whatnot. legacy nonsense
<geist>
and thus other cores will just pick up the change you have, provided they at least branch
<heat_>
erm, i don't think they are?
<heat_>
the intel manual explicitly requires a serializing instruction on every core *except* the one you're modifying .text on
<geist>
ARM/riscv/power/etc the i and d cache are not coherent at all, so you h ave to run an instruction to flush and sync the caches. fine. ARM though you do it per cache line and it intrinsically broadcasts to all cores, so you jsut need to locally do it and you're fine
<heat_>
because the legacy self-modifying code detection garbage takes care of that for the one poking instructions
<geist>
re: x86 sure, but that is still coherent. you're just observing the prefetching and whatnot and needing to disable that
<geist>
youneed to dump the pipeline so the cpu knows to fetch fresh versions, from the fully coherent cache
<formerly-twitter>
ey mofoz
<formerly-twitter>
Simplify/streamline pipes a little bit:
<formerly-twitter>
Some notable results:
<formerly-twitter>
- -30% latency on a 486DX2/66 doing 1 byte ping-pong within a single process.
<formerly-twitter>
now, who can guess which year is the commit from
<formerly-twitter>
hint: it's netbsd
<heat_>
2023
<heat_>
omg it was committed 6 hours ago
<heat_>
hahaha
<heat_>
- 2.5x less lock contention during "make cleandir" of src on a 48 CPU machine.
<heat_>
this is in the *same* commit message
<heat_>
geist, is it architecturally guaranteed for the i cache and dcache to be coherent or is it a uarch detail?
<heat_>
i interpreted the intel/amd details as generically as possible. i.e you may be clearing the prefetch queue, or you may be clearing the whole icache
<formerly-twitter>
to be fair the commit itself makes sense
<formerly-twitter>
on the face of it
<geist>
heat_: in x86? i think it's architecturally guaranteed
<geist>
primarily because when caches got added back in the 386/486 days there was just one of them, so to be compatible you kept it compatible
<moon-child>
x86 is i/d coherent but at the level of traces
* heat_
nods
<heat_>
formerly-twitter, i'm surprised there's no VAX benchmark!
<moon-child>
architecturally. If you modify some code within the current trace, it's undefined, but if you modify some code and then jump to it, you're fine
<formerly-twitter>
ikr
<formerly-twitter>
shite commit message 2/10 would revert
<moon-child>
one thing I wonder about arm/etc is
<heat_>
do you think their vm regression tests run on cranor's original CPU?
<moon-child>
suppose you atomically modify two instructions and you _don't_ sync. It might see them or it might not; fine. But can it tear and see just one but not the other?
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
gbowne1 has quit [Remote host closed the connection]
skipwich has quit [Quit: DISCONNECT]
skipwich has joined #osdev
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
agent314 has quit [Ping timeout: 240 seconds]
<geist>
depends if the instruction crosses a cache line
<geist>
arm64, no, because they cant
<geist>
arm32 thumb? yes. riscv? yes
heat_ has quit [Ping timeout: 252 seconds]
<geist>
and in those arches, like x86, branching into the middle of an instruction has all the usual side effects of whatever that is interpreted as
<moon-child>
you can't do atomic updates across multiple cache lines anyway so it's beside the point
<moon-child>
but I'm wondering if e.g. there's some internal queues where it can insert the first instruction (old version), then evict the line from cache, then reinsert the line and then get the second instruction (new version)
<geist>
well, so for example wen you flush the cache for ARM, you have to flush the icache and then run an isb instruction
<geist>
and the isb instruction is a instruction fetch barrier, to guarantee the cpu can't fetch across it
<geist>
basically an explicit synchronization point
<moon-child>
if you patch just one instruction, presumably you'll always get either the old instruction or the new instruction, regardless of whether you synchronise before you get to it
<moon-child>
so the question is, if I atomically patch two instructions at once, on arm, am I guaranteed to get either both old or both new?
<geist>
no
<geist>
though depending on which cache line it's on, etc, you can defacto guarantee because the arch is defined to work a pretty specific way
<geist>
i think ARM calls that something like CONSTRAINED UNDEFINED or something
<geist>
ie, it's undefined, but really only one of N things can happen and you can basically predict it
<moon-child>
what is etc? And again I don't see the significance of cache line since you can't do atomic updates across cache lines anyway
<geist>
etcetera
<geist>
like, and other similar things
<moon-child>
I know what it means, I'm asking what the things are
<geist>
well, i mean like if you update two instrutions on the same cache line then the cpu will fetch them both at the same time it does it
<geist>
but if the two instructiosn are on two cache lines, then they may be fetched at different points in time
<moon-child>
you can't do atomic updates across cache lines _anyway_ so the point is moot. Again
<moon-child>
anyway...what about this (with a fence between each thing)? Update first instruction to jump to slow path; update second instruction to jump to slow path; update first instruction to do the actual thing I want; update second instruction to do the actual thing I want
<geist>
okay. i dunno exaclty what you're getting at but i think you understand it as well as I
<moon-child>
a related question is: with no explicit flush, if a core executes the new version of the instruction at a given address, can it ever execute the old version after?
admiral_frost has quit [Quit: It's time]
<geist>
i dont see how, no
<geist>
oh well, yes on ARM yes
<geist>
because it could fetch th eold version from the L2, then if the new version still sits in the dcache's L1, you can load the L1 icache multiple times, from the stale L2
<geist>
if the L1d doesn't flush in the interim
<geist>
so yeah you can
<geist>
the problem on ARM is the inner L1i and L1d can be out of sync, and until you flush the L1d into the L2 (or other point of unification) then if the L1i goes to fetch data it'll still fetch it from the L2 or out
<moon-child>
I mean if you write the instruction from another core
<geist>
thats precisely why when you synchronouse i and d on arm you have to first flush the L1d to point of unification (POU) and then flush the i
<clever>
2023-10-14 00:58:51 < geist> ie, it's undefined, but really only one of N things can happen and you can basically predict it
<clever>
geist: something else i think ive heard, is that the spec doesnt say what will happen, but its very predictable what a specific implementation will do
<clever>
but because its not in the spec, another implementation (A72 vs A76) is free to do something different
<clever>
and you should not rely on what is happening, and avoid triggering it
<geist>
true but the arch does pretty specifically state how the cache layers work
<geist>
with PoU, PoC, etc
<geist>
so the range of possible outcomes is dictated by the scope of how the caches are allowed to work
<clever>
yeah
<geist>
if you write from another core it's probably going to work, because if the local cpu had a copy of the line i the L1d it'll get evicted and pulled out to at least some sort of unified layer, L2/L3 etc
<geist>
and then a local fetch on the L1i will fetch from that
<geist>
after you invalidate the cache line on the i cache of course
<geist>
you always have to do that, there's no way around it
gildasio has quit [Remote host closed the connection]
gildasio has joined #osdev
vai has joined #osdev
<vai>
morning
<clever>
geist: oh, cortex-a7, there is a flag in ACTLR, to turn coherency on/off
<clever>
when off, the load/store exclusive opcodes trigger an undefined opcode exception
<clever>
but, how does the L1 and L2 behave in that case?
<geist>
i guess it just doesn't
project10 has quit [Quit: Ping timeout (120 seconds)]
project10 has joined #osdev
admiral_frost has joined #osdev
bliminse has quit [Quit: leaving]
project10 has quit [Quit: Ping timeout (120 seconds)]
zxrom has quit [Ping timeout: 252 seconds]
zxrom has joined #osdev
Yoofie has quit [Ping timeout: 255 seconds]
project10 has joined #osdev
Yoofie has joined #osdev
bliminse has joined #osdev
Burgundy has joined #osdev
MarchHare has quit [Ping timeout: 260 seconds]
grumbler_ has joined #osdev
admiral_frost has quit [Ping timeout: 255 seconds]
Yoofie8 has joined #osdev
Yoofie has quit [Ping timeout: 255 seconds]
Yoofie8 is now known as Yoofie
netbsduser`` has joined #osdev
grumbler_ has quit [Ping timeout: 255 seconds]
admiral_frost has joined #osdev
admiral_frost has quit [Remote host closed the connection]
netbsduser`` has quit [Ping timeout: 255 seconds]
[itchyjunk] has quit [Read error: Connection reset by peer]
<heat>
nikolar, -smp 24 adds 24 cores, then the other options specify the "topology"
<heat>
but it's all emulated anyway, it doesn't affect anything, just cpuid and acpi tables
<nikolar>
can you just leave it out, since the rest fully determins the number
<heat>
sure you can
<heat>
but AFAIK if you leave it out QEMU defaults to 24 cores
<heat>
i.e if you're on a hyperthreaded system and controlling thread cpu affinity properly then doing cores=N,threads=2 may make sense, for the host CPU to properly manage hyperthreaded CPUs
<heat>
s/host/guest/
<nikolar>
Makes sense
<zid>
can I -smp 4096
<heat>
actually i don't think so
<heat>
yeah: "Invalid SMP CPUs 4096. The max CPUs supported by machine 'pc-i440fx-8.1' is 255"
<zid>
ah acpi or whatever issues, makes sense
<heat>
q35 supports 1024, probably because of the x2APIC
<zid>
what happens if I do do -smp 255, it.. runs each as a thread and linux sorts out the runtime?
<geist>
yah default -smp 24 is just a 24 socket system
<geist>
something like -smp 24,cores=4,threads=2 is more interesting
<geist>
also fun you can have it emulate something like a tri-thread x86 machine, or 4 way SMT
<heat>
surely 24 core, not socket?
<heat>
you can also make up fun topologies using -numa
<geist>
that would make like 4 sockets of 4 cores a piece with 2 threads per
<geist>
basically it'll try to evenly subdivide the first number with all the rest
<geist>
or 3 sockets. etc
<geist>
oh but for the first one? yeah i think it's 24 sockets
<geist>
i remember having to explicitly set sockets=1 when emulating windows, because it has socket limits
Burgundy has joined #osdev
<geist>
it of course makes no difference at all to linux if it's just N sockets with 1 core a piece, vs 1 socket with N cores, since the topology is still the same flat
<geist>
hypoehtically i guess it could tweak some parameters and change the amount of weight to transferring threads betweeen cores, but i think it's all relative to a multi level heirarchy, and if it's only 1 level deep the weights are all the same relative weight
<heat>
yeah
<heat>
i've thought about it and i think it makes a lot more sense to schedule per numa node?
<heat>
though i would guess most machines have a rough numa - socket correspondence
<heat>
but still, for stuff like zen you can have 2+ NUMA nodes per socket, and in that case it does make sense to keep a thread roughly local
irl25519 has joined #osdev
irl25519 has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
<geist>
yeah i think most systems just think of it as a heirarchy tree with varying distances based on how many levels of tree you have to traverse
<heat>
btw, just tested, and indeed it's just a single socket
<geist>
what is, just smp N?
MarchHare has joined #osdev
<geist>
(note it's different per arch, but i assume we're talking about x86 here)
<heat>
yeah -smp N
<geist>
i had noticed the exact opposite at least some point in the past, enough that i ghave a script to start windows with -smp 4,socket=1
<geist>
but maybe the default changed
<heat>
at least that's what freebsd lscpu told me, but i assume their topo code is correct
<geist>
random band youtube just discovered for me: Descartes A Kant
<geist>
but yeah you're right, it's now cores per socket. huh. guess that changed over the years
<geist>
makes sense anyway, that's how PCs work nowadays
GeDaMo has quit [Quit: That's it, you people have stood in my way long enough! I'm going to clown college!]
<gog>
well i'm playing a game with proton on linux
<gog>
which is fine
<gog>
it's not great
<gog>
the game is great
<gog>
wine is ok
<heat>
i think my problem is that the compositor usually tends to fuck things up
<heat>
but proton is great
<heat>
valve is poggers
<Ermine>
I trust linux only with stuff like Tetris and Klondike
<gog>
meow
<gog>
i love karlach
<gog>
she's best girl
<heat>
>tetris
<heat>
least russian russian
gog` has joined #osdev
gog has quit [Quit: byee]
<heat>
gog`
<zid>
eve used to work great in linux under wine, except it'd crash if you tabbed out
<zid>
so we just used to run it on its own virtual desktop with no WM
<zid>
DISPLAY=":1" wine ./eve.exe
<heat>
you know, the problem with linux desktop is that everything works great EXCEPT
<zid>
the problem with the linux desktop is the desktop part
<Ermine>
I learned that tetris was developed by soviet guy only relatively recently
<Ermine>
heat: except everything
<zid>
He went hypercapitalist afterwards!
<zid>
He and henk share the rights, and 'the tetris company's just a patent troll company
<zid>
if you make a tetris clone, they will use their patents to C&D you
<Ermine>
Re-iterating fate of linuks desktop discussion again?
vai has quit [Ping timeout: 252 seconds]
<zid>
akira's tgm3 couldn't release as is because ttc refused to let them release a game without the 'official' tetris ruleset (the crappy teleporting pieces to make bizzare combos system) rather than the good arika/sega/etc system, so it has two modes, classic and 'world'
<zid>
and tgm4 is fully completed but has not been released ever because they can't get *any* licence out of ttc for it