<muurkha>
it shows examples of the ungodly hacks in a RISC-V context
<muurkha>
aha, and the RI5CY user manual https://pulp-platform.org/docs/ri5cy_user_manual.pdf explains how the DSP-style zero-overhead hardware looping support interacts with context switches: the loop state is exposed as CSRs. the OR10N paper didn't seem to explain that
<jrtc27>
and not gated behind a feature bit, so your OS has to be aware of it otherwise bad things happen if something uses it
<jrtc27>
(boom if multiple use it, plus side-channels galore)
<muurkha>
I don't think these chips support memory protection, so if one task wanted to exfiltrate information to another I think it could just write into its memory
<muurkha>
so I don't think they're worried about side-channels
<muurkha>
it's interesting to think about what kind of design would allow you to add per-task state like these hardware looping registers in a way that you wouldn't need to patch the OS task switching code every time you added a new one
anup_patel has quit [Remote host closed the connection]
jacklsw has quit [Quit: Back to the real world]
<dh`>
that link (the dan bernstein one) sounds kind of nutty as one might perhaps expect ("take this freedom away"?)
<dh`>
but, I think he's unaware of any number of reasons why huge vectors might in fact slow the whole system down
<dh`>
meanwhile I've been more-or-less intentionally not following the RVV stuff so I don't have an opinion
elastic_dog has quit [Ping timeout: 240 seconds]
Leopold has joined #riscv
Leopold has quit [Remote host closed the connection]
jacklsw has joined #riscv
elastic_dog has joined #riscv
<muurkha>
dh`: hmm, are you aware of such reasons, or are you just speculating that they might exist?
<muurkha>
this is suboptimal output from objdump -d when presented with a not-yet-relocated la:
<muurkha>
aa:00000517 auipca0,0x0
<muurkha>
ae:00050513 mva0,a0
<jrtc27>
what's your point?
<jrtc27>
that's how RISC ISAs present
<jrtc27>
and x86 would instead just say mov 0(%rip), %rXX
<jrtc27>
uh lea even
<muurkha>
maybe objdump should say addi a0, a0, 0 in this case
<jrtc27>
there's an LLVM patch to do that
<jrtc27>
I'm not sure I entirely agree with it
<jrtc27>
unless you can go all the way and make it say addi a0, a0, %pcrel_lo(1b)
Leopold has joined #riscv
<jrtc27>
but regardless, objdump -dr and move on
<muurkha>
that would be a lot better but objdump doesn't normally go around looking at relocations
<muurkha>
does it?
<jrtc27>
-r does
<jrtc27>
no reason -d couldn't
<jrtc27>
like how -d already looks up symbols
<muurkha>
hmm, good point
<muurkha>
I'd never used -r, thanks
<muurkha>
it just seems a little unnecessarily obscurantist to use the pseudo-instruction in the disassembly output; probably I'm not the only noob who forgets that mv is actually addi...0
<muurkha>
which you still have to remember if you're reading this:
<muurkha>
ae:00050513 mva0,a0
<muurkha>
ae: R_RISCV_RELAX*ABS*
<muurkha>
ae: R_RISCV_PCREL_LO12_I.L0
<jrtc27>
well it'd have to look at the relocation to decide not to emit mv
<muurkha>
oh, I was thinking it could just look at the fact that the source and destination register are the same
<muurkha>
although I guess they wouldn't have to be
Armand has joined #riscv
<muurkha>
in theory you could auipc or lui into one register and then fix up the low 12 bits in some other register, and maybe that would even be a useful thing to do if you had two or more constants to load in the same 12-bit area, but do any compilers actually do that?
<muurkha>
oh well. pretty trivial
ldevulder has joined #riscv
Leopold has quit [Quit: No Ping reply in 180 seconds.]
Leopold has joined #riscv
JanC has quit [Read error: Connection reset by peer]
JanC has joined #riscv
JanC has quit [Ping timeout: 240 seconds]
JanC_ has joined #riscv
JanC_ is now known as JanC
Leopold has quit [Remote host closed the connection]
Leopold has joined #riscv
rsalveti has quit [Quit: Connection closed for inactivity]
<patersonc[m]>
Isn't the world full of "non-standard" devices?
<bjdooks>
are those the special ones where they managed to make the closecoupled memory match on VA insyead of PA?
frkzoid has quit [Read error: Connection reset by peer]
freakazoid332 has joined #riscv
frkzoid has joined #riscv
frkazoid333 has quit [Ping timeout: 240 seconds]
unlord has quit [Ping timeout: 240 seconds]
ema has joined #riscv
unlord has joined #riscv
MaxGanzII has joined #riscv
Leopold has quit [Quit: No Ping reply in 180 seconds.]
Leopold has joined #riscv
Wickram has quit [Ping timeout: 240 seconds]
heat_ has joined #riscv
DynamiteDan has quit [Excess Flood]
DynamiteDan has joined #riscv
DynamiteDan has quit [Excess Flood]
DynamiteDan has joined #riscv
Leopold has quit [Ping timeout: 240 seconds]
Leopold has joined #riscv
wingsorc has quit [Ping timeout: 240 seconds]
<prabhakarlad>
bjdooks: "losecoupled memory match on VA instead of PA" this has nothing to do with CMO.
<prabhakarlad>
patch series pointed by patersonc[m] is to support non-coherent nonstandard systems with the use of func pointers. I vaguely remember you were working on such platform and you wanted to have function pointers too to handle CMO?
Leopold has quit [Quit: No Ping reply in 180 seconds.]
Tenkawa has joined #riscv
Leopold has joined #riscv
uzix is now known as mahk
FL4SHK has quit [Ping timeout: 240 seconds]
FL4SHK has joined #riscv
Wickram has joined #riscv
aburgess has joined #riscv
GreaseMonkey has quit [Quit: HYDRA IRC LOL]
DynamiteDan has quit []
DynamiteDan has joined #riscv
DynamiteDan has quit [Excess Flood]
DynamiteDan has joined #riscv
prabhakarlad has quit [Quit: Client closed]
knolle has joined #riscv
Wickram has quit [Quit: WeeChat 3.8]
Wickram has joined #riscv
shoragan has quit [Server closed connection]
shoragan has joined #riscv
Leopold_ has joined #riscv
meta-coder has joined #riscv
Leopold has quit [Ping timeout: 240 seconds]
BootLayer has quit [Quit: Leaving]
sauce has quit [Server closed connection]
sauce has joined #riscv
MaxGanzII has quit [Remote host closed the connection]
MaxGanzII has joined #riscv
prabhakarlad has joined #riscv
greaser|q has joined #riscv
greaser|q has quit [Client Quit]
Wickram has quit [Quit: WeeChat 3.8]
greaser|q has joined #riscv
awita has joined #riscv
prabhakarlad has quit [Quit: Client closed]
BootLayer has joined #riscv
Armand has quit [Ping timeout: 240 seconds]
handsome_feng has quit [Quit: Connection closed for inactivity]
<Larhzu>
Hello! There is a plan to add a RISC-V filter to the .xz compression format. A filter converts pc-relative addresses in immediates to absolute addresses. This makes the data more repetitive.
<Larhzu>
(1) Seems that Clang/LLVM 16 can generate code where AUIPC and the paired instruction(s) are not adjacent. Many variations are possible. There can even be a conditional branch between AUIPC and its paired instruction(s).
<Larhzu>
Around 99 % are adjacent pairs so filtering only those works well for now. Any guesses about future compilers? If non-adjacent pairs are expected to become much more common (like 10 % of all uses) then the filter might need to catch the simplest forms of split pairs too.
<Larhzu>
(2) The spec has a table about 48-bit and longer instructions but it's not frozen. I wonder if the table is still the best guess. It can help a little if the filter can sync to the instruction stream.
jacklsw has joined #riscv
<palmer>
Larhzu: you should be able to get non-contiguous high/low relocations out of GCC as well, it's just not the default because we've found it generates slightly worse code
<Larhzu>
palmer: OK. :-) I guess in the future it might depend on -mtune too.
jacklsw has quit [Client Quit]
<Larhzu>
Like if the processor can fuse auipc+jalr or auipc+ld or not.
jacklsw has joined #riscv
wingsorc has joined #riscv
jacklsw has quit [Client Quit]
<jrtc27>
probably most of the non-contiguous cases are things like `1: auipc a0, %pcrel_hi(foo); ld a1, %pcrel_lo(1b)(a0); addi a1, a1, 1; sd a1, %cprel_lo(1b)(a0)`
jacklsw has joined #riscv
vagrantc has joined #riscv
<jrtc27>
ie reusing the auipc rather than rematerialising it
<jrtc27>
that was the main motivation in LLVM for properly modelling things
<muurkha>
yeah, I was wondering last night about whether that might happen, and that's a more plausible case than what I'd thought of
<Larhzu>
I had objdumped files from llvm-16_16.0.4-1~exp1_riscv64.deb and found various interesting cases. :)
<muurkha>
also, and I don't know how likely this is, you could imagine a superscalar platform that didn't do op fusion, so that `auipc s0,1; addi s0,s0,248; auipc s1,2; addi s1,s1,40` would be slower than if you interleaved the pairs to get more ILP
sauce has quit []
jacklsw has quit [Client Quit]
<jrtc27>
the last case definitely looks like a bug to me
jacklsw has joined #riscv
<muurkha>
the dead loads in the paste?
<jrtc27>
they're not dead
<jrtc27>
but it's spilling an auipc rather than rematerialising
<jrtc27>
(auipc isn't *quite* rematerialisable, but in effect it is in the way it's used here so long as you're careful)
<Larhzu>
The original auipc result goes directly to stack without any other use. Unfortunately I didn't keep the exact filenames anywhere but it's from that Debian package.
<muurkha>
oh, right, because the second instruction of each pair uses the result of the previous one
<Larhzu>
So I suppose the auipc should just appear later right before the ld instructions.
<muurkha>
that seems like it would be better
sauce has joined #riscv
<Larhzu>
From filtering point of view, auipc is a complex instruction because one needs the paired instruction too. Converting auipc alone is off-by-4096 half the time, and it's good to convert the lowest 12 bits too.
<Larhzu>
If there is auipc-ld-addi-sd sequence, handling just the auipc-ld part is very good already.
aburgess has quit [Ping timeout: 260 seconds]
<Larhzu>
A filter is small and dumb code. ARM64 filter, without comments and empty lines, is under 40 lines. RISC-V prototypes are 75-120 lines.
<Larhzu>
So simple and good enough is the goal.
<Larhzu>
But if future output from compilers will have more of auipc-auipc-ld-ld or such cases then perhaps a filter should handle them. An improved filter can made later but it's a bit annoying.
<muurkha>
I wonder if you could usefully filter just the instruction with the 12-bit offset
<muurkha>
l*, s*, jalr
<muurkha>
because that's where most of the entropy will be
<Larhzu>
No because the 12-bit offset is relative to the pc of auipc, not the pc of l* or s*.
<Larhzu>
To filter the pair (load, store, jalr, addi) one has to know the pc of the related auipc.
<muurkha>
that's true, but as you said, in 99% of cases it's the previous instruction
<Larhzu>
And to filter auipc, one has to know the lowest 12 bits because otherwise the auipc conversion will be off-by-4096 half the time.
<Larhzu>
Current filter prototypes treat auipc+inst2 as a fused pair (like 8-byte instruction) so either both are converted or neither.
<muurkha>
well, to do it correctly in every case, one has to
jacklsw has quit [Quit: Back to the real life]
<Larhzu>
Adding lookahead of a few instructions is possible but the first tries didn't give good results. If filter is applied to the whole executable and not just .text then false positives in non-code data are a problem.
jacklsw has joined #riscv
jacklsw has quit [Client Quit]
guerby__ is now known as guerby
<Larhzu>
Trying to filter only .text would be good but ELF section headers are at the end of the executable which isn't nice for streamable compression. Program headers fairly accurately tell the location of executable section on x86-64 but not on RISC-V or ARM64.
<muurkha>
a drawback of the pre-filtering approach to improving compressibility is that you can't just decide not to use an encoding in a given case because it doesn't help compression, I guess
<Larhzu>
There are ideas about smarter filtering. To use section headers, one has to buffer a lot to allow compression tool to work in pipes.
<Larhzu>
Or figuring out some way to cheat, for example, detecting what is executable code and what isn't without ELF headers.
<Larhzu>
Perhaps the filter development should wait a bit to see how compiler outputs evolve.
<muurkha>
RISC-V is 13 years old tho
<muurkha>
and one of the first things they did was get a GCC target working
<Larhzu>
Linker relaxation in psABI doesn't allow auipc+ld to become auipc+c.ld. In big executables there are a few places where it would be possible (the immediate would fit) but it's not common. From filtering point of view I kind of hope such relaxation won't be allowed.
<muurkha>
it's not as if the current LLVM and GCC support is an early prototype that will be replaced by something much better next week
<Larhzu>
RISC-V is both old and young at the same time.
<jrtc27>
who says you can't turn auipc+ld into auipc+c.ld?
<jrtc27>
from a psABI perspective you can
<jrtc27>
whether GNU ld bothers to is a different matter
<Larhzu>
It's not explicitly listed under "Linker Relaxation Types". So perhaps my interpration was just too strict.
<jrtc27>
it already turns auipc+jalr into jal or c.jal
<jrtc27>
(or c.j for the x0 rather than ra case)
<Larhzu>
auipc+jalr to c.jal and to c.j *are* expliclty listed.
<jrtc27>
we should probably kill that section...
<jrtc27>
and just say "to a semantically-equivalent sequence" or similar
<jrtc27>
enumerating every possibility, whether performed or not, is a fool's errand
<jrtc27>
and only call out some of the interesting ones
awita has quit [Ping timeout: 246 seconds]
<jrtc27>
like GP-based relaxation
<jrtc27>
and, when added, GOT->non-GOT
<jrtc27>
which are both semantically equivalent in the normal case, but do have some implications for certain use cases
<Larhzu>
muurkha: It's not about they being replaced. A new processor might have best performance with different instruction scheduling, including putting something else in the middle of auipc+inst2. It would be new -mtune=foo.
<jrtc27>
(former for if GP isn't set correctly, e.g. at program startup when you're trying to set GP in the first place, and latter for early-boot code of OSes that have yet to set up virtual memory)
<Larhzu>
jrtc27: OK, thanks, this is useful info to me. :-)
heat_ has quit [Read error: Connection reset by peer]
heat has joined #riscv
Armand has joined #riscv
ikke has quit [Quit: WeeChat 3.8]
<dh`>
muurkha: real reasons use/overuse of vectors might slow things down headline with "if you use these monster registers they have to be saevd and loaded by the kernel all over the place"
Andre_Z has joined #riscv
<dh`>
also, more speculatively, I'd expect that if you make your vector ops too dense other functional units end up going idle, which doesn't necessairly result in better overall throughput
<dh`>
meanwhile iirc from last night djb's reasoning ignored the fact that these things usually happen in loops and there's an icache