<palmer>
IIUC it's the same as the K1, just clocked faster
<courmisch>
and more RAM
<ja_02>
courmisch so ill drop [PATCH v3 4/8] RISC-V: Check Zicclsm to set unaligned access speed
<courmisch>
uh?
<ja_02>
conchuod sry
<ja_02>
i was hopeing that there could be a way to skip speed probing as its quite slow
<conchuod>
Ye, I guess. It was only really useful before you added the speed probing.
<ja_02>
wdym?
<bjdooks>
I finally got around to updating ivan's opensbi build updates and sedning them
crest has quit [Changing host]
crest has joined #riscv
another| has quit [Remote host closed the connection]
la_mettrie has quit [Quit: WeeChat 4.2.1]
<conchuod>
ja_02: That the Zicclsm test was only a useful idea before the speed probing was added, when all that was being done was testing for a trap.
<conchuod>
Because it does tell if you misaligned accesses are supported
another has joined #riscv
<ja_02>
ah i understand now. the supported part is fast though so it doesn't really matter
<courmisch>
using indirect branches to call VSETVLI zero, zero with different immediate vector types... seems about as slow as using VSETVL. Somewhat unsurprisingly
<courmisch>
so only way to optimise functions with small fixed-size inputs will be to replicate them entirely for each hardware VLEN :(
<courmisch>
(unless the function only uses a single VTYPE, but that's not so commonly possible)
<courmisch>
unlord, haasn ^ but my benchmarks were not too thorough
<unlord>
courmisch: hmm, I was seeing benefits on the dav1d stuff I thought
<courmisch>
but that will barely work if you don't loop[3~
<courmisch>
and it just doesn't work if you don't loop and you need to change element size
<courmisch>
well, it depends on the ratio of instructions per vsetvl, obviously
khem has joined #riscv
<sorear>
there's always "wait a year and see if performance characteristics of common cores have changed", I very much doubt c908 will have staying power
<courmisch>
this is much more of an x60 than c908 problem
<courmisch>
c908 has the issue, but it's kind of irrelevant because it's got the smallest hardware vector size that we care about
<courmisch>
thus code that *works* at all on Zv128l should be optimal for C908 anyway.
<courmisch>
But it won't be optimal for Zvl256b or larger. That's why I call it an x60 problem
<courmisch>
sorear: and I agree that we should wait and see, but I seem to be alone in voicing that opinion in FFmpeg, dav1d, etc
<sorear>
do we no longer care about dynamic LMUL for variable-length vector stuff on c908?
Stat_headcrabed has joined #riscv
Stat_headcrabed has quit [Client Quit]
_catircservices has quit [Quit: Bridge terminating on SIGTERM]
_catircservices has joined #riscv
_catircservices has quit [Client Quit]
<courmisch>
if the variable length is iterated a lot, then computing VTYPE dynamically and using VSETVL is almost free, that's not much of an issue
_catircservices has joined #riscv
<courmisch>
the problem now is small fixed-size inputs
<courmisch>
for that we cannot use VSETVL, it's too slow
<sorear>
vsetvl is actually slower than vsetvli on existing hardware? i didn't expect that to be a problem until next year and was worried the optimizations using it would all have to be reverted
<courmisch>
it seems to be, but maybe I'm measuring wrong
<unlord>
courmisch: most of the DSP functions I've been looking at in dav1d have loops after the vsetvl*
<courmisch>
unlord: but don't they need to change SEW?
<unlord>
some don't
<courmisch>
for instance, if you need to clip s16 to u8, you have to change SEW (and therefore VTYPE)
<unlord>
the blend ones for example
<courmisch>
yeah in that case it's easy
<courmisch>
but it's not always possible
<sorear>
I wonder if any of the "clip s16 to u8" cases could be profitably converted into "clip s16 to s8" with a -128 offset transform