sorear changed the topic of #riscv to: RISC-V instruction set architecture | https://riscv.org | Logs: https://libera.irclog.whitequark.org/riscv | Matrix: #riscv:catircservices.org
hightower2 has quit [Remote host closed the connection]
jmdaemon has joined #riscv
hightower2 has joined #riscv
vagrantc has quit [Quit: leaving]
hightower3 has joined #riscv
hightower2 has quit [Ping timeout: 264 seconds]
maxinux has quit [Quit: Brb]
maxinux has joined #riscv
kaaliakahn2 has joined #riscv
kaaliakahn has quit [Read error: Connection reset by peer]
stazthebox has quit [Quit: Ping timeout (120 seconds)]
stazthebox has joined #riscv
hightower3 has quit [Remote host closed the connection]
hightower3 has joined #riscv
KREYREN has quit [Remote host closed the connection]
KREYREN has joined #riscv
epony has quit [Remote host closed the connection]
epony has joined #riscv
hightower3 has quit [Remote host closed the connection]
aburgess has quit [Remote host closed the connection]
aburgess has joined #riscv
Pokey has quit [Ping timeout: 264 seconds]
jmdaemon has quit [Ping timeout: 264 seconds]
<palmer> does anyone have more QEMU vector slowdown examples? I'm proposing a GSoC project: https://gitlab.com/qemu-project/qemu/-/issues/2137
jmdaemon has joined #riscv
<unlord> palmer: sure, I've got RVV optimizations in dav1d (with benchmarking code) can run in QEMU and see the slowdown
<palmer> Cool, thanks. Presumably these also run faster on k230? That'd be really nice to see, as it kind of validates the workload
<unlord> instead of being 7x to 8x faster, it runs 40% the speed of scalar in QEMU
<unlord> palmer: yes, there are perf numbers from the K230 in the commit messages, e.g., https://code.videolan.org/videolan/dav1d/-/merge_requests/1463/diffs?commit_id=d2b59409f4b2ab65ec1ba3c8cab90bca4b35a2e8
jmdaemon has quit [Ping timeout: 264 seconds]
heat has quit [Ping timeout: 264 seconds]
<palmer> awesome, I just added a link to the QEMU tracker
<unlord> perfect!
maxinux has quit [Quit: Brb]
maxinux has joined #riscv
EchelonX has quit [Quit: Leaving]
<unlord> palmer: your comment in 2137 is not accurate. The numbers linked in the commit message are speedups from C -> RVV code on hardware
<unlord> the delta from C -> RVV in emulation is more like 1x -> 0.48x
<unlord> so half the throughput
<palmer> ah, sorry, I misunderstood it
<unlord> no worries
<palmer> should be fixed
<unlord> see it, thanks
<palmer> you might be able to edit it? IDK how QEMU's gitlab permissions work...
<unlord> palmer: I'll drop a link to my FOSDEM slides when I finish making them
shamoe has quit [Quit: Connection closed for inactivity]
jmdaemon has joined #riscv
KombuchaKip has quit [Quit: Leaving.]
KombuchaKip has joined #riscv
jmdaemon has quit [Ping timeout: 252 seconds]
Stat_headcrabed has joined #riscv
Stat_headcrabed has quit [Ping timeout: 268 seconds]
jacklsw has joined #riscv
jmdaemon has joined #riscv
Tenkawa has quit [Quit: Was I really ever here?]
BootLayer has joined #riscv
mwette has quit [Read error: Connection reset by peer]
Stat_headcrabed has joined #riscv
Stat_headcrabed has quit [Ping timeout: 256 seconds]
epony has quit [Remote host closed the connection]
epony has joined #riscv
KREYREN has quit [Quit: Leaving]
Stat_headcrabed has joined #riscv
Stat_headcrabed has quit [Ping timeout: 260 seconds]
davidlt has joined #riscv
maxinux has quit [Quit: Brb]
ntwk has quit [Ping timeout: 276 seconds]
foton has quit [Ping timeout: 260 seconds]
TMM_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
TMM_ has joined #riscv
foton has joined #riscv
davidlt has quit [Ping timeout: 268 seconds]
BootLayer has quit [Quit: Leaving]
frkzoid has quit [Ping timeout: 260 seconds]
Stat_headcrabed has joined #riscv
crossdev has joined #riscv
crossdev has quit [Remote host closed the connection]
crossdev has joined #riscv
notgull has quit [Ping timeout: 256 seconds]
Stat_headcrabed has quit [Ping timeout: 260 seconds]
Stat_headcrabed has joined #riscv
notgull has joined #riscv
epony has quit [Remote host closed the connection]
frkazoid333 has joined #riscv
Stat_headcrabed has quit [Remote host closed the connection]
Stat_headcrabed has joined #riscv
Stat_headcrabed has quit [Ping timeout: 268 seconds]
jobol has joined #riscv
davidlt has joined #riscv
Stat_headcrabed has joined #riscv
Stat_headcrabed has quit [Ping timeout: 276 seconds]
Stat_headcrabed has joined #riscv
Stat_headcrabed has quit [Remote host closed the connection]
crossdev has quit [Remote host closed the connection]
crossdev has joined #riscv
mark4o has joined #riscv
Pokey has joined #riscv
markh has quit [Ping timeout: 260 seconds]
mark4o is now known as markh
ezulian has joined #riscv
crossdev has quit [Remote host closed the connection]
crossdev has joined #riscv
pabs3 has quit [Quit: Don't rest until all the world is paved in moss and greenery.]
pabs3 has joined #riscv
MaxGanzII__ has joined #riscv
jacklsw has quit [Ping timeout: 264 seconds]
Leopold has quit [Remote host closed the connection]
Leopold has joined #riscv
prabhakar has joined #riscv
prabhakarlad has joined #riscv
prabhakarlad has quit [Ping timeout: 250 seconds]
prabhakar has quit [Ping timeout: 268 seconds]
heat has joined #riscv
lagash has quit [Ping timeout: 256 seconds]
unnick has quit [Ping timeout: 255 seconds]
felixonmars_ has joined #riscv
felixonmars has quit [Remote host closed the connection]
felixonmars_ is now known as felixonmars
davidlt has quit [Ping timeout: 268 seconds]
prabhakar has joined #riscv
prabhakarlad has joined #riscv
<unlord> palmer: so [fixing the missing headers and] running the example code at https://gitlab.com/qemu-project/qemu/-/issues/2137 I'm getting a speed up on HW if it is not vectorized
<unlord> not sure this is really the motivating example you want
crabbedhaloablut has quit []
crossdev has quit [Ping timeout: 252 seconds]
crabbedhaloablut has joined #riscv
heat has quit [Read error: Connection reset by peer]
heat_ has joined #riscv
psydroid has joined #riscv
paulk has quit [Quit: WeeChat 3.0]
davidlt has joined #riscv
davidlt has quit [Remote host closed the connection]
davidlt has joined #riscv
lagash has joined #riscv
paulk has joined #riscv
paulk has joined #riscv
MaxGanzII__ has quit [Remote host closed the connection]
Tenkawa has joined #riscv
MaxGanzII has joined #riscv
crossdev has joined #riscv
MaxGanzII has quit [Remote host closed the connection]
MaxGanzII has joined #riscv
MaxGanzII has quit [Remote host closed the connection]
epony has joined #riscv
MaxGanzII has joined #riscv
mlw has quit [Ping timeout: 268 seconds]
mlw has joined #riscv
ezulian has quit [Quit: ezulian]
ezulian has joined #riscv
prabhakarlad has quit [Quit: Client closed]
prabhakar has quit [Quit: Connection closed]
MaxGanzII has quit [Remote host closed the connection]
prabhakar has joined #riscv
prabhakarlad has joined #riscv
<unlord> palmer: the dav1d RVV code has landed, I cannot edit the description on https://gitlab.com/qemu-project/qemu/-/issues/2137 but when you get a chance please update the link to this commit: https://code.videolan.org/videolan/dav1d/-/commit/219befef
jacklsw has joined #riscv
muurkha has left #riscv [#riscv]
_whitelogger has joined #riscv
billchenchina has joined #riscv
dogukan has joined #riscv
MaxGanzII has joined #riscv
JanC has quit [Ping timeout: 255 seconds]
JanC has joined #riscv
prabhakarlad has quit [Ping timeout: 250 seconds]
MaxGanzII has quit [Remote host closed the connection]
ntwk has joined #riscv
MaxGanzII has joined #riscv
MaxGanzII has quit [Remote host closed the connection]
dogukan has quit [Quit: Konversation terminated!]
MaxGanzII has joined #riscv
jacklsw has quit [Ping timeout: 260 seconds]
heat_ is now known as heat
maxinux has joined #riscv
unnick has joined #riscv
MaxGanzII has quit [Remote host closed the connection]
MaxGanzII has joined #riscv
MaxGanzII has quit [Ping timeout: 255 seconds]
maxinux has quit [Quit: Brb]
<palmer> unlord: odd, so maybe we just have some bad codegen. I guess there's an extra factor of 10x badness in there on QEMU, but maybe that doesn't matter if it's just something like scatter-gather that goes slow on HW too.
<palmer> (also, I updated the description)
<palmer> Patrick is running the fuzzer in some mode that looks for performance issues like this, so hopefully we'll have some better examples soon...
<unlord> I guess I could try these binaries with qemu-user-riscv64 and see the output
shamoe has joined #riscv
<palmer> oh, my numbers were all user-mode
<palmer> that's how we run SPEC and the compiler test suites, so that's what's really been hammering folks around here
BootLayer has joined #riscv
<unlord> yeah, I am seeing even worse than you posted, roughly 275x slower!
<unlord> palmer: feel free to toss this link into the issue as well https://paste.debian.net/1305885/
billchenchina has quit [Remote host closed the connection]
davidlt has quit [Remote host closed the connection]
davidlt has joined #riscv
prabhakarlad has joined #riscv
foxbat has quit [Read error: Connection reset by peer]
heat_ has joined #riscv
heat has quit [Read error: Connection reset by peer]
crossdev has quit [Ping timeout: 268 seconds]
foxbat has joined #riscv
vagrantc has joined #riscv
crossdev has joined #riscv
dzaima[m] has joined #riscv
<dzaima[m]> just built latest qemu (took a bit due to https://github.com/llvm/llvm-project/issues/75168) - a simple vlmax e8,m8 vle8.v takes ~1020ns on a vlen=128 config - 8ns/element
<courmisch> palmer: full FFmpeg checkasm bench takes half an hour on K230, and I have not bothered to try on QEMU. On individual tests, I see RVV being like 60% slower than C in QEMU.
<courmisch> while on hardware, I see anywhere from 2x to 8x speedup from C to RVV
<palmer> ya, that's what unlord was saying too
<courmisch> FFmpeg has many more cases and RVV instruction coverage than dav1d as of now
<courmisch> definitely totally not to brag
<palmer> ;)
<palmer> if you know of anything that's specifically slow then please just point me at it (or post on the QEMU bug or whatever), Patrick's going to try and get some reduced test cases from toolchain land but I don't hack on FFMPEG so I don't really know what's up over there
MaxGanzII has joined #riscv
<palmer> I think getting all this fixed will probably be too big for an intern project, but hopefully there's some low hanging fruit left we can deal w ith
<courmisch> I mean, it's basically "git clone ...; cd ffmpeg ; ./configure; make tests/checkasm/checkasm ; tests/checkasm/checkasm --bench"
<gurki> you might want to delve into the specific libraries ffmpeg uses and use them directly if you want to do benchmarking
<gurki> ffmpeg is little more than a (admittedly very sophisticated) wrapper for these
<gurki> encoding/decoding libraries*
<gurki> palmer: if you have problems with rvv you might want to run a linpack
<gurki> thats more of a hpc thing, but is a very good metric whether you get as much performance from simd as expected
<courmisch> uh
<courmisch> are you for real?
<courmisch> did you crunch then numbers on native implementations vs external libraries in FFmpeg?
<gurki> do you get significantly more performance when using e.g. libx265 within ffmpeg instead of externally when you ignore all the boilerplate video handling parts?
<gurki> this would surprise me, but im happy to be convinced otherwise by numbers
MaxGanzII has quit [Remote host closed the connection]
<gurki> i did _not_ mean to belittle ffmpeg efforts btw.
<gurki> sorry if i gave that impression.
MaxGanzII has joined #riscv
<courmisch> ffmpeg is little more than a wrapper for these [specific libraries]
<courmisch> ^ literally what you wrote
<courmisch> clearly you have no clue what you are on about, as any cursory look at the code base would invalidate such statement
<gurki> https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu <- i just checked whether things significantly changed. this still reads like they essentially use a lot of external libs for the heavy encoding/decoding lifting
<gurki> you are correct about "gurki is not familiar with the code itself".
<gurki> im a mere (happy!) user as far as ffmpeg is concerned.
<courmisch> and libx265 doesn't have *any* RISC-V optimisation as of yet
<gurki> so your statement is that there are quite a bunch of rv specific optimizations which have yet to be ported to these libraries, but there are some optimizations for the internal parts?
<gurki> (no offense, genuinely trying to grasp your point)
<courmisch> I think you're missing the point
<courmisch> there are a few things were ffmpeg uses external lirbaries
<courmisch> but by and large it does stuff natively. Otherwise gst, mpv and VLC wouldn't care to use FFmpeg
<courmisch> even h264 and h265 are decoded natively, only encoding is delegated to x26x
ldevulder_ has joined #riscv
mwette has joined #riscv
ldevulder has quit [Ping timeout: 268 seconds]
mlw has quit [Ping timeout: 252 seconds]
<unlord> courmisch: hey, give it some time. dav1d just got RISC-V 4 hours ago!
<courmisch> VLC got RVV optimisations 2 years ago
<unlord> courmisch: don't ask me, I created that dav1d MR in 2022
<unlord> but yeah, we should probably add the FFmpeg checkasm to that QEMU issue, more tests are definitely better for anyone working on codegen in QEMU
<courmisch> I don't really see the point in optimising QEMU RVV.
<courmisch> eww
mlw has joined #riscv
<courmisch> doesn't this assume that any RISC-V board is a QEMU VM?
<unlord> only if it has that file
<courmisch> isn't that file present if OpenSBI is present?
<jrtc27> any kernel config that enables the sbi console should have it, yes
<courmisch> which it pretty much always is
<jrtc27> IIRC there was a period it wasn't due to legacy sbi support being dropped prior to the new dbcn extension being added?
mwette has quit [Ping timeout: 252 seconds]
prabhakarlad has quit [Ping timeout: 250 seconds]
<unlord> courmisch: this is really a temporary measure because of how slow QEMU
<palmer> unlord: we should probably just give the QEMU virt board some m{vendor,arch,impl}id values, that'd be a more reliable way to detect this kind of thing
<unlord> palmer: that sounds reasonable
___nick___ has joined #riscv
ldevulder_ has quit [Ping timeout: 268 seconds]
___nick___ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
___nick___ has joined #riscv
___nick___ has quit [Client Quit]
crossdev has quit [Ping timeout: 252 seconds]
___nick___ has joined #riscv
ntwk has quit [Read error: Connection reset by peer]
bitoff has joined #riscv
jfsimon1981 has joined #riscv
ldevulder_ has joined #riscv
heat_ is now known as heat
KombuchaKip has quit [Quit: Leaving.]
KombuchaKip has joined #riscv
BootLayer has quit [Quit: Leaving]
crossdev has joined #riscv
<courmisch> unlord: your "temporary measure" is as good as always returning false
<courmisch> it's effectively checking if the kernel was compiled with support for the SBI console (which should always be the case). It does not distinguish QEMU from real hardware
<courmisch> palmer: that won't help userspace
<palmer> you can get them via hwprobe
<courmisch> okay but even then, if you emulate real hardware, you have to fake the values, so that's pointless
<courmisch> at least usermode QEMU can be detected by reading uname(&utsname.machine)
<courmisch> which should return the real ISA, as opposed to riscv
<courmisch> (obviously won't help system emulation)
vagrantc has quit [Ping timeout: 260 seconds]
shamoe has quit [Quit: Connection closed for inactivity]
maylay has quit [Ping timeout: 252 seconds]
davidlt has quit [Ping timeout: 252 seconds]
vagrantc has joined #riscv
crossdev has quit [Remote host closed the connection]
vagrantc has quit [Ping timeout: 264 seconds]
EchelonX has joined #riscv
vagrantc has joined #riscv
ezulian has quit [Ping timeout: 268 seconds]
maylay has joined #riscv
Andre_Z has joined #riscv
ntwk has joined #riscv
___nick___ has quit [Ping timeout: 252 seconds]
shamoe has joined #riscv
jfsimon1981 has quit [Remote host closed the connection]
<sorear> I can't speak for the system call but "uname -a" returns the emulated architecture in a chroot running with binfmt_misc and qemu, many things would fail to compile otherwise
<sorear> traumatize the longer-seving people here by making them remember HTIF...
<conchuod> maybe a silly question, but outside of the virt machine do you really care about detecting whether something is qemu or not?
<sorear> the topic was "detecting fast V" and "let's use qemu as a proxy for V being much slower than scalars", or possibly the opposite
<jrtc27> HTIF lives on as the interface used for shutting down QEMU
<jrtc27> by virtue of conforming to syscon-power's interface
<sorear> I do think that hwprobe should grow at least a few flags of the form "segmented loads are as fast as unit-stride", "vrgather is as fast as arithmetic at LMUL=1", "masked operations with tu mu are as cheap as unmasked" all of which differ widely between current hw implementations and affect sw optimization
<sorear> I am aware that getting that data in a way that satisfies kernel and firmware stakeholders will be a nigh-insurmountable challenge, but there's a clear userspace need for every library to not reinvent the runtime benchmarking wheel
<palmer> sorear: ya, I think we're going to need a bunch of vector performance flags. The only hardware I know if is the K230, if there's other stuff we can probably start to look into the differences and see what makes sense to be generic
<palmer> ya, we can at least get the uABI sorted out and then deal with the probing later ;)
<sorear> gurki: fp _also_ sucks in qemu, unless integer linpack is a thing you'll get a severely biased view of vector perf
<jrtc27> 2024 may well be the year of V 1.0, SG2380 (in the Milk-V Oasis) claims to be shipping Q3
<gurki> sorear: thank you for explaining the underlying issue in a way a gurki understands :)
<gurki> i did not expect that
<sorear> when I was actively working on the qemu riscv port fp instructions all generated helper calls that used the berkeley-softfloat library. I think there was an effort to use actual float instructions in the JITing in at least some cases, but qemu has always prioritized correctness over speed and there's only so much you can do to optimize implementing a different platform's NaN and flags rules
crossdev has joined #riscv
crossdev has quit [Remote host closed the connection]
crossdev has joined #riscv
jobol has quit [Quit: Leaving]
<geist> yah the V bits in particular seem to be fairly impossible to JIT natively, due to the variable width stuff
<geist> looks like basically every V instruction falls to a helper that does a big for loop for every element
<geist> some internal folks at work that were working with linux on riscv on qemu have found it's much slower to enable V than to run with a machine without it
esv has quit [Remote host closed the connection]
<geist> (though looks like i just repeated basically what everyone has been saying. i should read scrollback before blabbing :) )
esv has joined #riscv
Andre_Z has quit [Quit: Leaving.]
<sorear> It's not like you couldn't vectorize that for loop. Maybe use tb_flags to distinguish between VL=VLMAX, where you can unroll inline and use unpredicated traditional SIMD, and VL<VLMAX where you really need lengths or predication
crossdev has quit [Remote host closed the connection]
psydroid has quit [Quit: KVIrc 5.0.0 Aria http://www.kvirc.net/]
bitoff has quit [Ping timeout: 256 seconds]
Zeroday_ has joined #riscv
Zer0day1984 has quit [Ping timeout: 256 seconds]
Zeroday_ is now known as Zer0day1984
bitoff has joined #riscv
<sorear> how does risc-v external debug achieve a usable speed? reading all registers naively requires hundreds of roundtrips between the debugger and the debug module. are USB roundtrips reliably sub-ms? do we assume the existence of debug transport hardware that can do the roundtrippy bits at µs hardware speed? does lazy register and memory access work better than it sounds?
<geist> yah, trouble is you already have to have made it into a helper function at that point
<geist> so you're already out of JIT land, but you're right, could at least optimize that loop
<geist> this is where maybe some templaty C++ stuff would help since the loop is basically repeated for every opcode
<jrtc27> how much memory are you trying to read via external debug?
<jrtc27> normal guiding principle is steer clear of bare metal debugging on any arch where possible
<jrtc27> ie embrace a crummy uart printf :)
<sorear> if you're targeting SVE or AVX512, most single-width LMUL=1 instructions can be turned into a single host instruction
<geist> well, not really, because its based on what vlen was previously set to
<sorear> jrtc27: I'm imagining "enough memory to do the LOC + backtrace + locals most graphical debuggers do on a breakpoint" but I don't have a good idea of the problem space so that might not be the best answer
<geist> vlen could be like 3 or 7 or something
<jrtc27> my experience of debugging a soft core with the horrendously slow DMI-over-JTAG is it really isn't that bad
<sorear> you mean vl, and vsetvli would populate a mask register corresponding to vl
<jrtc27> admittedly not a graphical debugger, just boring tui gdb
<jrtc27> but LOC comes from the debug symbols on the host
<jrtc27> backtrace is two pointer-sized memory reads per frame
<jrtc27> locals you normally do lazily
<geist> hmm, i suppose yes. you could set the host mask register
<jrtc27> also I remember the speed being totally fine when doing bare metal debugging of a HiFive Unmatched
<jrtc27> but that was less extensive
<sorear> while "it's fine" is relevant information, i'm primarily asking "why is it fine"
<jrtc27> because it's not *that* much data
<jrtc27> I would imagine
<jrtc27> you can do 10s or 100s of KiB/s IIRC for some slow 100 MHz FPGA
<jrtc27> admittedly that's for writing to memory, not the slower back-and-forth for registers, but still
<jrtc27> and, FWIW, USB 2.0 uses 0.125ms frames
TMM_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
TMM_ has joined #riscv