MaxGanzII has quit [Remote host closed the connection]
ntwk has joined #riscv
MaxGanzII has joined #riscv
MaxGanzII has quit [Remote host closed the connection]
dogukan has quit [Quit: Konversation terminated!]
MaxGanzII has joined #riscv
jacklsw has quit [Ping timeout: 260 seconds]
heat_ is now known as heat
maxinux has joined #riscv
unnick has joined #riscv
MaxGanzII has quit [Remote host closed the connection]
MaxGanzII has joined #riscv
MaxGanzII has quit [Ping timeout: 255 seconds]
maxinux has quit [Quit: Brb]
<palmer>
unlord: odd, so maybe we just have some bad codegen. I guess there's an extra factor of 10x badness in there on QEMU, but maybe that doesn't matter if it's just something like scatter-gather that goes slow on HW too.
<palmer>
(also, I updated the description)
<palmer>
Patrick is running the fuzzer in some mode that looks for performance issues like this, so hopefully we'll have some better examples soon...
<unlord>
I guess I could try these binaries with qemu-user-riscv64 and see the output
shamoe has joined #riscv
<palmer>
oh, my numbers were all user-mode
<palmer>
that's how we run SPEC and the compiler test suites, so that's what's really been hammering folks around here
BootLayer has joined #riscv
<unlord>
yeah, I am seeing even worse than you posted, roughly 275x slower!
<courmisch>
palmer: full FFmpeg checkasm bench takes half an hour on K230, and I have not bothered to try on QEMU. On individual tests, I see RVV being like 60% slower than C in QEMU.
<courmisch>
while on hardware, I see anywhere from 2x to 8x speedup from C to RVV
<palmer>
ya, that's what unlord was saying too
<courmisch>
FFmpeg has many more cases and RVV instruction coverage than dav1d as of now
<courmisch>
definitely totally not to brag
<palmer>
;)
<palmer>
if you know of anything that's specifically slow then please just point me at it (or post on the QEMU bug or whatever), Patrick's going to try and get some reduced test cases from toolchain land but I don't hack on FFMPEG so I don't really know what's up over there
MaxGanzII has joined #riscv
<palmer>
I think getting all this fixed will probably be too big for an intern project, but hopefully there's some low hanging fruit left we can deal w ith
<courmisch>
I mean, it's basically "git clone ...; cd ffmpeg ; ./configure; make tests/checkasm/checkasm ; tests/checkasm/checkasm --bench"
<gurki>
you might want to delve into the specific libraries ffmpeg uses and use them directly if you want to do benchmarking
<gurki>
ffmpeg is little more than a (admittedly very sophisticated) wrapper for these
<gurki>
encoding/decoding libraries*
<gurki>
palmer: if you have problems with rvv you might want to run a linpack
<gurki>
thats more of a hpc thing, but is a very good metric whether you get as much performance from simd as expected
<courmisch>
uh
<courmisch>
are you for real?
<courmisch>
did you crunch then numbers on native implementations vs external libraries in FFmpeg?
<gurki>
do you get significantly more performance when using e.g. libx265 within ffmpeg instead of externally when you ignore all the boilerplate video handling parts?
<gurki>
this would surprise me, but im happy to be convinced otherwise by numbers
MaxGanzII has quit [Remote host closed the connection]
<gurki>
i did _not_ mean to belittle ffmpeg efforts btw.
<gurki>
sorry if i gave that impression.
MaxGanzII has joined #riscv
<courmisch>
ffmpeg is little more than a wrapper for these [specific libraries]
<courmisch>
^ literally what you wrote
<courmisch>
clearly you have no clue what you are on about, as any cursory look at the code base would invalidate such statement
<gurki>
https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu <- i just checked whether things significantly changed. this still reads like they essentially use a lot of external libs for the heavy encoding/decoding lifting
<gurki>
you are correct about "gurki is not familiar with the code itself".
<gurki>
im a mere (happy!) user as far as ffmpeg is concerned.
<courmisch>
and libx265 doesn't have *any* RISC-V optimisation as of yet
<gurki>
so your statement is that there are quite a bunch of rv specific optimizations which have yet to be ported to these libraries, but there are some optimizations for the internal parts?
<gurki>
(no offense, genuinely trying to grasp your point)
<courmisch>
I think you're missing the point
<courmisch>
there are a few things were ffmpeg uses external lirbaries
<courmisch>
but by and large it does stuff natively. Otherwise gst, mpv and VLC wouldn't care to use FFmpeg
<courmisch>
even h264 and h265 are decoded natively, only encoding is delegated to x26x
ldevulder_ has joined #riscv
mwette has joined #riscv
ldevulder has quit [Ping timeout: 268 seconds]
mlw has quit [Ping timeout: 252 seconds]
<unlord>
courmisch: hey, give it some time. dav1d just got RISC-V 4 hours ago!
<courmisch>
VLC got RVV optimisations 2 years ago
<unlord>
courmisch: don't ask me, I created that dav1d MR in 2022
<unlord>
but yeah, we should probably add the FFmpeg checkasm to that QEMU issue, more tests are definitely better for anyone working on codegen in QEMU
<courmisch>
I don't really see the point in optimising QEMU RVV.
<courmisch>
doesn't this assume that any RISC-V board is a QEMU VM?
<unlord>
only if it has that file
<courmisch>
isn't that file present if OpenSBI is present?
<jrtc27>
any kernel config that enables the sbi console should have it, yes
<courmisch>
which it pretty much always is
<jrtc27>
IIRC there was a period it wasn't due to legacy sbi support being dropped prior to the new dbcn extension being added?
mwette has quit [Ping timeout: 252 seconds]
prabhakarlad has quit [Ping timeout: 250 seconds]
<unlord>
courmisch: this is really a temporary measure because of how slow QEMU
<palmer>
unlord: we should probably just give the QEMU virt board some m{vendor,arch,impl}id values, that'd be a more reliable way to detect this kind of thing
ntwk has quit [Read error: Connection reset by peer]
bitoff has joined #riscv
jfsimon1981 has joined #riscv
ldevulder_ has joined #riscv
heat_ is now known as heat
KombuchaKip has quit [Quit: Leaving.]
KombuchaKip has joined #riscv
BootLayer has quit [Quit: Leaving]
crossdev has joined #riscv
<courmisch>
unlord: your "temporary measure" is as good as always returning false
<courmisch>
it's effectively checking if the kernel was compiled with support for the SBI console (which should always be the case). It does not distinguish QEMU from real hardware
<courmisch>
palmer: that won't help userspace
<palmer>
you can get them via hwprobe
<courmisch>
okay but even then, if you emulate real hardware, you have to fake the values, so that's pointless
<courmisch>
at least usermode QEMU can be detected by reading uname(&utsname.machine)
<courmisch>
which should return the real ISA, as opposed to riscv
<courmisch>
(obviously won't help system emulation)
vagrantc has quit [Ping timeout: 260 seconds]
shamoe has quit [Quit: Connection closed for inactivity]
maylay has quit [Ping timeout: 252 seconds]
davidlt has quit [Ping timeout: 252 seconds]
vagrantc has joined #riscv
crossdev has quit [Remote host closed the connection]
vagrantc has quit [Ping timeout: 264 seconds]
EchelonX has joined #riscv
vagrantc has joined #riscv
ezulian has quit [Ping timeout: 268 seconds]
maylay has joined #riscv
Andre_Z has joined #riscv
ntwk has joined #riscv
___nick___ has quit [Ping timeout: 252 seconds]
shamoe has joined #riscv
jfsimon1981 has quit [Remote host closed the connection]
<sorear>
I can't speak for the system call but "uname -a" returns the emulated architecture in a chroot running with binfmt_misc and qemu, many things would fail to compile otherwise
<sorear>
traumatize the longer-seving people here by making them remember HTIF...
<conchuod>
maybe a silly question, but outside of the virt machine do you really care about detecting whether something is qemu or not?
<sorear>
the topic was "detecting fast V" and "let's use qemu as a proxy for V being much slower than scalars", or possibly the opposite
<jrtc27>
HTIF lives on as the interface used for shutting down QEMU
<jrtc27>
by virtue of conforming to syscon-power's interface
<sorear>
I do think that hwprobe should grow at least a few flags of the form "segmented loads are as fast as unit-stride", "vrgather is as fast as arithmetic at LMUL=1", "masked operations with tu mu are as cheap as unmasked" all of which differ widely between current hw implementations and affect sw optimization
<sorear>
I am aware that getting that data in a way that satisfies kernel and firmware stakeholders will be a nigh-insurmountable challenge, but there's a clear userspace need for every library to not reinvent the runtime benchmarking wheel
<palmer>
sorear: ya, I think we're going to need a bunch of vector performance flags. The only hardware I know if is the K230, if there's other stuff we can probably start to look into the differences and see what makes sense to be generic
<palmer>
ya, we can at least get the uABI sorted out and then deal with the probing later ;)
<sorear>
gurki: fp _also_ sucks in qemu, unless integer linpack is a thing you'll get a severely biased view of vector perf
<jrtc27>
2024 may well be the year of V 1.0, SG2380 (in the Milk-V Oasis) claims to be shipping Q3
<gurki>
sorear: thank you for explaining the underlying issue in a way a gurki understands :)
<gurki>
i did not expect that
<sorear>
when I was actively working on the qemu riscv port fp instructions all generated helper calls that used the berkeley-softfloat library. I think there was an effort to use actual float instructions in the JITing in at least some cases, but qemu has always prioritized correctness over speed and there's only so much you can do to optimize implementing a different platform's NaN and flags rules
crossdev has joined #riscv
crossdev has quit [Remote host closed the connection]
crossdev has joined #riscv
jobol has quit [Quit: Leaving]
<geist>
yah the V bits in particular seem to be fairly impossible to JIT natively, due to the variable width stuff
<geist>
looks like basically every V instruction falls to a helper that does a big for loop for every element
<geist>
some internal folks at work that were working with linux on riscv on qemu have found it's much slower to enable V than to run with a machine without it
esv has quit [Remote host closed the connection]
<geist>
(though looks like i just repeated basically what everyone has been saying. i should read scrollback before blabbing :) )
esv has joined #riscv
Andre_Z has quit [Quit: Leaving.]
<sorear>
It's not like you couldn't vectorize that for loop. Maybe use tb_flags to distinguish between VL=VLMAX, where you can unroll inline and use unpredicated traditional SIMD, and VL<VLMAX where you really need lengths or predication
crossdev has quit [Remote host closed the connection]
<sorear>
how does risc-v external debug achieve a usable speed? reading all registers naively requires hundreds of roundtrips between the debugger and the debug module. are USB roundtrips reliably sub-ms? do we assume the existence of debug transport hardware that can do the roundtrippy bits at µs hardware speed? does lazy register and memory access work better than it sounds?
<geist>
yah, trouble is you already have to have made it into a helper function at that point
<geist>
so you're already out of JIT land, but you're right, could at least optimize that loop
<geist>
this is where maybe some templaty C++ stuff would help since the loop is basically repeated for every opcode
<jrtc27>
how much memory are you trying to read via external debug?
<jrtc27>
normal guiding principle is steer clear of bare metal debugging on any arch where possible
<jrtc27>
ie embrace a crummy uart printf :)
<sorear>
if you're targeting SVE or AVX512, most single-width LMUL=1 instructions can be turned into a single host instruction
<geist>
well, not really, because its based on what vlen was previously set to
<sorear>
jrtc27: I'm imagining "enough memory to do the LOC + backtrace + locals most graphical debuggers do on a breakpoint" but I don't have a good idea of the problem space so that might not be the best answer
<geist>
vlen could be like 3 or 7 or something
<jrtc27>
my experience of debugging a soft core with the horrendously slow DMI-over-JTAG is it really isn't that bad
<sorear>
you mean vl, and vsetvli would populate a mask register corresponding to vl
<jrtc27>
admittedly not a graphical debugger, just boring tui gdb
<jrtc27>
but LOC comes from the debug symbols on the host
<jrtc27>
backtrace is two pointer-sized memory reads per frame
<jrtc27>
locals you normally do lazily
<geist>
hmm, i suppose yes. you could set the host mask register
<jrtc27>
also I remember the speed being totally fine when doing bare metal debugging of a HiFive Unmatched
<jrtc27>
but that was less extensive
<sorear>
while "it's fine" is relevant information, i'm primarily asking "why is it fine"
<jrtc27>
because it's not *that* much data
<jrtc27>
I would imagine
<jrtc27>
you can do 10s or 100s of KiB/s IIRC for some slow 100 MHz FPGA
<jrtc27>
admittedly that's for writing to memory, not the slower back-and-forth for registers, but still