sorear[m] changed the topic of #riscv to: Matrix users: #riscv:libera.chat will be ending operation NET Jul 25; please test #riscv:catircservices.org as a replacement | RISC-V instruction set architecture | https://riscv.org | Logs: https://libera.irclog.whitequark.org/riscv
Tenkawa has quit [Quit: Was I really ever here?]
<dzaima[m]> slightly more analysis - clang can and does autovectorize the loop in question when it's by itself (https://godbolt.org/z/T9xKqqfjo), but just does not in the full context where I put it (and I didn't bother verifying it as my ARM "test bench" is my phone, debugging/profiling which is slightly annoying); anyway, very off-topic
<sorear> I wouldn't say that microarchitectural design space exploration is "very" off-topic
<muurkha> or even slightly
<dzaima[m]> well, that was just compiler autovectorization debugging
<dzaima[m]> for something that's probably more on-topic, here's some.. self-advertisement: I've been working on https://dzaima.github.io/intrinsics-viewer/, a listing of (rvv1.0) C intrinsics with C-like pseudocode description & expectable output assembly; might be helpful for learning, but of course most likely not as accurate as the spec itself
<muurkha> oh cool!
TMM_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
TMM_ has joined #riscv
knolle has quit [Ping timeout: 245 seconds]
zBeeble42 has joined #riscv
zBeeble24 has quit [Ping timeout: 264 seconds]
rvalles has quit [Ping timeout: 250 seconds]
rvalles has joined #riscv
jacklsw has joined #riscv
jacklsw has quit [Ping timeout: 246 seconds]
<sorear> https://gist.github.com/sorear/be81ce2725cdc8917973e8d64f40d0b1 if you want to see some appalling ideas with rvv, part of me wants to not explain anything and leave it as a puzzle
<muurkha> heh
<sorear> if you have a can-do mindset any instruction can be a mask instruction
<sorear> i forgot a vmin in the second and don't want to think about how much is wrong in the first
BootLayer has joined #riscv
billchenchina has joined #riscv
MarvelousWololo has quit [Read error: Connection reset by peer]
crabbedhaloablut has joined #riscv
<courmisch> sorear: sorry; European bed time. NF as in how many segments? I think 2 is most common (e.g. complex float numbers used in audio). 4 and 8 also occurs. NPOT not so much
<courmisch> sorear: as for ELEN, audio is almost exclusively 32-bit (single precision), non-HDR video is a mix of 8- and 16-bit; HDR video is only 16-bit
<muurkha> 32-bit audio?
<courmisch> no, single precision
<muurkha> single precision is normally 32 bits, isn't it? Even if a few of them are in the exponent
<courmisch> almost all audio is single precision. JACK, PulseAudio and PipeWire use it. Most multimedia apps too
<courmisch> most decoders and filters are written for single precision too
<muurkha> are you saying that almost all audio is floating point?
<courmisch> in processing, yes
<muurkha> interesting, I thought that was an unusual niche formt
<muurkha> *format
<courmisch> of course what comes from ADC and goes to DAC is integer
<muurkha> It wouldn't have to be; I mean ADPCM is more like floating point, right?
<courmisch> FWIW, I think Microsoft's WASAPI also uses float internally
<muurkha> I think most audio file formats are integer, which is probably why I think of that as being normal
<courmisch> ADPCM is only found in pre-MP3 era video games
<muurkha> it was more widely used than that at the time
<courmisch> telecommunication used G.711, not ADPCM, though they are vaguely similar
<muurkha> and S1 MP3 players use ADPCM for their voice recording by default
<muurkha> sure, I guess μlaw and a-law are kind of floaty
<courmisch> ADPCM doesn't strike me as a voice codec. AFAICT, ADPCM is mostly a trick to reduce bit rate at higher sampling rates. But voice doesn't require high sampling rates in the first place
<courmisch> or at least the industry thought so, ignoring that more than half the population is female or children, with higher pitch voices, but I blame the previous generation
<muurkha> well, the S1 MP3 is built around a Z80, so "higher sampling rates" is a matter of perspective :)
<muurkha> I was going to say "was" but then I remembered that I searched a few weeks ago and apparently they're still for sale
<courmisch> I mean music rates, 44.1kHz or 48kHz
<courmisch> voice is usually 8kHz or if you are fancy 16kHz
<muurkha> yes, IIRC it uses 16ksps ADPCM
<muurkha> I had some files I recorded on one but I don't know where they are
<muurkha> not sure why they didn't use something like GSM full rate
<muurkha> anyway I think there's still a lot of integer PCM data flowing around between programs, even if Pulse does support floats
<muurkha> audio data I mean
<courmisch> you can feed PCM to any sane audio output HAL, but it will be converted to float, unless you have monopoly over the audio output device
<muurkha> you don't need float to do mixing, if that's what you mean, or even dithering
<muurkha> people like floats for mixing because they don't clip
<courmisch> no, but that ship has sailed
<courmisch> nobody does audio processing in integer
<courmisch> unless it's some niche embedded code for an FPU-less device, ofc
<muurkha> there are more programs in cloud and earth, courmisch, than are dreamed of in your philosophy
<muurkha> have you used Audacity? Audacity is totally comfortable doing everything in integers
<muurkha> and that's what it does by default
<muurkha> not just if you import from integer WAV files or CD-DA. Even if you import from AAC or MP3!
<muurkha> Try importing some random YouTube video into Audacity and select Effects → Amplify and tell it to amplify it by 60dB. You'll get clipping all over the place, exactly the way you don't with float
<courmisch> of course if your source is an audio CD or a microphone you'll get PCM. I'm just saying internal processing is utterly dominated by single precision.
<courmisch> Audacity can do silly things to please the so-called "audiophiles". Unless they write their own decoders, they'll get single precision for most sources, and unless they bypass JACK/PA/PW, their output will be postprocessed in single precision
<muurkha> I don't think it's to please the audiophiles, though who knows; IEEE-754 single-precision floating point has more bits of mantissa than CD-DA
<muurkha> I just think it's silly to describe Audacity as "some niche embedded code for an FPU-less device,some niche embedded code for an FPU-less device" ;)
<courmisch> the only people who would get pissy about possibly loosing precision due to 32-bit float are audiophiles
<muurkha> losing precision?
<courmisch> and those will insist on fixed-point 32-bit processing
<courmisch> muurkha: the "problem" with float processing is that you have rounding errors, and they're not even the same between implementations
<muurkha> I've never heard of anyone using fixed-point 32-bit processing
<courmisch> well, you can't really do processing at 16-bit
<muurkha> depends
<muurkha> yes, irreproducibility of floating-point is a problem, but reproducibility is not a major objective for audio processing I think
<courmisch> and open-source decoders with integer paths typically output fixed-point 32-bit on that path
<muurkha> interesting, I don't think I've seen that. but there's an enormous amount I haven't seen
<courmisch> muurkha: I agree, but audiophiles seem to live in an alternate reality where the human ear can tell the difference
<muurkha> I thought the hip audiophile thing was 24/96: 24-bit precision at 96ksps
<courmisch> yes, lossless FLAC 24-bit 96kHz
<courmisch> but that's just the storage format
<muurkha> but 32-bit floats have a 24-bit significand
<muurkha> so I don't think you're going to lose precision by going to 32-bit floats
billchenchina has quit [Quit: Leaving]
<courmisch> yes they do, don't try reason when it comes to audiophilia
<muurkha> but yes it's true that on mixing you have rounding errors (but who cares)
<muurkha> I mean 24 bits of precision is a lot of headroom
<courmisch> a good human ear cannot even sense 12th bit error
<muurkha> and floats give you more headroom
<muurkha> 12 bits is only 72 dB, right?
<muurkha> I think there are probably circumstances where you can detect a sound that's 72 dB quieter than another sound that's far away in frequency
<muurkha> like, if the loud sound is 20 Hz and the quiet sound is 3500 Hz, I think it should be easy
<courmisch> and IIRC, blind test couldn't differentiate 16-bit and 24-bit recordinds
<courmisch> the only logical reason to use 24-bit recording is HQ postprocessing. But audiophilia is not about logic.
<muurkha> especially if it's 3500 Hz beating with an equal-amplitude sound at 3600 Hz
<courmisch> hmm, not sure how the frequncies where spaced, and I could off by a bit or two
<courmisch> but the point is that 24-bit is mostly useless (compared to 16-bit)
<muurkha> you'd need a quiet listening room probably
<courmisch> and integer processing is a waste of SW developer time, unless your target CPU has no FPU
<muurkha> or good over-ear headphones
<muurkha> because otherwise even slight background noise will overwhelm the signal of interest
<courmisch> and ditto 96kHz/192kHz vs 48kHz. Waste of bits, unless you're going to downrate the sound to make your techno music.
<muurkha> yeah, the benefit of 96ksps is that you have an extra octave of headroom for your antialiasing filters
<muurkha> but once you have it in the digital domain you might as well downsample it with the kinds of astoundingly good filters that are easy to do digitally
<muurkha> unless you're listening to bats or something
<muurkha> there are some really interesting DSP algorithms you probably aren't aware of that depend on exact pole-zero cancellation and so don't work reliably in floating-point
<muurkha> Hogenauer filters are probably the most popular ones, and you're undoubtedly surrounded by equipment with Hogenauer filters deeply embedded in it even if you aren't aware of it
<muurkha> but there are others
<muurkha> there's a plot on https://en.wikipedia.org/wiki/Psychoacoustics#Masking_effects showing how masking drops off with Δf
<muurkha> I was hoping to find a better plot
knolle has joined #riscv
BootLayer has quit [Quit: Leaving]
aburgess_ has joined #riscv
Trifton_ has joined #riscv
zjason` has joined #riscv
Noisytoot has quit [Killed (erbium.libera.chat (Nickname regained by services))]
Noisytoot has joined #riscv
TMM__ has joined #riscv
Leopold_ has joined #riscv
Esmil_ has joined #riscv
matoro_ has joined #riscv
joev1 has joined #riscv
troglodi1o has joined #riscv
rvalles has quit [*.net *.split]
TMM_ has quit [*.net *.split]
matoro has quit [*.net *.split]
joev has quit [*.net *.split]
aburgess has quit [*.net *.split]
kilobyte_ch has quit [*.net *.split]
Pierce[m] has quit [*.net *.split]
Leopold has quit [*.net *.split]
zjason has quit [*.net *.split]
FL4SHK has quit [*.net *.split]
pabs3 has quit [*.net *.split]
troglodito has quit [*.net *.split]
Trifton2 has quit [*.net *.split]
Esmil has quit [*.net *.split]
whitequark[cis] has quit [*.net *.split]
_catircservices has quit [*.net *.split]
FL4SHK has joined #riscv
_catircservices has joined #riscv
whitequark[cis] has joined #riscv
Pierce[m] has joined #riscv
gordonDrogon has quit [Ping timeout: 250 seconds]
pabs3 has joined #riscv
kilobyte_ch has joined #riscv
rvalles has joined #riscv
bjoto has quit [Ping timeout: 246 seconds]
bjoto has joined #riscv
bjoto has quit [Ping timeout: 258 seconds]
bjoto has joined #riscv
gordonDrogon has joined #riscv
BootLayer has joined #riscv
bjoto has quit [Ping timeout: 264 seconds]
bjoto has joined #riscv
<courmisch> sorear: problem with that non-segmented strategy: it only works one way. There's no widening left shift as far as I can see
bjoto has quit [Ping timeout: 246 seconds]
<dzaima[m]> yeah, afaik you have to do that as separate widening & shifting; so all together 4 ops - 2 widens, a shift, and an bitwise-or or blend or something
<courmisch> yikes
bjoto has joined #riscv
<dzaima[m]> there's a widening shift left in some propsed extension - https://github.com/riscv/riscv-crypto/blob/v20230620/doc/vector/insns/vwsll.adoc but that's of course an extension, and also isn't yet ratified
whitequark has left #riscv [#riscv]
jrtc27[m] has quit [Quit: Bridge terminating on SIGTERM]
[0x4A6F][m] has quit [Quit: Bridge terminating on SIGTERM]
_catircservices has quit [Quit: Bridge terminating on SIGTERM]
sorear[m] has quit [Quit: Bridge terminating on SIGTERM]
psydroid[m] has quit [Quit: Bridge terminating on SIGTERM]
Pierce[m] has quit [Quit: Bridge terminating on SIGTERM]
whitequark[cis] has quit [Quit: Bridge terminating on SIGTERM]
dzaima[m] has quit [Quit: Bridge terminating on SIGTERM]
_catircservices has joined #riscv
sorear[m] has joined #riscv
_catircservices has quit [Remote host closed the connection]
sorear[m] has quit [Remote host closed the connection]
_catircservices has joined #riscv
sorear[m] has joined #riscv
whitequark[cis] has joined #riscv
jrtc27[m] has joined #riscv
dzaima[m] has joined #riscv
Pierce[m] has joined #riscv
[0x4A6F][m] has joined #riscv
Pierce[m] has quit [Remote host closed the connection]
whitequark[cis] has quit [Remote host closed the connection]
jrtc27[m] has quit [Remote host closed the connection]
[0x4A6F][m] has quit [Remote host closed the connection]
sorear[m] has quit [Remote host closed the connection]
_catircservices has quit [Remote host closed the connection]
dzaima[m] has quit [Remote host closed the connection]
_catircservices has joined #riscv
sorear[m] has joined #riscv
whitequark[cis] has joined #riscv
jrtc27[m] has joined #riscv
dzaima[m] has joined #riscv
Pierce[m] has joined #riscv
[0x4A6F][m] has joined #riscv
psydroid[m] has joined #riscv
_catircservices has quit [Remote host closed the connection]
[0x4A6F][m] has quit [Remote host closed the connection]
jrtc27[m] has quit [Remote host closed the connection]
psydroid[m] has quit [Remote host closed the connection]
whitequark[cis] has quit [Remote host closed the connection]
sorear[m] has quit [Remote host closed the connection]
Pierce[m] has quit [Remote host closed the connection]
dzaima[m] has quit [Remote host closed the connection]
_catircservices has joined #riscv
sorear[m] has joined #riscv
whitequark[cis] has joined #riscv
jrtc27[m] has joined #riscv
dzaima[m] has joined #riscv
Pierce[m] has joined #riscv
[0x4A6F][m] has joined #riscv
psydroid[m] has joined #riscv
sorear[m] has quit [Quit: Bridge terminating on SIGTERM]
_catircservices has quit [Quit: Bridge terminating on SIGTERM]
whitequark[cis] has quit [Quit: Bridge terminating on SIGTERM]
dzaima[m] has quit [Quit: Bridge terminating on SIGTERM]
jrtc27[m] has quit [Quit: Bridge terminating on SIGTERM]
psydroid[m] has quit [Quit: Bridge terminating on SIGTERM]
Pierce[m] has quit [Quit: Bridge terminating on SIGTERM]
[0x4A6F][m] has quit [Quit: Bridge terminating on SIGTERM]
_catircservices has joined #riscv
sorear[m] has joined #riscv
whitequark[cis] has joined #riscv
jrtc27[m] has joined #riscv
dzaima[m] has joined #riscv
Pierce[m] has joined #riscv
[0x4A6F][m] has joined #riscv
psydroid[m] has joined #riscv
sorear[m] has quit [Remote host closed the connection]
_catircservices has quit [Remote host closed the connection]
Pierce[m] has quit [Remote host closed the connection]
dzaima[m] has quit [Remote host closed the connection]
[0x4A6F][m] has quit [Remote host closed the connection]
psydroid[m] has quit [Remote host closed the connection]
jrtc27[m] has quit [Remote host closed the connection]
whitequark[cis] has quit [Remote host closed the connection]
_catircservices has joined #riscv
sorear[m] has joined #riscv
whitequark[cis] has joined #riscv
jrtc27[m] has joined #riscv
dzaima[m] has joined #riscv
Pierce[m] has joined #riscv
[0x4A6F][m] has joined #riscv
psydroid[m] has joined #riscv
junaid_ has joined #riscv
junaid_ has quit [Client Quit]
Andre_Z has joined #riscv
Tenkawa has joined #riscv
junaid_ has joined #riscv
pecastro has joined #riscv
awita has joined #riscv
raghavgururajan has joined #riscv
BootLayer has quit [Quit: Leaving]
cwebber`` is now known as cwebber
cwebber has quit [Changing host]
cwebber has joined #riscv
unsigned has quit [Quit: .]
BootLayer has joined #riscv
ximet43 has quit [Read error: Connection reset by peer]
ximet43 has joined #riscv
Andre_Z has quit [Quit: Leaving.]
<sorear> you can use vwadd.vw
<sorear> instead of vwsll you can try using vwmul
<sorear> vwmaccu.vx _almost_ does what you want but it requires one of the inputs to be pre-widened
<sorear> given that Zbb has PACK this is a kinda obvious omission from Zvbb !
<sorear> or ignore the problem. stores are a fair amount rarer than loads and having them slow is less of a problem
<dzaima[m]> oh yeah vwadd.vw is nice for merging widen&or. vwmul doesn't work though, as you'd need the constant multiplicand to be wider than the half-width type
Danidada has joined #riscv
<sorear> does ffmpeg have any mechanism for "feature X exists on uarch Y but is slow, don't use it"? it's not like intel hasn't done this more than once
<sorear> looking at you maskmov
<sorear> (anyone like my hell loops?)
junaid_ has quit [Remote host closed the connection]
<sorear> is ffmpeg likely to use Zvfh (_Float16) anywhere?
<dzaima[m]> very "fun" is that AVX2 maskmov store is still very slow on Zen 4, despite an AVX512 masked store being equivalent and fast
* cousteau_ wonders if C compilers are likely to support a `short float` type, and stuff like scanf("%hf")
<sorear> what would "short float" do now that ISO has standardized _Float16?
<courmisch> sorear: there are such flags for SSE2, SSE3, SSSE and AVX, yes
<courmisch> but currently, there'd be no ways to populate them on RV
<courmisch> sorear: multimedia doesn't use half-precision for anything AFAIK, nor the other AI variants of 16-bit flots.
<cousteau_> sorear: I guess the same as `bool` or `complex`
<courmisch> sorear: maybe GPUs have some use for video filtering? maybe check the DRM chroma list. but that would probably go via GL or VK rather than the vector unit
<cousteau_> except that, uh, they're not macros/typedefs
<cousteau_> *it's not
<sorear> _Float16 isn't a typedef
<cousteau_> no, I meant that `bool` is
<cousteau_> but `short float` wouldn't
<courmisch> sorear: I am a bit suspicious of using a multiply accumulate for a shift-and-or; multiplications tend to be slower
<sorear> ISO 18661-3-2015 doesn't define printf/scanf codes, it has a new "strfromf16" function...
<cousteau_> so you can `typedef _Bool bool;` and `#define complex _Complex`, but you can't `typedef _Float16 short float;`
<sorear> courmisch: you're already writing single-purpose code for the c910...?
<courmisch> no, I'm writing RVV 1.0 code then postprocessing it to benchmark on C910
<courmisch> for lack of a better alternative
<cousteau_> but in any case, I'd expect compilers to support `short float` at least as an extension, and maybe in a future as a base type
<sorear> interesting that ffmpeg went with exclusive use of array-of-structs, I don't think SSE supports segmented operations well
<courmisch> interleaved is the defacto standard for audio
<sorear> I wouldn't expect that, every type that has been added since "long long" in C99 has been a single keyword
<courmisch> I have seen planar audio, but it's really rare
<sorear> externally, sure, but I meant for internal variables in the AAC codec or whatever you were working on
<courmisch> I think those were complex numbers, though I have not checked the "upper layers"; just ported the reference C
<sorear> deplanarize it at the same time you convert from float32 to int16 for the DAC
<courmisch> that would be inside JACKd/PA/PW outside the reach of FFmpeg
<courmisch> I don't think PA and PW even support planar audio
<courmisch> (could be wrong)
<courmisch> but converting FFmpeg and all the apps using FFmpeg is essentially impossible at this point
<sorear> how do you handle NPOT channels?
<courmisch> you mean surround? that's also interleaved. I haven't touched the filtering code yet though
<courmisch> typically, surround is pass-through anyway, thanks to Dolby / DTS insanity
<cousteau_> sorear: well, C99 was the last standard I really understood so...
<courmisch> C99 doesn't have atomics and alignment. Useless.
<courmisch> sorear: I am a bit confused why so many opcodes were taken for segment support if it's not implemented properly by anybody
<sorear> huh, c23 has scanf codes for float, double, long double, _Decimal32, _Decimal64, and _Decimal128 but not the sized binary float types
<courmisch> we'll see what SiFive did, whenever they actually release a CPU
<sorear> courmisch: i think you should test more than one implementation before declaring that it's not implemented properly by anybody
aerkiaga has joined #riscv
<sorear> (and the spec doesn't actually say anything about which operations are required to be fast)
<courmisch> sorear: sure. But for comparison, arm only supports 2-4 segments. They went out of their way and consumed one more bit
<courmisch> also if the plan is to enable compilers to autovectorise, segmented loads are probably unavoidable
<sorear> sure. and you did say that a trivial loop with a segmented load and 1.5 FLOPs per loaded value was only slightly slower than the scalar loop
<sorear> it's not like you're hitting m-mode emulation code here, like what happens when you try to take advantage of required support for misaligned memory access in the unprivileged spec on any sifive cpu
<courmisch> yes, it's not catastrophically slow
<courmisch> but if it's slower than scalar, even if it's faster than manually shuffling bits
<courmisch> it's still kinda useless. We'll just have to see what anybody other than T-Head did
<sorear> can you put in a tune flag for avoiding/using zvlsseg, to use 0.7.1 terminology?
<sorear> if anything I'm surprised t-head implemented zvlsseg if they made it no faster than independent strided loads
<courmisch> you mean FFmpeg CPU flag?
<courmisch> if __riscv_hwprobe hypothetically defined a corresponding flag, yes
<cousteau_> may I ask why the sudden interest in ffmpeg? I've seen it popping up a lot here recently
<cousteau_> is this because something in the vector extension may be of interest for ffmpeg audio/video codecs, and there's some collaboration going on, or intended to be?
<courmisch> it's one of the most assembler-intensive of the popular OSS libraries
<cousteau_> oh I see
<cousteau_> so it's being used as an example?
<courmisch> meanwhile OpenSSL and Nettle probably can't do much until the crypto vector stuff is out
<courmisch> I haven't seen FFmpeg pop up much recently here. It's just the same segmented load discussion that's been going on for a few days
<courmisch> and neither FFmpeg nor any other multimedia libraries are included in RISE :(
<cousteau_> I'm no crypto expert but is it really that important? I saw a lot of noise about Zvk* (whereas Zk* seems to be pretty established by now)
<courmisch> AFAIU, there are two parts to Zvk. One where they just vectorise the Zk stuff like CLMUL
<courmisch> and one where they implement specific algorithm rounds directly in the CPU
<courmisch> the latter should obviously be a hell of a lot faster than Zk
<courmisch> not that I'd know
jacklsw has joined #riscv
<cousteau_> I see
cousteau_ is now known as cousteau
<cousteau> the way you put it, I'd expect that first part to just be the same instructions implemented differently
<cousteau> (although maybe there are some minor differences other than the ABI that do matter, such as timing... part of Zk* is "data-independent timing" so I guess that matters)
<courmisch> that's like comparing I and V
<courmisch> V does not really add anything that I/ or M can't do. It's just faster if used right
MarvelousWololo has joined #riscv
<cousteau> courmisch: ah. I misunderstood that Zvk would just accelerate instructions that Zk already did
joev1 has quit [Ping timeout: 252 seconds]
joev1 has joined #riscv
<dzaima[m]> fwiw, current status of autovectorization of the original question ffmpeg loop: https://godbolt.org/z/xPecPnYna - clang uses vlseg2e32 whereas gcc does vrgather-s, clang always has a scalar loop for the tail, gcc sucks at eliminating setvl-s, both unroll 2x despite usually not doing that in my experience (might make sense if for gcc splitting up higher-lmul gathers, but no it's still LMUL=1 which is its usual LMUL; clang's at LMUL=2 which is
<dzaima[m]> also its usual LMUL)
<cousteau> I think RISC-V should add carry-less addition and subtraction in addition to multiplication
<cousteau> as a joke
<cousteau> see how long it takes people to realize that's just XOR
frkzoid has joined #riscv
frkazoid333 has quit [Ping timeout: 264 seconds]
<sorear> clmul is Zbc, not Zk*
<sorear> Zk* implements round function bits, but it's limited by your datapath width ... Zkned needs 16 instructions for an AES round on rv32ik, 4 on rv64ik
<sorear> Zkt is not really fit for purpose because it doesn't specify that loads and stores are independent of the value loaded or stored, so if you're treating it as a manual you can't store keymat in memory
<sorear> only useful for secrets born in registers (via Zkr) and dying in registers
<sorear> courmisch: so no ffmpeg mechanism for setting flags other than architectural features on the basis of vendor/arch/impl, runtime benchmarking, or configuration?
<courmisch> sorear: you can set a flag based on whatever, but well, you need to detect the correct value somehow
<courmisch> dzaima[m]: interesting. Well I doubt that vrgather is fast. It sounds like the one instruction that can't really be fast
<sorear> I mean it's easy enough to detect "vlseg2e32 is slower than vle64 + vnsrl + vnsrl" if you have access to rdcycle
<courmisch> sorear: "if you have access to rdcycle" well, I think the kernel is removing that access
<dzaima[m]> yeah. gcc autovectorization is very early stages afaict, so that's likely just it not being particularly complete
<sorear> although it might not be safe to assume that the behavior is the same for all element sizes, and linux is phasing out the initial scounteren setup
<courmisch> dzaima[m]: IMO, segmented loads are correct here. It's Clang that's using gather where it's easily avoided
<dzaima[m]> @libera_courmisch:catircservices.org: no, clang uses vlseg2e32 whereas gcc uses vrgather
<sorear> gcc autovectorization actually uses VL so in my book it's way ahead of clang
<courmisch> dzaima[m]: ah ok, nvmd
<dzaima[m]> (ugh, bridge doesn't convert user links properly matrix→irc)
<courmisch> sorear: seems more like a case of neither of them being mature, with different problems
<dzaima[m]> clang's presumably using the same core as ARM SVE which doesn't have VL; gcc is a new vectorizer afaik
<cousteau> sorear: isn't clmul also part of Zbk* ?
<cousteau> (probably Zbkc)
<sorear> maybe
<sorear> clang's IR has %evl arguments on every vector-predication intrinsic function which aren't being generated on any architecture...
<cousteau> apparently it used to be called "Zkg"
<courmisch> dzaima[m]: SVE doesn't have VL, but it has WHILExx doing almost the same though. But maybe they determined that unrolling was faster than using VL style due to lack of LMUL
Narrat has joined #riscv
Tenkawa has quit [Quit: Was I really ever here?]
<dzaima[m]> hmm yeah; gcc also appears to use whilelo: https://godbolt.org/z/9szqGoxKG
<courmisch> and GCC is not really using VL on RVV
<courmisch> it's hard-coding the AVL to 8 for no obvious reasons
<courmisch> ah, maybe missing restrict?
<courmisch> nah, I don't know
<dzaima[m]> that may be because of.. --param=riscv-autovec-preference=fixed-vlmax - change that to scalable and it should probably scale, at the expense of even more output
<courmisch> not sure if becoming a compiler developer is insanity or job security
<courmisch> probably both
<dzaima[m]> gcc's scalable code is usually a lot better though; the deinterleaving hits a pretty bad case
<courmisch> I guess LLVM just took it from NEON, not SVE
<courmisch> that would actually explain the unrolling
<dzaima[m]> this is the first case I've seen LLVM unroll though
<courmisch> doesn't it use NEON with 2x unroll on Armv8 for this function?
<dzaima[m]> on NEON clang does unroll usually, yes
<dzaima[m]> but for rvv it doesn't unroll even the simplest loops - https://godbolt.org/z/q1YqG3noc
<courmisch> soembody should teach it vmv.v.i
<dzaima[m]> that's a float 1.0, not integer
<dzaima[m]> it does indeed vmv.v.i v8, 1 for an integer 1
<sorear> it's still painful to look at scalar epilog loops for rvv
<dzaima[m]> yeah
<courmisch> so hmm, is there any way to load a vector *backward* other than vlse with a negative stride?
<courmisch> a loop over vslide, but that's probably even slower
<dzaima[m]> vrgather is always an option
<courmisch> vid.v; vneg; vrgather, hmm
<dzaima[m]> probably would want to have separate tail handling though, otherwise the index generation has to be in the loop
<sorear> panpipe is definitely going to optimize strides of -1,0,1 into unit-stride memory operations btw
<sorear> mostly because it's extremely easy to generate a variable-but-1 stride in FORTRAN
<sorear> i think that everything in lapack is a strided load but I haven't checked the actual binary
jacklsw has quit [Ping timeout: 272 seconds]
<sorear> how _is_ vrgather performance on c910?
<dzaima[m]> here's a vrgather-based array reverse impl, both in-place and out-of-place in one, in autogenerated C intrinsics, that I wrote as an experiment, and appears to possibly work with some simple tests in qemu - https://godbolt.org/z/Edf8addjK
<sorear> suddenly reminded of / wondering when we're going to be able to inline memcpy as 3 instructions for len<=128
<sorear> vsetivli;vle8;vse8 compares rather favorably to auipc;jalr ...
<sorear> maybe I'll turn the json parser into something testable
<courmisch> is it legal to use an odd register number as a wide operand if LMUL is fractional?
<courmisch> sorear: implying you don't have a C910 lying around? I can run a test case if you have one
<dzaima[m]> clang does some... very funky things with memcpy: https://godbolt.org/z/b3qM78TaY
<sorear> courmisch: i ordered a lp4a but it won't be here for two weeks
<sorear> courmisch: I would say so - my understanding of the rules on register number is that they are based on EMUL, not LMUL
<sorear> dzaima[m]: SLP was a mistake
<dzaima[m]> I can hack up some vrgather tests if desired; I'm also quite curious how it'd perform (and also don't have any risc-v hardware)
<sorear> it's not clear from the photos and the pdfs how I'm supposed to run my own S/M-mode code on the lp4a but I'll probably figure it out when I have it
<courmisch> mine has broken fastboot
<courmisch> I had to use unofficial U-boot's fastboot to flash it
<courmisch> to run your own S mode code, you can just replace the kernel image in u-boot, I think. Custom M mode, I dunno
joev1 has quit [Ping timeout: 245 seconds]
<sorear> what hardware do you use to flash it?
<courmisch> USB C cable
joev1 has joined #riscv
<courmisch> since it's built-in MMC, you can't do the usual dd from your desktop computer
<courmisch> but judging by your timeline, you just got the first nonbeta while I have the last beta
<courmisch> in principle, you should be able to overwrite the u-boot partition from within Linux. But don't come crying to me if you brick the device...
<sorear> haven't done anything with u-boot before
<sorear> does "unofficial U-boot's fastboot" imply that you replaced the U-boot image on yours?
<courmisch> from U0 TTL serial port, I just typed "fastboot usb 0" + Enter, and that got me a somewhat working fastboot gadget
<courmisch> again, that's not the official method, which is to hold the BOOT button while attaching power (never worked for me)
<courmisch> sorear: no, there is a built-in fastboot implementation inside u-boot, and Sipeed didn't proactively remove it
TMM__ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
TMM_ has joined #riscv
<sorear> courmisch: do you know how the beta/official thing works? do you have a beta version?
Narrat has quit [Quit: They say a little knowledge is a dangerous thing, but it's not one half so bad as a lot of ignorance.]
<courmisch> sorear: I guess I have a beta because it shipped before they announced the release, but :shrug:
<sorear> i think i accidentally got a beta, the only one listed on the store when i ordered was the "8+8"
<sorear> if it was _clearly labeled_ as a beta i would have waited a few days
<sorear> online docs say that the release versions have dip switches to boot from sd card, would be a nice simple way to run M-mode code without worrying about bricking it
<courmisch> when I bought, they were only selling the beta specs. So I guess I have a beta from inventory, even though beta was officially only for preorders
<sorear> fairly clear that the fastboot "boot" image includes opensbi, which means I can replace the M-mode code, although there's no obvious way to restore if it won't load u-boot
<courmisch> is it so that only u-boot SPL and OpenSBI get M, while "normal" u-boot runs at S mode?
<sorear> hey, if you are writing in a dialect where you can translate to xtheadv for testing, would it make sense to include that in the build system for users?
<sorear> (SPL, opensbi, normal uboot) that is my understanding, yes
<courmisch> and OpenSBI monopolises M mode, and protects itself with PMP from S mode?
<courmisch> I'm very unfamiliar with the privileged RISC-V stuff
<sorear> I'll probably patch opensbi to include a backdoor, then not touch the opensbi on flash again
<sorear> that or - any idea how to use the jtag?
<courmisch> to paraphrase my go-to-colleague for electronic questions, using JTAG would require some very fine soldering
<courmisch> tl;dr: no
<sorear> oh, you have to solder on your own connector? :/
<courmisch> that's what he said. I took his word at face value because my electronic-fu is nonexistent
<courmisch> like managing to get a serial console from GPIO is an achievement for me
<courmisch> I'm not using any dialect. Just making tons of macros to convince gas to assemble V 0.7.1, XTheadBA and XTheadBB instead of V 1.0, Zba and Zbb
<courmisch> then a pile of hacks directly in the code for the stuff that just can't be done that way
BootLayer has quit [Quit: Leaving]
joev1 has quit [Ping timeout: 245 seconds]
joev1 has joined #riscv
rory_be has quit [Read error: Connection reset by peer]
rory_be has joined #riscv
joev1 has quit [Ping timeout: 272 seconds]
joev1 has joined #riscv
<sorear> courmisch: any expectation of committing those hacks?
esv_ has joined #riscv
esv__ has joined #riscv
esv has quit [Ping timeout: 245 seconds]
esv_ has quit [Ping timeout: 245 seconds]
esv has joined #riscv
esv__ has quit [Ping timeout: 245 seconds]
<courmisch> sorear: nothing grandiose there https://git.remlab.net/gitweb/?p=ffmpeg.git;a=shortlog;h=refs/heads/thead
<sorear> how feasible is to do benchmarking on startup for microarchitectural characteristics? using rdcycle, rdtime, clock_gettime, perf_event_open, whatever
esv has quit [Ping timeout: 272 seconds]
<courmisch> if you mean the stock firmware, it's some heavily patched 5.10
<courmisch> so rdcycle and rdtime work
<courmisch> perf doesn't
<whitequark[cis]> courmisch: pic of very fine soldering?
<courmisch> I recommend to upgrade the firmware to the last vendor release, but that's still 5.10 anyway
<courmisch> whitequark[cis]: he said it would be needed; we didn't actually do it
<sorear> perf should work iff rdcycle doesn't, so you might be able to try perf_event_open and fall back to rdcycle on -ENOSYS
<sorear> still stupid
<courmisch> I have a MR on FFmpeg for that, yes
<sorear> for checkasm
<whitequark[cis]> courmisch: I'm just curious what the footprint is
<courmisch> whitequark[cis]: sorry :shrug:
<whitequark[cis]> (people's classification of soldering jobs varies immensely; for some anything SMD is too hard, some shrug off doing pads that are 300x300 micron)
<courmisch> sorear: that's the only use of rdcycle in the FFmpeg code base
<sorear> with the usual caveat that I barely have a clue what ffmpeg is, I don't think your checkasm MR is usable for function selection during normal usage of the library/application?
<courmisch> by all means, if you want to play with RVV, pick whatever project you are comfortable with
<gurki> id expect some basic math lib to yield more meaningfull results
<gurki> linpack if you want some benchmark numbers
<sorear> I'm suggesting that instead of just checking hwprobe, check hwprobe and also run some small benchmark routines to select the fastest X on your machine
<courmisch> sorear: if/when there is a confirmed RVV 1.0 implementation that has decent segmented loads and stores, and one that doesn't, and if by that point, the kernel has no flag in hwprobe, then yes
<courmisch> in the mean time, it is urgent to wait as the proverb goes in my mother tongue
<sorear> if everyone makes that decision there won't be a hwprobe flag
<sorear> but yes, not urgent
<courmisch> I don't really think I could convince palmer or conchuod to add a flag for an hypothetical property
<sorear> i think my chances of convincing them with words are signficantly lower than my chances of getting a patch accepted
<sorear> although there's no point in writing such a patch until I can test it on hardware
esv has joined #riscv
Andre_Z has joined #riscv
<sorear> re. small pads, right now I don't have soldering skills or soldering equipment so anything too small for me to hand-tape or hand-wire-wrap is too small
<conchuod> Isn't the whole point of a patches changelog to persuade people with words as to somethings merit?
<conchuod> s/es/'s/
<dzaima[m]> ok so I actually wrote a whole perf test suite for vrgather: https://gist.github.com/dzaima/189ca8d5cfd59e866895f5e945483bbd; 930 lines of autogenerated assembly (slightly manually tweaked to remove unwanted setvls) isn't particularly nice though
<conchuod> I'm not sure as to what you want a flag for though, I've been playing video games all day and not reading here :)
<courmisch> conchuod: whether segmented vector loads & stores are fast or not
<courmisch> dzaima[m]: the assembler has a very disassembly of compiler-generated code feel to it
<dzaima[m]> it is indeed the output of clang (the input C to clang was autogenerated)
<conchuod> Ah right.
<conchuod> I won't claim to have any opinion at this point :)
<dzaima[m]> sweet, thanks!
GenTooMan has quit [Ping timeout: 240 seconds]
<sorear> I think your definition of VLMAX is off by 8, VL is elements at the current SEW, VLEN is bits
<courmisch> should probably use rdcycle for bench rather than the clock
* sorear trying to figure out exactly what the units are
<dzaima[m]> oh yeah, vlmax is just the number of bits in the vectors in use
<dzaima[m]> units should be nanoseconds per gather invocation
<dzaima[m]> changing the measured unit can be done in the u64 measurement() function
<sorear> so it looks like 2.00/cycle throughput if it's running at stock 1.85GHz
<sorear> 3 cycle latency
<dzaima[m]> yeah
<sorear> dzaima[m]: your qsort comparison function is wrong and producing undefined behavior
<sorear> (b<a) - (a<b) I think?
<dzaima[m]> ..right, I never remember which things need a full comparison and which just a less-than
<sorear> so 32 cycles for all LMUL=8 gathers, 8 cycles for all LMUL=4, 1/2 cycle (t) 3 cycle (l) for LMUL=1, LMUL=2 too noisy to interpret, maybe a fixed sort will help
GenTooMan has joined #riscv
awita has quit [Remote host closed the connection]
<sorear> no effect from pattern although I wonder if VL would affect it
<courmisch> so far most stuff seems to be fastest with LMUL=4
<courmisch> not clear why it gets significantly slower at LMUL=8
<courmisch> it's also kinda annoying, because I'd been hoping that the reasonable thing to do with just maximise LMUL for any given function
<courmisch> (unless operating at fixed block size, of course)
<courmisch> and it would be a huge mess to try to guess the correct LMUL depending on CPU at run-time
<sorear> 2.5 (t), 4 (l) or so
<dzaima[m]> updated the gist to add a tested_vl variable that can be changed in the code
matoro_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
matoro has joined #riscv
frkazoid333 has joined #riscv
frkzoid has quit [Ping timeout: 272 seconds]
unsigned has joined #riscv
Andre_Z has quit [Quit: Leaving.]
crabbedhaloablut has quit []
frkazoid333 has quit [Ping timeout: 245 seconds]
pecastro has quit [Ping timeout: 272 seconds]
frkazoid333 has joined #riscv
elastic_dog has quit [Ping timeout: 260 seconds]
elastic_dog has joined #riscv
<muurkha> dzaima[m]: I feel like gcc autovectorization was in very early stages 15 years ago; do you mean it still sucks or that it's very early stages for GCC RISC-V autovectorization support?
<dzaima[m]> RISC-V specifically
<muurkha> whitequark[cis]: I wonder if this reporesents a minor bug in the Matrix bridge? 16:27 < dzaima[m]> @libera_courmisch:catircservices.org: no, clang uses vlseg2e32 whereas gcc
<muurkha> that doesn't look reasonably formatted for IRC
<dzaima[m]> some testing: @libera_muurkha:catircservices.org @dzaima:matrix.org
<muurkha> courmisch: I'm wondering if maybe the market demand for autovectorization is going away because for the pieces of code where it really matters, someone will write them in assembly manually?
<muurkha> sorear: "panpipe" <3
<sorear> i already reported that glitch elsewhere because it was less than topical here, that may have been a mistake, the response I got was to say that should have been impossible if it were a Matrix reply, then nothing when I pointed out it was never a reply
<dzaima[m]> hmm how do replies get translated?
<dzaima[m]> ok reply target just gets dropped