#riscv on 2023-07-16 — irc logs at libera.irclog.whitequark.org

2023-07-13 23:36 sorear[m] changed the topic of #riscv to: Matrix users: #riscv:libera.chat will be ending operation NET Jul 25; please test #riscv:catircservices.org as a replacement | RISC-V instruction set architecture | https://riscv.org | Logs: https://libera.irclog.whitequark.org/riscv

00:19 Tenkawa has quit [Quit: Was I really ever here?]

00:53 <dzaima[m]> slightly more analysis - clang can and does autovectorize the loop in question when it's by itself (https://godbolt.org/z/T9xKqqfjo), but just does not in the full context where I put it (and I didn't bother verifying it as my ARM "test bench" is my phone, debugging/profiling which is slightly annoying); anyway, very off-topic

00:54 <sorear> I wouldn't say that microarchitectural design space exploration is "very" off-topic

01:04 <muurkha> or even slightly

01:05 <dzaima[m]> well, that was just compiler autovectorization debugging

01:09 <dzaima[m]> for something that's probably more on-topic, here's some.. self-advertisement: I've been working on https://dzaima.github.io/intrinsics-viewer/, a listing of (rvv1.0) C intrinsics with C-like pseudocode description & expectable output assembly; might be helpful for learning, but of course most likely not as accurate as the spec itself

01:20 <muurkha> oh cool!

01:31 TMM_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

01:31 TMM_ has joined #riscv

02:11 knolle has quit [Ping timeout: 245 seconds]

02:12 zBeeble42 has joined #riscv

02:14 zBeeble24 has quit [Ping timeout: 264 seconds]

02:49 rvalles has quit [Ping timeout: 250 seconds]

03:02 rvalles has joined #riscv

03:07 jacklsw has joined #riscv

03:43 jacklsw has quit [Ping timeout: 246 seconds]

04:39 <sorear> https://gist.github.com/sorear/be81ce2725cdc8917973e8d64f40d0b1 if you want to see some appalling ideas with rvv, part of me wants to not explain anything and leave it as a puzzle

04:39 <muurkha> heh

04:40 <sorear> if you have a can-do mindset any instruction can be a mask instruction

04:42 <sorear> i forgot a vmin in the second and don't want to think about how much is wrong in the first

04:54 BootLayer has joined #riscv

05:02 billchenchina has joined #riscv

05:42 MarvelousWololo has quit [Read error: Connection reset by peer]

06:38 crabbedhaloablut has joined #riscv

07:12 <courmisch> sorear: sorry; European bed time. NF as in how many segments? I think 2 is most common (e.g. complex float numbers used in audio). 4 and 8 also occurs. NPOT not so much

07:14 <courmisch> sorear: as for ELEN, audio is almost exclusively 32-bit (single precision), non-HDR video is a mix of 8- and 16-bit; HDR video is only 16-bit

07:14 <muurkha> 32-bit audio?

07:14 <courmisch> no, single precision

07:15 <muurkha> single precision is normally 32 bits, isn't it? Even if a few of them are in the exponent

07:15 <courmisch> almost all audio is single precision. JACK, PulseAudio and PipeWire use it. Most multimedia apps too

07:15 <courmisch> most decoders and filters are written for single precision too

07:15 <muurkha> are you saying that almost all audio is floating point?

07:15 <courmisch> in processing, yes

07:16 <muurkha> interesting, I thought that was an unusual niche formt

07:16 <muurkha> *format

07:16 <courmisch> of course what comes from ADC and goes to DAC is integer

07:17 <muurkha> It wouldn't have to be; I mean ADPCM is more like floating point, right?

07:17 <courmisch> FWIW, I think Microsoft's WASAPI also uses float internally

07:18 <muurkha> I think most audio file formats are integer, which is probably why I think of that as being normal

07:18 <courmisch> ADPCM is only found in pre-MP3 era video games

07:18 <muurkha> it was more widely used than that at the time

07:19 <courmisch> telecommunication used G.711, not ADPCM, though they are vaguely similar

07:19 <muurkha> and S1 MP3 players use ADPCM for their voice recording by default

07:20 <muurkha> sure, I guess μlaw and a-law are kind of floaty

07:23 <courmisch> ADPCM doesn't strike me as a voice codec. AFAICT, ADPCM is mostly a trick to reduce bit rate at higher sampling rates. But voice doesn't require high sampling rates in the first place

07:25 <courmisch> or at least the industry thought so, ignoring that more than half the population is female or children, with higher pitch voices, but I blame the previous generation

07:25 <muurkha> well, the S1 MP3 is built around a Z80, so "higher sampling rates" is a matter of perspective :)

07:26 <muurkha> I was going to say "was" but then I remembered that I searched a few weeks ago and apparently they're still for sale

07:26 <courmisch> I mean music rates, 44.1kHz or 48kHz

07:26 <courmisch> voice is usually 8kHz or if you are fancy 16kHz

07:26 <muurkha> yes, IIRC it uses 16ksps ADPCM

07:27 <muurkha> I had some files I recorded on one but I don't know where they are

07:27 <muurkha> not sure why they didn't use something like GSM full rate

07:28 <muurkha> anyway I think there's still a lot of integer PCM data flowing around between programs, even if Pulse does support floats

07:29 <muurkha> audio data I mean

07:29 <courmisch> you can feed PCM to any sane audio output HAL, but it will be converted to float, unless you have monopoly over the audio output device

07:29 <muurkha> you don't need float to do mixing, if that's what you mean, or even dithering

07:30 <muurkha> people like floats for mixing because they don't clip

07:30 <courmisch> no, but that ship has sailed

07:30 <courmisch> nobody does audio processing in integer

07:30 <courmisch> unless it's some niche embedded code for an FPU-less device, ofc

07:32 <muurkha> there are more programs in cloud and earth, courmisch, than are dreamed of in your philosophy

07:33 <muurkha> have you used Audacity? Audacity is totally comfortable doing everything in integers

07:33 <muurkha> and that's what it does by default

07:34 <muurkha> not just if you import from integer WAV files or CD-DA. Even if you import from AAC or MP3!

07:35 <muurkha> Try importing some random YouTube video into Audacity and select Effects → Amplify and tell it to amplify it by 60dB. You'll get clipping all over the place, exactly the way you don't with float

07:35 <courmisch> of course if your source is an audio CD or a microphone you'll get PCM. I'm just saying internal processing is utterly dominated by single precision.

07:36 <courmisch> Audacity can do silly things to please the so-called "audiophiles". Unless they write their own decoders, they'll get single precision for most sources, and unless they bypass JACK/PA/PW, their output will be postprocessed in single precision

07:38 <muurkha> I don't think it's to please the audiophiles, though who knows; IEEE-754 single-precision floating point has more bits of mantissa than CD-DA

07:38 <muurkha> I just think it's silly to describe Audacity as "some niche embedded code for an FPU-less device,some niche embedded code for an FPU-less device" ;)

07:39 <courmisch> the only people who would get pissy about possibly loosing precision due to 32-bit float are audiophiles

07:39 <muurkha> losing precision?

07:39 <courmisch> and those will insist on fixed-point 32-bit processing

07:40 <courmisch> muurkha: the "problem" with float processing is that you have rounding errors, and they're not even the same between implementations

07:40 <muurkha> I've never heard of anyone using fixed-point 32-bit processing

07:40 <courmisch> well, you can't really do processing at 16-bit

07:40 <muurkha> depends

07:40 <muurkha> yes, irreproducibility of floating-point is a problem, but reproducibility is not a major objective for audio processing I think

07:41 <courmisch> and open-source decoders with integer paths typically output fixed-point 32-bit on that path

07:41 <muurkha> interesting, I don't think I've seen that. but there's an enormous amount I haven't seen

07:41 <courmisch> muurkha: I agree, but audiophiles seem to live in an alternate reality where the human ear can tell the difference

07:42 <muurkha> I thought the hip audiophile thing was 24/96: 24-bit precision at 96ksps

07:42 <courmisch> yes, lossless FLAC 24-bit 96kHz

07:42 <courmisch> but that's just the storage format

07:42 <muurkha> but 32-bit floats have a 24-bit significand

07:43 <muurkha> so I don't think you're going to lose precision by going to 32-bit floats

07:43 billchenchina has quit [Quit: Leaving]

07:43 <courmisch> yes they do, don't try reason when it comes to audiophilia

07:43 <muurkha> but yes it's true that on mixing you have rounding errors (but who cares)

07:43 <muurkha> I mean 24 bits of precision is a lot of headroom

07:44 <courmisch> a good human ear cannot even sense 12th bit error

07:44 <muurkha> and floats give you more headroom

07:44 <muurkha> 12 bits is only 72 dB, right?

07:45 <muurkha> I think there are probably circumstances where you can detect a sound that's 72 dB quieter than another sound that's far away in frequency

07:45 <muurkha> like, if the loud sound is 20 Hz and the quiet sound is 3500 Hz, I think it should be easy

07:45 <courmisch> and IIRC, blind test couldn't differentiate 16-bit and 24-bit recordinds

07:46 <courmisch> the only logical reason to use 24-bit recording is HQ postprocessing. But audiophilia is not about logic.

07:46 <muurkha> especially if it's 3500 Hz beating with an equal-amplitude sound at 3600 Hz

07:47 <courmisch> hmm, not sure how the frequncies where spaced, and I could off by a bit or two

07:47 <courmisch> but the point is that 24-bit is mostly useless (compared to 16-bit)

07:47 <muurkha> you'd need a quiet listening room probably

07:47 <courmisch> and integer processing is a waste of SW developer time, unless your target CPU has no FPU

07:48 <muurkha> or good over-ear headphones

07:48 <muurkha> because otherwise even slight background noise will overwhelm the signal of interest

07:50 <courmisch> and ditto 96kHz/192kHz vs 48kHz. Waste of bits, unless you're going to downrate the sound to make your techno music.

07:50 <muurkha> yeah, the benefit of 96ksps is that you have an extra octave of headroom for your antialiasing filters

07:51 <muurkha> but once you have it in the digital domain you might as well downsample it with the kinds of astoundingly good filters that are easy to do digitally

07:51 <muurkha> unless you're listening to bats or something

07:52 <muurkha> there are some really interesting DSP algorithms you probably aren't aware of that depend on exact pole-zero cancellation and so don't work reliably in floating-point

07:53 <muurkha> Hogenauer filters are probably the most popular ones, and you're undoubtedly surrounded by equipment with Hogenauer filters deeply embedded in it even if you aren't aware of it

07:53 <muurkha> but there are others

07:59 <muurkha> there's a plot on https://en.wikipedia.org/wiki/Psychoacoustics#Masking_effects showing how masking drops off with Δf

08:00 <muurkha> I was hoping to find a better plot

08:28 knolle has joined #riscv

08:42 BootLayer has quit [Quit: Leaving]

08:44 aburgess_ has joined #riscv

08:44 Trifton_ has joined #riscv

08:45 zjason` has joined #riscv

08:47 Noisytoot has quit [Killed (erbium.libera.chat (Nickname regained by services))]

08:47 Noisytoot has joined #riscv

08:47 TMM__ has joined #riscv

08:48 Leopold_ has joined #riscv

08:48 Esmil_ has joined #riscv

08:48 matoro_ has joined #riscv

08:49 joev1 has joined #riscv

08:49 troglodi1o has joined #riscv

08:53 rvalles has quit [*.net *.split]

08:53 TMM_ has quit [*.net *.split]

08:53 matoro has quit [*.net *.split]

08:53 joev has quit [*.net *.split]

08:53 aburgess has quit [*.net *.split]

08:53 kilobyte_ch has quit [*.net *.split]

08:53 Pierce[m] has quit [*.net *.split]

08:53 Leopold has quit [*.net *.split]

08:53 zjason has quit [*.net *.split]

08:53 FL4SHK has quit [*.net *.split]

08:53 pabs3 has quit [*.net *.split]

08:53 troglodito has quit [*.net *.split]

08:53 Trifton2 has quit [*.net *.split]

08:53 Esmil has quit [*.net *.split]

08:53 whitequark[cis] has quit [*.net *.split]

08:53 _catircservices has quit [*.net *.split]

08:53 FL4SHK has joined #riscv

09:00 _catircservices has joined #riscv

09:00 whitequark[cis] has joined #riscv

09:00 Pierce[m] has joined #riscv

09:00 gordonDrogon has quit [Ping timeout: 250 seconds]

09:00 pabs3 has joined #riscv

09:00 kilobyte_ch has joined #riscv

09:01 rvalles has joined #riscv

09:02 bjoto has quit [Ping timeout: 246 seconds]

09:04 bjoto has joined #riscv

09:09 bjoto has quit [Ping timeout: 258 seconds]

09:11 bjoto has joined #riscv

09:26 gordonDrogon has joined #riscv

09:54 BootLayer has joined #riscv

09:58 bjoto has quit [Ping timeout: 264 seconds]

09:59 bjoto has joined #riscv

10:03 <courmisch> sorear: problem with that non-segmented strategy: it only works one way. There's no widening left shift as far as I can see

10:05 bjoto has quit [Ping timeout: 246 seconds]

10:06 <dzaima[m]> yeah, afaik you have to do that as separate widening & shifting; so all together 4 ops - 2 widens, a shift, and an bitwise-or or blend or something

10:07 <courmisch> yikes

10:08 bjoto has joined #riscv

10:10 <dzaima[m]> there's a widening shift left in some propsed extension - https://github.com/riscv/riscv-crypto/blob/v20230620/doc/vector/insns/vwsll.adoc but that's of course an extension, and also isn't yet ratified

10:19 whitequark has left #riscv [#riscv]

10:49 jrtc27[m] has quit [Quit: Bridge terminating on SIGTERM]

10:49 [0x4A6F][m] has quit [Quit: Bridge terminating on SIGTERM]

10:49 _catircservices has quit [Quit: Bridge terminating on SIGTERM]

10:49 sorear[m] has quit [Quit: Bridge terminating on SIGTERM]

10:49 psydroid[m] has quit [Quit: Bridge terminating on SIGTERM]

10:49 Pierce[m] has quit [Quit: Bridge terminating on SIGTERM]

10:49 whitequark[cis] has quit [Quit: Bridge terminating on SIGTERM]

10:49 dzaima[m] has quit [Quit: Bridge terminating on SIGTERM]

10:49 _catircservices has joined #riscv

10:49 sorear[m] has joined #riscv

10:50 _catircservices has quit [Remote host closed the connection]

10:50 sorear[m] has quit [Remote host closed the connection]

10:50 _catircservices has joined #riscv

10:50 sorear[m] has joined #riscv

10:51 whitequark[cis] has joined #riscv

10:51 jrtc27[m] has joined #riscv

10:51 dzaima[m] has joined #riscv

10:52 Pierce[m] has joined #riscv

10:52 [0x4A6F][m] has joined #riscv

10:52 Pierce[m] has quit [Remote host closed the connection]

10:52 whitequark[cis] has quit [Remote host closed the connection]

10:52 jrtc27[m] has quit [Remote host closed the connection]

10:52 [0x4A6F][m] has quit [Remote host closed the connection]

10:52 sorear[m] has quit [Remote host closed the connection]

10:52 _catircservices has quit [Remote host closed the connection]

10:52 dzaima[m] has quit [Remote host closed the connection]

10:52 _catircservices has joined #riscv

10:52 sorear[m] has joined #riscv

10:53 whitequark[cis] has joined #riscv

10:53 jrtc27[m] has joined #riscv

10:53 dzaima[m] has joined #riscv

10:53 Pierce[m] has joined #riscv

10:53 [0x4A6F][m] has joined #riscv

10:53 psydroid[m] has joined #riscv

10:56 _catircservices has quit [Remote host closed the connection]

10:56 [0x4A6F][m] has quit [Remote host closed the connection]

10:56 jrtc27[m] has quit [Remote host closed the connection]

10:56 psydroid[m] has quit [Remote host closed the connection]

10:56 whitequark[cis] has quit [Remote host closed the connection]

10:56 sorear[m] has quit [Remote host closed the connection]

10:56 Pierce[m] has quit [Remote host closed the connection]

10:56 dzaima[m] has quit [Remote host closed the connection]

10:56 _catircservices has joined #riscv

10:56 sorear[m] has joined #riscv

10:57 whitequark[cis] has joined #riscv

10:58 jrtc27[m] has joined #riscv

10:58 dzaima[m] has joined #riscv

10:58 Pierce[m] has joined #riscv

10:58 [0x4A6F][m] has joined #riscv

10:58 psydroid[m] has joined #riscv

11:08 sorear[m] has quit [Quit: Bridge terminating on SIGTERM]

11:08 _catircservices has quit [Quit: Bridge terminating on SIGTERM]

11:08 whitequark[cis] has quit [Quit: Bridge terminating on SIGTERM]

11:08 dzaima[m] has quit [Quit: Bridge terminating on SIGTERM]

11:08 jrtc27[m] has quit [Quit: Bridge terminating on SIGTERM]

11:08 psydroid[m] has quit [Quit: Bridge terminating on SIGTERM]

11:08 Pierce[m] has quit [Quit: Bridge terminating on SIGTERM]

11:08 [0x4A6F][m] has quit [Quit: Bridge terminating on SIGTERM]

11:08 _catircservices has joined #riscv

11:09 sorear[m] has joined #riscv

11:10 whitequark[cis] has joined #riscv

11:10 jrtc27[m] has joined #riscv

11:10 dzaima[m] has joined #riscv

11:10 Pierce[m] has joined #riscv

11:10 [0x4A6F][m] has joined #riscv

11:10 psydroid[m] has joined #riscv

11:15 sorear[m] has quit [Remote host closed the connection]

11:15 _catircservices has quit [Remote host closed the connection]

11:15 Pierce[m] has quit [Remote host closed the connection]

11:15 dzaima[m] has quit [Remote host closed the connection]

11:15 [0x4A6F][m] has quit [Remote host closed the connection]

11:15 psydroid[m] has quit [Remote host closed the connection]

11:15 jrtc27[m] has quit [Remote host closed the connection]

11:15 whitequark[cis] has quit [Remote host closed the connection]

11:15 _catircservices has joined #riscv

11:15 sorear[m] has joined #riscv

11:16 whitequark[cis] has joined #riscv

11:16 jrtc27[m] has joined #riscv

11:16 dzaima[m] has joined #riscv

11:16 Pierce[m] has joined #riscv

11:16 [0x4A6F][m] has joined #riscv

11:16 psydroid[m] has joined #riscv

11:21 junaid_ has joined #riscv

11:22 junaid_ has quit [Client Quit]

11:29 Andre_Z has joined #riscv

11:41 Tenkawa has joined #riscv

11:43 junaid_ has joined #riscv

12:08 pecastro has joined #riscv

12:24 awita has joined #riscv

12:30 raghavgururajan has joined #riscv

13:11 BootLayer has quit [Quit: Leaving]

13:57 cwebber`` is now known as cwebber

13:57 cwebber has quit [Changing host]

13:57 cwebber has joined #riscv

14:10 unsigned has quit [Quit: .]

14:21 BootLayer has joined #riscv

14:26 ximet43 has quit [Read error: Connection reset by peer]

14:28 ximet43 has joined #riscv

14:41 Andre_Z has quit [Quit: Leaving.]

14:43 <sorear> you can use vwadd.vw

14:44 <sorear> instead of vwsll you can try using vwmul

14:46 <sorear> vwmaccu.vx _almost_ does what you want but it requires one of the inputs to be pre-widened

14:46 <sorear> given that Zbb has PACK this is a kinda obvious omission from Zvbb !

14:46 <sorear> or ignore the problem. stores are a fair amount rarer than loads and having them slow is less of a problem

14:46 <dzaima[m]> oh yeah vwadd.vw is nice for merging widen&or. vwmul doesn't work though, as you'd need the constant multiplicand to be wider than the half-width type

14:47 Danidada has joined #riscv

14:47 <sorear> does ffmpeg have any mechanism for "feature X exists on uarch Y but is slow, don't use it"? it's not like intel hasn't done this more than once

14:47 <sorear> looking at you maskmov

14:48 <sorear> (anyone like my hell loops?)

14:48 junaid_ has quit [Remote host closed the connection]

14:49 <sorear> is ffmpeg likely to use Zvfh (_Float16) anywhere?

14:50 <dzaima[m]> very "fun" is that AVX2 maskmov store is still very slow on Zen 4, despite an AVX512 masked store being equivalent and fast

14:52 * cousteau_ wonders if C compilers are likely to support a `short float` type, and stuff like scanf("%hf")

14:54 <sorear> what would "short float" do now that ISO has standardized _Float16?

14:58 <courmisch> sorear: there are such flags for SSE2, SSE3, SSSE and AVX, yes

14:58 <courmisch> but currently, there'd be no ways to populate them on RV

14:59 <courmisch> sorear: multimedia doesn't use half-precision for anything AFAIK, nor the other AI variants of 16-bit flots.

15:01 <cousteau_> sorear: I guess the same as `bool` or `complex`

15:01 <courmisch> sorear: maybe GPUs have some use for video filtering? maybe check the DRM chroma list. but that would probably go via GL or VK rather than the vector unit

15:01 <cousteau_> except that, uh, they're not macros/typedefs

15:01 <cousteau_> *it's not

15:02 <sorear> _Float16 isn't a typedef

15:02 <cousteau_> no, I meant that `bool` is

15:02 <cousteau_> but `short float` wouldn't

15:02 <courmisch> sorear: I am a bit suspicious of using a multiply accumulate for a shift-and-or; multiplications tend to be slower

15:02 <sorear> ISO 18661-3-2015 doesn't define printf/scanf codes, it has a new "strfromf16" function...

15:02 <cousteau_> so you can `typedef _Bool bool;` and `#define complex _Complex`, but you can't `typedef _Float16 short float;`

15:03 <sorear> courmisch: you're already writing single-purpose code for the c910...?

15:03 <courmisch> no, I'm writing RVV 1.0 code then postprocessing it to benchmark on C910

15:03 <courmisch> for lack of a better alternative

15:03 <cousteau_> but in any case, I'd expect compilers to support `short float` at least as an extension, and maybe in a future as a base type

15:04 <sorear> interesting that ffmpeg went with exclusive use of array-of-structs, I don't think SSE supports segmented operations well

15:04 <courmisch> interleaved is the defacto standard for audio

15:04 <sorear> I wouldn't expect that, every type that has been added since "long long" in C99 has been a single keyword

15:04 <courmisch> I have seen planar audio, but it's really rare

15:05 <sorear> externally, sure, but I meant for internal variables in the AAC codec or whatever you were working on

15:05 <courmisch> I think those were complex numbers, though I have not checked the "upper layers"; just ported the reference C

15:05 <sorear> deplanarize it at the same time you convert from float32 to int16 for the DAC

15:06 <courmisch> that would be inside JACKd/PA/PW outside the reach of FFmpeg

15:07 <courmisch> I don't think PA and PW even support planar audio

15:07 <courmisch> (could be wrong)

15:07 <courmisch> but converting FFmpeg and all the apps using FFmpeg is essentially impossible at this point

15:08 <sorear> how do you handle NPOT channels?

15:08 <courmisch> you mean surround? that's also interleaved. I haven't touched the filtering code yet though

15:09 <courmisch> typically, surround is pass-through anyway, thanks to Dolby / DTS insanity

15:10 <cousteau_> sorear: well, C99 was the last standard I really understood so...

15:11 <courmisch> C99 doesn't have atomics and alignment. Useless.

15:16 <courmisch> sorear: I am a bit confused why so many opcodes were taken for segment support if it's not implemented properly by anybody

15:16 <sorear> huh, c23 has scanf codes for float, double, long double, _Decimal32, _Decimal64, and _Decimal128 but not the sized binary float types

15:16 <courmisch> we'll see what SiFive did, whenever they actually release a CPU

15:16 <sorear> courmisch: i think you should test more than one implementation before declaring that it's not implemented properly by anybody

15:17 aerkiaga has joined #riscv

15:17 <sorear> (and the spec doesn't actually say anything about which operations are required to be fast)

15:18 <courmisch> sorear: sure. But for comparison, arm only supports 2-4 segments. They went out of their way and consumed one more bit

15:19 <courmisch> also if the plan is to enable compilers to autovectorise, segmented loads are probably unavoidable

15:20 <sorear> sure. and you did say that a trivial loop with a segmented load and 1.5 FLOPs per loaded value was only slightly slower than the scalar loop

15:21 <sorear> it's not like you're hitting m-mode emulation code here, like what happens when you try to take advantage of required support for misaligned memory access in the unprivileged spec on any sifive cpu

15:21 <courmisch> yes, it's not catastrophically slow

15:22 <courmisch> but if it's slower than scalar, even if it's faster than manually shuffling bits

15:23 <courmisch> it's still kinda useless. We'll just have to see what anybody other than T-Head did

15:23 <sorear> can you put in a tune flag for avoiding/using zvlsseg, to use 0.7.1 terminology?

15:24 <sorear> if anything I'm surprised t-head implemented zvlsseg if they made it no faster than independent strided loads

15:24 <courmisch> you mean FFmpeg CPU flag?

15:24 <courmisch> if __riscv_hwprobe hypothetically defined a corresponding flag, yes

15:25 <cousteau_> may I ask why the sudden interest in ffmpeg? I've seen it popping up a lot here recently

15:25 <cousteau_> is this because something in the vector extension may be of interest for ffmpeg audio/video codecs, and there's some collaboration going on, or intended to be?

15:25 <courmisch> it's one of the most assembler-intensive of the popular OSS libraries

15:26 <cousteau_> oh I see

15:26 <cousteau_> so it's being used as an example?

15:26 <courmisch> meanwhile OpenSSL and Nettle probably can't do much until the crypto vector stuff is out

15:28 <courmisch> I haven't seen FFmpeg pop up much recently here. It's just the same segmented load discussion that's been going on for a few days

15:28 <courmisch> and neither FFmpeg nor any other multimedia libraries are included in RISE :(

15:28 <cousteau_> I'm no crypto expert but is it really that important? I saw a lot of noise about Zvk* (whereas Zk* seems to be pretty established by now)

15:32 <courmisch> AFAIU, there are two parts to Zvk. One where they just vectorise the Zk stuff like CLMUL

15:32 <courmisch> and one where they implement specific algorithm rounds directly in the CPU

15:32 <courmisch> the latter should obviously be a hell of a lot faster than Zk

15:32 <courmisch> not that I'd know

15:35 jacklsw has joined #riscv

15:38 <cousteau_> I see

15:39 cousteau_ is now known as cousteau

15:39 <cousteau> the way you put it, I'd expect that first part to just be the same instructions implemented differently

15:40 <cousteau> (although maybe there are some minor differences other than the ABI that do matter, such as timing... part of Zk* is "data-independent timing" so I guess that matters)

15:40 <courmisch> that's like comparing I and V

15:40 <courmisch> V does not really add anything that I/ or M can't do. It's just faster if used right

15:42 MarvelousWololo has joined #riscv

15:45 <cousteau> courmisch: ah. I misunderstood that Zvk would just accelerate instructions that Zk already did

15:59 joev1 has quit [Ping timeout: 252 seconds]

16:00 joev1 has joined #riscv

16:04 <dzaima[m]> fwiw, current status of autovectorization of the original question ffmpeg loop: https://godbolt.org/z/xPecPnYna - clang uses vlseg2e32 whereas gcc does vrgather-s, clang always has a scalar loop for the tail, gcc sucks at eliminating setvl-s, both unroll 2x despite usually not doing that in my experience (might make sense if for gcc splitting up higher-lmul gathers, but no it's still LMUL=1 which is its usual LMUL; clang's at LMUL=2 which is

16:04 <dzaima[m]> also its usual LMUL)

16:05 <cousteau> I think RISC-V should add carry-less addition and subtraction in addition to multiplication

16:05 <cousteau> as a joke

16:06 <cousteau> see how long it takes people to realize that's just XOR

16:13 frkzoid has joined #riscv

16:14 frkazoid333 has quit [Ping timeout: 264 seconds]

16:16 <sorear> clmul is Zbc, not Zk*

16:18 <sorear> Zk* implements round function bits, but it's limited by your datapath width ... Zkned needs 16 instructions for an AES round on rv32ik, 4 on rv64ik

16:18 <sorear> Zkt is not really fit for purpose because it doesn't specify that loads and stores are independent of the value loaded or stored, so if you're treating it as a manual you can't store keymat in memory

16:19 <sorear> only useful for secrets born in registers (via Zkr) and dying in registers

16:21 <sorear> courmisch: so no ffmpeg mechanism for setting flags other than architectural features on the basis of vendor/arch/impl, runtime benchmarking, or configuration?

16:23 <courmisch> sorear: you can set a flag based on whatever, but well, you need to detect the correct value somehow

16:25 <courmisch> dzaima[m]: interesting. Well I doubt that vrgather is fast. It sounds like the one instruction that can't really be fast

16:26 <sorear> I mean it's easy enough to detect "vlseg2e32 is slower than vle64 + vnsrl + vnsrl" if you have access to rdcycle

16:26 <courmisch> sorear: "if you have access to rdcycle" well, I think the kernel is removing that access

16:26 <dzaima[m]> yeah. gcc autovectorization is very early stages afaict, so that's likely just it not being particularly complete

16:26 <sorear> although it might not be safe to assume that the behavior is the same for all element sizes, and linux is phasing out the initial scounteren setup

16:27 <courmisch> dzaima[m]: IMO, segmented loads are correct here. It's Clang that's using gather where it's easily avoided

16:27 <dzaima[m]> @libera_courmisch:catircservices.org: no, clang uses vlseg2e32 whereas gcc uses vrgather

16:27 <sorear> gcc autovectorization actually uses VL so in my book it's way ahead of clang

16:27 <courmisch> dzaima[m]: ah ok, nvmd

16:29 <dzaima[m]> (ugh, bridge doesn't convert user links properly matrix→irc)

16:29 <courmisch> sorear: seems more like a case of neither of them being mature, with different problems

16:30 <dzaima[m]> clang's presumably using the same core as ARM SVE which doesn't have VL; gcc is a new vectorizer afaik

16:31 <cousteau> sorear: isn't clmul also part of Zbk* ?

16:31 <cousteau> (probably Zbkc)

16:31 <sorear> maybe

16:32 <sorear> clang's IR has %evl arguments on every vector-predication intrinsic function which aren't being generated on any architecture...

16:34 <cousteau> https://github.com/riscv/riscv-crypto/blob/master/doc/scalar/riscv-crypto-scalar-zbkc.adoc

16:35 <cousteau> apparently it used to be called "Zkg"

16:41 <courmisch> dzaima[m]: SVE doesn't have VL, but it has WHILExx doing almost the same though. But maybe they determined that unrolling was faster than using VL style due to lack of LMUL

16:43 Narrat has joined #riscv

16:44 Tenkawa has quit [Quit: Was I really ever here?]

16:46 <dzaima[m]> hmm yeah; gcc also appears to use whilelo: https://godbolt.org/z/9szqGoxKG

16:49 <courmisch> and GCC is not really using VL on RVV

16:49 <courmisch> it's hard-coding the AVL to 8 for no obvious reasons

16:50 <courmisch> ah, maybe missing restrict?

16:50 <courmisch> nah, I don't know

16:50 <dzaima[m]> that may be because of.. --param=riscv-autovec-preference=fixed-vlmax - change that to scalable and it should probably scale, at the expense of even more output

16:52 <courmisch> not sure if becoming a compiler developer is insanity or job security

16:52 <courmisch> probably both

16:52 <dzaima[m]> gcc's scalable code is usually a lot better though; the deinterleaving hits a pretty bad case

16:53 <courmisch> I guess LLVM just took it from NEON, not SVE

16:53 <courmisch> that would actually explain the unrolling

16:54 <dzaima[m]> this is the first case I've seen LLVM unroll though

16:54 <courmisch> doesn't it use NEON with 2x unroll on Armv8 for this function?

16:55 <dzaima[m]> on NEON clang does unroll usually, yes

16:56 <dzaima[m]> but for rvv it doesn't unroll even the simplest loops - https://godbolt.org/z/q1YqG3noc

16:57 <courmisch> soembody should teach it vmv.v.i

16:57 <dzaima[m]> that's a float 1.0, not integer

16:58 <dzaima[m]> it does indeed vmv.v.i v8, 1 for an integer 1

16:59 <sorear> it's still painful to look at scalar epilog loops for rvv

16:59 <dzaima[m]> yeah

17:00 <courmisch> so hmm, is there any way to load a vector *backward* other than vlse with a negative stride?

17:01 <courmisch> a loop over vslide, but that's probably even slower

17:01 <dzaima[m]> vrgather is always an option

17:02 <courmisch> vid.v; vneg; vrgather, hmm

17:02 <dzaima[m]> probably would want to have separate tail handling though, otherwise the index generation has to be in the loop

17:08 <sorear> panpipe is definitely going to optimize strides of -1,0,1 into unit-stride memory operations btw

17:09 <sorear> mostly because it's extremely easy to generate a variable-but-1 stride in FORTRAN

17:09 <sorear> i think that everything in lapack is a strided load but I haven't checked the actual binary

17:11 jacklsw has quit [Ping timeout: 272 seconds]

17:12 <sorear> how _is_ vrgather performance on c910?

17:12 <dzaima[m]> here's a vrgather-based array reverse impl, both in-place and out-of-place in one, in autogenerated C intrinsics, that I wrote as an experiment, and appears to possibly work with some simple tests in qemu - https://godbolt.org/z/Edf8addjK

17:13 <sorear> suddenly reminded of / wondering when we're going to be able to inline memcpy as 3 instructions for len<=128

17:14 <sorear> vsetivli;vle8;vse8 compares rather favorably to auipc;jalr ...

17:14 <sorear> maybe I'll turn the json parser into something testable

17:17 <courmisch> is it legal to use an odd register number as a wide operand if LMUL is fractional?

17:19 <courmisch> sorear: implying you don't have a C910 lying around? I can run a test case if you have one

17:20 <dzaima[m]> clang does some... very funky things with memcpy: https://godbolt.org/z/b3qM78TaY

17:25 <sorear> courmisch: i ordered a lp4a but it won't be here for two weeks

17:26 <sorear> courmisch: I would say so - my understanding of the rules on register number is that they are based on EMUL, not LMUL

17:28 <sorear> dzaima[m]: SLP was a mistake

17:29 <dzaima[m]> I can hack up some vrgather tests if desired; I'm also quite curious how it'd perform (and also don't have any risc-v hardware)

17:30 <sorear> it's not clear from the photos and the pdfs how I'm supposed to run my own S/M-mode code on the lp4a but I'll probably figure it out when I have it

17:38 <courmisch> mine has broken fastboot

17:38 <courmisch> I had to use unofficial U-boot's fastboot to flash it

17:38 <courmisch> to run your own S mode code, you can just replace the kernel image in u-boot, I think. Custom M mode, I dunno

17:38 joev1 has quit [Ping timeout: 245 seconds]

17:38 <sorear> what hardware do you use to flash it?

17:38 <courmisch> USB C cable

17:39 joev1 has joined #riscv

17:39 <courmisch> since it's built-in MMC, you can't do the usual dd from your desktop computer

17:40 <courmisch> but judging by your timeline, you just got the first nonbeta while I have the last beta

17:41 <courmisch> in principle, you should be able to overwrite the u-boot partition from within Linux. But don't come crying to me if you brick the device...

17:41 <sorear> haven't done anything with u-boot before

17:42 <sorear> does "unofficial U-boot's fastboot" imply that you replaced the U-boot image on yours?

17:42 <courmisch> from U0 TTL serial port, I just typed "fastboot usb 0" + Enter, and that got me a somewhat working fastboot gadget

17:43 <courmisch> again, that's not the official method, which is to hold the BOOT button while attaching power (never worked for me)

17:44 <courmisch> sorear: no, there is a built-in fastboot implementation inside u-boot, and Sipeed didn't proactively remove it

17:52 TMM__ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

17:53 TMM_ has joined #riscv

17:53 <sorear> courmisch: do you know how the beta/official thing works? do you have a beta version?

18:00 Narrat has quit [Quit: They say a little knowledge is a dangerous thing, but it's not one half so bad as a lot of ignorance.]

18:11 <courmisch> sorear: I guess I have a beta because it shipped before they announced the release, but :shrug:

18:12 <sorear> i think i accidentally got a beta, the only one listed on the store when i ordered was the "8+8"

18:12 <sorear> if it was _clearly labeled_ as a beta i would have waited a few days

18:14 <sorear> online docs say that the release versions have dip switches to boot from sd card, would be a nice simple way to run M-mode code without worrying about bricking it

18:14 <courmisch> when I bought, they were only selling the beta specs. So I guess I have a beta from inventory, even though beta was officially only for preorders

18:14 <sorear> fairly clear that the fastboot "boot" image includes opensbi, which means I can replace the M-mode code, although there's no obvious way to restore if it won't load u-boot

18:15 <courmisch> is it so that only u-boot SPL and OpenSBI get M, while "normal" u-boot runs at S mode?

18:16 <sorear> hey, if you are writing in a dialect where you can translate to xtheadv for testing, would it make sense to include that in the build system for users?

18:16 <sorear> (SPL, opensbi, normal uboot) that is my understanding, yes

18:16 <courmisch> and OpenSBI monopolises M mode, and protects itself with PMP from S mode?

18:17 <courmisch> I'm very unfamiliar with the privileged RISC-V stuff

18:17 <sorear> I'll probably patch opensbi to include a backdoor, then not touch the opensbi on flash again

18:18 <sorear> that or - any idea how to use the jtag?

18:19 <courmisch> to paraphrase my go-to-colleague for electronic questions, using JTAG would require some very fine soldering

18:19 <courmisch> tl;dr: no

18:20 <sorear> oh, you have to solder on your own connector? :/

18:20 <courmisch> that's what he said. I took his word at face value because my electronic-fu is nonexistent

18:21 <courmisch> like managing to get a serial console from GPIO is an achievement for me

18:23 <courmisch> I'm not using any dialect. Just making tons of macros to convince gas to assemble V 0.7.1, XTheadBA and XTheadBB instead of V 1.0, Zba and Zbb

18:24 <courmisch> then a pile of hacks directly in the code for the stuff that just can't be done that way

18:32 BootLayer has quit [Quit: Leaving]

18:54 joev1 has quit [Ping timeout: 245 seconds]

18:55 joev1 has joined #riscv

18:59 rory_be has quit [Read error: Connection reset by peer]

19:00 rory_be has joined #riscv

19:04 joev1 has quit [Ping timeout: 272 seconds]

19:05 joev1 has joined #riscv

19:06 <sorear> courmisch: any expectation of committing those hacks?

19:09 esv_ has joined #riscv

19:11 esv__ has joined #riscv

19:11 esv has quit [Ping timeout: 245 seconds]

19:14 esv_ has quit [Ping timeout: 245 seconds]

19:15 esv has joined #riscv

19:17 esv__ has quit [Ping timeout: 245 seconds]

19:18 <courmisch> sorear: nothing grandiose there https://git.remlab.net/gitweb/?p=ffmpeg.git;a=shortlog;h=refs/heads/thead

19:21 <sorear> how feasible is to do benchmarking on startup for microarchitectural characteristics? using rdcycle, rdtime, clock_gettime, perf_event_open, whatever

19:24 esv has quit [Ping timeout: 272 seconds]

19:24 <courmisch> if you mean the stock firmware, it's some heavily patched 5.10

19:24 <courmisch> so rdcycle and rdtime work

19:24 <courmisch> perf doesn't

19:24 <whitequark[cis]> courmisch: pic of very fine soldering?

19:24 <courmisch> I recommend to upgrade the firmware to the last vendor release, but that's still 5.10 anyway

19:24 <courmisch> whitequark[cis]: he said it would be needed; we didn't actually do it

19:24 <sorear> perf should work iff rdcycle doesn't, so you might be able to try perf_event_open and fall back to rdcycle on -ENOSYS

19:25 <sorear> still stupid

19:25 <courmisch> I have a MR on FFmpeg for that, yes

19:25 <sorear> for checkasm

19:25 <whitequark[cis]> courmisch: I'm just curious what the footprint is

19:25 <courmisch> whitequark[cis]: sorry :shrug:

19:26 <whitequark[cis]> (people's classification of soldering jobs varies immensely; for some anything SMD is too hard, some shrug off doing pads that are 300x300 micron)

19:26 <courmisch> sorear: that's the only use of rdcycle in the FFmpeg code base

19:26 <sorear> with the usual caveat that I barely have a clue what ffmpeg is, I don't think your checkasm MR is usable for function selection during normal usage of the library/application?

19:27 <courmisch> by all means, if you want to play with RVV, pick whatever project you are comfortable with

19:28 <gurki> id expect some basic math lib to yield more meaningfull results

19:28 <gurki> linpack if you want some benchmark numbers

19:28 <sorear> I'm suggesting that instead of just checking hwprobe, check hwprobe and also run some small benchmark routines to select the fastest X on your machine

19:30 <courmisch> sorear: if/when there is a confirmed RVV 1.0 implementation that has decent segmented loads and stores, and one that doesn't, and if by that point, the kernel has no flag in hwprobe, then yes

19:30 <courmisch> in the mean time, it is urgent to wait as the proverb goes in my mother tongue

19:31 <sorear> if everyone makes that decision there won't be a hwprobe flag

19:31 <sorear> but yes, not urgent

19:32 <courmisch> I don't really think I could convince palmer or conchuod to add a flag for an hypothetical property

19:33 <sorear> i think my chances of convincing them with words are signficantly lower than my chances of getting a patch accepted

19:34 <sorear> although there's no point in writing such a patch until I can test it on hardware

19:34 esv has joined #riscv

19:35 Andre_Z has joined #riscv

19:35 <sorear> re. small pads, right now I don't have soldering skills or soldering equipment so anything too small for me to hand-tape or hand-wire-wrap is too small

19:37 <conchuod> Isn't the whole point of a patches changelog to persuade people with words as to somethings merit?

19:37 <conchuod> s/es/'s/

19:37 <dzaima[m]> ok so I actually wrote a whole perf test suite for vrgather: https://gist.github.com/dzaima/189ca8d5cfd59e866895f5e945483bbd; 930 lines of autogenerated assembly (slightly manually tweaked to remove unwanted setvls) isn't particularly nice though

19:38 <conchuod> I'm not sure as to what you want a flag for though, I've been playing video games all day and not reading here :)

19:38 <courmisch> conchuod: whether segmented vector loads & stores are fast or not

19:39 <courmisch> dzaima[m]: the assembler has a very disassembly of compiler-generated code feel to it

19:40 <dzaima[m]> it is indeed the output of clang (the input C to clang was autogenerated)

19:40 <conchuod> Ah right.

19:44 <conchuod> I won't claim to have any opinion at this point :)

19:48 <courmisch> dzaima[m]: https://www.remlab.net/files/divers/vrgather.txt

19:49 <dzaima[m]> sweet, thanks!

19:51 GenTooMan has quit [Ping timeout: 240 seconds]

19:55 <sorear> I think your definition of VLMAX is off by 8, VL is elements at the current SEW, VLEN is bits

19:55 <courmisch> should probably use rdcycle for bench rather than the clock

19:56 * sorear trying to figure out exactly what the units are

19:56 <dzaima[m]> oh yeah, vlmax is just the number of bits in the vectors in use

19:56 <dzaima[m]> units should be nanoseconds per gather invocation

19:58 <dzaima[m]> changing the measured unit can be done in the u64 measurement() function

19:58 <sorear> so it looks like 2.00/cycle throughput if it's running at stock 1.85GHz

19:59 <sorear> 3 cycle latency

20:00 <dzaima[m]> yeah

20:00 <sorear> dzaima[m]: your qsort comparison function is wrong and producing undefined behavior

20:01 <sorear> (b<a) - (a<b) I think?

20:02 <dzaima[m]> ..right, I never remember which things need a full comparison and which just a less-than

20:06 <sorear> so 32 cycles for all LMUL=8 gathers, 8 cycles for all LMUL=4, 1/2 cycle (t) 3 cycle (l) for LMUL=1, LMUL=2 too noisy to interpret, maybe a fixed sort will help

20:07 GenTooMan has joined #riscv

20:07 awita has quit [Remote host closed the connection]

20:07 <sorear> no effect from pattern although I wonder if VL would affect it

20:07 <courmisch> so far most stuff seems to be fastest with LMUL=4

20:08 <courmisch> not clear why it gets significantly slower at LMUL=8

20:08 <courmisch> it's also kinda annoying, because I'd been hoping that the reasonable thing to do with just maximise LMUL for any given function

20:09 <courmisch> (unless operating at fixed block size, of course)

20:09 <courmisch> and it would be a huge mess to try to guess the correct LMUL depending on CPU at run-time

20:10 <sorear> 2.5 (t), 4 (l) or so

20:14 <dzaima[m]> updated the gist to add a tested_vl variable that can be changed in the code

21:00 matoro_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

21:00 matoro has joined #riscv

21:10 frkazoid333 has joined #riscv

21:12 frkzoid has quit [Ping timeout: 272 seconds]

22:15 unsigned has joined #riscv

22:32 Andre_Z has quit [Quit: Leaving.]

23:09 crabbedhaloablut has quit []

23:19 frkazoid333 has quit [Ping timeout: 245 seconds]

23:19 pecastro has quit [Ping timeout: 272 seconds]

23:26 frkazoid333 has joined #riscv

23:33 elastic_dog has quit [Ping timeout: 260 seconds]

23:45 elastic_dog has joined #riscv

23:52 <muurkha> dzaima[m]: I feel like gcc autovectorization was in very early stages 15 years ago; do you mean it still sucks or that it's very early stages for GCC RISC-V autovectorization support?

23:53 <dzaima[m]> RISC-V specifically

23:53 <muurkha> whitequark[cis]: I wonder if this reporesents a minor bug in the Matrix bridge? 16:27 < dzaima[m]> @libera_courmisch:catircservices.org: no, clang uses vlseg2e32 whereas gcc

23:54 <muurkha> that doesn't look reasonably formatted for IRC

23:55 <dzaima[m]> some testing: @libera_muurkha:catircservices.org @dzaima:matrix.org

23:56 <muurkha> courmisch: I'm wondering if maybe the market demand for autovectorization is going away because for the pieces of code where it really matters, someone will write them in assembly manually?

23:57 <muurkha> sorear: "panpipe" <3

23:58 <dzaima[m]> the bridge issue is https://github.com/matrix-org/matrix-appservice-irc/issues/541 probably

23:58 <sorear> i already reported that glitch elsewhere because it was less than topical here, that may have been a mistake, the response I got was to say that should have been impossible if it were a Matrix reply, then nothing when I pointed out it was never a reply

23:59 <dzaima[m]> hmm how do replies get translated?

23:59 <dzaima[m]> ok reply target just gets dropped