<dzaima[m]> slightly more analysis - clang can and does autovectorize the loop in question when it's by itself (, but just does not in the full context where I put it (and I didn't bother verifying it as my ARM "test bench" is my phone, debugging/profiling which is slightly annoying); anyway, very off-topic
<sorear> I wouldn't say that microarchitectural design space exploration is "very" off-topic
<muurkha> or even slightly
<dzaima[m]> well, that was just compiler autovectorization debugging
<dzaima[m]> for something that's probably more on-topic, here's some.. self-advertisement: I've been working on, a listing of (rvv1.0) C intrinsics with C-like pseudocode description & expectable output assembly; might be helpful for learning, but of course most likely not as accurate as the spec itself
<muurkha> oh cool!
<sorear> if you want to see some appalling ideas with rvv, part of me wants to not explain anything and leave it as a puzzle
<muurkha> heh
<sorear> if you have a can-do mindset any instruction can be a mask instruction
<sorear> i forgot a vmin in the second and don't want to think about how much is wrong in the first
<courmisch> sorear: sorry; European bed time. NF as in how many segments? I think 2 is most common (e.g. complex float numbers used in audio). 4 and 8 also occurs. NPOT not so much
<courmisch> sorear: as for ELEN, audio is almost exclusively 32-bit (single precision), non-HDR video is a mix of 8- and 16-bit; HDR video is only 16-bit
<muurkha> 32-bit audio?
<courmisch> no, single precision
<muurkha> single precision is normally 32 bits, isn't it? Even if a few of them are in the exponent
<courmisch> almost all audio is single precision. JACK, PulseAudio and PipeWire use it. Most multimedia apps too
<courmisch> most decoders and filters are written for single precision too
<muurkha> are you saying that almost all audio is floating point?
<courmisch> in processing, yes
<muurkha> interesting, I thought that was an unusual niche formt
<muurkha> *format
<courmisch> of course what comes from ADC and goes to DAC is integer
<muurkha> It wouldn't have to be; I mean ADPCM is more like floating point, right?
<courmisch> FWIW, I think Microsoft's WASAPI also uses float internally
<muurkha> I think most audio file formats are integer, which is probably why I think of that as being normal
<courmisch> ADPCM is only found in pre-MP3 era video games
<muurkha> it was more widely used than that at the time
<courmisch> telecommunication used G.711, not ADPCM, though they are vaguely similar
<muurkha> and S1 MP3 players use ADPCM for their voice recording by default
<muurkha> sure, I guess μlaw and a-law are kind of floaty
<courmisch> ADPCM doesn't strike me as a voice codec. AFAICT, ADPCM is mostly a trick to reduce bit rate at higher sampling rates. But voice doesn't require high sampling rates in the first place
<courmisch> or at least the industry thought so, ignoring that more than half the population is female or children, with higher pitch voices, but I blame the previous generation
<muurkha> well, the S1 MP3 is built around a Z80, so "higher sampling rates" is a matter of perspective :)
<muurkha> I was going to say "was" but then I remembered that I searched a few weeks ago and apparently they're still for sale
<courmisch> I mean music rates, 44.1kHz or 48kHz
<courmisch> voice is usually 8kHz or if you are fancy 16kHz
<muurkha> yes, IIRC it uses 16ksps ADPCM
<muurkha> I had some files I recorded on one but I don't know where they are
<muurkha> not sure why they didn't use something like GSM full rate
<muurkha> anyway I think there's still a lot of integer PCM data flowing around between programs, even if Pulse does support floats
<muurkha> audio data I mean
<courmisch> you can feed PCM to any sane audio output HAL, but it will be converted to float, unless you have monopoly over the audio output device
<muurkha> you don't need float to do mixing, if that's what you mean, or even dithering
<muurkha> people like floats for mixing because they don't clip
<courmisch> no, but that ship has sailed
<courmisch> nobody does audio processing in integer
<courmisch> unless it's some niche embedded code for an FPU-less device, ofc
<muurkha> there are more programs in cloud and earth, courmisch, than are dreamed of in your philosophy
<muurkha> have you used Audacity? Audacity is totally comfortable doing everything in integers
<muurkha> and that's what it does by default
<muurkha> not just if you import from integer WAV files or CD-DA. Even if you import from AAC or MP3!
<muurkha> Try importing some random YouTube video into Audacity and select Effects → Amplify and tell it to amplify it by 60dB. You'll get clipping all over the place, exactly the way you don't with float
<courmisch> of course if your source is an audio CD or a microphone you'll get PCM. I'm just saying internal processing is utterly dominated by single precision.
<courmisch> Audacity can do silly things to please the so-called "audiophiles". Unless they write their own decoders, they'll get single precision for most sources, and unless they bypass JACK/PA/PW, their output will be postprocessed in single precision
<muurkha> I don't think it's to please the audiophiles, though who knows; IEEE-754 single-precision floating point has more bits of mantissa than CD-DA
<muurkha> I just think it's silly to describe Audacity as "some niche embedded code for an FPU-less device,some niche embedded code for an FPU-less device" ;)
<courmisch> the only people who would get pissy about possibly loosing precision due to 32-bit float are audiophiles
<muurkha> losing precision?
<courmisch> and those will insist on fixed-point 32-bit processing
<courmisch> muurkha: the "problem" with float processing is that you have rounding errors, and they're not even the same between implementations
<muurkha> I've never heard of anyone using fixed-point 32-bit processing
<courmisch> well, you can't really do processing at 16-bit
<muurkha> depends
<muurkha> yes, irreproducibility of floating-point is a problem, but reproducibility is not a major objective for audio processing I think
<courmisch> and open-source decoders with integer paths typically output fixed-point 32-bit on that path
<muurkha> interesting, I don't think I've seen that. but there's an enormous amount I haven't seen
<courmisch> muurkha: I agree, but audiophiles seem to live in an alternate reality where the human ear can tell the difference
<muurkha> I thought the hip audiophile thing was 24/96: 24-bit precision at 96ksps
<courmisch> yes, lossless FLAC 24-bit 96kHz
<courmisch> but that's just the storage format
<muurkha> but 32-bit floats have a 24-bit significand
<muurkha> so I don't think you're going to lose precision by going to 32-bit floats
<courmisch> yes they do, don't try reason when it comes to audiophilia
<muurkha> but yes it's true that on mixing you have rounding errors (but who cares)
<muurkha> I mean 24 bits of precision is a lot of headroom
<courmisch> a good human ear cannot even sense 12th bit error
<muurkha> and floats give you more headroom
<muurkha> 12 bits is only 72 dB, right?
<muurkha> I think there are probably circumstances where you can detect a sound that's 72 dB quieter than another sound that's far away in frequency
<muurkha> like, if the loud sound is 20 Hz and the quiet sound is 3500 Hz, I think it should be easy
<courmisch> and IIRC, blind test couldn't differentiate 16-bit and 24-bit recordinds
<courmisch> the only logical reason to use 24-bit recording is HQ postprocessing. But audiophilia is not about logic.
<muurkha> especially if it's 3500 Hz beating with an equal-amplitude sound at 3600 Hz
<courmisch> hmm, not sure how the frequncies where spaced, and I could off by a bit or two
<courmisch> but the point is that 24-bit is mostly useless (compared to 16-bit)
<muurkha> you'd need a quiet listening room probably
<courmisch> and integer processing is a waste of SW developer time, unless your target CPU has no FPU
<muurkha> or good over-ear headphones
<muurkha> because otherwise even slight background noise will overwhelm the signal of interest
<courmisch> and ditto 96kHz/192kHz vs 48kHz. Waste of bits, unless you're going to downrate the sound to make your techno music.
<muurkha> yeah, the benefit of 96ksps is that you have an extra octave of headroom for your antialiasing filters
<muurkha> but once you have it in the digital domain you might as well downsample it with the kinds of astoundingly good filters that are easy to do digitally
<muurkha> unless you're listening to bats or something
<muurkha> there are some really interesting DSP algorithms you probably aren't aware of that depend on exact pole-zero cancellation and so don't work reliably in floating-point
<muurkha> Hogenauer filters are probably the most popular ones, and you're undoubtedly surrounded by equipment with Hogenauer filters deeply embedded in it even if you aren't aware of it
<muurkha> but there are others
<muurkha> there's a plot on showing how masking drops off with Δf
<muurkha> I was hoping to find a better plot
<courmisch> sorear: problem with that non-segmented strategy: it only works one way. There's no widening left shift as far as I can see
<sorear> you can use vwadd.vw
<sorear> instead of vwsll you can try using vwmul
<sorear> vwmaccu.vx _almost_ does what you want but it requires one of the inputs to be pre-widened
<sorear> given that Zbb has PACK this is a kinda obvious omission from Zvbb !
<sorear> or ignore the problem. stores are a fair amount rarer than loads and having them slow is less of a problem
<dzaima[m]> oh yeah vwadd.vw is nice for merging widen&or. vwmul doesn't work though, as you'd need the constant multiplicand to be wider than the half-width type
<sorear> does ffmpeg have any mechanism for "feature X exists on uarch Y but is slow, don't use it"? it's not like intel hasn't done this more than once
<sorear> looking at you maskmov
<sorear> (anyone like my hell loops?)
<sorear> is ffmpeg likely to use Zvfh (_Float16) anywhere?
<dzaima[m]> very "fun" is that AVX2 maskmov store is still very slow on Zen 4, despite an AVX512 masked store being equivalent and fast
* cousteau_ wonders if C compilers are likely to support a `short float` type, and stuff like scanf("%hf")
<sorear> what would "short float" do now that ISO has standardized _Float16?
<courmisch> sorear: there are such flags for SSE2, SSE3, SSSE and AVX, yes
<courmisch> but currently, there'd be no ways to populate them on RV
<courmisch> sorear: multimedia doesn't use half-precision for anything AFAIK, nor the other AI variants of 16-bit flots.
<cousteau_> sorear: I guess the same as `bool` or `complex`
<courmisch> sorear: maybe GPUs have some use for video filtering? maybe check the DRM chroma list. but that would probably go via GL or VK rather than the vector unit
<cousteau_> except that, uh, they're not macros/typedefs
<cousteau_> *it's not
<sorear> _Float16 isn't a typedef
<cousteau_> no, I meant that `bool` is
<cousteau_> but `short float` wouldn't
<courmisch> sorear: I am a bit suspicious of using a multiply accumulate for a shift-and-or; multiplications tend to be slower
<sorear> ISO 18661-3-2015 doesn't define printf/scanf codes, it has a new "strfromf16" function...
<cousteau_> so you can `typedef _Bool bool;` and `#define complex _Complex`, but you can't `typedef _Float16 short float;`
<sorear> courmisch: you're already writing single-purpose code for the c910...?
<courmisch> no, I'm writing RVV 1.0 code then postprocessing it to benchmark on C910
<courmisch> for lack of a better alternative
<cousteau_> but in any case, I'd expect compilers to support `short float` at least as an extension, and maybe in a future as a base type
<sorear> interesting that ffmpeg went with exclusive use of array-of-structs, I don't think SSE supports segmented operations well
<courmisch> interleaved is the defacto standard for audio
<sorear> I wouldn't expect that, every type that has been added since "long long" in C99 has been a single keyword
<courmisch> I have seen planar audio, but it's really rare
<sorear> externally, sure, but I meant for internal variables in the AAC codec or whatever you were working on
<courmisch> I think those were complex numbers, though I have not checked the "upper layers"; just ported the reference C
<sorear> deplanarize it at the same time you convert from float32 to int16 for the DAC
<courmisch> that would be inside JACKd/PA/PW outside the reach of FFmpeg
<courmisch> I don't think PA and PW even support planar audio
<courmisch> (could be wrong)
<courmisch> but converting FFmpeg and all the apps using FFmpeg is essentially impossible at this point
<sorear> how do you handle NPOT channels?
<courmisch> you mean surround? that's also interleaved. I haven't touched the filtering code yet though
<courmisch> typically, surround is pass-through anyway, thanks to Dolby / DTS insanity
<cousteau_> sorear: well, C99 was the last standard I really understood so...
<courmisch> C99 doesn't have atomics and alignment. Useless.
<courmisch> sorear: I am a bit confused why so many opcodes were taken for segment support if it's not implemented properly by anybody
<sorear> huh, c23 has scanf codes for float, double, long double, _Decimal32, _Decimal64, and _Decimal128 but not the sized binary float types
<courmisch> we'll see what SiFive did, whenever they actually release a CPU
<sorear> courmisch: i think you should test more than one implementation before declaring that it's not implemented properly by anybody
<sorear> (and the spec doesn't actually say anything about which operations are required to be fast)
<courmisch> sorear: sure. But for comparison, arm only supports 2-4 segments. They went out of their way and consumed one more bit
<courmisch> also if the plan is to enable compilers to autovectorise, segmented loads are probably unavoidable
<sorear> sure. and you did say that a trivial loop with a segmented load and 1.5 FLOPs per loaded value was only slightly slower than the scalar loop
<sorear> it's not like you're hitting m-mode emulation code here, like what happens when you try to take advantage of required support for misaligned memory access in the unprivileged spec on any sifive cpu
<courmisch> yes, it's not catastrophically slow
<courmisch> but if it's slower than scalar, even if it's faster than manually shuffling bits
<courmisch> it's still kinda useless. We'll just have to see what anybody other than T-Head did
<sorear> can you put in a tune flag for avoiding/using zvlsseg, to use 0.7.1 terminology?
<sorear> if anything I'm surprised t-head implemented zvlsseg if they made it no faster than independent strided loads
<courmisch> you mean FFmpeg CPU flag?
<courmisch> if __riscv_hwprobe hypothetically defined a corresponding flag, yes
<cousteau_> may I ask why the sudden interest in ffmpeg? I've seen it popping up a lot here recently
<cousteau_> is this because something in the vector extension may be of interest for ffmpeg audio/video codecs, and there's some collaboration going on, or intended to be?
<courmisch> it's one of the most assembler-intensive of the popular OSS libraries
<cousteau_> oh I see
<cousteau_> so it's being used as an example?
<courmisch> meanwhile OpenSSL and Nettle probably can't do much until the crypto vector stuff is out
<courmisch> I haven't seen FFmpeg pop up much recently here. It's just the same segmented load discussion that's been going on for a few days
<courmisch> and neither FFmpeg nor any other multimedia libraries are included in RISE :(
<cousteau_> I'm no crypto expert but is it really that important? I saw a lot of noise about Zvk* (whereas Zk* seems to be pretty established by now)
<courmisch> AFAIU, there are two parts to Zvk. One where they just vectorise the Zk stuff like CLMUL
<courmisch> and one where they implement specific algorithm rounds directly in the CPU
<courmisch> the latter should obviously be a hell of a lot faster than Zk
<courmisch> not that I'd know
<cousteau_> I see
cousteau_ is now known as cousteau
<cousteau> the way you put it, I'd expect that first part to just be the same instructions implemented differently
<cousteau> (although maybe there are some minor differences other than the ABI that do matter, such as timing... part of Zk* is "data-independent timing" so I guess that matters)
<courmisch> that's like comparing I and V
<courmisch> V does not really add anything that I/ or M can't do. It's just faster if used right
<cousteau> courmisch: ah. I misunderstood that Zvk would just accelerate instructions that Zk already did
<dzaima[m]> fwiw, current status of autovectorization of the original question ffmpeg loop: - clang uses vlseg2e32 whereas gcc does vrgather-s, clang always has a scalar loop for the tail, gcc sucks at eliminating setvl-s, both unroll 2x despite usually not doing that in my experience (might make sense if for gcc splitting up higher-lmul gathers, but no it's still LMUL=1 which is its usual LMUL; clang's at LMUL=2 which is
<dzaima[m]> also its usual LMUL)
<cousteau> I think RISC-V should add carry-less addition and subtraction in addition to multiplication
<cousteau> as a joke
<cousteau> see how long it takes people to realize that's just XOR
<sorear> clmul is Zbc, not Zk*
<sorear> Zk* implements round function bits, but it's limited by your datapath width ... Zkned needs 16 instructions for an AES round on rv32ik, 4 on rv64ik
<sorear> Zkt is not really fit for purpose because it doesn't specify that loads and stores are independent of the value loaded or stored, so if you're treating it as a manual you can't store keymat in memory
<sorear> only useful for secrets born in registers (via Zkr) and dying in registers
<sorear> courmisch: so no ffmpeg mechanism for setting flags other than architectural features on the basis of vendor/arch/impl, runtime benchmarking, or configuration?
<courmisch> sorear: you can set a flag based on whatever, but well, you need to detect the correct value somehow
<courmisch> dzaima[m]: interesting. Well I doubt that vrgather is fast. It sounds like the one instruction that can't really be fast
<sorear> I mean it's easy enough to detect "vlseg2e32 is slower than vle64 + vnsrl + vnsrl" if you have access to rdcycle
<courmisch> sorear: "if you have access to rdcycle" well, I think the kernel is removing that access
<dzaima[m]> yeah. gcc autovectorization is very early stages afaict, so that's likely just it not being particularly complete
<sorear> although it might not be safe to assume that the behavior is the same for all element sizes, and linux is phasing out the initial scounteren setup
<courmisch> dzaima[m]: IMO, segmented loads are correct here. It's Clang that's using gather where it's easily avoided
<dzaima[m]> no, clang uses vlseg2e32 whereas gcc uses vrgather
<sorear> gcc autovectorization actually uses VL so in my book it's way ahead of clang
<courmisch> dzaima[m]: ah ok, nvmd
<dzaima[m]> (ugh, bridge doesn't convert user links properly matrix→irc)
<courmisch> sorear: seems more like a case of neither of them being mature, with different problems
<dzaima[m]> clang's presumably using the same core as ARM SVE which doesn't have VL; gcc is a new vectorizer afaik
<cousteau> sorear: isn't clmul also part of Zbk* ?
<cousteau> (probably Zbkc)
<sorear> maybe
<sorear> clang's IR has %evl arguments on every vector-predication intrinsic function which aren't being generated on any architecture...
<cousteau> apparently it used to be called "Zkg"
<courmisch> dzaima[m]: SVE doesn't have VL, but it has WHILExx doing almost the same though. But maybe they determined that unrolling was faster than using VL style due to lack of LMUL
<dzaima[m]> hmm yeah; gcc also appears to use whilelo:
<courmisch> and GCC is not really using VL on RVV
<courmisch> it's hard-coding the AVL to 8 for no obvious reasons
<courmisch> ah, maybe missing restrict?
<courmisch> nah, I don't know
<dzaima[m]> that may be because of.. --param=riscv-autovec-preference=fixed-vlmax - change that to scalable and it should probably scale, at the expense of even more output
<courmisch> not sure if becoming a compiler developer is insanity or job security
<courmisch> probably both
<dzaima[m]> gcc's scalable code is usually a lot better though; the deinterleaving hits a pretty bad case
<courmisch> I guess LLVM just took it from NEON, not SVE
<courmisch> that would actually explain the unrolling
<dzaima[m]> this is the first case I've seen LLVM unroll though
<courmisch> doesn't it use NEON with 2x unroll on Armv8 for this function?
<dzaima[m]> on NEON clang does unroll usually, yes
<dzaima[m]> but for rvv it doesn't unroll even the simplest loops -
<courmisch> soembody should teach it vmv.v.i
<dzaima[m]> that's a float 1.0, not integer
<dzaima[m]> it does indeed vmv.v.i v8, 1 for an integer 1
<sorear> it's still painful to look at scalar epilog loops for rvv
<dzaima[m]> yeah
<courmisch> so hmm, is there any way to load a vector *backward* other than vlse with a negative stride?
<courmisch> a loop over vslide, but that's probably even slower
<dzaima[m]> vrgather is always an option
<courmisch> vid.v; vneg; vrgather, hmm
<dzaima[m]> probably would want to have separate tail handling though, otherwise the index generation has to be in the loop
<sorear> panpipe is definitely going to optimize strides of -1,0,1 into unit-stride memory operations btw
<sorear> mostly because it's extremely easy to generate a variable-but-1 stride in FORTRAN
<sorear> i think that everything in lapack is a strided load but I haven't checked the actual binary
jacklsw has quit [Ping timeout: 272 seconds]
<sorear> how _is_ vrgather performance on c910?
<dzaima[m]> here's a vrgather-based array reverse impl, both in-place and out-of-place in one, in autogenerated C intrinsics, that I wrote as an experiment, and appears to possibly work with some simple tests in qemu -
<sorear> suddenly reminded of / wondering when we're going to be able to inline memcpy as 3 instructions for len<=128
<sorear> vsetivli;vle8;vse8 compares rather favorably to auipc;jalr ...
<sorear> maybe I'll turn the json parser into something testable
<courmisch> is it legal to use an odd register number as a wide operand if LMUL is fractional?
<courmisch> sorear: implying you don't have a C910 lying around? I can run a test case if you have one
<dzaima[m]> clang does some... very funky things with memcpy:
<sorear> courmisch: i ordered a lp4a but it won't be here for two weeks
<sorear> courmisch: I would say so - my understanding of the rules on register number is that they are based on EMUL, not LMUL
<sorear> dzaima[m]: SLP was a mistake
<dzaima[m]> I can hack up some vrgather tests if desired; I'm also quite curious how it'd perform (and also don't have any risc-v hardware)
<sorear> it's not clear from the photos and the pdfs how I'm supposed to run my own S/M-mode code on the lp4a but I'll probably figure it out when I have it
<courmisch> mine has broken fastboot
<courmisch> I had to use unofficial U-boot's fastboot to flash it
<courmisch> to run your own S mode code, you can just replace the kernel image in u-boot, I think. Custom M mode, I dunno
<sorear> what hardware do you use to flash it?
<courmisch> USB C cable
<courmisch> since it's built-in MMC, you can't do the usual dd from your desktop computer
<courmisch> but judging by your timeline, you just got the first nonbeta while I have the last beta
<courmisch> in principle, you should be able to overwrite the u-boot partition from within Linux. But don't come crying to me if you brick the device...
<sorear> haven't done anything with u-boot before
<sorear> does "unofficial U-boot's fastboot" imply that you replaced the U-boot image on yours?
<courmisch> from U0 TTL serial port, I just typed "fastboot usb 0" + Enter, and that got me a somewhat working fastboot gadget
<courmisch> again, that's not the official method, which is to hold the BOOT button while attaching power (never worked for me)
<courmisch> sorear: no, there is a built-in fastboot implementation inside u-boot, and Sipeed didn't proactively remove it
<sorear> courmisch: do you know how the beta/official thing works? do you have a beta version?
<courmisch> sorear: I guess I have a beta because it shipped before they announced the release, but :shrug:
<sorear> i think i accidentally got a beta, the only one listed on the store when i ordered was the "8+8"
<sorear> if it was _clearly labeled_ as a beta i would have waited a few days
<sorear> online docs say that the release versions have dip switches to boot from sd card, would be a nice simple way to run M-mode code without worrying about bricking it
<courmisch> when I bought, they were only selling the beta specs. So I guess I have a beta from inventory, even though beta was officially only for preorders
<sorear> fairly clear that the fastboot "boot" image includes opensbi, which means I can replace the M-mode code, although there's no obvious way to restore if it won't load u-boot
<courmisch> is it so that only u-boot SPL and OpenSBI get M, while "normal" u-boot runs at S mode?
<sorear> hey, if you are writing in a dialect where you can translate to xtheadv for testing, would it make sense to include that in the build system for users?
<sorear> (SPL, opensbi, normal uboot) that is my understanding, yes
<courmisch> and OpenSBI monopolises M mode, and protects itself with PMP from S mode?
<courmisch> I'm very unfamiliar with the privileged RISC-V stuff
<sorear> I'll probably patch opensbi to include a backdoor, then not touch the opensbi on flash again
<sorear> that or - any idea how to use the jtag?
<courmisch> to paraphrase my go-to-colleague for electronic questions, using JTAG would require some very fine soldering
<courmisch> tl;dr: no
<sorear> oh, you have to solder on your own connector? :/
<courmisch> that's what he said. I took his word at face value because my electronic-fu is nonexistent
<courmisch> like managing to get a serial console from GPIO is an achievement for me
<courmisch> I'm not using any dialect. Just making tons of macros to convince gas to assemble V 0.7.1, XTheadBA and XTheadBB instead of V 1.0, Zba and Zbb
<courmisch> then a pile of hacks directly in the code for the stuff that just can't be done that way
<sorear> courmisch: any expectation of committing those hacks?
<courmisch> sorear: nothing grandiose there;a=shortlog;h=refs/heads/thead
<sorear> how feasible is to do benchmarking on startup for microarchitectural characteristics? using rdcycle, rdtime, clock_gettime, perf_event_open, whatever
<courmisch> if you mean the stock firmware, it's some heavily patched 5.10
<courmisch> so rdcycle and rdtime work
<courmisch> perf doesn't
<whitequark[cis]> courmisch: pic of very fine soldering?
<courmisch> I recommend to upgrade the firmware to the last vendor release, but that's still 5.10 anyway
<courmisch> whitequark[cis]: he said it would be needed; we didn't actually do it
<sorear> perf should work iff rdcycle doesn't, so you might be able to try perf_event_open and fall back to rdcycle on -ENOSYS
<sorear> still stupid
<courmisch> I have a MR on FFmpeg for that, yes
<sorear> for checkasm
<whitequark[cis]> courmisch: I'm just curious what the footprint is
<courmisch> whitequark[cis]: sorry :shrug:
<whitequark[cis]> (people's classification of soldering jobs varies immensely; for some anything SMD is too hard, some shrug off doing pads that are 300x300 micron)
<courmisch> sorear: that's the only use of rdcycle in the FFmpeg code base
<sorear> with the usual caveat that I barely have a clue what ffmpeg is, I don't think your checkasm MR is usable for function selection during normal usage of the library/application?
<courmisch> by all means, if you want to play with RVV, pick whatever project you are comfortable with
<gurki> id expect some basic math lib to yield more meaningfull results
<gurki> linpack if you want some benchmark numbers
<sorear> I'm suggesting that instead of just checking hwprobe, check hwprobe and also run some small benchmark routines to select the fastest X on your machine
<courmisch> sorear: if/when there is a confirmed RVV 1.0 implementation that has decent segmented loads and stores, and one that doesn't, and if by that point, the kernel has no flag in hwprobe, then yes
<courmisch> in the mean time, it is urgent to wait as the proverb goes in my mother tongue
<sorear> if everyone makes that decision there won't be a hwprobe flag
<sorear> but yes, not urgent
<courmisch> I don't really think I could convince palmer or conchuod to add a flag for an hypothetical property
<sorear> i think my chances of convincing them with words are signficantly lower than my chances of getting a patch accepted
<sorear> although there's no point in writing such a patch until I can test it on hardware
<sorear> re. small pads, right now I don't have soldering skills or soldering equipment so anything too small for me to hand-tape or hand-wire-wrap is too small
<conchuod> Isn't the whole point of a patches changelog to persuade people with words as to somethings merit?
<conchuod> s/es/'s/
<dzaima[m]> ok so I actually wrote a whole perf test suite for vrgather:; 930 lines of autogenerated assembly (slightly manually tweaked to remove unwanted setvls) isn't particularly nice though
<conchuod> I'm not sure as to what you want a flag for though, I've been playing video games all day and not reading here :)
<courmisch> conchuod: whether segmented vector loads & stores are fast or not
<courmisch> dzaima[m]: the assembler has a very disassembly of compiler-generated code feel to it
<dzaima[m]> it is indeed the output of clang (the input C to clang was autogenerated)
<conchuod> Ah right.
<conchuod> I won't claim to have any opinion at this point :)
<dzaima[m]> sweet, thanks!
<sorear> I think your definition of VLMAX is off by 8, VL is elements at the current SEW, VLEN is bits
<courmisch> should probably use rdcycle for bench rather than the clock
* sorear trying to figure out exactly what the units are
<dzaima[m]> oh yeah, vlmax is just the number of bits in the vectors in use
<dzaima[m]> units should be nanoseconds per gather invocation
<dzaima[m]> changing the measured unit can be done in the u64 measurement() function
<sorear> so it looks like 2.00/cycle throughput if it's running at stock 1.85GHz
<sorear> 3 cycle latency
<dzaima[m]> yeah
<sorear> dzaima[m]: your qsort comparison function is wrong and producing undefined behavior
<sorear> (b<a) - (a<b) I think?
<dzaima[m]> ..right, I never remember which things need a full comparison and which just a less-than
<sorear> so 32 cycles for all LMUL=8 gathers, 8 cycles for all LMUL=4, 1/2 cycle (t) 3 cycle (l) for LMUL=1, LMUL=2 too noisy to interpret, maybe a fixed sort will help
<sorear> no effect from pattern although I wonder if VL would affect it
<courmisch> so far most stuff seems to be fastest with LMUL=4
<courmisch> not clear why it gets significantly slower at LMUL=8
<courmisch> it's also kinda annoying, because I'd been hoping that the reasonable thing to do with just maximise LMUL for any given function
<courmisch> (unless operating at fixed block size, of course)
<courmisch> and it would be a huge mess to try to guess the correct LMUL depending on CPU at run-time
<sorear> 2.5 (t), 4 (l) or so
<dzaima[m]> updated the gist to add a tested_vl variable that can be changed in the code
<muurkha> dzaima[m]: I feel like gcc autovectorization was in very early stages 15 years ago; do you mean it still sucks or that it's very early stages for GCC RISC-V autovectorization support?
<dzaima[m]> RISC-V specifically
<muurkha> whitequark[cis]: I wonder if this reporesents a minor bug in the Matrix bridge? 16:27 < dzaima[m]> no, clang uses vlseg2e32 whereas gcc
<muurkha> that doesn't look reasonably formatted for IRC
<dzaima[m]> some testing:
<muurkha> courmisch: I'm wondering if maybe the market demand for autovectorization is going away because for the pieces of code where it really matters, someone will write them in assembly manually?
<muurkha> sorear: "panpipe" <3
<sorear> i already reported that glitch elsewhere because it was less than topical here, that may have been a mistake, the response I got was to say that should have been impossible if it were a Matrix reply, then nothing when I pointed out it was never a reply
<dzaima[m]> hmm how do replies get translated?
<dzaima[m]> ok reply target just gets dropped