<re_irc>
<@yatekii:matrix.org> pardon my ignorance, what is lowering and what does this mean? :) can I read on it somewhere?
<re_irc>
<@therealprof:matrix.org> yatekii: It means assembler code generation from an intermediate compiler code representation generated in this case by the Rust compiler.
<re_irc>
<@therealprof:matrix.org> In this particular case we found out that for saturated additions of 8/16bit signed integers, the assembly generated by LLVM for the Rust code would turn into a specialzed DSP instruction on supported MCUs/CPUs but LLVM wouldn't do the same for unsigned integers which is what this PR adds.
<re_irc>
<@yatekii:matrix.org> ah nice =)
<re_irc>
<@yatekii:matrix.org> thanks! for implementing and explaing :D
<re_irc>
<@adamgreig:matrix.org> are there any non-vector instructions left that won't be lowered?
<re_irc>
<@adamgreig:matrix.org> i still dunno what to do about the vector stuff now
<re_irc>
<@adamgreig:matrix.org> was very up for doing it in cortex-m, the intrinsics are easy and it seems like a packed struct of I8x4 or I16x4 or U8x4 or U16x4 would be easy and would work, but
<re_irc>
<@adamgreig:matrix.org> it seems like maybe it's going to happen in stdsimd?
<re_irc>
<@adamgreig:matrix.org> or, maybe, stdarch, not sure why the arm dsp simd stuff is in stdarch not stdsimd
<re_irc>
<@adamgreig:matrix.org> I worded that poorly, I guess really I just wonder if this completes all the obvious lowerings llvm should have for v7e
<re_irc>
<@adamgreig:matrix.org> I'm assuming it won't easily infer vector/simd stuff, but of course it does still use the vector instructions when operating on scalar but smaller-than-word datatypes (like qadd16 or whatever)
<re_irc>
<@therealprof:matrix.org> Sorry busy with family activities... Can reply soon with more details.
<re_irc>
<@adamgreig:matrix.org> no rush.
neceve has joined #rust-embedded
fabic has joined #rust-embedded
<re_irc>
<@adamgreig:matrix.org> therealprof: aaah, the new intrinsics in stdarch _are_ in core, but not in stable core because they haven't made it through the pipeline yet, but if you look at the nightly docs they're present: https://doc.rust-lang.org/nightly/core/arch/arm/index.html
<re_irc>
<@adamgreig:matrix.org> so maybe it would still be worth putting in cortex-m, idk
<re_irc>
<@therealprof:matrix.org> Intrinsics are rather boring IMHO.
<re_irc>
<@therealprof:matrix.org> The state of DSP extensions in LLVM is rather desolate.
<re_irc>
<@therealprof:matrix.org> They're modelled as instructions and there's a mapping from the intrinsics but then you're confined to the special types.
<re_irc>
<@adamgreig:matrix.org> at least working and stable and safe intrinsics would let people write efficient platform-specific code and know it will run using vector ops
<re_irc>
<@adamgreig:matrix.org> right now if i wanna write some dsp code on cortex-m i basically have to use unstable asm and do it all by hand, or implement all this stuff i've been looking at in cortex-m, or I guess use the nightly-only core::arch::arm intrinsics now they're in nightly but with no immediate path to stabilisation
<re_irc>
<@adamgreig:matrix.org> so... actually, lots of options
<re_irc>
<@therealprof:matrix.org> There's a special lowering pass for scalars if DSP instructions are present but they don't do SIMD, only single values (which is why we're seeing the instructions multiple times when doing operations on multiple values).
<re_irc>
<@adamgreig:matrix.org> yea, that makes sense and is better than not using the instructions for sure
<re_irc>
<@adamgreig:matrix.org> but it would be nice if it could detect the potential for autovectorisation
<re_irc>
<@therealprof:matrix.org> Yeah. The thing is LLVM handles vector types internally.
<re_irc>
<@adamgreig:matrix.org> the problem is it's probably worse in all those things we tried anyway: two i16 arguments will be passed into a function using two registers because of the ABI, right?
<re_irc>
<@adamgreig:matrix.org> so it would be more instructions to pack them, run the qadd16, then unpack them for output
<re_irc>
<@therealprof:matrix.org> There's also an autovectorizer pass, but since the DSP "vectors" are not modelled, they can't be used at all.
<re_irc>
<@therealprof:matrix.org> adamgreig: Well, there's the experimental simd types in Rust which would allow you to actually declare vectors instead of scalars.
<re_irc>
<@adamgreig:matrix.org> yea, that's what I mean, though they're morally equivalent to `#[repr(packed)] struct I16x2(pub i16, pub i16)`
<re_irc>
<@therealprof:matrix.org> They don't work that well at the moment but even if they would that wouldn't help at all since the lack of DSP support means that LLVM will turn them into scalars automatically before lowering.
<re_irc>
<@adamgreig:matrix.org> therealprof: but you use them with the intrinsics that are also part of experimental simd
<re_irc>
<@adamgreig:matrix.org> they don't even implement + and * and so on afaict
<re_irc>
<@adamgreig:matrix.org> unless we're talking about different things here: I'm looking at the core::arch::arm::dsp::int16x2, but maybe there's a generic high-level int16x2 as well that's meant to be portable?
<re_irc>
<@therealprof:matrix.org> adamgreig: I think that's the broader idea, yes.
<re_irc>
<@adamgreig:matrix.org> unfortunately they don't do any 32-bit types
<re_irc>
<@therealprof:matrix.org> LLVM does support that already.
<re_irc>
<@adamgreig:matrix.org> no 16x2 or 8x4
<re_irc>
<@therealprof:matrix.org> adamgreig: Same as LLVM. 🙄
<re_irc>
<@adamgreig:matrix.org> I see 🙃 though i've noticed their SimdU16 type is const-generic over the number of lanes so you could have SImdU16<2> I suppose
<re_irc>
<@adamgreig:matrix.org> indeed they have `pub type i16x4 = SimdI16<4>;` but not `pub type i16x2 = SimdI16<2>;` so it would be super easy to add if nothing else
<re_irc>
<@therealprof:matrix.org> In LLVM the types are fully generic already. So you can call e.g. `<2 x i16> @llvm.sadd.sat.v2i16(<2 x i16>, <2 x i16>)`
<re_irc>
<@adamgreig:matrix.org> does that lower to the right thing for us?
<re_irc>
<@therealprof:matrix.org> It lowers to correct code, but since v2i16 is not a natively supported type it can't lower to a single add instruction.
<re_irc>
<@therealprof:matrix.org> Currently it would lower to:
<re_irc>
<@adamgreig:matrix.org> so even if stdsimd added u16x2 and u8x4 and so forth for us, which would implement Add and Mul and all that, LLVM wouldn't lower it to the right thing anyway
<re_irc>
<@therealprof:matrix.org> Note the additional shiftery just to separate and recombine the values. 🙄
<re_irc>
<@adamgreig:matrix.org> but ultimately that would be a very nice way to be able to write SIMD stuff for v7e?
<re_irc>
<@therealprof:matrix.org> No, for any architecture.
<re_irc>
<@therealprof:matrix.org> 😉
<re_irc>
<@adamgreig:matrix.org> sorry, yes, for any arch, _including_ v7e
<re_irc>
<@adamgreig:matrix.org> whereas currently stdsimd doesn't work at all for v7e, both because it doesn't define any 32-bit SIMD types and also because even if it did, LLVM wouldn't lower them to the SIMD instructions
<re_irc>
<@therealprof:matrix.org> It would use the available capabilities. You could also use a `v4i16` which might use a single instruction on NEON and 2 instructions on DSP automatically.
<re_irc>
<@adamgreig:matrix.org> it would be cool to be able to test your SIMD code on x86 and then run it on v7e and know it was doing the right thing both times
<re_irc>
<@adamgreig:matrix.org> yea, I see
<re_irc>
<@therealprof:matrix.org> I think the NEON code would already do the right thing but I haven't tested specifically. 😉
<re_irc>
<@adamgreig:matrix.org> nicer than the stdarch stuff where we'd get arm-specific intrinsics and packed types eventually, but the types are opaque and can only be used with those intrinsics
rjframe has quit [Remote host closed the connection]
rjframe has joined #rust-embedded
<re_irc>
<@adamgreig:matrix.org> so maybe worth asking stdsimd people to add the 32-bit types anyway, since at least the llvm codegen would work even if it's not actually simd yet?
<re_irc>
<@therealprof:matrix.org> It's actually brilliant because you can define vectors of arbitrary size and the lowering would automatically use all available resources and registers to the fullest.
<re_irc>
<@therealprof:matrix.org> adamgreig: The only problem is: Currently the code would be lowered to significantly slower assembly than not using them.
<re_irc>
<@adamgreig:matrix.org> compared to passing around each i16 as a u32, sure
<re_irc>
<@adamgreig:matrix.org> but identical to passing them around packed, right?
<re_irc>
<@adamgreig:matrix.org> not sure how a buffer of i16 would be treated either...
<re_irc>
<@adamgreig:matrix.org> but nevertheless we'd need it in stdsimd eventually and in theory there's no reason for them to not include the typedef?
<re_irc>
<@therealprof:matrix.org> adamgreig: Yes, no harm in that.
<re_irc>
<@adamgreig:matrix.org> heh, talking of cross, guess what's in stdsimd's CI
<re_irc>
<@therealprof:matrix.org> One cool thing about the Rust representation is that Rust is free to model structs however they desire. So in theory Rust could detect that you're trying to model a coordinates or color values or whatever and automatically chose a vector format and pass the data around like that.
<re_irc>
<@therealprof:matrix.org> I'm having a hard time making sure the ARM instruction build code understands the concept of a v2i16 and v4i8 value though...
<re_irc>
... count doesn't match vector breakdown!"' failed.
<re_irc>
<@therealprof:matrix.org> is where I'm currently stuck.
rjframe has quit [Ping timeout: 258 seconds]
<re_irc>
<@therealprof:matrix.org> I think I do have the lowering itself pretty much working already.
<re_irc>
<@therealprof:matrix.org> I think the DSP instructions are pretty much all there so it's really just a matter of instruction selection (i.e. what goes in) and lowering (what comes out).
<re_irc>
<@therealprof:matrix.org> If instruction selection doesn't know about v2i16 it will break up `<2 x i16>` into `apply this insn 2 times on i16`. It might be possible to rummage through the DAG while lowering and recombine those but yuck...
<re_irc>
<@therealprof:matrix.org> I think there's also a lot of low hanging fruit in lowering of scalar operations using DSP. Those sign-extend and truction operations seem also quite superfluous to me and only seem to happen because it thinks the scalar operation is actually executed on the full register rather a fraction with the rest being...
<re_irc>
... simply ignored.
<re_irc>
<@therealprof:matrix.org> Some mappings from IR calls to DSP instructions are also very rudimentary and could use a little love.
<re_irc>
<@adamgreig:matrix.org> given the stuff you're looking at in llvm now would benefit all the 32-bit armv6/v7/v8 applications processors as well as v7e-m and v8-m it sounds like it would have a pretty widespread positive impact
<re_irc>
<@therealprof:matrix.org> Yeah, let's see whether I can convince them that we can do better than "hey, we could use some of those fancy instructions for some scalar operations". 😉
<re_irc>
<@therealprof:matrix.org> I think I'll add lowering tests for those types to all relevant test cases. Maybe that'll do the trick as an eye opener.
fabic has quit [Ping timeout: 268 seconds]
<re_irc>
<@adamgreig:matrix.org> calling core::arch::arm::__qadd16(a: i16x2, b: i16x2) does lower to qadd16 at least
<re_irc>
<@adamgreig:matrix.org> but it lowers to an FFI call to `llvm.arm.qadd16` so you'd hope so, lol
<re_irc>
<@adamgreig:matrix.org> but you have to core::mem::transmute to get in/out of a core::arch::arm::i16x2 type as far as I can tell
<re_irc>
<@therealprof:matrix.org> Yeah, the intrinsics work but are iffy to use since you have to come up with the correct types.
<re_irc>
<@adamgreig:matrix.org> yea
<re_irc>
<@therealprof:matrix.org> I think having those types would go a long way already since one could also impl the standard core math functions for them and they would already turn into useful code for all architectures.
<re_irc>
<@adamgreig:matrix.org> which types exactly, in stdsimd?
<re_irc>
<@therealprof:matrix.org> Yes.
<re_irc>
<@adamgreig:matrix.org> they should already impl all the standard core math functions
<re_irc>
<@adamgreig:matrix.org> uh, no idea at all what's going on with the disassembly for that though
<re_irc>
<@adamgreig:matrix.org> something seems badly wrong because `cargo objdump` is saying that function is `movs r1, #2; movt r1, #4; str r1, [r0]; bx lr`
<re_irc>
<@therealprof:matrix.org> Hm.
<re_irc>
<@adamgreig:matrix.org> meanwhile binaryninja thinks it's defined twice, with the first definition being `vaddw.s8 q9, q0, d2; andvsr0, r1, r4, lsl, #2; strlt r4, [r0, #0x770];` and the second being `movs r1, #2, movt r1, #4, str r1, [r0]; bx lr`
<re_irc>
<@adamgreig:matrix.org> that first disassembly obviously makes no sense
<re_irc>
<@adamgreig:matrix.org> (clearly it's disassembling the same bytes into different instructions in both cases, and the vaddw is for non-M-class, so maybe it's showing me two interpretations? dunno what the ELF says)
<re_irc>
<@thalesfragoso:matrix.org> are you calling with constants ? maybe try `#[inline(never)]`
<re_irc>
<@adamgreig:matrix.org> oh interesting, I did inline(never) but I am also calling with constants
<re_irc>
<@adamgreig:matrix.org> time to make up some i16s I guess
<re_irc>
<@thalesfragoso:matrix.org> weird that it would still do that with inline never, does the result makes sense ?
<re_irc>
<@adamgreig:matrix.org> (ha ha, yes, well spotted, I was indeed calling it with [1, 2] twice, and so it returns [2, 4] correctly)
<re_irc>
<@adamgreig:matrix.org> llvm outsmarts me again
<re_irc>
<@thalesfragoso:matrix.org> even with inline never, crazy
<re_irc>
<@therealprof:matrix.org> It's weird that it doesn't show up in the IR?
<re_irc>
<@adamgreig:matrix.org> I generated the IR without it having to be inside a binary crate
<re_irc>
<@adamgreig:matrix.org> ok, got it now, it generates `ldrsh.w, qadd16, strh, qadd16, strh, bx lr`
<re_irc>
<@adamgreig:matrix.org> so yea, should give the correct result but is suboptimal execution
<re_irc>
<@therealprof:matrix.org> Well, that's way from the worst code that could be generated.
<re_irc>
<@adamgreig:matrix.org> yea
<re_irc>
<@therealprof:matrix.org> Basically it's splitting the vector operation into two scalar operations and then the lowering pass picks it up and turns it into a scalar DSP operation.
<re_irc>
<@therealprof:matrix.org> If you drop down to thumbv7m it should get way worse.
<re_irc>
<@adamgreig:matrix.org> if I let it inline, starting with r0 and r1 loaded, I get `lsrs r2, r1, #16; lsrs r3, r0, #16; qadd16 r0, r1, r0; qadd16 r2, r2, r3; uxth r0, r0; uxth r1, r2;`
<re_irc>
<@adamgreig:matrix.org> checking v7 but yea i'm sure it would be
<re_irc>
<@therealprof:matrix.org> You could also try a regular add which should generate way worse code for `u16x2` than two `u16`.
<re_irc>
<@adamgreig:matrix.org> indeed, it's all over the place, total mess on v7 non-e
<re_irc>
<@adamgreig:matrix.org> I wont' paste it but it's like 30 instructions
<re_irc>
<@therealprof:matrix.org> That's another thing which is missing: Good scalarising when no vector instructions are available. At the moment LLVM kind of assumes that it will always end up in vectors and thus be benefital to keep in that form rather than deconstruct it.
<re_irc>
<@therealprof:matrix.org> So if we were to add some kind of vectorisation it would probably also take the architecture into account.
<re_irc>
<@dirbaio:matrix.org> fun maths question: I got a SPI peripheral that has 2 8-bit dividers, so that `spi_clk = sys_clk / div1 / div2`, where `divX` is in `1..=255`
<re_irc>
<@dirbaio:matrix.org> so seems like I can first calculate `ratio = sys_clk / spi_clk`, then find `div1, div2` such that `div1*div2 = ratio`
<re_irc>
<@dirbaio:matrix.org> but I can't think of an easy way to do that :S
<re_irc>
<@dirbaio:matrix.org> for example if `ratio=10_000` then exact solutions are `100*100, 125*80, 200*50, 250*40`
<re_irc>
<@dirbaio:matrix.org> is there some simple and fast algo to do this? it smells like prime decomposition and stuff :S
<re_irc>
<@dirbaio:matrix.org> all I can think is trial and error, though that's 256 iterations..
<re_irc>
<@adamgreig:matrix.org> Integer factorisation is what you're looking for
<re_irc>
<@dirbaio:matrix.org> yeah to enumerate the divirsors and shit
<re_irc>
<@newam:matrix.org> Wouldn't some sort of a binary search work there? Could still do a trail-and-error but probably cut down on the execution time.
<re_irc>
<@dirbaio:matrix.org> but that might be slower than just trying the 256 possible div1's
<re_irc>
<@adamgreig:matrix.org> Unfortunately integers factorisation is one of the hard problems where we hope it doesn't work in polynomial time, lol
<re_irc>
<@dirbaio:matrix.org> I mean factorization will probably be slower, even if I end up testing less divisors the "constant" cost of testing one will be higher
<re_irc>
<@dirbaio:matrix.org> this is for the rpi pico
<re_irc>
<@dirbaio:matrix.org> so cortex m0, no divide 🤣
<re_irc>
<@adamgreig:matrix.org> Slower than what?
<re_irc>
<@dirbaio:matrix.org> let div2 = ratio / div1;
<re_irc>
<@dirbaio:matrix.org> for div1 in 1..=255 {
<re_irc>
<@dirbaio:matrix.org> I think it can be rewritten without divides or multiples 🤔
<re_irc>
<@dirbaio:matrix.org> so just 255 iterations, with no expensive math
<re_irc>
<@dirbaio:matrix.org> but.. can it be done faster? :D
<re_irc>
<@dirbaio:matrix.org> is it even worth it to try to match it exactly? 🤣
<re_irc>
<@dirbaio:matrix.org> or maybe the API should have the user pass in the dividers instead of the desired clk?
<re_irc>
<@dirbaio:matrix.org> most hals seem to have users pass in desired clks and do the calculations internally
<re_irc>
<@adamgreig:matrix.org> Heh, passing dividers or target freq is certainly a divisive question in hal design
<re_irc>
<@dirbaio:matrix.org> target freq seems much friendlier indeed..
<re_irc>
<@adamgreig:matrix.org> I don't know that there is a more efficient way to factor integers, it's quite intensively studied
<re_irc>
<@adamgreig:matrix.org> I guess it's different if your objective is to minimise frequency error rather than strictly find factors though
<re_irc>
<@dirbaio:matrix.org> yea.. it might be impossible to match target exactly
<re_irc>
<@dirbaio:matrix.org> in that case I just want closest
<re_irc>
<@adamgreig:matrix.org> Closest or less-than?
<re_irc>
<@dirbaio:matrix.org> maybe "closest greatest"? so the SPI never runs faster than requested freq?
<re_irc>
<@adamgreig:matrix.org> Yea
<re_irc>
<@dirbaio:matrix.org> so, find `div1, div2` such that `div1*div2 >= ratio` and `div1*div2` is closest to `ratio`
<re_irc>
<@dirbaio:matrix.org> this is stupid, why didn't they do a single 16bit divider
<re_irc>
<@dirbaio:matrix.org> a 16bit counter is the same amount of gates than 2x 8bit counters 😠
<re_irc>
<@dirbaio:matrix.org> everything about the pico's spi peripheral is a big bag of WTF
<re_irc>
<@adamgreig:matrix.org> 16 bit counter is way more gates than two 8bit counters
<re_irc>
<@dirbaio:matrix.org> wut, doesn't it scale linearly?
<re_irc>
<@adamgreig:matrix.org> Each bit has to compute a function of all previous bits
<re_irc>
<@dirbaio:matrix.org> in all cases, or with fancy carry lookahead?
<re_irc>
<@adamgreig:matrix.org> In all cases where you have a synchronous counter
<re_irc>
<@adamgreig:matrix.org> If you have fancy carry lookahead you can reduce some of the gates needed, but the propagation time increases, so you can't run as high a frequency
<re_irc>
<@adamgreig:matrix.org> So either way your 16 bit counter takes more gates and/or runs slower than two 8 bit
<re_irc>
<@adamgreig:matrix.org> (slower meaning slower maximum permissible clock)
<re_irc>
<@dirbaio:matrix.org> huh I guess I have no idea about HDL then :D
<re_irc>
<@thalesfragoso:matrix.org> But we talking about dividers, no ?
<re_irc>
<@adamgreig:matrix.org> Generally you make dividers using counters
<re_irc>
<@adamgreig:matrix.org> (not always, it does depend and there are other ways)
<re_irc>
<@thalesfragoso:matrix.org> oh, yeah, right
<re_irc>
<@dirbaio:matrix.org> LOL I think I'm overengineering this... but it's O(255) without any multiplies or divides in the loop!
<re_irc>
<@adamgreig:matrix.org> Btw can the divider not divide by 256?
<re_irc>
<@adamgreig:matrix.org> Normally I've seen you write x-1 to divide by X, so range is 1 to 256
<re_irc>
<@dirbaio:matrix.org> yeah I was simplifying
<re_irc>
<@dirbaio:matrix.org> div1 must be 2..254 and even
<re_irc>
<@dirbaio:matrix.org> div2 must be 1..256 (you write div2-1 to the reg)
<re_irc>
<@adamgreig:matrix.org> And even, weird
<re_irc>
<@adamgreig:matrix.org> Wonder if it's actually a 7 bit counter and a fixed /2 too or something
<re_irc>
<@dirbaio:matrix.org> datasheet says "Clock prescale divisor. Must be an even number from 2-254, depending on the frequency of SSPCLK. The least significant bit always returns zero on reads."
<re_irc>
<@dirbaio:matrix.org> so probably yes
<re_irc>
<@dirbaio:matrix.org> and div1 doesn't do the `-1` thing but div2 does 🤷♂️
<re_irc>
<@dirbaio:matrix.org> wohoo display at 64mhz
<re_irc>
<@dirbaio:matrix.org> I just realized with all the sane SPI speeds you only need one divider really... epic fail :D
<re_irc>
<@dirbaio:matrix.org> sysclk is 133mhz so with div1=2 and div2=256 you can get down to 259khz
<re_irc>
<@dirbaio:matrix.org> why would anyone want an SPI slower than that lol
mikehcox has joined #rust-embedded
<re_irc>
<@firefrommoonlight:matrix.org> Have y'all used opamps on battery-powered devices before? How did you do it? Get one with an "enable" or "shutdown" pin?
<re_irc>
<@firefrommoonlight:matrix.org> I'm troubleshooting battery life, and noticed a smoking gun in an IR camera of the op amps
<re_irc>
<@firefrommoonlight:matrix.org> dirbaio: It's nice you have th2 2 prescalers. For example, most STM32s (all?) use a single one with crude factors, so you generally will only get roughly in the desired range.
neceve has quit [Ping timeout: 255 seconds]
mikehcox has quit [Quit: Client closed]
<re_irc>
<@adamgreig:matrix.org> Opamps vary wildly in power consumption down to like 1uA, so depending on whether you want it on all the time or not maybe you just get a lower power one?