emerent has quit [Killed (platinum.libera.chat (Nickname regained by services))]
emerent_ is now known as emerent
<re_irc>
<@newam:matrix.org> Wow Curve25519 is _fast_.
<re_irc>
<@newam:matrix.org> NIST P-256 with the ST hardware public key accelerator takes an eye watering 5,249,000 cycles (109 milliseconds at the highest clock speed for the STM32WL) to sign a message.
<re_irc>
<@newam:matrix.org> Using the `no_std` implementation of Curve25519 signing from `ed25519-dalek` (`{ version = "=1.0.1", default-features = false, features = ["u32_backend", "rand"] }`, all optimizations, but with debug assertions and overflow checks) takes only 866,802 cycles (18 milliseconds at highest clock speed for the STM32WL).
<re_irc>
<@newam:matrix.org> ...the only downside is the 80KiB of flash that `ed25519-dalek` eats up 😕
<re_irc>
<@newam:matrix.org> and on the verification side:
<re_irc>
<@newam:matrix.org> SW w/ ed25519: 2,081,189 cycles, 43ms @ max of 48MHz
<re_irc>
<@newam:matrix.org> HW PKA w/ P256: 10,498,000 cycles, 219ms @ max of 48MHz
fabic has quit [Ping timeout: 248 seconds]
<re_irc>
<@thejpster:matrix.org> jamesmunns: I mean, I have Eben's phone number...
<re_irc>
<@disasm-ewg:matrix.org> almindor: Added dkhayes117 as an outside collaborator with proper access rights
<re_irc>
<@wcpannell:matrix.org> Is the output of cargo run supposed to be different than cargo rustc? When I do "cargo rustc --release -- --emit asm" the .s files show my const fn is done at compile time and replaced with a numeric literal. When I do "cargo run --release" and step through it the const fn is run time. I haven't checked the...
<re_irc>
... binaries made with just "cargo build --release" yet
<re_irc>
<@wcpannell:matrix.org> hmm. after clean and release build the objdump of the binary it is slightly different from the --emit asm .s files. It uses different regs, some ops moved around in the GPIO setup code right before the "call" to the const fn, nothing majorly different.
<re_irc>
<@wcpannell:matrix.org> except that the "call" in the .s file is just moving a literal and storing it. where the binary has a wall of text doing software division
<re_irc>
<@wcpannell:matrix.org> so the const fn did get inlined
<re_irc>
<@adamgreig:matrix.org> I think `const fn` are not _guaranteed_ to be run at compile time, it's just that they may be used in `const` contexts; any runtime function might be optimised to compile-time if the compiler can do it and decides to, regardless of whether it's const
<re_irc>
<@adamgreig:matrix.org> maybe it would help if you #[inline] those methods? I suspect when `set_baudrate` is being compiled, it doesn't know it's being called with constant integer arguments, and `set_baudrate` isn't being inlined into the binary because it's another crate and not marked `#[inline]`
<re_irc>
<@adamgreig:matrix.org> so while `bus_to_baudrate_divisor` is getting inlined into `set_baudrate` (because they're in the same library), it still needs to emit code for runtime division, because it doesn't know it's only getting called with constant arguments by your binary crate
<re_irc>
<@adamgreig:matrix.org> (maybe turning on/up LTO will let it make this link-time optimisation, I'm not sure)
<re_irc>
<@yatekii:matrix.org> hmm is it bad to have cargo workspaces across different build targets (arm/x86)?
<re_irc>
<@9names:matrix.org> it's certainly not fun, but does that make it bad?
<re_irc>
<@yatekii:matrix.org> I get endless dependency cycles across projects
<re_irc>
<@yatekii:matrix.org> and rust-analyzer seems to choke hard on it, not sure
<re_irc>
<@yatekii:matrix.org> so two arguments against it
<re_irc>
<@yatekii:matrix.org> but if I do not use a single cargo workspace, using xtask becomes a nuissance
<re_irc>
<@jamesmunns:matrix.org> newam: newam: I feel like someone... maybe the folks from SoloKeys, have a different 25519 impl that doesn't require the huge look up table?
<re_irc>
<@jamesmunns:matrix.org> Maybe CC Nicolas Stalder | SoloKeys?
<re_irc>
<@newam:matrix.org> jamesmunns: That is awesome, thank you!
<re_irc>
<@jamesmunns:matrix.org> No idea on the completeness of that, but I know Nicolas has been working on it for a while
<re_irc>
<@jamesmunns:matrix.org> and can probably give better usage guidance
<re_irc>
<@newam:matrix.org> > One reason the current ed25519-dalek library in its current state is not usable for microcontrollers is that it includes ~40kB of pre-computed data to speed things up.
<re_irc>
<@newam:matrix.org> Ah, that explains the ~80KiB extra binary size.
<re_irc>
<@dirbaio:matrix.org> it's suuuuuuper fast
<re_irc>
<@dirbaio:matrix.org> only x25519 though, no ed25519 :|
<re_irc>
<@newam:matrix.org> Speaking of all this encryption anyone know if there is a good resource for learning more on an embedded scale?
<re_irc>
<@newam:matrix.org> I keep learning piecemeal because most of the resources I find are targeted towards people working on higher level software.
fabic has quit [Ping timeout: 248 seconds]
<re_irc>
<@nickray:solokeys.com> yes, https://github.com/ycrypto/salty/ uses assembly for the field ops, it's pretty speedy. i'd like to eventually upstream as backend for dalek, but there's push back/reluctance.
<re_irc>
<@nickray:solokeys.com> emil's assembly impl (for curve255) is slightly faster than haase's which salty uses. could be swapped out if someone feels the need.
<re_irc>
<@nickray:solokeys.com> for p256 i wrapped emil's *insanely* fast assembly-all-the-way impl in https://github.com/ycrypto/p256-cortex-m4. this one is just nuts.
<re_irc>
<@nickray:solokeys.com> generally the philosophy in ycrypto is to use pure rust for mid (group) and high (signatures etc.) level, but use the best known assembly for low (base field) level. - to complement rustcrypto/dalek and the other conceptual/pure rust work for microcontrollers. and upstream perhaps, if we find ways to do this...
<re_irc>
... neatly.
<re_irc>
<@dirbaio:matrix.org> ooh coooool
<re_irc>
<@dirbaio:matrix.org> I didn't see it had asm too
<re_irc>
<@nickray:solokeys.com> i feel like "this is the way" for crypto on microcontrollers. looking forward to stable inline assembly to clean it up more.
<re_irc>
<@dirbaio:matrix.org> all-asm is too hardcore yeah :\
<re_irc>
<@nickray:solokeys.com> not according to emil 😅
<re_irc>
<@dirbaio:matrix.org> I guess you can get slightly faster with all-asm
<re_irc>
<@dirbaio:matrix.org> emil's field op functions use a weird abi where one field element is passed in r0-r8
<re_irc>
<@dirbaio:matrix.org> I don't think there's a way to call that from rust
<re_irc>
<@dirbaio:matrix.org> and that makes it slightly faster because you can remove unused loads/stores
<re_irc>
<@nickray:solokeys.com> it'd be nice if the compiler actively used umaal etc., which is tricky. but a lot of the final optimisations are register level, which i don't think a non-specialized compiler can figure out (research project)
<re_irc>
<@newam:matrix.org> > i'd like to eventually upstream as backend for dalek, but there's push back/reluctance.
<re_irc>
<@newam:matrix.org> Unfortunately it seems like ed25519-dalek is unmaintained these days :/
<re_irc>
<@nickray:solokeys.com> i wouldn't say that. it forked and both forks are active.
<re_irc>
<@newam:matrix.org> Ah
<re_irc>
<@newam:matrix.org> How fast is that assembly p256 implementation, got any cycle counts?
<re_irc>
<@nickray:solokeys.com> hoping to contract emil for some more curve base fields, and grow ycrypto. if anybody else is interested in this, happy to collab.
<re_irc>
<@newam:matrix.org> Ah, it's that one, thanks :D
<re_irc>
<@newam:matrix.org> wow that's a lot faster than the hardware accelerator provided by ST...
<re_irc>
<@dirbaio:matrix.org> yeah
<re_irc>
<@nickray:solokeys.com> yeah beats the lpc55 demo impl on C for the lpc55 accelerator too.
<re_irc>
<@dirbaio:matrix.org> it beats the cc310 accelerator in nrf52 chips too
<re_irc>
<@dirbaio:matrix.org> it's insane 🤣
<re_irc>
<@dkhayes117:matrix.org> What is the current state of inline-asm nowadays?
<re_irc>
<@adamgreig:matrix.org> what were they doing with the hardware accel... lol
<re_irc>
<@adamgreig:matrix.org> dkhayes117: it's really lovely, but I don't think there's any ETA on it being stable, so you have to either use nightly, or do what crates like c-m-rt do and use a nightly compiler to build a static lib that you distribute and link against for stable users
<re_irc>
<@adamgreig:matrix.org> that said, there's also been progress on bringing the cortex-m dsp intrinsics into core::arch, which might make it easier to not need asm for some things
<re_irc>
<@adamgreig:matrix.org> (currently still nightly-only though)
<re_irc>
<@dkhayes117:matrix.org> I've used the new `asm!` syntax, I like it much better than `llvm_asm!`
<re_irc>
<@nickray:solokeys.com> corporate C 🤪 in earnest, i think you'd need bigger curves to beat optimal register use and umaal, to take advantage of 64bit mult in an accelerator
<re_irc>
<@adamgreig:matrix.org> yea, the new `asm!` is the one that will get stabilised sooner or later, and it's so nicely designed
<re_irc>
<@nickray:solokeys.com> > dkhayes117: it's really lovely, but I don't think there's any ETA on it being stable, so you have to either use nightly, or do what crates like c-m-rt do and use a nightly compiler to build a static lib that you distribute and link against for stable users
<re_irc>
<@nickray:solokeys.com> this is what we do in ycrypto, it's honestly not that bad either.
<re_irc>
<@adamgreig:matrix.org> yea. I can't wait for stable inline asm, but in the meantime there's not many disadvantages to this
<re_irc>
<@adamgreig:matrix.org> can't be inlined is the main problem we have
<re_irc>
<@adamgreig:matrix.org> well-
<re_irc>
<@adamgreig:matrix.org> it _can_ be inlined with the linker plugin stuff, even
<re_irc>
<@dirbaio:matrix.org> a few months ago I started writing all-asm ed25519 on top of emil's asm
<re_irc>
<@dirbaio:matrix.org> gave up halfway 🤣
<re_irc>
<@dirbaio:matrix.org> when I saw I needed sha512 :c
<re_irc>
<@wcpannell:matrix.org> adamgreig: I'll have to check the dissasm in the morning, away from computer for the day. IIRC the set_baudrate args were literals and everything in the app (except cortex-m inline asm and the software division) got inlined into main. lto = true in release profile, so that should be fat lto. So It's doing...
<re_irc>
... stuff, just not the same as --emit asm
<re_irc>
<@dirbaio:matrix.org> W.C. Pannell: if you run rustc directly, you're not including many flags that cargo would include
<re_irc>
<@dirbaio:matrix.org> rustc doen't read `cargo.toml` or `.cargo/config.toml`
<re_irc>
<@wcpannell:matrix.org> I was under the impression that using "cargo rustc blahblah" instead of calling rustc directly did use all the config.
<re_irc>
<@dirbaio:matrix.org> ah okay, no idea about that one, sorry :)
<re_irc>
<@dirbaio:matrix.org> I thought you were calling bare rustc
<re_irc>
<@wcpannell:matrix.org> I think it does because it cranked the lto when i added it to the profile. Idk. I'll dig in more in the morning.
<re_irc>
<@newam:matrix.org> Ok so with the rust overhead on `P256-Cortex-M4` (e.g. passing a CryptoRng impl which takes 412 cycles instead of pre-generating)
<re_irc>
<@therealprof:matrix.org> Nicolas Stalder | SoloKeys: LLVM can lower such instructions if compiler generates the proper LLVM instructions. It is quite a bit of an uphill battle though, all of the reviewers are mostly interested in getting support for the latest and greatest architectures in and don't care too much about our lowly...
<re_irc>
... micros, despite the higher ups also profiting quite a bit from lowly instruction support. There's quite a bit of low hanging fruit still to be picked...
<re_irc>
<@newam:matrix.org> dirbaio: I think this might actually be the instruction cache size difference between our targets, the NRF Emil tested on has a 2K icache, my STM32WL has a 1K icache.
<re_irc>
<@therealprof:matrix.org> The fun part with the ARM ISA is that the registers are shared between the different extensions so basically you can use any "special" instruction as is without needing to ensure the data is in the right set of registers which makes actually utilizing the instructions a whole lot easier.
<re_irc>
<@newam:matrix.org> At the very least my benchmarks for rust overhead are inaccurate since our targets are pretty dissimilar in a few ways that matter for performance.
fabic has joined #rust-embedded
fabic has quit [Ping timeout: 272 seconds]
SanchayanMaity has quit [Ping timeout: 258 seconds]
dreamcat4 has quit [Ping timeout: 256 seconds]
<re_irc>
<@newam:matrix.org> Nicolas Stalder | SoloKeys: Sent a couple PRs your way, thanks for these crates, they're great!
<re_irc>
<@newam:matrix.org> Going to sleep a few million cycles faster is going to help my power budget a ton 😀
edm has joined #rust-embedded
SanchayanMaity has joined #rust-embedded
dreamcat4 has joined #rust-embedded
troth has quit [Quit: Leaving.]
<re_irc>
<@windfisch42:matrix.org> Heya! I'm trying to implement the embedded_hal::blocking::spi::Write trait for my struct, but it fails due to "conflicting implementation"; it seems that Write<T> is implemented for each implementor of the Default trait, and rust tells me that "someone else could implement Default on your struct"...
<re_irc>
<@windfisch42:matrix.org> how can I implement that Write trait for my struct without this error?
<re_irc>
<@adamgreig:matrix.org> that thread has some discussion, those default impls will be removed eventually
<re_irc>
<@windfisch42:matrix.org> ooof, thanks. so i'd need to add a patch line to my Cargo.toml?
<re_irc>
<@adamgreig:matrix.org> (or, they are removed already in master, but I don't think backported)
<re_irc>
<@adamgreig:matrix.org> no, you shouldn't need to patch
<re_irc>
<@adamgreig:matrix.org> assuming you're currently impl'ing for a generic T, instead impl for a concrete u8 and/or u16
<re_irc>
<@windfisch42:matrix.org> how do I fix this then? embedded_hal wasn't pulled in by myself, but by display-interface-spi and/or stm32f1xx_hal
<re_irc>
<@windfisch42:matrix.org> ah oh.
<re_irc>
<@windfisch42:matrix.org> ah nice, that seems to fix it. thanks!
<re_irc>
<@windfisch42:matrix.org> btw, maybe what I am trying to do already exists: Is there an implementation (adapter) that gives me Write<T> for WriteDma<T>?
<re_irc>
<@windfisch42:matrix.org> I have a library that wants blocking::spi::Write, but I'd like to allocate a DMA buffer and use DMA for the blocking write operation. The operation shall still block, but if it gets interrupted, the DMA should continue to write
<re_irc>
<@newam:matrix.org> Windfisch: For situations like that I just make a different type that implements `blocking::spi::Write` with DMAs.
<re_irc>
<@newam:matrix.org> I think that's the intended use-case, but I may be wrong about that.
<re_irc>
<@windfisch42:matrix.org> that's what I am doing right now, i was hoping that someone else already did that work :D
<re_irc>
<@newam:matrix.org> Ahhh gotcha
cr19011 has joined #rust-embedded
cr1901 has quit [Killed (NickServ (GHOST command used by cr19011!~William@2601:8d:8600:911:456:81f2:40b3:f921))]
<re_irc>
<@u007d:matrix.org> For the life of me I can't find these alleged header pins :). I suppose I'm not reading the schematic correctly--can someone help me understand which header pins (and bonus marks for which chip package pins) the `PWMx`s are on?