azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/azonenberg/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal
<d1b2> <Darius> send them some hot stack traces and hope they fix it? 🙂
<azonenberg> darius: i sent them screenshots of the profiler
<azonenberg> and hinted that if they were so kind as to send me source and/or debug symbols i'd have an easier time
<d1b2> <Darius> hehe
<azonenberg> But this is not the first time i've done something like this
<azonenberg> I got a massive discount on my sonnet seat after sending them a "pull request" for their solver that included an AVX implementation of a couple of inner loops providing a 40-70% speedup depending on workload lol
<azonenberg> I want the computer to be waiting for me all the time, if i'm waiting for it that's a problem which needs to be fixed
<azonenberg> either by optimizing the code, throwing hardware at it, or both
<azonenberg> If anybody else is curious the hot loop is at 0x13fe3a3 in libpicoipp.so.1.0.1, latest release from their debian/ubuntu package repository
<azonenberg> 11 instructions of what looks like simple bytewise memory copies
<azonenberg> looks like it's shuffling/packing data from one format to another or something
<azonenberg> I haven't reversed it fully but the loop is entirely mov instructions other than the inc / cmp / jmp at the end
<azonenberg> so it screams "vectorize me"
<azonenberg> also the instructions are in the wrong order, idk if the cpu is able to correct for this efficiently or if it will end up having constant pipeline stalls
<azonenberg> but it's four copies of mov r, mem / mov mem, r
<azonenberg> for different addresses
<azonenberg> the second mov can't dual issue because it's blocked on the first
<azonenberg> and all of the copies are single byte wide
<azonenberg> Looking a bit more, it seems the loop is 2-way unrolled
<azonenberg> and it's copying data from one input buffer to two output ones, un-interleaving
<azonenberg> rdi+rsi*4+1 and rdi+rsi*4 + 3 go to r8+rsi*2 and r8+rsi*2 + 1
<azonenberg> and rdi+rsi*4 and rdi+rsi*4 + 2 go to r9+rsi*2 and r9+rsi*2 + 1
<azonenberg> that should be easy to convert to a vector load, some shuffles, and a vector store
<azonenberg> Basically it's moving even bytes to one buffer and odd bytes to the other
Degi_ has joined #scopehal
Degi has quit [Ping timeout: 258 seconds]
Degi_ is now known as Degi
<azonenberg> well, I "only" got a 1.78x speedup in an isolated test (haven't tried patching the binary yet but I've confirmed it gives bytewise identical results to the original code)
<azonenberg> Patch complete. No increase in WFM/s on my workstation, i guess i'm bottlenecked on USB or firmware?
<azonenberg> but ps6000d now uses about 20% less CPU to do the same workload
<azonenberg> So i still count that as a win
<_whitenotifier-1> [stm32-cpp] azonenberg pushed 1 commit to master [+0/-0/±3] https://git.io/JitS5
<_whitenotifier-1> [stm32-cpp] azonenberg 7d7212f - Initial STM32F7 EMAC support
someone-else has joined #scopehal
<d1b2> <dannas> @azonenberg Oh, I wanna do binary patching too! Saw your twitter post with the details. Looks like great fun! Before that I thought: Hey, how can he replace that loop without altering the size of the function. But then I saw that you made a jump to the end of the elf file.
bvernoux1 has quit [Quit: Leaving]
bvernoux has joined #scopehal
<azonenberg> dannas: yeah a little trick i use for making code caves is to patch the PT_NOTE section, present in most elf files
<azonenberg> so no need to add a new program header
<azonenberg> the note is where the build ID information and other things i can safely discard lives - just metadata that's not used at run time
<azonenberg> so i patch the PT_NOTE to point just past the end of the binary
<azonenberg> then put my new version there, then add a single jmp in the original code. no need for a call because the ret at the end of my function returns to the call of the original function, assuming i replace a whole function and not just a loop
<azonenberg> which is what i did here
<azonenberg> i wrote this tool originally for patching Sonnet, I have / had a patched version of the solver that was substantially faster
<azonenberg> they actually DID vectorize but it was done with SSE in the probably early 2000s, then never updated for AVX
<azonenberg> so i corrected that :p
<azonenberg> and there were some other optimizations i did too
<azonenberg> I say "had" because i wrote the patches against 17.56 and am now running 18.52 and haven't updated the patches with the new pointers for the current version of the solver yet
<azonenberg> I am working on upstreaming them but they're not in 18.52, i'm told they're working on integrating my fixes in v19
<azonenberg> (i submitted them late in the v18 dev cycle and they didn't want to do it that late)
<azonenberg> i actually patched three different loops there which was fun
<azonenberg> it's a massive application but those were the hottest loops for that test case. i've since done other simulations that hit other hot paths
<azonenberg> so there's a lot more room to tweak
<azonenberg> but now that i've kinda pointed them in the right direction i'm hoping i won't need to be doing hand patching too much longer there
<azonenberg> my goal with all of these things is to upstream the patch. Can be dicey as the RE is technically a eula violation
<azonenberg> so it's best if you have a good working relationship with the vendor going into it
<d1b2> <dannas> I'll remember the PT_NOTE trick for the future!
<bvernoux> azonenberg, do you have reported them to update their code to do that on new version ?
someone-else has quit [Quit: Connection closed]
massi has quit [Remote host closed the connection]
someone-else has joined #scopehal
GenTooMan has quit [Quit: Leaving]
GenTooMan has joined #scopehal
<azonenberg> bvernoux: yeah i sent both vendors the patches
<bvernoux> OK GREAT
bvernoux1 has joined #scopehal
bvernoux has quit [Ping timeout: 260 seconds]
JSharp has quit [*.net *.split]
bvernoux1 has quit [Read error: Connection reset by peer]
JSharp has joined #scopehal
someone-else has quit [Quit: Connection closed]
<azonenberg> So i just realized
<azonenberg> I already *have* a 6 GHz tone generator, it just doesnt have all of the fancy options of the SSG for modulation and leveling and easy control
<azonenberg> the picovna can do sinewave output
<azonenberg> Which means i can proof of concept the signal processing for my crazy VNA idea right now if i buy some doublers
<miek> what part are you going to use for the power splitter?
<azonenberg> So I plan on actually building two versions of it
<azonenberg> the first will be a benchtop PoC using SMA connectorized components
<azonenberg> That one will use ZFRSC-183-S+
<azonenberg> the second will be SMT with electronically controlled muxes attached via usb or ethernet
<azonenberg> so that I can have a single glscopeclient session controlling the scope, doubler module, and generator
<azonenberg> As a proof of concept I'll be making a little console application that manually instructs you which cables to move around
<azonenberg> And for the very early testing I'll only run out to 6 GHz because that's the range of the VNA
<azonenberg> Which means i can use the VNA to ground truth my experiment
<azonenberg> For the very early testing there will also be no calibration with no DUT
<azonenberg> which means there's no correction for phase/amplitude mismatch between splitter outputs, scope channels, or cables between splitter and scope
<azonenberg> that will be added later