<d1b2>
<Darius> send them some hot stack traces and hope they fix it? 🙂
<azonenberg>
darius: i sent them screenshots of the profiler
<azonenberg>
and hinted that if they were so kind as to send me source and/or debug symbols i'd have an easier time
<d1b2>
<Darius> hehe
<azonenberg>
But this is not the first time i've done something like this
<azonenberg>
I got a massive discount on my sonnet seat after sending them a "pull request" for their solver that included an AVX implementation of a couple of inner loops providing a 40-70% speedup depending on workload lol
<azonenberg>
I want the computer to be waiting for me all the time, if i'm waiting for it that's a problem which needs to be fixed
<azonenberg>
either by optimizing the code, throwing hardware at it, or both
<azonenberg>
If anybody else is curious the hot loop is at 0x13fe3a3 in libpicoipp.so.1.0.1, latest release from their debian/ubuntu package repository
<azonenberg>
11 instructions of what looks like simple bytewise memory copies
<azonenberg>
looks like it's shuffling/packing data from one format to another or something
<azonenberg>
I haven't reversed it fully but the loop is entirely mov instructions other than the inc / cmp / jmp at the end
<azonenberg>
so it screams "vectorize me"
<azonenberg>
also the instructions are in the wrong order, idk if the cpu is able to correct for this efficiently or if it will end up having constant pipeline stalls
<azonenberg>
but it's four copies of mov r, mem / mov mem, r
<azonenberg>
for different addresses
<azonenberg>
the second mov can't dual issue because it's blocked on the first
<azonenberg>
and all of the copies are single byte wide
<azonenberg>
Looking a bit more, it seems the loop is 2-way unrolled
<azonenberg>
and it's copying data from one input buffer to two output ones, un-interleaving
<azonenberg>
rdi+rsi*4+1 and rdi+rsi*4 + 3 go to r8+rsi*2 and r8+rsi*2 + 1
<azonenberg>
and rdi+rsi*4 and rdi+rsi*4 + 2 go to r9+rsi*2 and r9+rsi*2 + 1
<azonenberg>
that should be easy to convert to a vector load, some shuffles, and a vector store
<azonenberg>
Basically it's moving even bytes to one buffer and odd bytes to the other
Degi_ has joined #scopehal
Degi has quit [Ping timeout: 258 seconds]
Degi_ is now known as Degi
<azonenberg>
well, I "only" got a 1.78x speedup in an isolated test (haven't tried patching the binary yet but I've confirmed it gives bytewise identical results to the original code)
<azonenberg>
Patch complete. No increase in WFM/s on my workstation, i guess i'm bottlenecked on USB or firmware?
<azonenberg>
but ps6000d now uses about 20% less CPU to do the same workload
<azonenberg>
So i still count that as a win
<_whitenotifier-1>
[stm32-cpp] azonenberg pushed 1 commit to master [+0/-0/±3] https://git.io/JitS5
<_whitenotifier-1>
[stm32-cpp] azonenberg 7d7212f - Initial STM32F7 EMAC support
someone-else has joined #scopehal
<d1b2>
<dannas> @azonenberg Oh, I wanna do binary patching too! Saw your twitter post with the details. Looks like great fun! Before that I thought: Hey, how can he replace that loop without altering the size of the function. But then I saw that you made a jump to the end of the elf file.
bvernoux1 has quit [Quit: Leaving]
bvernoux has joined #scopehal
<azonenberg>
dannas: yeah a little trick i use for making code caves is to patch the PT_NOTE section, present in most elf files
<azonenberg>
so no need to add a new program header
<azonenberg>
the note is where the build ID information and other things i can safely discard lives - just metadata that's not used at run time
<azonenberg>
so i patch the PT_NOTE to point just past the end of the binary
<azonenberg>
then put my new version there, then add a single jmp in the original code. no need for a call because the ret at the end of my function returns to the call of the original function, assuming i replace a whole function and not just a loop
<azonenberg>
which is what i did here
<azonenberg>
i wrote this tool originally for patching Sonnet, I have / had a patched version of the solver that was substantially faster
<azonenberg>
they actually DID vectorize but it was done with SSE in the probably early 2000s, then never updated for AVX
<azonenberg>
so i corrected that :p
<azonenberg>
and there were some other optimizations i did too
<azonenberg>
I say "had" because i wrote the patches against 17.56 and am now running 18.52 and haven't updated the patches with the new pointers for the current version of the solver yet
<azonenberg>
I am working on upstreaming them but they're not in 18.52, i'm told they're working on integrating my fixes in v19
<azonenberg>
(i submitted them late in the v18 dev cycle and they didn't want to do it that late)
<azonenberg>
i actually patched three different loops there which was fun
<azonenberg>
it's a massive application but those were the hottest loops for that test case. i've since done other simulations that hit other hot paths
<azonenberg>
so there's a lot more room to tweak
<azonenberg>
but now that i've kinda pointed them in the right direction i'm hoping i won't need to be doing hand patching too much longer there
<azonenberg>
my goal with all of these things is to upstream the patch. Can be dicey as the RE is technically a eula violation
<azonenberg>
so it's best if you have a good working relationship with the vendor going into it
<d1b2>
<dannas> I'll remember the PT_NOTE trick for the future!
<bvernoux>
azonenberg, do you have reported them to update their code to do that on new version ?
someone-else has quit [Quit: Connection closed]
massi has quit [Remote host closed the connection]
someone-else has joined #scopehal
GenTooMan has quit [Quit: Leaving]
GenTooMan has joined #scopehal
<azonenberg>
bvernoux: yeah i sent both vendors the patches
<bvernoux>
OK GREAT
bvernoux1 has joined #scopehal
bvernoux has quit [Ping timeout: 260 seconds]
JSharp has quit [*.net *.split]
bvernoux1 has quit [Read error: Connection reset by peer]
JSharp has joined #scopehal
someone-else has quit [Quit: Connection closed]
<azonenberg>
So i just realized
<azonenberg>
I already *have* a 6 GHz tone generator, it just doesnt have all of the fancy options of the SSG for modulation and leveling and easy control
<azonenberg>
the picovna can do sinewave output
<azonenberg>
Which means i can proof of concept the signal processing for my crazy VNA idea right now if i buy some doublers
<miek>
what part are you going to use for the power splitter?
<azonenberg>
So I plan on actually building two versions of it
<azonenberg>
the first will be a benchtop PoC using SMA connectorized components
<azonenberg>
That one will use ZFRSC-183-S+
<azonenberg>
the second will be SMT with electronically controlled muxes attached via usb or ethernet
<azonenberg>
so that I can have a single glscopeclient session controlling the scope, doubler module, and generator
<azonenberg>
As a proof of concept I'll be making a little console application that manually instructs you which cables to move around
<azonenberg>
And for the very early testing I'll only run out to 6 GHz because that's the range of the VNA
<azonenberg>
Which means i can use the VNA to ground truth my experiment
<azonenberg>
For the very early testing there will also be no calibration with no DUT
<azonenberg>
which means there's no correction for phase/amplitude mismatch between splitter outputs, scope channels, or cables between splitter and scope