#scopehal on 2021-10-22 — irc logs at libera.irclog.whitequark.org

2021-05-22 06:58 azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/azonenberg/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal

01:55 <d1b2> <Darius> send them some hot stack traces and hope they fix it? 🙂

01:57 <azonenberg> darius: i sent them screenshots of the profiler

01:57 <azonenberg> and hinted that if they were so kind as to send me source and/or debug symbols i'd have an easier time

01:58 <d1b2> <Darius> hehe

01:59 <azonenberg> But this is not the first time i've done something like this

01:59 <azonenberg> I got a massive discount on my sonnet seat after sending them a "pull request" for their solver that included an AVX implementation of a couple of inner loops providing a 40-70% speedup depending on workload lol

02:00 <azonenberg> I want the computer to be waiting for me all the time, if i'm waiting for it that's a problem which needs to be fixed

02:00 <azonenberg> either by optimizing the code, throwing hardware at it, or both

02:01 <azonenberg> If anybody else is curious the hot loop is at 0x13fe3a3 in libpicoipp.so.1.0.1, latest release from their debian/ubuntu package repository

02:02 <azonenberg> 11 instructions of what looks like simple bytewise memory copies

02:02 <azonenberg> looks like it's shuffling/packing data from one format to another or something

02:03 <azonenberg> I haven't reversed it fully but the loop is entirely mov instructions other than the inc / cmp / jmp at the end

02:03 <azonenberg> so it screams "vectorize me"

02:03 <azonenberg> also the instructions are in the wrong order, idk if the cpu is able to correct for this efficiently or if it will end up having constant pipeline stalls

02:04 <azonenberg> but it's four copies of mov r, mem / mov mem, r

02:04 <azonenberg> for different addresses

02:04 <azonenberg> the second mov can't dual issue because it's blocked on the first

02:04 <azonenberg> and all of the copies are single byte wide

02:08 <azonenberg> Looking a bit more, it seems the loop is 2-way unrolled

02:08 <azonenberg> and it's copying data from one input buffer to two output ones, un-interleaving

02:09 <azonenberg> rdi+rsi*4+1 and rdi+rsi*4 + 3 go to r8+rsi*2 and r8+rsi*2 + 1

02:10 <azonenberg> and rdi+rsi*4 and rdi+rsi*4 + 2 go to r9+rsi*2 and r9+rsi*2 + 1

02:10 <azonenberg> that should be easy to convert to a vector load, some shuffles, and a vector store

02:11 <azonenberg> Basically it's moving even bytes to one buffer and odd bytes to the other

02:33 Degi_ has joined #scopehal

02:33 Degi has quit [Ping timeout: 258 seconds]

02:33 Degi_ is now known as Degi

04:58 <azonenberg> well, I "only" got a 1.78x speedup in an isolated test (haven't tried patching the binary yet but I've confirmed it gives bytewise identical results to the original code)

07:36 <azonenberg> Patch complete. No increase in WFM/s on my workstation, i guess i'm bottlenecked on USB or firmware?

07:37 <azonenberg> but ps6000d now uses about 20% less CPU to do the same workload

07:37 <azonenberg> So i still count that as a win

10:03 <_whitenotifier-1> [stm32-cpp] azonenberg pushed 1 commit to master [+0/-0/±3] https://git.io/JitS5

10:03 <_whitenotifier-1> [stm32-cpp] azonenberg 7d7212f - Initial STM32F7 EMAC support

10:23 someone-else has joined #scopehal

10:31 <d1b2> <dannas> @azonenberg Oh, I wanna do binary patching too! Saw your twitter post with the details. Looks like great fun! Before that I thought: Hey, how can he replace that loop without altering the size of the function. But then I saw that you made a jump to the end of the elf file.

10:35 bvernoux1 has quit [Quit: Leaving]

10:35 bvernoux has joined #scopehal

10:37 <azonenberg> dannas: yeah a little trick i use for making code caves is to patch the PT_NOTE section, present in most elf files

10:37 <azonenberg> so no need to add a new program header

10:37 <azonenberg> the note is where the build ID information and other things i can safely discard lives - just metadata that's not used at run time

10:37 <azonenberg> so i patch the PT_NOTE to point just past the end of the binary

10:38 <azonenberg> then put my new version there, then add a single jmp in the original code. no need for a call because the ret at the end of my function returns to the call of the original function, assuming i replace a whole function and not just a loop

10:38 <azonenberg> which is what i did here

10:39 <azonenberg> i wrote this tool originally for patching Sonnet, I have / had a patched version of the solver that was substantially faster

10:39 <azonenberg> they actually DID vectorize but it was done with SSE in the probably early 2000s, then never updated for AVX

10:39 <azonenberg> so i corrected that :p

10:39 <azonenberg> and there were some other optimizations i did too

10:40 <azonenberg> I say "had" because i wrote the patches against 17.56 and am now running 18.52 and haven't updated the patches with the new pointers for the current version of the solver yet

10:40 <azonenberg> I am working on upstreaming them but they're not in 18.52, i'm told they're working on integrating my fixes in v19

10:40 <azonenberg> (i submitted them late in the v18 dev cycle and they didn't want to do it that late)

10:41 <azonenberg> i actually patched three different loops there which was fun

10:42 <azonenberg> it's a massive application but those were the hottest loops for that test case. i've since done other simulations that hit other hot paths

10:42 <azonenberg> so there's a lot more room to tweak

10:42 <azonenberg> but now that i've kinda pointed them in the right direction i'm hoping i won't need to be doing hand patching too much longer there

10:42 <azonenberg> my goal with all of these things is to upstream the patch. Can be dicey as the RE is technically a eula violation

10:43 <azonenberg> so it's best if you have a good working relationship with the vendor going into it

11:06 <d1b2> <dannas> I'll remember the PT_NOTE trick for the future!

11:38 <bvernoux> azonenberg, do you have reported them to update their code to do that on new version ?

12:27 someone-else has quit [Quit: Connection closed]

12:57 massi has quit [Remote host closed the connection]

13:34 someone-else has joined #scopehal

15:13 GenTooMan has quit [Quit: Leaving]

15:13 GenTooMan has joined #scopehal

16:59 <azonenberg> bvernoux: yeah i sent both vendors the patches

17:04 <bvernoux> OK GREAT

17:46 bvernoux1 has joined #scopehal

17:49 bvernoux has quit [Ping timeout: 260 seconds]

21:31 JSharp has quit [*.net *.split]

21:31 bvernoux1 has quit [Read error: Connection reset by peer]

21:33 JSharp has joined #scopehal

23:06 someone-else has quit [Quit: Connection closed]

23:23 <azonenberg> So i just realized

23:24 <azonenberg> I already *have* a 6 GHz tone generator, it just doesnt have all of the fancy options of the SSG for modulation and leveling and easy control

23:24 <azonenberg> the picovna can do sinewave output

23:24 <azonenberg> Which means i can proof of concept the signal processing for my crazy VNA idea right now if i buy some doublers

23:45 <miek> what part are you going to use for the power splitter?

23:52 <azonenberg> So I plan on actually building two versions of it

23:52 <azonenberg> the first will be a benchtop PoC using SMA connectorized components

23:53 <azonenberg> That one will use ZFRSC-183-S+

23:53 <azonenberg> the second will be SMT with electronically controlled muxes attached via usb or ethernet

23:53 <azonenberg> so that I can have a single glscopeclient session controlling the scope, doubler module, and generator

23:53 <azonenberg> As a proof of concept I'll be making a little console application that manually instructs you which cables to move around

23:54 <azonenberg> And for the very early testing I'll only run out to 6 GHz because that's the range of the VNA

23:54 <azonenberg> Which means i can use the VNA to ground truth my experiment

23:55 <azonenberg> For the very early testing there will also be no calibration with no DUT

23:55 <azonenberg> which means there's no correction for phase/amplitude mismatch between splitter outputs, scope channels, or cables between splitter and scope

23:55 <azonenberg> that will be added later