<d1b2>
<azonenberg> Yeah so the two things i have planned are a) lighting up the second fiber on the xen box and b) lighting up the second fibers on the storage nodes so i can have one set of ports for client accesses while replication uses a second one
<d1b2>
<azonenberg> right now replication competes with client accesses for a single 10G pipe
<d1b2>
<azonenberg> that's a pointwise 1D multiplication
<d1b2>
<azonenberg> not a matrix mult
<d1b2>
<johnsel> there are some other places where tensor cores may still speed up things
<d1b2>
<azonenberg> i.e. out[i] = a[i] * b[i]
<d1b2>
<azonenberg> i looked at the tensor accelerator stuff a while back and precision mostly seemed too low to be useful to us. and matrix mult is not an operation we do much of in DSP
<d1b2>
<johnsel> it can be rewritten
<d1b2>
<azonenberg> anyway near term i am far more interested in getting more stuff into shaders at all than in optimizing existing ones
<d1b2>
<johnsel> and precision too low? it supports fp32?
<d1b2>
<azonenberg> ah i might be thinking of a different extension i was thinking of ones that were mostly like int8/fp16
<d1b2>
<azonenberg> anyway for a lot of filter graphs the bottleneck is transferring data host-device-host and back a bunch
<d1b2>
<johnsel> sure I would obviously fix both problems at once
<d1b2>
<azonenberg> so minimizing that, i.e. moving trivial math onto the gpu so we dont have to go to the cpu to do it and add a round trip, is more of a priority
<d1b2>
<johnsel> it would be good to collect data on which filters are slow
<d1b2>
<azonenberg> Yeah performance tuning is something i will probably focus more on for v0.2
<d1b2>
<azonenberg> there are some architectural things i think we could use to get a lot of boost with some rewriting of things
<d1b2>
<johnsel> I do want to implement tensor cores regardless as it's my own interest
<d1b2>
<azonenberg> but i want to focus on making things work for v0.1
<d1b2>
<azonenberg> it's been too long, we need to just get a release out there
<d1b2>
<johnsel> sure, it's not something I expect to PR soon, I just want to dig deep into Vulkan and this seems like a useful and interesting way to do so
<d1b2>
<johnsel> Tensor cores also do huge calculations in a single clock so you can draw out lots of TFlops
<d1b2>
<johnsel> anyway if you have data on filter execution times/priority feel free to share that if not I will collect it myself when I continue with this
<d1b2>
<azonenberg> i dont have data handy right now, i generally collected it for specific datasets to benchmark a particular filter i was optimizing
<d1b2>
<azonenberg> one thing to note is that a lot of our code is memory bandwidth bound, not flops bound
<d1b2>
<azonenberg> the FIR filter with large tap sizes being one of the few exceptions
<d1b2>
<azonenberg> also many of the more complex protocol decodes get bottlenecked on the (not very parallelizable) clock recovery PLL filter, followed by decodes that dont use much CPU because i've reduced a string of voltages down into bits and bytes i can process in much less cycles
<d1b2>
<azonenberg> The eye pattern is currently semi AVX but i think it might be possible to do on GPU with some effort
<d1b2>
<azonenberg> that's definitely one i want to optimize
<d1b2>
<azonenberg> FFT is already pretty well tuned i think
<d1b2>
<azonenberg> channel emulation and de-embed i think are pushed pretty far however i think there is room to do a more fundamental algorithmic optimization by using overlap-add to do multiple smaller FFTs instead of one big one
<d1b2>
<azonenberg> also for clock recovery i want to explore multi-instancing the PLL
<d1b2>
<azonenberg> i.e. divide the waveform up into say 16 sub-regions and multithread the CDR
<d1b2>
<azonenberg> knowing there might be a small discontinuity at the region boundaries
<d1b2>
<azonenberg> this might not be suitable for SI work so it'd be an opt-in
<d1b2>
<azonenberg> but would probably be good enough for protocol decoding
<d1b2>
<azonenberg> it might even be possible to get good enough for SI, TBD
<d1b2>
<azonenberg> @johnsel anyway if you want to play with shaders, top priority is probably looking at the rendering shader to see if you can track down the issue @246tnt is having
<d1b2>
<azonenberg> i consider that a release blocker
Degi has quit [Ping timeout: 256 seconds]
Degi has joined #scopehal
<d1b2>
<johnsel> Yes, on most platforms. I am also interested in optimizing more for NVidia's Jetson Orin modules, which have a different performance profile. They may or may not be providing some hardware for my scope prototype 🙂
<d1b2>
<johnsel> Note that I haven't profiled anything yet, but based on my experience optimizing neural networks they're really worth the effort if you can rewrite the algorithms to leverage them properly. Ofcourse PyTorch and the like already do most hard work for you, especially nowadays, but vectorizing and optimizing memory access patterns definitely made a lot of difference
<d1b2>
<johnsel> Noted
<d1b2>
<johnsel> Yeah, that is an interesting problem (re: the nco/pll). Definitely not my area of expertise and a big bottleneck for the more interesting problems
<d1b2>
<johnsel> I actually thing the FFT and channel emulation and de-embed would benefit most from vectorizing and parallelizing, the tensor cores output a 8x8x8 mult+acc operation in the same time, but the complex numbers make it less desirable.
<d1b2>
<johnsel> if I thought I would get somewhere I would take a look but I really don't unfortunately haha
<d1b2>
<246tnt> Interestingly I get a segfault when quitting 🤔
bvernoux has joined #scopehal
<d1b2>
<azonenberg> i've seen that a few times related to the measurement window
<d1b2>
<azonenberg> havent root caused it
<d1b2>
<azonenberg> segfaults on quit are annoying but not release blocking unless they also occur when closing the session without quitting
<d1b2>
<246tnt> Here I have it not even starting or adding any instrument, just start and quit.
<d1b2>
<246tnt> Could also be a mesa bug of course, but it happens with anv and lvp (I know it doesn't run/display with lvp but it starts at least).
<_whitenotifier-3>
[scopehal-apps] smunaut 29c94d9 - VulkanWindow: Releave descriptor pool after imgui cleanup The descriptor pool needs to remain valid until we're done with ImGUI. Signed-off-by: Sylvain Munaut <tnt@246tNt.com>
<_whitenotifier-3>
[scopehal-apps] azonenberg c8226bc - Merge pull request #662 from smunaut/fix-vulkan-descriptorpool-release VulkanWindow: Releave descriptor pool after imgui cleanup
<azonenberg>
Merged
<azonenberg>
Yay for squishing buts even if not that one
<_whitenotifier-3>
[scopehal-apps] azonenberg ee54e41 - Fixed bug where m_nodeGroupMap would not be cleared when a filter graph group was deleted. Fixes #661.
<d1b2>
<azonenberg> if you do confirm that's the problem let me spend some time on a proper fix
<d1b2>
<azonenberg> I'm probably the only person who properly understands this shader at this point lol
<d1b2>
<246tnt> There are also some sign related weirdness ( like uint iend and then if (iend <= 0) ...
<d1b2>
<246tnt> And some sign that don't match between the cpp struct and the layout in the shader.
<d1b2>
<246tnt> But they don't see to be the issue here.
<d1b2>
<azonenberg> we should definitely verify that though
<d1b2>
<azonenberg> long term we are going to need to do a lot of retooling to support >4GB (>1Gpoint) waveforms
<d1b2>
<azonenberg> since vulkan doesnt let you allocate >4GB at a time
<d1b2>
<azonenberg> but we're a ways from being able to handle stuff that large efficiently anyway
<d1b2>
<azonenberg> (some of this stuff may be vestigial from previous shaders using opengl or opencl that had different addressing models and full 64 bit indexes etc)
<d1b2>
<246tnt> So I'm pretty sure the problem is indeed the shared g_done ... but my attempts to fix it with atomic have all failed so far :/
<d1b2>
<246tnt> That's my best working attempt ATM : https://pastebin.com/vfrue3Ts with this I've been so far unable to reproduce the issue.
<d1b2>
<azonenberg> If g_done is the issue then i'll spend some time poking on it later today
<d1b2>
<azonenberg> i want to avoid using atomics as they're slow
<d1b2>
<azonenberg> i'd rather have, for example, the first thread in the block set g_done and the others just set a local "not proceeding" variable
<d1b2>
<azonenberg> it might also just be that we need a memory barrier (but this could be slow enough its even worse)
<d1b2>
<246tnt> So there might be something else going on ...
<d1b2>
<azonenberg> Will look later, busy right now. but i'm starting to suspect its a memory ordering issue where something might not be getting flushed to all threads or something
<d1b2>
<246tnt> Anyone tried to use GL_EXT_debug_printf ? Wondering how to use it with mesa ...
<azonenberg>
i have used the vulkan version, yes. you need to use the vulkan validation layer to access the print messages
<azonenberg>
they go to stdout once you enable it in vkconfig
<azonenberg>
under "gpu base" select "debug printf"
<azonenberg>
tnt: i'm getting vulkan validation errors related to your last PR btw
<azonenberg>
need to investigate
<d1b2>
<246tnt> Mmm, interesting. I get no error. I did get some errors before the PR (that's how I narrowed the issue) but now the only I get is vkCreateSwapchainKHR() called with minImageCount = 2
bvernoux has quit [Read error: Connection reset by peer]