azonenberg changed the topic of #scopehal to: ngscopeclient, libscopehal, and libscopeprotocols development and testing | https://github.com/ngscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal
<d1b2> <azonenberg> Yeah so the two things i have planned are a) lighting up the second fiber on the xen box and b) lighting up the second fibers on the storage nodes so i can have one set of ports for client accesses while replication uses a second one
<d1b2> <azonenberg> right now replication competes with client accesses for a single 10G pipe
<d1b2> <johnsel> https://www.jedec.org/document_search?search_api_views_fulltext=jesd79-4%20ddr4 the summaries say nothing about increased speed
<d1b2> <azonenberg> which limits your write speed to something like 2.5 Gbps
<d1b2> <azonenberg> while reads, if you are on a 40G pipe, can get up to 15-20
<d1b2> <johnsel> it's not that important though
<d1b2> <johnsel> sounds good
<d1b2> <johnsel> we can/should also investigate the initial apt-get
<d1b2> <johnsel> the build itself is now +-5min but we also have 4 mins+ vm deploy on top of that
<d1b2> <johnsel> lol
<d1b2> <johnsel> read closely
<d1b2> <david.rysk> the September 2012 revision I found by googling include 1600,1866,2133,2400,2666,3200 but a lot of stuff is "TBD"
<d1b2> <johnsel> yes I think you are right
<d1b2> <johnsel> it is supported but not very well
<d1b2> <johnsel> very little actual sticks on the market
<d1b2> <johnsel> most are XMP profiles
<d1b2> <david.rysk> there's plenty on the market if you look for ECC
<d1b2> <david.rysk> consumer.... yeahhhh
<d1b2> <johnsel> apparently my motherboard also supports ecc
<d1b2> <david.rysk> does your CPU too?
<d1b2> <david.rysk> most AMD CPUs that are not APUs do
<d1b2> <johnsel> I think so but checking
<d1b2> <johnsel> 5800x
<d1b2> <david.rysk> yeah it does
<d1b2> <david.rysk> just have to be UDIMMs
<d1b2> <david.rysk> it's interesing though, you're finally seeing gaming ECC memory meant for the new gen Threadripper
<d1b2> <azonenberg> ecc ram with rgb leds? that will be a sight lol
<d1b2> <azonenberg> I'm completely out of the consumer / gaming space these days aside from GPUs
<d1b2> <azonenberg> i havent done a build without a xeon in idk how many years
<d1b2> <david.rysk> I did an i9-13900k build for a friend some months ago
<d1b2> <david.rysk> used a W680 based board which does ECC with that CPU
<d1b2> <david.rysk> A slightly more expensive version of the board I used comes with an IPMI card
<d1b2> <david.rysk> Asrock makes some nice "low end server" boards with IPMI as well
<d1b2> <johnsel> Asrock is such a shitty brand though
<d1b2> <david.rysk> apparently they've greatly improved over the past few years
<d1b2> <johnsel> it's a very small company with terrible service
<d1b2> <david.rysk> and are no longer shitty like they were some years back
<d1b2> <azonenberg> interesting. every build i've done in the past 5ish years has been supermicro based
<d1b2> <johnsel> well perhaps, but when they told me they weren't interested in the fact my motherboard would not boot with certain RAM
<d1b2> <johnsel> because all it would give them is "information" and "maybe a new bios update will fix it at some point"
<d1b2> <johnsel> and they were such a small company and they couldn't help everyone
<d1b2> <david.rysk> I used ASUS before, but then that whole scandal happened
<d1b2> <david.rysk> with poor voltage control leading to 7800x3d CPUs getting (literally) blown up
<d1b2> <johnsel> I mean every brand can have such a thing with however many products they release
<d1b2> <johnsel> it's unfortunate but so did NVIDIA, AMD and every other brand at some point had mayor incidents
<d1b2> <johnsel> Samsung phones that blew up in people's pockets
<d1b2> <david.rysk> I went with Asrock for my latest build because they were the only vendor with onboard USB4
<d1b2> <david.rysk> (which means I could use a Thunderscope)
<d1b2> <johnsel> Yeah that's the shame, they are nicely featured and very price competitive
<d1b2> <david.rysk> Other vendors have USB4 with an expensive add-in card
<d1b2> <johnsel> as long as you have no problems with the board they're great
<d1b2> <david.rysk> I bought one in 2015 or so and it came DOA
<d1b2> <david.rysk> lol
<d1b2> <johnsel> well doa they can handle still at least
<d1b2> <johnsel> that's just RMA and replacement
<d1b2> <johnsel> nobody will question that
<d1b2> <johnsel> but like my situation they basically just told me to buy new RAM
<d1b2> <johnsel> which is rediculous to say to somebody who just bought new RAM + their motherboard
<d1b2> <johnsel> Yep I bought the Asus B550 Creator for the same reason
<d1b2> <johnsel> although it's thunderbolt implementation is a little flaky as well
<d1b2> <johnsel> sometimes I need to shut down the system because the tb pcie host will be gone entirely
<d1b2> <johnsel> but apparently this is par for the course with TB
<d1b2> <david.rysk> with a lot of boards the TBT PCIe host will only appear when a TBT device is plugged in
<d1b2> <johnsel> that's not it
<d1b2> <david.rysk> I just returned and got an Asus board instead at the time
juri_ has joined #scopehal
juri_ has quit [Ping timeout: 276 seconds]
juri_ has joined #scopehal
<d1b2> <johnsel> @azonenberg off the top of your head where do we have large matrix-matrix mults? Other than vkFFT?
<d1b2> <johnsel> (ideally already in a shader)
<d1b2> <johnsel> mm actually doesn't matter if it's in a shader or not, just need the vulkan references
<d1b2> <azonenberg> i am not aware of any matrix mults other than the 2x2 rotation matrix used in the de-embed filter
<d1b2> <johnsel> yeah I saw that one
<d1b2> <johnsel> and
<d1b2> <johnsel> MultiplyFilter::RefreshVectorVector
<d1b2> <azonenberg> that's a pointwise 1D multiplication
<d1b2> <azonenberg> not a matrix mult
<d1b2> <johnsel> there are some other places where tensor cores may still speed up things
<d1b2> <azonenberg> i.e. out[i] = a[i] * b[i]
<d1b2> <azonenberg> i looked at the tensor accelerator stuff a while back and precision mostly seemed too low to be useful to us. and matrix mult is not an operation we do much of in DSP
<d1b2> <johnsel> it can be rewritten
<d1b2> <azonenberg> anyway near term i am far more interested in getting more stuff into shaders at all than in optimizing existing ones
<d1b2> <johnsel> and precision too low? it supports fp32?
<d1b2> <azonenberg> ah i might be thinking of a different extension i was thinking of ones that were mostly like int8/fp16
<d1b2> <azonenberg> anyway for a lot of filter graphs the bottleneck is transferring data host-device-host and back a bunch
<d1b2> <johnsel> sure I would obviously fix both problems at once
<d1b2> <azonenberg> so minimizing that, i.e. moving trivial math onto the gpu so we dont have to go to the cpu to do it and add a round trip, is more of a priority
<d1b2> <johnsel> it would be good to collect data on which filters are slow
<d1b2> <azonenberg> Yeah performance tuning is something i will probably focus more on for v0.2
<d1b2> <azonenberg> there are some architectural things i think we could use to get a lot of boost with some rewriting of things
<d1b2> <johnsel> I do want to implement tensor cores regardless as it's my own interest
<d1b2> <azonenberg> but i want to focus on making things work for v0.1
<d1b2> <azonenberg> it's been too long, we need to just get a release out there
<d1b2> <johnsel> sure, it's not something I expect to PR soon, I just want to dig deep into Vulkan and this seems like a useful and interesting way to do so
<d1b2> <johnsel> Tensor cores also do huge calculations in a single clock so you can draw out lots of TFlops
<d1b2> <johnsel> free extra speed
<d1b2> <johnsel> anyway if you have data on filter execution times/priority feel free to share that if not I will collect it myself when I continue with this
<d1b2> <azonenberg> i dont have data handy right now, i generally collected it for specific datasets to benchmark a particular filter i was optimizing
<d1b2> <azonenberg> one thing to note is that a lot of our code is memory bandwidth bound, not flops bound
<d1b2> <azonenberg> the FIR filter with large tap sizes being one of the few exceptions
<d1b2> <azonenberg> also many of the more complex protocol decodes get bottlenecked on the (not very parallelizable) clock recovery PLL filter, followed by decodes that dont use much CPU because i've reduced a string of voltages down into bits and bytes i can process in much less cycles
<d1b2> <azonenberg> The eye pattern is currently semi AVX but i think it might be possible to do on GPU with some effort
<d1b2> <azonenberg> that's definitely one i want to optimize
<d1b2> <azonenberg> FFT is already pretty well tuned i think
<d1b2> <azonenberg> channel emulation and de-embed i think are pushed pretty far however i think there is room to do a more fundamental algorithmic optimization by using overlap-add to do multiple smaller FFTs instead of one big one
<d1b2> <azonenberg> also for clock recovery i want to explore multi-instancing the PLL
<d1b2> <azonenberg> i.e. divide the waveform up into say 16 sub-regions and multithread the CDR
<d1b2> <azonenberg> knowing there might be a small discontinuity at the region boundaries
<d1b2> <azonenberg> this might not be suitable for SI work so it'd be an opt-in
<d1b2> <azonenberg> but would probably be good enough for protocol decoding
<d1b2> <azonenberg> it might even be possible to get good enough for SI, TBD
<d1b2> <azonenberg> @johnsel anyway if you want to play with shaders, top priority is probably looking at the rendering shader to see if you can track down the issue @246tnt is having
<d1b2> <azonenberg> i consider that a release blocker
Degi has quit [Ping timeout: 256 seconds]
Degi has joined #scopehal
<d1b2> <johnsel> Yes, on most platforms. I am also interested in optimizing more for NVidia's Jetson Orin modules, which have a different performance profile. They may or may not be providing some hardware for my scope prototype 🙂
<d1b2> <johnsel> Note that I haven't profiled anything yet, but based on my experience optimizing neural networks they're really worth the effort if you can rewrite the algorithms to leverage them properly. Ofcourse PyTorch and the like already do most hard work for you, especially nowadays, but vectorizing and optimizing memory access patterns definitely made a lot of difference
<d1b2> <johnsel> Noted
<d1b2> <johnsel> Yeah, that is an interesting problem (re: the nco/pll). Definitely not my area of expertise and a big bottleneck for the more interesting problems
<d1b2> <johnsel> I actually thing the FFT and channel emulation and de-embed would benefit most from vectorizing and parallelizing, the tensor cores output a 8x8x8 mult+acc operation in the same time, but the complex numbers make it less desirable.
<d1b2> <johnsel> if I thought I would get somewhere I would take a look but I really don't unfortunately haha
<d1b2> <246tnt> Interestingly I get a segfault when quitting 🤔
bvernoux has joined #scopehal
<d1b2> <azonenberg> i've seen that a few times related to the measurement window
<d1b2> <azonenberg> havent root caused it
<d1b2> <azonenberg> segfaults on quit are annoying but not release blocking unless they also occur when closing the session without quitting
<d1b2> <246tnt> Here I have it not even starting or adding any instrument, just start and quit.
<d1b2> <246tnt> Could also be a mesa bug of course, but it happens with anv and lvp (I know it doesn't run/display with lvp but it starts at least).
<d1b2> <246tnt> Ok, found the bug.
<azonenberg> woo
<azonenberg> what was it
<azonenberg> (the segfault or the shader issue?)
<_whitenotifier-3> [scopehal-apps] smunaut opened pull request #662: VulkanWindow: Releave descriptor pool after imgui cleanup - https://github.com/ngscopeclient/scopehal-apps/pull/662
<tnt> The segfault only.
<_whitenotifier-3> [scopehal-apps] azonenberg closed pull request #662: VulkanWindow: Releave descriptor pool after imgui cleanup - https://github.com/ngscopeclient/scopehal-apps/pull/662
<_whitenotifier-3> [scopehal-apps] azonenberg pushed 2 commits to master [+0/-0/±2] https://github.com/ngscopeclient/scopehal-apps/compare/ff094878d472...c8226bc062d4
<_whitenotifier-3> [scopehal-apps] smunaut 29c94d9 - VulkanWindow: Releave descriptor pool after imgui cleanup The descriptor pool needs to remain valid until we're done with ImGUI. Signed-off-by: Sylvain Munaut <tnt@246tNt.com>
<_whitenotifier-3> [scopehal-apps] azonenberg c8226bc - Merge pull request #662 from smunaut/fix-vulkan-descriptorpool-release VulkanWindow: Releave descriptor pool after imgui cleanup
<azonenberg> Merged
<azonenberg> Yay for squishing buts even if not that one
<azonenberg> bugs*
<_whitenotifier-3> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/ngscopeclient/scopehal-apps/compare/c8226bc062d4...ee54e41037e8
<_whitenotifier-3> [scopehal-apps] azonenberg ee54e41 - Fixed bug where m_nodeGroupMap would not be cleared when a filter graph group was deleted. Fixes #661.
<_whitenotifier-3> [scopehal-apps] azonenberg closed issue #661: Deleting a filter graph group containing nodes segfaults - https://github.com/ngscopeclient/scopehal-apps/issues/661
<d1b2> <azonenberg> and i mean we already hit one mesa bug lol
<d1b2> <azonenberg> So while i'm not saying our code is flawless don't rule out the possibility :p
<d1b2> <246tnt> I think I might have a lead. Concurrent writes to the same shared location might be UB.
<d1b2> <azonenberg> We should not be concurrently hitting the same shared location
<d1b2> <246tnt> g_done can be written by several threads when the end of the waveform is in view.
<d1b2> <azonenberg> aha
<d1b2> <azonenberg> i wonder if we might need an atomic for that
<d1b2> <246tnt> And only allowing thread local_id_y==0 seems to fix the issue ... (that's a ugly hack but just trying to confirm the theory).
<d1b2> <azonenberg> or that
<d1b2> <azonenberg> anyway sounds promising, keep digging
<d1b2> <246tnt> Yup will do after lunch 😅
<d1b2> <azonenberg> if you do confirm that's the problem let me spend some time on a proper fix
<d1b2> <azonenberg> I'm probably the only person who properly understands this shader at this point lol
<d1b2> <246tnt> There are also some sign related weirdness ( like uint iend and then if (iend <= 0) ...
<d1b2> <246tnt> And some sign that don't match between the cpp struct and the layout in the shader.
<d1b2> <246tnt> But they don't see to be the issue here.
<d1b2> <azonenberg> we should definitely verify that though
<d1b2> <azonenberg> long term we are going to need to do a lot of retooling to support >4GB (>1Gpoint) waveforms
<d1b2> <azonenberg> since vulkan doesnt let you allocate >4GB at a time
<d1b2> <azonenberg> but we're a ways from being able to handle stuff that large efficiently anyway
<d1b2> <azonenberg> (some of this stuff may be vestigial from previous shaders using opengl or opencl that had different addressing models and full 64 bit indexes etc)
<d1b2> <246tnt> So I'm pretty sure the problem is indeed the shared g_done ... but my attempts to fix it with atomic have all failed so far :/
<d1b2> <246tnt> That's my best working attempt ATM : https://pastebin.com/vfrue3Ts with this I've been so far unable to reproduce the issue.
<d1b2> <azonenberg> If g_done is the issue then i'll spend some time poking on it later today
<d1b2> <azonenberg> i want to avoid using atomics as they're slow
<d1b2> <azonenberg> i'd rather have, for example, the first thread in the block set g_done and the others just set a local "not proceeding" variable
<d1b2> <azonenberg> it might also just be that we need a memory barrier (but this could be slow enough its even worse)
<d1b2> <246tnt> AFAIU this also should have fixed the issue ... but it doesn't .... https://pastebin.com/RhbxYSSB
<d1b2> <246tnt> So there might be something else going on ...
<d1b2> <azonenberg> Will look later, busy right now. but i'm starting to suspect its a memory ordering issue where something might not be getting flushed to all threads or something
<d1b2> <246tnt> Anyone tried to use GL_EXT_debug_printf ? Wondering how to use it with mesa ...
<azonenberg> i have used the vulkan version, yes. you need to use the vulkan validation layer to access the print messages
<azonenberg> they go to stdout once you enable it in vkconfig
<azonenberg> under "gpu base" select "debug printf"
<azonenberg> tnt: i'm getting vulkan validation errors related to your last PR btw
<azonenberg> need to investigate
<d1b2> <246tnt> Mmm, interesting. I get no error. I did get some errors before the PR (that's how I narrowed the issue) but now the only I get is vkCreateSwapchainKHR() called with minImageCount = 2
bvernoux has quit [Read error: Connection reset by peer]
josuah has joined #scopehal