#scopehal on 2023-12-08 — irc logs at libera.irclog.whitequark.org

2023-10-21 05:40 azonenberg changed the topic of #scopehal to: ngscopeclient, libscopehal, and libscopeprotocols development and testing | https://github.com/ngscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal

00:00 <d1b2> <azonenberg> Yeah so the two things i have planned are a) lighting up the second fiber on the xen box and b) lighting up the second fibers on the storage nodes so i can have one set of ports for client accesses while replication uses a second one

00:00 <d1b2> <azonenberg> right now replication competes with client accesses for a single 10G pipe

00:00 <d1b2> <johnsel> https://www.jedec.org/document_search?search_api_views_fulltext=jesd79-4%20ddr4 the summaries say nothing about increased speed

00:00 <d1b2> <azonenberg> which limits your write speed to something like 2.5 Gbps

00:00 <d1b2> <azonenberg> while reads, if you are on a 40G pipe, can get up to 15-20

00:00 <d1b2> <johnsel> it's not that important though

00:03 <d1b2> <johnsel> sounds good

00:03 <d1b2> <johnsel> we can/should also investigate the initial apt-get

00:03 <d1b2> <johnsel> the build itself is now +-5min but we also have 4 mins+ vm deploy on top of that

00:06 <d1b2> <johnsel> lol

00:06 <d1b2> <johnsel> https://en.wikipedia.org/wiki/DDR4_SDRAM#cite_note-ddr4-55

00:06 <d1b2> <johnsel> read closely

00:07 <d1b2> <david.rysk> the September 2012 revision I found by googling include 1600,1866,2133,2400,2666,3200 but a lot of stuff is "TBD"

00:07 <d1b2> <johnsel> yes I think you are right

00:07 <d1b2> <johnsel> it is supported but not very well

00:07 <d1b2> <johnsel> very little actual sticks on the market

00:08 <d1b2> <johnsel> most are XMP profiles

00:08 <d1b2> <david.rysk> there's plenty on the market if you look for ECC

00:08 <d1b2> <david.rysk> consumer.... yeahhhh

00:09 <d1b2> <johnsel> apparently my motherboard also supports ecc

00:09 <d1b2> <david.rysk> does your CPU too?

00:09 <d1b2> <david.rysk> most AMD CPUs that are not APUs do

00:09 <d1b2> <johnsel> I think so but checking

00:09 <d1b2> <johnsel> 5800x

00:09 <d1b2> <david.rysk> yeah it does

00:09 <d1b2> <david.rysk> just have to be UDIMMs

00:09 <d1b2> <david.rysk> it's interesing though, you're finally seeing gaming ECC memory meant for the new gen Threadripper

00:10 <d1b2> <azonenberg> ecc ram with rgb leds? that will be a sight lol

00:11 <d1b2> <david.rysk> https://www.kingston.com/en/memory/gaming/fury-renegade-ddr5-pro-rdimm

00:11 <d1b2> <azonenberg> I'm completely out of the consumer / gaming space these days aside from GPUs

00:11 <d1b2> <azonenberg> i havent done a build without a xeon in idk how many years

00:12 <d1b2> <david.rysk> I did an i9-13900k build for a friend some months ago

00:12 <d1b2> <david.rysk> used a W680 based board which does ECC with that CPU

00:12 <d1b2> <david.rysk> A slightly more expensive version of the board I used comes with an IPMI card

00:12 <d1b2> <david.rysk> Asrock makes some nice "low end server" boards with IPMI as well

00:13 <d1b2> <johnsel> Asrock is such a shitty brand though

00:13 <d1b2> <david.rysk> apparently they've greatly improved over the past few years

00:13 <d1b2> <johnsel> it's a very small company with terrible service

00:13 <d1b2> <david.rysk> and are no longer shitty like they were some years back

00:13 <d1b2> <azonenberg> interesting. every build i've done in the past 5ish years has been supermicro based

00:13 <d1b2> <johnsel> well perhaps, but when they told me they weren't interested in the fact my motherboard would not boot with certain RAM

00:14 <d1b2> <johnsel> because all it would give them is "information" and "maybe a new bios update will fix it at some point"

00:14 <d1b2> <johnsel> and they were such a small company and they couldn't help everyone

00:14 <d1b2> <david.rysk> I used ASUS before, but then that whole scandal happened

00:14 <d1b2> <david.rysk> with poor voltage control leading to 7800x3d CPUs getting (literally) blown up

00:15 <d1b2> <johnsel> I mean every brand can have such a thing with however many products they release

00:15 <d1b2> <johnsel> it's unfortunate but so did NVIDIA, AMD and every other brand at some point had mayor incidents

00:15 <d1b2> <johnsel> Samsung phones that blew up in people's pockets

00:15 <d1b2> <david.rysk> I went with Asrock for my latest build because they were the only vendor with onboard USB4

00:15 <d1b2> <david.rysk> (which means I could use a Thunderscope)

00:16 <d1b2> <johnsel> Yeah that's the shame, they are nicely featured and very price competitive

00:16 <d1b2> <david.rysk> Other vendors have USB4 with an expensive add-in card

00:16 <d1b2> <johnsel> as long as you have no problems with the board they're great

00:16 <d1b2> <david.rysk> I bought one in 2015 or so and it came DOA

00:16 <d1b2> <david.rysk> lol

00:16 <d1b2> <johnsel> well doa they can handle still at least

00:16 <d1b2> <johnsel> that's just RMA and replacement

00:16 <d1b2> <johnsel> nobody will question that

00:17 <d1b2> <johnsel> but like my situation they basically just told me to buy new RAM

00:17 <d1b2> <johnsel> which is rediculous to say to somebody who just bought new RAM + their motherboard

00:17 <d1b2> <johnsel> Yep I bought the Asus B550 Creator for the same reason

00:18 <d1b2> <johnsel> although it's thunderbolt implementation is a little flaky as well

00:18 <d1b2> <johnsel> sometimes I need to shut down the system because the tb pcie host will be gone entirely

00:18 <d1b2> <johnsel> but apparently this is par for the course with TB

00:22 <d1b2> <david.rysk> with a lot of boards the TBT PCIe host will only appear when a TBT device is plugged in

00:22 <d1b2> <johnsel> that's not it

00:22 <d1b2> <david.rysk> I just returned and got an Asus board instead at the time

00:35 juri_ has joined #scopehal

00:43 juri_ has quit [Ping timeout: 276 seconds]

00:54 juri_ has joined #scopehal

01:30 <d1b2> <johnsel> @azonenberg off the top of your head where do we have large matrix-matrix mults? Other than vkFFT?

01:31 <d1b2> <johnsel> (ideally already in a shader)

01:39 <d1b2> <johnsel> mm actually doesn't matter if it's in a shader or not, just need the vulkan references

02:00 <d1b2> <azonenberg> i am not aware of any matrix mults other than the 2x2 rotation matrix used in the de-embed filter

02:00 <d1b2> <johnsel> yeah I saw that one

02:00 <d1b2> <johnsel> and

02:00 <d1b2> <johnsel> MultiplyFilter::RefreshVectorVector

02:01 <d1b2> <azonenberg> that's a pointwise 1D multiplication

02:01 <d1b2> <azonenberg> not a matrix mult

02:01 <d1b2> <johnsel> there are some other places where tensor cores may still speed up things

02:01 <d1b2> <azonenberg> i.e. out[i] = a[i] * b[i]

02:02 <d1b2> <azonenberg> i looked at the tensor accelerator stuff a while back and precision mostly seemed too low to be useful to us. and matrix mult is not an operation we do much of in DSP

02:02 <d1b2> <johnsel> it can be rewritten

02:03 <d1b2> <azonenberg> anyway near term i am far more interested in getting more stuff into shaders at all than in optimizing existing ones

02:03 <d1b2> <johnsel> and precision too low? it supports fp32?

02:03 <d1b2> <azonenberg> ah i might be thinking of a different extension i was thinking of ones that were mostly like int8/fp16

02:03 <d1b2> <azonenberg> anyway for a lot of filter graphs the bottleneck is transferring data host-device-host and back a bunch

02:04 <d1b2> <johnsel> sure I would obviously fix both problems at once

02:04 <d1b2> <azonenberg> so minimizing that, i.e. moving trivial math onto the gpu so we dont have to go to the cpu to do it and add a round trip, is more of a priority

02:04 <d1b2> <johnsel> it would be good to collect data on which filters are slow

02:04 <d1b2> <azonenberg> Yeah performance tuning is something i will probably focus more on for v0.2

02:05 <d1b2> <azonenberg> there are some architectural things i think we could use to get a lot of boost with some rewriting of things

02:05 <d1b2> <johnsel> I do want to implement tensor cores regardless as it's my own interest

02:05 <d1b2> <azonenberg> but i want to focus on making things work for v0.1

02:05 <d1b2> <azonenberg> it's been too long, we need to just get a release out there

02:06 <d1b2> <johnsel> sure, it's not something I expect to PR soon, I just want to dig deep into Vulkan and this seems like a useful and interesting way to do so

02:06 <d1b2> <johnsel> Tensor cores also do huge calculations in a single clock so you can draw out lots of TFlops

02:07 <d1b2> <johnsel> https://cdn.discordapp.com/attachments/776941750291267595/1182503567119097916/pasted-image-0-11.png?ex=6584ef44&is=65727a44&hm=803f7fe9dcb1028c4d5284768d8a3513a611721f2a1b61456d4326e76cb64764&

02:07 <d1b2> <johnsel> free extra speed

02:10 <d1b2> <johnsel> anyway if you have data on filter execution times/priority feel free to share that if not I will collect it myself when I continue with this

02:29 <d1b2> <azonenberg> i dont have data handy right now, i generally collected it for specific datasets to benchmark a particular filter i was optimizing

02:29 <d1b2> <azonenberg> one thing to note is that a lot of our code is memory bandwidth bound, not flops bound

02:30 <d1b2> <azonenberg> the FIR filter with large tap sizes being one of the few exceptions

02:31 <d1b2> <azonenberg> also many of the more complex protocol decodes get bottlenecked on the (not very parallelizable) clock recovery PLL filter, followed by decodes that dont use much CPU because i've reduced a string of voltages down into bits and bytes i can process in much less cycles

02:31 <d1b2> <azonenberg> The eye pattern is currently semi AVX but i think it might be possible to do on GPU with some effort

02:31 <d1b2> <azonenberg> that's definitely one i want to optimize

02:31 <d1b2> <azonenberg> FFT is already pretty well tuned i think

02:32 <d1b2> <azonenberg> channel emulation and de-embed i think are pushed pretty far however i think there is room to do a more fundamental algorithmic optimization by using overlap-add to do multiple smaller FFTs instead of one big one

02:32 <d1b2> <azonenberg> also for clock recovery i want to explore multi-instancing the PLL

02:33 <d1b2> <azonenberg> i.e. divide the waveform up into say 16 sub-regions and multithread the CDR

02:33 <d1b2> <azonenberg> knowing there might be a small discontinuity at the region boundaries

02:33 <d1b2> <azonenberg> this might not be suitable for SI work so it'd be an opt-in

02:33 <d1b2> <azonenberg> but would probably be good enough for protocol decoding

02:33 <d1b2> <azonenberg> it might even be possible to get good enough for SI, TBD

02:37 <d1b2> <azonenberg> @johnsel anyway if you want to play with shaders, top priority is probably looking at the rendering shader to see if you can track down the issue @246tnt is having

02:37 <d1b2> <azonenberg> i consider that a release blocker

03:21 Degi has quit [Ping timeout: 256 seconds]

03:21 Degi has joined #scopehal

04:03 <d1b2> <johnsel> Yes, on most platforms. I am also interested in optimizing more for NVidia's Jetson Orin modules, which have a different performance profile. They may or may not be providing some hardware for my scope prototype 🙂

04:05 <d1b2> <johnsel> Note that I haven't profiled anything yet, but based on my experience optimizing neural networks they're really worth the effort if you can rewrite the algorithms to leverage them properly. Ofcourse PyTorch and the like already do most hard work for you, especially nowadays, but vectorizing and optimizing memory access patterns definitely made a lot of difference

04:05 <d1b2> <johnsel> Noted

04:07 <d1b2> <johnsel> Yeah, that is an interesting problem (re: the nco/pll). Definitely not my area of expertise and a big bottleneck for the more interesting problems

04:10 <d1b2> <johnsel> I actually thing the FFT and channel emulation and de-embed would benefit most from vectorizing and parallelizing, the tensor cores output a 8x8x8 mult+acc operation in the same time, but the complex numbers make it less desirable.

04:10 <d1b2> <johnsel> if I thought I would get somewhere I would take a look but I really don't unfortunately haha

08:13 <d1b2> <246tnt> Interestingly I get a segfault when quitting 🤔

09:17 bvernoux has joined #scopehal

09:17 <d1b2> <azonenberg> i've seen that a few times related to the measurement window

09:17 <d1b2> <azonenberg> havent root caused it

09:18 <d1b2> <azonenberg> segfaults on quit are annoying but not release blocking unless they also occur when closing the session without quitting

09:57 <d1b2> <246tnt> Here I have it not even starting or adding any instrument, just start and quit.

09:58 <d1b2> <246tnt> Could also be a mesa bug of course, but it happens with anv and lvp (I know it doesn't run/display with lvp but it starts at least).

10:12 <d1b2> <246tnt> Ok, found the bug.

10:12 <azonenberg> woo

10:12 <azonenberg> what was it

10:13 <azonenberg> (the segfault or the shader issue?)

10:15 <_whitenotifier-3> [scopehal-apps] smunaut opened pull request #662: VulkanWindow: Releave descriptor pool after imgui cleanup - https://github.com/ngscopeclient/scopehal-apps/pull/662

10:15 <tnt> The segfault only.

10:47 <_whitenotifier-3> [scopehal-apps] azonenberg closed pull request #662: VulkanWindow: Releave descriptor pool after imgui cleanup - https://github.com/ngscopeclient/scopehal-apps/pull/662

10:47 <_whitenotifier-3> [scopehal-apps] azonenberg pushed 2 commits to master [+0/-0/±2] https://github.com/ngscopeclient/scopehal-apps/compare/ff094878d472...c8226bc062d4

10:47 <_whitenotifier-3> [scopehal-apps] smunaut 29c94d9 - VulkanWindow: Releave descriptor pool after imgui cleanup The descriptor pool needs to remain valid until we're done with ImGUI. Signed-off-by: Sylvain Munaut <tnt@246tNt.com>

10:47 <_whitenotifier-3> [scopehal-apps] azonenberg c8226bc - Merge pull request #662 from smunaut/fix-vulkan-descriptorpool-release VulkanWindow: Releave descriptor pool after imgui cleanup

10:48 <azonenberg> Merged

10:49 <azonenberg> Yay for squishing buts even if not that one

10:49 <azonenberg> bugs*

11:09 <_whitenotifier-3> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/ngscopeclient/scopehal-apps/compare/c8226bc062d4...ee54e41037e8

11:09 <_whitenotifier-3> [scopehal-apps] azonenberg ee54e41 - Fixed bug where m_nodeGroupMap would not be cleared when a filter graph group was deleted. Fixes #661.

11:09 <_whitenotifier-3> [scopehal-apps] azonenberg closed issue #661: Deleting a filter graph group containing nodes segfaults - https://github.com/ngscopeclient/scopehal-apps/issues/661

12:17 <d1b2> <azonenberg> and i mean we already hit one mesa bug lol

12:17 <d1b2> <azonenberg> So while i'm not saying our code is flawless don't rule out the possibility :p

12:19 <d1b2> <246tnt> I think I might have a lead. Concurrent writes to the same shared location might be UB.

12:19 <d1b2> <azonenberg> (https://gitlab.freedesktop.org/mesa/mesa/-/issues/8596)

12:19 <d1b2> <azonenberg> We should not be concurrently hitting the same shared location

12:19 <d1b2> <246tnt> g_done can be written by several threads when the end of the waveform is in view.

12:20 <d1b2> <azonenberg> aha

12:20 <d1b2> <azonenberg> i wonder if we might need an atomic for that

12:20 <d1b2> <246tnt> And only allowing thread local_id_y==0 seems to fix the issue ... (that's a ugly hack but just trying to confirm the theory).

12:20 <d1b2> <azonenberg> or that

12:20 <d1b2> <azonenberg> anyway sounds promising, keep digging

12:21 <d1b2> <246tnt> Yup will do after lunch 😅

12:21 <d1b2> <azonenberg> if you do confirm that's the problem let me spend some time on a proper fix

12:21 <d1b2> <azonenberg> I'm probably the only person who properly understands this shader at this point lol

12:22 <d1b2> <246tnt> There are also some sign related weirdness ( like uint iend and then if (iend <= 0) ...

12:22 <d1b2> <246tnt> And some sign that don't match between the cpp struct and the layout in the shader.

12:22 <d1b2> <246tnt> But they don't see to be the issue here.

12:22 <d1b2> <azonenberg> we should definitely verify that though

12:23 <d1b2> <azonenberg> long term we are going to need to do a lot of retooling to support >4GB (>1Gpoint) waveforms

12:23 <d1b2> <azonenberg> since vulkan doesnt let you allocate >4GB at a time

12:23 <d1b2> <azonenberg> but we're a ways from being able to handle stuff that large efficiently anyway

12:24 <d1b2> <azonenberg> (some of this stuff may be vestigial from previous shaders using opengl or opencl that had different addressing models and full 64 bit indexes etc)

13:16 <d1b2> <246tnt> So I'm pretty sure the problem is indeed the shared g_done ... but my attempts to fix it with atomic have all failed so far :/

13:18 <d1b2> <246tnt> That's my best working attempt ATM : https://pastebin.com/vfrue3Ts with this I've been so far unable to reproduce the issue.

13:20 <d1b2> <azonenberg> If g_done is the issue then i'll spend some time poking on it later today

13:20 <d1b2> <azonenberg> i want to avoid using atomics as they're slow

13:20 <d1b2> <azonenberg> i'd rather have, for example, the first thread in the block set g_done and the others just set a local "not proceeding" variable

13:21 <d1b2> <azonenberg> it might also just be that we need a memory barrier (but this could be slow enough its even worse)

13:28 <d1b2> <246tnt> AFAIU this also should have fixed the issue ... but it doesn't .... https://pastebin.com/RhbxYSSB

13:28 <d1b2> <246tnt> So there might be something else going on ...

13:29 <d1b2> <azonenberg> Will look later, busy right now. but i'm starting to suspect its a memory ordering issue where something might not be getting flushed to all threads or something

16:53 <d1b2> <246tnt> Anyone tried to use GL_EXT_debug_printf ? Wondering how to use it with mesa ...

17:02 <azonenberg> i have used the vulkan version, yes. you need to use the vulkan validation layer to access the print messages

17:02 <azonenberg> they go to stdout once you enable it in vkconfig

17:03 <azonenberg> under "gpu base" select "debug printf"

17:13 <azonenberg> tnt: i'm getting vulkan validation errors related to your last PR btw

17:13 <azonenberg> need to investigate

19:17 <d1b2> <246tnt> Mmm, interesting. I get no error. I did get some errors before the PR (that's how I narrowed the issue) but now the only I get is vkCreateSwapchainKHR() called with minImageCount = 2

22:55 bvernoux has quit [Read error: Connection reset by peer]

23:26 josuah has joined #scopehal