#scopehal on 2022-09-06 — irc logs at libera.irclog.whitequark.org

2022-03-25 21:41 azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/glscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal

00:46 Degi has quit [Ping timeout: 244 seconds]

00:46 Degi has joined #scopehal

01:13 <azonenberg> So I'm thinking about porting the "threshold" filter to GPU

01:13 <azonenberg> without hysteresis, that's trivial

01:14 <azonenberg> With hysteresis, it gets more fun

01:14 <d1b2> <johnsel> cuz that ain't parallel

01:14 <azonenberg> Not trivially parallel, no. But I have an idea for a three-pass algorithm that is likely still faster than a serial one for the common case of a large waveform with millions of points

01:15 <azonenberg> my tentative thought is to split the waveform into a fairly small number of blocks, say 1024 or something. enough to get a decent number of GPU threads

01:16 <azonenberg> for each thread, scan your block from the beginning. If between the low/high hysteresis threshold, previous state is unknown so don't generate any output

01:16 <azonenberg> once you get below the low or above the high threshold, you're in a well defined state and can threshold until you reach the end of your block

01:16 <azonenberg> Record where you started your block. Now you have a list of gaps

01:19 <d1b2> <johnsel> I'm not sure how you process the waveforms right now, do you process the entire dataset or just incoming data?

01:20 <azonenberg> Filters are all invoked on waveform objects, where a waveform corresponds to the data the scope returned for a single trigger event

01:20 <azonenberg> anyway... next, make a second pass. Most blocks should have a well defined state at the end, so the subsequent block now has a high/low state at sample -1 and you can finish thresholding the start of that block

01:21 <azonenberg> finally, you're left with any remaining gaps if there were entire blocks where the signal did not toggle at all

01:21 <azonenberg> for this, either revert to CPU or run a single shader thread to do a linear sweep

01:22 <azonenberg> anyway, i'm building pipeline caching first before i do that

01:24 <azonenberg> johnsel: this is the tricky scenario of parallel computing

01:25 <azonenberg> when your problem is not embarrassingly parallel (zero dependencies between tasks done to different data elements)

01:25 <azonenberg> but still has enough parallelism that you can get a significant speedup vs serial for most cases if you carefully design your algorithm

01:26 <d1b2> <johnsel> yeah and we think about it a lot differently, I would build something that would work on a stream because I see all the overhead limits getting removed in a few years

01:27 <d1b2> <johnsel> but we'd end up in a whole other discussion then

01:28 <d1b2> <johnsel> anyway one thing that might be useful is you can probably throw away a lot of samples regardless

01:30 <d1b2> <johnsel> though that still depends on how much you control you have (or take) right now on distribution and sampling

01:33 <d1b2> <johnsel> although presumably the threshold filter has than the most basic thresholding

01:33 <d1b2> <johnsel> I really should take the time to actually work through the code

01:33 <azonenberg> The threshold filter right now has two options: single level (embarrassingly parallel) and hysteresis (not so, but likely still parallelizable for typical input)

01:34 <azonenberg> Everything works on waveforms, you can stream by having waveforms represent consecutive blocks of samples if you want

01:34 <azonenberg> but i do not see that going away any time soon

01:34 <azonenberg> To give you an idea, right now we struggle to keep up with ~1.5 Gbps of data. I think we can get up a fair bit past there if we leave GTK

01:35 <azonenberg> a single AD9213 ADC as planned for one of my future scopes puts out 10 Gsps * 12 bits = 120 Gbps of data *per channel*

01:35 <azonenberg> times a four channel scope is almost half a Tbps of waveform data

01:35 <d1b2> <johnsel> Sure, but it's a pretty basic assumption if you are working on small blocks of data or large

01:35 <azonenberg> and that's not even particularly high sample rate as scopes go. My lecroy scopes can do 40 Gsps

01:36 <d1b2> <johnsel> and I'm somewhat familiar with large datastreams, the last product I worked on dealt with 6 2D + 6 3D cameras

01:36 <azonenberg> And so when the scopes are spitting out data 3+ orders of magnitude faster than we can process it, i do not see continuous streaming being viable any time soon, if ever. maybe in 20 years we'll be able to process a few hundred Gbps of waveform, but then the scopes will be faster

01:36 <d1b2> <johnsel> though I did not touch most of those algorithms

01:37 <azonenberg> that's so much less data its not even funny. especially if compressed

01:38 <azonenberg> a 4K RGB24 image is 3840 * 2160 * 24 bits = 199 Mbits, at 60 FPS that's about 11 Gbps of data. so the sum of all your cameras in 4Kp60 uncompressed is maybe about as much data as a single decent scope channel

01:38 <azonenberg> if they're 1080p30 its not even close

01:40 <azonenberg> for comparison a single large scope waveform in my benchmark dataset is 512M samples * 16 bits (padded) = 8 Gbits of data for a single waveform in native ADC format, or 16 Gbits in fp32

01:40 <d1b2> <johnsel> they were 1080p, but they were processed into pointclouds, regardless, it's still in the same order of magnitude as were you are choking on right now

01:40 <azonenberg> now to be fair the scope is not capable of spitting them out at 60 FPS

01:41 <azonenberg> and yeah. same OOM. And like i said our limiting factor right now is mostly the GUI actually, for simple datasets

01:41 <azonenberg> i think we can push a fair bit faster once we leave GTK. we're looking at like 30% GPU load for a lot of my benchmarks

01:41 <d1b2> <johnsel> that's still a long way off from where you want to be though

01:41 <azonenberg> Yes, agreed

01:42 <azonenberg> Which is why we'll need to optimize. And perhaps also explore multi GPU

01:42 <d1b2> <johnsel> and since you would not only want to be processing a single trigger event, you'd have to start not re-processing the same data at some point

01:42 <d1b2> <johnsel> or it's impossible

01:42 <d1b2> <johnsel> and I never claimed to be an expert with all the solutions, just some familiarity with the types of issues you run into

01:43 <azonenberg> Yeah. There are definitely limits with how well some stuff can parallelize

01:43 <azonenberg> But the good news is, the less parallel stuff tends to be higher level data

01:43 <d1b2> <johnsel> especially if you keep doing it over over and over

01:43 <azonenberg> which is working on digital bits and not raw analog samples

01:43 <azonenberg> so your data rate is lower

01:44 <azonenberg> Which is why i am particularly interested in a way to do parallel clock recovery so that I can sample and threshold the analog waveforms to a vector of digital bits as quickly as i possibly can

01:44 <azonenberg> and then descramble those bits to e.g. 8b10b symbols

01:44 <azonenberg> and now all of a sudden we're working with much smaller quantities of data

01:46 <d1b2> <johnsel> Anyway the viewpoint why I think it's potentially more worthwhile than it might seem is because there is lots of movement in direct DMA from ethernet to GPU (w/ decompression) as well as the CXL which is basically 'high speed memory access for everyone (including FPGAs) on the same system/motherboard'

01:48 <d1b2> <johnsel> specifically the 3.0 version of CXL w/ memory sharing with cache coherency

01:49 <d1b2> <johnsel> that is not necessarily an argument for changing your approach now, but those are the current forward movements on the HPC markets and consumer GPU

02:08 <mxshift> Direct DMA between NIC<->GPU is definitely making its way to more general use. Still only helps if your data is sourced or sinked off-machine.

02:14 <d1b2> <johnsel> there are lots of reasons why it might not help you and you will slow down, hence why you really want a processing pipeline that can process large and small chunks, and decide what is best to use when. But that's a fundamental architectural decision to either go for the approach that you want as large as possible waveforms and process them in one go and present a varying window into that processed memory (which is much simpler, for sure), or

02:14 <d1b2> that you decouple them more and try to build something leveraging everything you can to never process things twice, compression for those times you do put things over IO, etc

02:16 <d1b2> <johnsel> the IO overhead is big, so it's really not an unreasonable approach to just say let's minimize that and try to parallelize as much after that

02:16 <d1b2> <johnsel> but it's not optimal, for sure

02:16 <d1b2> <johnsel> especially once more and more techniques become 'free' from the HPC and Gaming end of things where they run into the same problem

02:16 <d1b2> <johnsel> the m1 will be a very interesting case to see how much effect IO can have with it's shared memory

02:17 <d1b2> <johnsel> the PS5 (Ratchet & Clank) is also an interesting showcase of what super fast firehosing data can do for you

02:17 <d1b2> <johnsel> and those things will definitely be far more commonplace in say 5 years

02:18 <d1b2> <johnsel> and there are many non hw-techniques like using data compression to minimize the impact of IO overhead

02:19 <d1b2> <johnsel> I mean 32 or 64 GT = 32 or 64 GT

02:19 <d1b2> <johnsel> GPU memory is 1 terabyte/s right now

02:19 <d1b2> <johnsel> which is an order of magnitude and then some still

02:19 <d1b2> <johnsel> but it's much simpler to buy a terabyte of RAM than to buy a terabyte of GPU mem

02:20 <d1b2> <johnsel> and if you go multi-gpu, then fuck that PCIe bus is in there anyway

02:20 <d1b2> <johnsel> so you just will have to deal with it anyway

02:25 <d1b2> <johnsel> and yes there are multi-gpu methods that use high speed interconnects, but that is precisely where this CXL is competing against, without vendor lock-in and wait lists and good luck using that under Vulkan

02:29 <mxshift> Always comes back to latency and throughout and measuring for IO size and placement to hit the throughout and latency you want/need

02:29 <d1b2> <johnsel> well we want as much for as little as possible, that's a given :p

02:30 <mxshift> Streaming data into GPUs can work but it's a very different case with different characteristics from operating entirely in GDDR

02:30 <d1b2> <johnsel> but there's also the reality that this data inherently arrives to us in a stream form, chunking it up to a couple GBs and then processing inherently already traded off a lot of latency

02:31 <d1b2> <johnsel> yeah it's very different, but since we're revisiting basic architecture it is definitely now the time to think about how strong we feel about that decision being the right one

02:32 <mxshift> Nothing about GPU behavior is steam based until you get very high level

02:33 <d1b2> <johnsel> I disagree, nothing about GPU behavior is frame or chunk based until you get very high level

02:33 <d1b2> <johnsel> vulkan doesn't even have frames anymore, you even have to define those yourself

02:39 <d1b2> <johnsel> anyway I'm happy to defer, but I'm more happy to pit the approaches against each other in the new grassroots app discussions when it's still cheap to decide

02:48 <azonenberg> The block based waveform processing architecture is extremely fundamental to how all of the filter graph blocks etc are built

02:48 <azonenberg> If we ever add streaming, it would likely be just a hint to the block based filters

02:49 <azonenberg> that says "hey FYI, the next waveform happened right after the last one so it's OK to preserve state across them"

02:49 <azonenberg> but the other thing to consider is, very often you want to go back in time and look at older historical data etc

02:49 <azonenberg> which really makes more sense in a block based flow than a stream based one

02:49 <d1b2> <johnsel> Sure I wouldn't know how to to do it differently than with blocks

02:50 <d1b2> <johnsel> I don't mean a literal C++ stream object

02:50 <d1b2> <johnsel> just the fundamental code style of thinking in streams

02:50 <d1b2> <johnsel> I mean not even code style, the sw architecture

02:51 <d1b2> <johnsel> I think you can simplify it to do you expect large blocks that move almost never, or small blocks that move around as little as possible

02:52 <d1b2> <johnsel> and do you expect a single processing 'stream' or potentially multiple with different local state

02:54 <azonenberg> So first off, it's not a stream. it's a filter graph

02:54 <azonenberg> very often one processing block will have output consumed by multiple downstream blocks

02:54 <azonenberg> it's a DAG not a linear chain

02:55 <azonenberg> There is potential for optimization if I have two consecutive filter blocks that *do not* render the intermediate result on screen to skip a redundant memory save/load

02:55 <azonenberg> but actually concatenating the shaders would be nontrivial

03:06 <d1b2> <johnsel> I don't think what I mean is getting across and I'm not sure I can formulate it better right now. We can chalk it up to my lack of communication skills. The filter graph is an obvious given

03:10 <mxshift> It's less about streaming vs blocks and more about data movement. Getting data in and out of GPU RAM is far from free and can be done in ways to either optimize for latency or throughout.

03:11 <azonenberg> Yeah. What I'm trying to do as much as possible is push raw ADC codes from the scope's socket into the GPU

03:11 <azonenberg> convert to fp32 on the GPU, do as much as possible of the heavy lifting there (even if slower than CPU, to avoid a transfer step)

03:12 <azonenberg> render on the GPU

03:12 <azonenberg> and only pull off to the CPU if i have a complex filter block that doesn't parallelize usefully

03:13 <mxshift> But there is always a minimal amount of processing from the socket to get it into a form for input into the filter graph

03:13 <d1b2> <johnsel> now we're getting there mxshift

03:14 <d1b2> <johnsel> and GPU memory is stupid expensive when all you do is present a few lines out of those GBs

03:14 <mxshift> So NIC->GPU doesn't help unless you do that scope driver work on the GPU

03:14 <azonenberg> johnsel: yeah but we're not doing that. the rendering shader touches every sample on screen to do intensity grading

03:14 <d1b2> <johnsel> it's just a really simple memory management strategy, I think we can agree to that no?

03:15 <d1b2> <johnsel> but you don't have to do that more than once

03:15 <azonenberg> we're not even using opengl/vulkan polygon geometry for the rendering. it's a custom rasterizer

03:15 <d1b2> <johnsel> you can cache that

03:16 <azonenberg> cache?? every frame it has to be redrawn. even if you're just panning the image samples are coming on/off screen and moving from one pixel to another

03:16 <azonenberg> if zooming, more so

03:16 <azonenberg> and of course if new waveform data is coming in then you have to update the data in memory with that

03:16 <d1b2> <johnsel> you can cache the output

03:16 <d1b2> <johnsel> of your rasterizer

03:16 <d1b2> <johnsel> in the simplest form

03:16 <azonenberg> We only call the compute shader if the image changes

03:16 <d1b2> <johnsel> if it has been visible once

03:16 <azonenberg> i.e. if the data has changed or we've changed zoom or position

03:16 <d1b2> <johnsel> right and all the rest of the time you are wasting GPU memory holding data you don't need

03:17 <d1b2> <johnsel> because sometimes it might change

03:17 CobaltBlue has joined #scopehal

03:17 <d1b2> <johnsel> when for many of those changes you don't need all the data anyway

03:17 <d1b2> <johnsel> and if you do, you still don't need it all the rest of the time

03:17 <azonenberg> the primary case we're optimizing for is lots of data flowing through fairly rapidly

03:17 <d1b2> <johnsel> now it's not that big of a deal, but one approach scales, the other doesn't

03:17 <azonenberg> where each waveform gets processed, displayed, and then is moved into history

03:17 <d1b2> <johnsel> exactly, so why do you keep that data at hand?

03:18 <d1b2> <johnsel> surely you can keep the last X samples only

03:18 <azonenberg> We do that already, to an extent. i'm tweaking cache settings

03:18 <d1b2> <johnsel> sure X billion

03:18 <azonenberg> but essentially the current waveform and filters processing it etc live in GPU memory

03:18 <d1b2> <johnsel> right so it becomes a 'where do you put the responsibility'

03:18 <azonenberg> and we push anything older out to history which is just normal CPU RAM by default (or pinned memory)

03:19 <azonenberg> or even memory mapped files

03:19 <mxshift> That's a forward flow process through the filter graph. You _can_ do a backwards demand scheme which can reduce how much data needs to be touched with lots of handwaving

03:19 <azonenberg> We have support for all of the above and i'm still fine tuning some optimization settings to figure out exactly when to process things

03:19 <d1b2> <johnsel> but that is handled at the driver level right?

03:19 <d1b2> <johnsel> the scope driver I mean

03:19 <azonenberg> no, scope drivers just spit out waveforms every trigger

03:19 <d1b2> <johnsel> and then everything flows off of that

03:20 <azonenberg> the filter graph is managed way downstream and may consume output from multiple scope drivers, files, waveform creation filters, etc

03:20 <azonenberg> every time a new set of waveforms has arrived from all connected instruments, the graph is re-evaluated

03:20 <d1b2> <johnsel> alright well perhaps we're nearer than it seemed then

03:21 <azonenberg> Right now with the WIP vulkan rewrite, we do keep some historical waveforms in GPU memory

03:21 <azonenberg> this is just a matter of calling SetUsageHint() on the buffer at the appropriate time when pushing it into history

03:21 <azonenberg> we also have memory pools to reuse buffers that scrolled off history, because vulkan memory allocation is expensive

03:21 <azonenberg> the challenge is not bloating too much GPU memory usage, but also not spenidng too much time in memcpy's and reallocations

03:21 <azonenberg> it's a difficult balance and one i'm still tuning

03:22 <d1b2> <johnsel> yeah but not if you would want to leverage e.g. RDMA

03:22 <d1b2> <johnsel> so all I would say is let us not forget those things

03:22 <d1b2> <johnsel> and also look at compressed data transfers

03:22 <d1b2> <johnsel> GPUs can decompress certain formats suuuper fast

03:23 <azonenberg> We're currently limited to the formats that the scope is able to generate

03:23 <azonenberg> Their APIs and interchange formats are often not optimized for gigabit streaming

03:23 <d1b2> <johnsel> not when you move it around from GPU to CPU

03:24 <azonenberg> We need random access to the waveform data and I am highly skeptical any kind of lossless compression would be faster than just reading raw float[]'s

03:24 <d1b2> <johnsel> hmm yes lossless....

03:24 <azonenberg> if anything, what might be worth looking at is using fp16 for cases where the underlying precision of the waveform source (ADC bit depth etc) is such that fp32 is more precision than we need

03:25 <azonenberg> but fp32 is a lot nicer wrt not losing precision at intermediate processing steps, with fp16 we'd have to pay a lot more attention to numerical stability and rounding error etc

03:25 <d1b2> <johnsel> there are huge performance implications though

03:26 <d1b2> <johnsel> though I even think they are worse for fp16

03:26 <azonenberg> for going to fp16? yes, that is definitely on the longer term list of things to explore

03:26 <azonenberg> fp16 is hugely faster, that's why it exists

03:26 <d1b2> <johnsel> or it's 64

03:26 <azonenberg> fp64 is slow especially on consumer GPUs. it's msotly used in scientific applications and the vendors deliberately cripple fp64 performance on consumer GPUs so you have to buy quadro/tesla cards

03:26 <azonenberg> for that work

03:26 <d1b2> <johnsel> at this point I"m talking out of my ass

03:26 <azonenberg> fp16 is used all over the place in games and ML

03:27 <d1b2> <johnsel> I just remember something being weird about it

03:27 <azonenberg> it's a question of doing the numerical analysis to determine if we can get away with it and get acceptable accuracy

03:27 <azonenberg> the key thing is, fp16 halves the memory requirements

03:27 <d1b2> <johnsel> on that note are you aware of the funky AI formats?

03:27 <azonenberg> half the vram and more improtantly have the bandwidth

03:28 <azonenberg> fp16 would make sense, i think, for data coming from an 8 bit scope

03:28 <azonenberg> but not for anything bigger

03:28 <azonenberg> at this point i dont want to put in the effort as we have bigger issues

03:29 <d1b2> <johnsel> I was thinking about bfloat16

03:29 <_whitenotifier-7> [scopehal] azonenberg pushed 2 commits to master [+2/-0/±2] https://github.com/glscopeclient/scopehal/compare/dbb5b158487a...531a9ac99e0b

03:29 <_whitenotifier-7> [scopehal] azonenberg 27e105f - Initial skeleton of PipelineCacheManager. Doesn't do anything useful yet.

03:29 <_whitenotifier-7> [scopehal] azonenberg 531a9ac - ComputePipeline: lazy initialization first time pipeline is bound. Fixes #678.

03:29 <_whitenotifier-7> [scopehal] azonenberg closed issue #678: ComputePipeline: lazy initialization - https://github.com/glscopeclient/scopehal/issues/678

03:29 <d1b2> <johnsel> not relevant though, but interesting nonetheless

03:29 <d1b2> <johnsel> or perhaps relevant but not useful I think

03:31 <d1b2> <johnsel> do you plan to implement tensor core support?

03:31 <d1b2> <johnsel> I'm sure I know an algorithm that will benefit from that

03:33 <d1b2> <johnsel> Peak FP64 Tensor Core 19.5 TFLOPS Peak FP32 19.5 TFLOPS Peak FP16 78 TFLOPS Peak FP16 Tensor Core 312 TFLOPS | 624 TFLOPS2

03:33 <d1b2> <johnsel> A100 for reference

03:38 <azonenberg> It looks like it is possible to do in vulkan via VK_NV_cooperative_matrix

03:38 <azonenberg> But i don't know how useful that would be for our use case

03:39 <d1b2> <johnsel> well an order of magnitude faster if you have some matrixes to multiply

03:40 <d1b2> <johnsel> perhaps it's a fun exercise for me to do to better familiarize myself with both Vulkan and the codebase

03:41 <azonenberg> Yeah I just dont know if any of the work we're doing is at all related to matrix multiplication

03:41 <azonenberg> S-parameter de-embedding is multiplying a 2-element vector by a 2x2 matrix

03:41 <azonenberg> but that's way too small to benefit from tensor cores

03:41 <d1b2> <johnsel> for a whole block?

03:42 <d1b2> <johnsel> that you will have to cut up

03:42 <d1b2> <johnsel> sounds vaguely familiar

03:42 <azonenberg> The de-embed algorithm we use is to resample the channel response to match the number of samples in the transformed input

03:42 <azonenberg> FFT the input

03:43 <azonenberg> multiply each complex FFT point by the interpolated channel response as a 2D matrix (scale by mag, rotate by angle)

03:43 <azonenberg> then inverse FFT

03:44 <d1b2> <johnsel> sure but you can still do a few of those together

03:44 <azonenberg> the cooperative matrix extensions expect all threads to work together on one big matrix

03:46 <azonenberg> Also... next developer call is scheduled for 10:00 Pacific time (13:00 Eastern, 19:00 CET) on Monday the 12th

03:46 <d1b2> <johnsel> yes I think it goes back to our previous convo actually (which is coincidental), but it's a good example of what I meant

03:46 <d1b2> <johnsel> there's a 10x speedup to be gained if you can provide it the arrays to work on

03:47 <azonenberg> Yeah i just dont know if we have math that actually will map well to that currently

03:47 <d1b2> <johnsel> in matrix format, and process the output

03:47 <azonenberg> anyway... ping lain, johnsel, miek, mxshift re dev call, if you're interested

03:47 <d1b2> <johnsel> fft does as far as I am aware

03:47 <azonenberg> We're using vkFFT for FFT. i have no idea how it works under the hood

03:47 <d1b2> <johnsel> but hence why I said it would be a good thing for me to play around with

03:48 <azonenberg> but that is an issue to raise w/ the vkfft dev team not here

03:48 <azonenberg> i have no intention of maintaining a fft lib

03:48 <azonenberg> we moved away from clfft in part so we wouldn't have to

03:48 <d1b2> <johnsel> I only picked fft because I think it easily maps

03:48 <d1b2> <johnsel> I'll see what else there is if there's another algorithm that might be more useful to have

03:49 <d1b2> <johnsel> I will join in to the call

03:50 <azonenberg> We'll make more announcements closer to the time. Louis is hosting and will have the zoom link

03:50 <d1b2> <johnsel> though I have to warn everyone it will be the second time I have a convo in English the past 3? 4? years

03:50 <azonenberg> (I don't want to paste it in the open channel in case a bot finds it and decides to spam us - anyone is free to join but you'll have to ask)

03:50 <d1b2> <johnsel> by speaking that is

03:50 <azonenberg> Feel free to just listen in

03:51 <d1b2> <johnsel> I'll do my best to be understandable

03:51 <d1b2> <johnsel> But I appreciate the sentiment

04:02 <d1b2> <johnsel> unrelated: I hate PSpice

05:21 Johnsel has quit [Ping timeout: 240 seconds]

05:22 Johnsel has joined #scopehal

06:18 Johnsel has quit [Ping timeout: 244 seconds]

06:18 Johnsel has joined #scopehal

06:23 Johnsel has quit [Ping timeout: 244 seconds]

06:23 Johnsel has joined #scopehal

07:14 massi has joined #scopehal

07:31 <azonenberg> Working on Vulkan pipeline caching now

07:33 <azonenberg> For those of you who aren't familiar, Vulkan shaders are compiled at build time from your source language to SPIR-V bytecode, but then they get JITted at run time to a GPU specific native blob

07:33 <azonenberg> this takes a fair bit of time, ~5 ms for even a very simple shader

07:33 <azonenberg> so doing this a lot adds up

07:34 <azonenberg> there is a mechanism to cache that blob within or across runs of the app

07:34 <azonenberg> In current (not yet pushed) code I have caching working within a single run of the app

07:34 <azonenberg> but the cache is not persisted to disk

07:34 <azonenberg> the challenge there is ensuring that you invalidate the cache properly if, for example, you upgrade your GPU driver

07:35 <azonenberg> or the bytecode changes

07:35 <azonenberg> So i'm starting to work on that but have more to do

07:35 <azonenberg> vkFFT is going to need caching too, but this will be even more extensive

07:35 <azonenberg> because vkFFT does JIT generation of GLSL source code which then gets compiled to SPIR-V and then from that to the native blob

07:35 <azonenberg> (so we're actually toting around a shader compiler just like OpenGL etc would)

07:36 <azonenberg> And it's a different API but both will live in the same cache manager object

07:50 <electronic_eel> is caching on disk across runs that important?

07:51 <electronic_eel> i mean if the jit for one shader takes 5ms, and you have a hundred and jit them all at startup, then starting up glscopeclient takes half a second longer. no big deal in my view

07:53 <electronic_eel> or if you run the jit just when a specific shader is used the first time, then it just makes the first waveform using some shader appearing a few milliseconds later

08:11 <azonenberg> electronic_eel: it's a bigger deal with the FFT which takes a lot longer

08:11 <azonenberg> or more complex shaders in general

08:12 <azonenberg> the fft feels like hundreds of ms

08:12 <azonenberg> havent profiled it yet, was going to do that after i take care of some other stuff

08:13 <electronic_eel> but the "hundreds of ms" only happens once, when you initialize the filter graph and include something new that needs fft, right?

08:13 <electronic_eel> so when you change some setting, the fft is still in cache, and then you wouldn't have to wait again

08:14 <azonenberg> every new size of FFT

08:14 <azonenberg> the shaders are all jitted

08:14 <azonenberg> and unrolled and such

08:14 <azonenberg> that is the single biggest thing to cache, doign the rest of the shaders is a "because we can and it's not much more work"

08:16 <azonenberg> let me check...

08:16 <electronic_eel> hmm, not really convinced yet. the disk cache adds quite a lot of causes for errors. like driver update, vulkan update, vkfft update, jit compiler update and so on.

08:17 <azonenberg> That's why the disk cache needs to store metadata in each cache entry for that

08:17 <azonenberg> the jit compiler is part of the driver

08:17 <azonenberg> vkfft has a version number in the header

08:18 <electronic_eel> and i don't think comparing released versions is good enough, there could be a bug somewhere, you patch it and recompile, and then you still got the old data

08:18 <azonenberg> its just gonna be a bit of yaml

08:18 <azonenberg> In rare cases like that, you can manually flush the cache, but i don't see us doing vkfft development much if ever

08:18 <electronic_eel> so maybe get a sha256sum of the binaries?

08:19 <azonenberg> vulkan actually includes a device uuid for this purpose

08:19 <azonenberg> which is supposed to change every driver release and is a typically hash over the device id, driver version, etc

08:19 <azonenberg> but good practice is to check the uuid as well as the driver version in case a particular driver doesn't always bump the uuid

08:20 <azonenberg> also ok, 300-400 ms to initialize vkFFT for half a million points

08:20 <azonenberg> thats not even particularly large

08:21 <azonenberg> another test, for the de-embed filter

08:21 <azonenberg> 300-400 ms *each* to initialize for forward and reverse directions

08:21 <azonenberg> so 600-800 ms every time you change memory depth

08:21 <azonenberg> that is absolutely somethign to cache as much as you possibly can

08:21 <electronic_eel> hmm, yeah 800 ms is annoying

08:22 <azonenberg> it's to the point that i'm thinking of prefetching your scope's known memory depths in a background thread as soon as the app starts

08:22 <azonenberg> so by the time you create a fft or de-embed filter, they're ready to go

08:22 <electronic_eel> that would be another idea. then you wouldn't need the disk cache

08:23 <azonenberg> No

08:23 <azonenberg> the disk cache is to keep that from happening every app start

08:24 <azonenberg> ~every game using vulkan has an on disk cache, the mechanism exists for a reason

08:24 <d1b2> <Darius> caching shaders is very common for games too, definitely a normal thing these days

08:24 <d1b2> <Darius> heh

08:24 <azonenberg> Exactly

08:41 <azonenberg> ok so it's ~50ms to initialize vkfft with the cache

08:41 <azonenberg> not great, but could be a lot worse

08:41 <azonenberg> (still ram only)

08:42 <azonenberg> the cache data blobs are ~38K 32-bit words of data

08:42 <azonenberg> i might want to rethink my plan of using yaml as the container format for the cache

08:42 <azonenberg> that is a bit large to stick in a text based file format

08:44 <azonenberg> (this is for a 1M point FFT, presumably different sizes will be different)

10:45 <_whitenotifier-7> [scopehal] azonenberg pushed 4 commits to master [+3/-0/±23] https://github.com/glscopeclient/scopehal/compare/531a9ac99e0b...7e97c7dc436c

10:45 <_whitenotifier-7> [scopehal] azonenberg d606dba - Added RAM-only (nonpersistent) implementation of pipeline caching. See #677.

10:45 <_whitenotifier-7> [scopehal] azonenberg 862d42e - Initial RAM-only (nonpersistent) implementation of VkFFT plan caching. See #676.

10:45 <_whitenotifier-7> [scopehal] azonenberg a158c13 - Refactoring: moved CRC32 to global function and not in Filter class, it now takes a uint8_t* instead of a vector

10:45 <_whitenotifier-7> [scopehal] azonenberg 7e97c7d - Added pipeline caching for vkFFT including disk persistence. Fixes #676.

10:45 <_whitenotifier-7> [scopehal] azonenberg closed issue #676: VulkanFFTPlan: caching of plans - https://github.com/glscopeclient/scopehal/issues/676

10:45 <_whitenotifier-7> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-2/±3] https://github.com/glscopeclient/scopehal-apps/compare/6a1f29590140...a6c694ec3825

10:45 <_whitenotifier-7> [scopehal-apps] azonenberg a6c694e - Updated to latest scopehal. Removed FileSystem utilities and put them in scopehal.

10:47 <azonenberg> ok now vkFFT caching to disk is done, now i just have to do that for the regular shaders

10:50 <azonenberg> (vkFFT gives me the blob as a raw memory array whiel with vulkan you have a PipelineCache object that you have to serialize separately)

11:53 <miek> these Rosenberger 32K243-40ML5 connectors are very nice :)

11:55 <azonenberg> miek: they are indeed. using them for something?

12:00 <miek> yeah, i'm going to be building some USB3 test fixtures with them. so far i've just done a simple test to try them out on the JLC stackup: https://github.com/miek/usb3-fixtures/tree/main/microstrip-test#results

12:02 <azonenberg> nice

12:02 <azonenberg> what's the bw of your tdr?

12:04 <miek> the sampler is 18GHz, and i think it's a nominal 40ps step

12:06 <azonenberg> So fairly comparable to my leo bodnar pulse generator + 16 GHz scope then

12:06 <azonenberg> the dip looks similar to what i've seen on connector transitions, which is why i askeed

12:07 <azonenberg> to make sure i understand that right, you've got a mismatch at one side and matched on the other?

12:07 <azonenberg> are you seeing any residual mismatch at the connector at all?

12:08 <miek> yeah, that's right - in the design i used their footprint diagram unchanged on one side & shrunk the signal trace to match the microstrip on the other side

12:09 <miek> not as far as i can see, if there is anything it's lost in the noise of the fr4 weave

12:10 <_whitenotifier-7> [scopehal-apps] azonenberg edited issue #482: Make sure vkFFT dependencies are correctly detected by CMake for all supported/upcoming platforms - https://github.com/glscopeclient/scopehal-apps/issues/482

12:11 <miek> at least i'm pretty sure that's what the waviness is during the microstrip part, on the board the weave is a few degrees off alignment to the trace so i think it makes sense to see it have those slow variations

12:11 <_whitenotifier-7> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±8] https://github.com/glscopeclient/scopehal/compare/7e97c7dc436c...7e16e3b690bb

12:11 <_whitenotifier-7> [scopehal] azonenberg 0b7d764 - vkFFT version is now included in cache file header

12:11 <_whitenotifier-7> [scopehal] azonenberg 7e16e3b - Implemented serialization of pipeline caching for Vulkan compute. Fixes #677.

12:11 <_whitenotifier-7> [scopehal] azonenberg closed issue #677: ComputePipeline: caching of pipelines - https://github.com/glscopeclient/scopehal/issues/677

12:11 <_whitenotifier-7> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal-apps/compare/a6c694ec3825...12cfaab9eb1c

12:11 <_whitenotifier-7> [scopehal-apps] azonenberg 12cfaab - Updated to latest scopehal with pipeline caching

12:22 <azonenberg> yes that looks like weave effect if you're rotated slightly

12:24 <_whitenotifier-7> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal/compare/7e16e3b690bb...98db1cb0162f

12:24 <_whitenotifier-7> [scopehal] azonenberg 98db1cb - Filter: don't cache invalid voltage range if channel data is null

12:30 <miek> oh and i borrowed your footprint, so thank you for that! :)

12:32 <azonenberg> FWIW, depending on stackup

12:32 <azonenberg> i've sometimes found it necessary to add a small ground plane cutout to get a good match with it

12:32 <azonenberg> it's subtle but there

12:47 <_whitenotifier-7> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal/compare/98db1cb0162f...dad31706600b

12:47 <_whitenotifier-7> [scopehal] azonenberg dad3170 - AcceleratorBuffer: don't copy old GPU buffer if it was empty

12:58 <_whitenotifier-7> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal/compare/dad31706600b...f0e8bf7b983d

12:58 <_whitenotifier-7> [scopehal] azonenberg f0e8bf7 - DeEmbedFilter: make sure output is actually marked as modified on the GPU

12:58 <_whitenotifier-7> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal-apps/compare/12cfaab9eb1c...2307ef81eb99

12:58 <_whitenotifier-7> [scopehal-apps] azonenberg 2307ef8 - Updated to latest scopehal

13:02 <azonenberg> https://www.antikernel.net/temp/sma-tdr.png

13:02 <azonenberg> miek: here's that connector plus a relatively unoptimized SMA-J-P-H-ST-EM1

13:03 <azonenberg> glscopeclient channel emulation with 6 GHz PicoVNA and 26.5 GHz FieldFox measurements of the same board (the latter by derek kozel from gnuradio)

13:03 <azonenberg> this was before i got the 8.5 GHz PicoVNA

13:03 <azonenberg> but you can see how the lower BW measurement loses a ton of detail

13:03 <azonenberg> you can also see my microstrip on this test board is about 10% high from the nominal 50 ohms

15:29 massi has quit [Remote host closed the connection]

15:32 <_whitenotifier-7> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±4] https://github.com/glscopeclient/scopehal/compare/f0e8bf7b983d...8a8f7f137f8d

15:32 <_whitenotifier-7> [scopehal] azonenberg 8a8f7f1 - Pipeline cache now uses file modification timestamp for shaders

15:33 <_whitenotifier-7> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://github.com/glscopeclient/scopehal-apps/compare/2307ef81eb99...2fb613c56e6f

15:33 <_whitenotifier-7> [scopehal-apps] azonenberg 2fb613c - Don't copy shaders unless they've been modified, to avoid needlessly invalidating caches every compile

15:39 <azonenberg> And then opened https://github.com/DTolm/VkFFT/issues/85. Which should give us even more fft init performance

17:08 <_whitenotifier-7> [scopehal-apps] azonenberg commented on issue #412: Add history cap in memory units - https://github.com/glscopeclient/scopehal-apps/issues/412#issuecomment-1238435136

17:08 <_whitenotifier-7> [scopehal-apps] azonenberg closed issue #412: Add history cap in memory units - https://github.com/glscopeclient/scopehal-apps/issues/412

17:10 <_whitenotifier-7> [scopehal-apps] azonenberg commented on issue #311: Swap older history waveforms out to disk - https://github.com/glscopeclient/scopehal-apps/issues/311#issuecomment-1238436491

17:44 <azonenberg> So I was going to assemble the AKL-AV1 v0.5 prototype today

17:44 <azonenberg> but i cant find the bottom solder stencil

17:45 <azonenberg> looking in the bin for the project i see everything else i expected to find component wise, and the top stencil

17:45 <azonenberg> no bottom stencil

17:45 <azonenberg> checked all over with no luck, so for now i'm giving up

17:45 <azonenberg> ordering a new one

17:46 <azonenberg> there goes $20ish and a bit of time

17:46 <azonenberg> But it's not like i have any shortage of other stuff to do (including assembling the latest PT5)

17:54 <azonenberg> So I guess that's next on the agenda

17:54 <azonenberg> Same PCB as the previous test, but a new LPF

17:54 <azonenberg> And then the AD4 works OK-ish but i really want to fine tune the input matching on the amplifier better

17:55 <azonenberg> So that's ongoing research

23:40 Johnsel has quit [Remote host closed the connection]