<azonenberg>
So I'm thinking about porting the "threshold" filter to GPU
<azonenberg>
without hysteresis, that's trivial
<azonenberg>
With hysteresis, it gets more fun
<d1b2>
<johnsel> cuz that ain't parallel
<azonenberg>
Not trivially parallel, no. But I have an idea for a three-pass algorithm that is likely still faster than a serial one for the common case of a large waveform with millions of points
<azonenberg>
my tentative thought is to split the waveform into a fairly small number of blocks, say 1024 or something. enough to get a decent number of GPU threads
<azonenberg>
for each thread, scan your block from the beginning. If between the low/high hysteresis threshold, previous state is unknown so don't generate any output
<azonenberg>
once you get below the low or above the high threshold, you're in a well defined state and can threshold until you reach the end of your block
<azonenberg>
Record where you started your block. Now you have a list of gaps
<d1b2>
<johnsel> I'm not sure how you process the waveforms right now, do you process the entire dataset or just incoming data?
<azonenberg>
Filters are all invoked on waveform objects, where a waveform corresponds to the data the scope returned for a single trigger event
<azonenberg>
anyway... next, make a second pass. Most blocks should have a well defined state at the end, so the subsequent block now has a high/low state at sample -1 and you can finish thresholding the start of that block
<azonenberg>
finally, you're left with any remaining gaps if there were entire blocks where the signal did not toggle at all
<azonenberg>
for this, either revert to CPU or run a single shader thread to do a linear sweep
<azonenberg>
anyway, i'm building pipeline caching first before i do that
<azonenberg>
johnsel: this is the tricky scenario of parallel computing
<azonenberg>
when your problem is not embarrassingly parallel (zero dependencies between tasks done to different data elements)
<azonenberg>
but still has enough parallelism that you can get a significant speedup vs serial for most cases if you carefully design your algorithm
<d1b2>
<johnsel> yeah and we think about it a lot differently, I would build something that would work on a stream because I see all the overhead limits getting removed in a few years
<d1b2>
<johnsel> but we'd end up in a whole other discussion then
<d1b2>
<johnsel> anyway one thing that might be useful is you can probably throw away a lot of samples regardless
<d1b2>
<johnsel> though that still depends on how much you control you have (or take) right now on distribution and sampling
<d1b2>
<johnsel> although presumably the threshold filter has than the most basic thresholding
<d1b2>
<johnsel> I really should take the time to actually work through the code
<azonenberg>
The threshold filter right now has two options: single level (embarrassingly parallel) and hysteresis (not so, but likely still parallelizable for typical input)
<azonenberg>
Everything works on waveforms, you can stream by having waveforms represent consecutive blocks of samples if you want
<azonenberg>
but i do not see that going away any time soon
<azonenberg>
To give you an idea, right now we struggle to keep up with ~1.5 Gbps of data. I think we can get up a fair bit past there if we leave GTK
<azonenberg>
a single AD9213 ADC as planned for one of my future scopes puts out 10 Gsps * 12 bits = 120 Gbps of data *per channel*
<azonenberg>
times a four channel scope is almost half a Tbps of waveform data
<d1b2>
<johnsel> Sure, but it's a pretty basic assumption if you are working on small blocks of data or large
<azonenberg>
and that's not even particularly high sample rate as scopes go. My lecroy scopes can do 40 Gsps
<d1b2>
<johnsel> and I'm somewhat familiar with large datastreams, the last product I worked on dealt with 6 2D + 6 3D cameras
<azonenberg>
And so when the scopes are spitting out data 3+ orders of magnitude faster than we can process it, i do not see continuous streaming being viable any time soon, if ever. maybe in 20 years we'll be able to process a few hundred Gbps of waveform, but then the scopes will be faster
<d1b2>
<johnsel> though I did not touch most of those algorithms
<azonenberg>
that's so much less data its not even funny. especially if compressed
<azonenberg>
a 4K RGB24 image is 3840 * 2160 * 24 bits = 199 Mbits, at 60 FPS that's about 11 Gbps of data. so the sum of all your cameras in 4Kp60 uncompressed is maybe about as much data as a single decent scope channel
<azonenberg>
if they're 1080p30 its not even close
<azonenberg>
for comparison a single large scope waveform in my benchmark dataset is 512M samples * 16 bits (padded) = 8 Gbits of data for a single waveform in native ADC format, or 16 Gbits in fp32
<d1b2>
<johnsel> they were 1080p, but they were processed into pointclouds, regardless, it's still in the same order of magnitude as were you are choking on right now
<azonenberg>
now to be fair the scope is not capable of spitting them out at 60 FPS
<azonenberg>
and yeah. same OOM. And like i said our limiting factor right now is mostly the GUI actually, for simple datasets
<azonenberg>
i think we can push a fair bit faster once we leave GTK. we're looking at like 30% GPU load for a lot of my benchmarks
<d1b2>
<johnsel> that's still a long way off from where you want to be though
<azonenberg>
Yes, agreed
<azonenberg>
Which is why we'll need to optimize. And perhaps also explore multi GPU
<d1b2>
<johnsel> and since you would not only want to be processing a single trigger event, you'd have to start not re-processing the same data at some point
<d1b2>
<johnsel> or it's impossible
<d1b2>
<johnsel> and I never claimed to be an expert with all the solutions, just some familiarity with the types of issues you run into
<azonenberg>
Yeah. There are definitely limits with how well some stuff can parallelize
<azonenberg>
But the good news is, the less parallel stuff tends to be higher level data
<d1b2>
<johnsel> especially if you keep doing it over over and over
<azonenberg>
which is working on digital bits and not raw analog samples
<azonenberg>
so your data rate is lower
<azonenberg>
Which is why i am particularly interested in a way to do parallel clock recovery so that I can sample and threshold the analog waveforms to a vector of digital bits as quickly as i possibly can
<azonenberg>
and then descramble those bits to e.g. 8b10b symbols
<azonenberg>
and now all of a sudden we're working with much smaller quantities of data
<d1b2>
<johnsel> Anyway the viewpoint why I think it's potentially more worthwhile than it might seem is because there is lots of movement in direct DMA from ethernet to GPU (w/ decompression) as well as the CXL which is basically 'high speed memory access for everyone (including FPGAs) on the same system/motherboard'
<d1b2>
<johnsel> specifically the 3.0 version of CXL w/ memory sharing with cache coherency
<d1b2>
<johnsel> that is not necessarily an argument for changing your approach now, but those are the current forward movements on the HPC markets and consumer GPU
<mxshift>
Direct DMA between NIC<->GPU is definitely making its way to more general use. Still only helps if your data is sourced or sinked off-machine.
<d1b2>
<johnsel> there are lots of reasons why it might not help you and you will slow down, hence why you really want a processing pipeline that can process large and small chunks, and decide what is best to use when. But that's a fundamental architectural decision to either go for the approach that you want as large as possible waveforms and process them in one go and present a varying window into that processed memory (which is much simpler, for sure), or
<d1b2>
that you decouple them more and try to build something leveraging everything you can to never process things twice, compression for those times you do put things over IO, etc
<d1b2>
<johnsel> the IO overhead is big, so it's really not an unreasonable approach to just say let's minimize that and try to parallelize as much after that
<d1b2>
<johnsel> but it's not optimal, for sure
<d1b2>
<johnsel> especially once more and more techniques become 'free' from the HPC and Gaming end of things where they run into the same problem
<d1b2>
<johnsel> the m1 will be a very interesting case to see how much effect IO can have with it's shared memory
<d1b2>
<johnsel> the PS5 (Ratchet & Clank) is also an interesting showcase of what super fast firehosing data can do for you
<d1b2>
<johnsel> and those things will definitely be far more commonplace in say 5 years
<d1b2>
<johnsel> and there are many non hw-techniques like using data compression to minimize the impact of IO overhead
<d1b2>
<johnsel> I mean 32 or 64 GT = 32 or 64 GT
<d1b2>
<johnsel> GPU memory is 1 terabyte/s right now
<d1b2>
<johnsel> which is an order of magnitude and then some still
<d1b2>
<johnsel> but it's much simpler to buy a terabyte of RAM than to buy a terabyte of GPU mem
<d1b2>
<johnsel> and if you go multi-gpu, then fuck that PCIe bus is in there anyway
<d1b2>
<johnsel> so you just will have to deal with it anyway
<d1b2>
<johnsel> and yes there are multi-gpu methods that use high speed interconnects, but that is precisely where this CXL is competing against, without vendor lock-in and wait lists and good luck using that under Vulkan
<mxshift>
Always comes back to latency and throughout and measuring for IO size and placement to hit the throughout and latency you want/need
<d1b2>
<johnsel> well we want as much for as little as possible, that's a given :p
<mxshift>
Streaming data into GPUs can work but it's a very different case with different characteristics from operating entirely in GDDR
<d1b2>
<johnsel> but there's also the reality that this data inherently arrives to us in a stream form, chunking it up to a couple GBs and then processing inherently already traded off a lot of latency
<d1b2>
<johnsel> yeah it's very different, but since we're revisiting basic architecture it is definitely now the time to think about how strong we feel about that decision being the right one
<mxshift>
Nothing about GPU behavior is steam based until you get very high level
<d1b2>
<johnsel> I disagree, nothing about GPU behavior is frame or chunk based until you get very high level
<d1b2>
<johnsel> vulkan doesn't even have frames anymore, you even have to define those yourself
<d1b2>
<johnsel> anyway I'm happy to defer, but I'm more happy to pit the approaches against each other in the new grassroots app discussions when it's still cheap to decide
<azonenberg>
The block based waveform processing architecture is extremely fundamental to how all of the filter graph blocks etc are built
<azonenberg>
If we ever add streaming, it would likely be just a hint to the block based filters
<azonenberg>
that says "hey FYI, the next waveform happened right after the last one so it's OK to preserve state across them"
<azonenberg>
but the other thing to consider is, very often you want to go back in time and look at older historical data etc
<azonenberg>
which really makes more sense in a block based flow than a stream based one
<d1b2>
<johnsel> Sure I wouldn't know how to to do it differently than with blocks
<d1b2>
<johnsel> I don't mean a literal C++ stream object
<d1b2>
<johnsel> just the fundamental code style of thinking in streams
<d1b2>
<johnsel> I mean not even code style, the sw architecture
<d1b2>
<johnsel> I think you can simplify it to do you expect large blocks that move almost never, or small blocks that move around as little as possible
<d1b2>
<johnsel> and do you expect a single processing 'stream' or potentially multiple with different local state
<azonenberg>
So first off, it's not a stream. it's a filter graph
<azonenberg>
very often one processing block will have output consumed by multiple downstream blocks
<azonenberg>
it's a DAG not a linear chain
<azonenberg>
There is potential for optimization if I have two consecutive filter blocks that *do not* render the intermediate result on screen to skip a redundant memory save/load
<azonenberg>
but actually concatenating the shaders would be nontrivial
<d1b2>
<johnsel> I don't think what I mean is getting across and I'm not sure I can formulate it better right now. We can chalk it up to my lack of communication skills. The filter graph is an obvious given
<mxshift>
It's less about streaming vs blocks and more about data movement. Getting data in and out of GPU RAM is far from free and can be done in ways to either optimize for latency or throughout.
<azonenberg>
Yeah. What I'm trying to do as much as possible is push raw ADC codes from the scope's socket into the GPU
<azonenberg>
convert to fp32 on the GPU, do as much as possible of the heavy lifting there (even if slower than CPU, to avoid a transfer step)
<azonenberg>
render on the GPU
<azonenberg>
and only pull off to the CPU if i have a complex filter block that doesn't parallelize usefully
<mxshift>
But there is always a minimal amount of processing from the socket to get it into a form for input into the filter graph
<d1b2>
<johnsel> now we're getting there mxshift
<d1b2>
<johnsel> and GPU memory is stupid expensive when all you do is present a few lines out of those GBs
<mxshift>
So NIC->GPU doesn't help unless you do that scope driver work on the GPU
<azonenberg>
johnsel: yeah but we're not doing that. the rendering shader touches every sample on screen to do intensity grading
<d1b2>
<johnsel> it's just a really simple memory management strategy, I think we can agree to that no?
<d1b2>
<johnsel> but you don't have to do that more than once
<azonenberg>
we're not even using opengl/vulkan polygon geometry for the rendering. it's a custom rasterizer
<d1b2>
<johnsel> you can cache that
<azonenberg>
cache?? every frame it has to be redrawn. even if you're just panning the image samples are coming on/off screen and moving from one pixel to another
<azonenberg>
if zooming, more so
<azonenberg>
and of course if new waveform data is coming in then you have to update the data in memory with that
<d1b2>
<johnsel> you can cache the output
<d1b2>
<johnsel> of your rasterizer
<d1b2>
<johnsel> in the simplest form
<azonenberg>
We only call the compute shader if the image changes
<d1b2>
<johnsel> if it has been visible once
<azonenberg>
i.e. if the data has changed or we've changed zoom or position
<d1b2>
<johnsel> right and all the rest of the time you are wasting GPU memory holding data you don't need
<d1b2>
<johnsel> because sometimes it might change
CobaltBlue has joined #scopehal
<d1b2>
<johnsel> when for many of those changes you don't need all the data anyway
<d1b2>
<johnsel> and if you do, you still don't need it all the rest of the time
<azonenberg>
the primary case we're optimizing for is lots of data flowing through fairly rapidly
<d1b2>
<johnsel> now it's not that big of a deal, but one approach scales, the other doesn't
<azonenberg>
where each waveform gets processed, displayed, and then is moved into history
<d1b2>
<johnsel> exactly, so why do you keep that data at hand?
<d1b2>
<johnsel> surely you can keep the last X samples only
<azonenberg>
We do that already, to an extent. i'm tweaking cache settings
<d1b2>
<johnsel> sure X billion
<azonenberg>
but essentially the current waveform and filters processing it etc live in GPU memory
<d1b2>
<johnsel> right so it becomes a 'where do you put the responsibility'
<azonenberg>
and we push anything older out to history which is just normal CPU RAM by default (or pinned memory)
<azonenberg>
or even memory mapped files
<mxshift>
That's a forward flow process through the filter graph. You _can_ do a backwards demand scheme which can reduce how much data needs to be touched with lots of handwaving
<azonenberg>
We have support for all of the above and i'm still fine tuning some optimization settings to figure out exactly when to process things
<d1b2>
<johnsel> but that is handled at the driver level right?
<d1b2>
<johnsel> the scope driver I mean
<azonenberg>
no, scope drivers just spit out waveforms every trigger
<d1b2>
<johnsel> and then everything flows off of that
<azonenberg>
the filter graph is managed way downstream and may consume output from multiple scope drivers, files, waveform creation filters, etc
<azonenberg>
every time a new set of waveforms has arrived from all connected instruments, the graph is re-evaluated
<d1b2>
<johnsel> alright well perhaps we're nearer than it seemed then
<azonenberg>
Right now with the WIP vulkan rewrite, we do keep some historical waveforms in GPU memory
<azonenberg>
this is just a matter of calling SetUsageHint() on the buffer at the appropriate time when pushing it into history
<azonenberg>
we also have memory pools to reuse buffers that scrolled off history, because vulkan memory allocation is expensive
<azonenberg>
the challenge is not bloating too much GPU memory usage, but also not spenidng too much time in memcpy's and reallocations
<azonenberg>
it's a difficult balance and one i'm still tuning
<d1b2>
<johnsel> yeah but not if you would want to leverage e.g. RDMA
<d1b2>
<johnsel> so all I would say is let us not forget those things
<d1b2>
<johnsel> and also look at compressed data transfers
<d1b2>
<johnsel> GPUs can decompress certain formats suuuper fast
<azonenberg>
We're currently limited to the formats that the scope is able to generate
<azonenberg>
Their APIs and interchange formats are often not optimized for gigabit streaming
<d1b2>
<johnsel> not when you move it around from GPU to CPU
<azonenberg>
We need random access to the waveform data and I am highly skeptical any kind of lossless compression would be faster than just reading raw float[]'s
<d1b2>
<johnsel> hmm yes lossless....
<azonenberg>
if anything, what might be worth looking at is using fp16 for cases where the underlying precision of the waveform source (ADC bit depth etc) is such that fp32 is more precision than we need
<azonenberg>
but fp32 is a lot nicer wrt not losing precision at intermediate processing steps, with fp16 we'd have to pay a lot more attention to numerical stability and rounding error etc
<d1b2>
<johnsel> there are huge performance implications though
<d1b2>
<johnsel> though I even think they are worse for fp16
<azonenberg>
for going to fp16? yes, that is definitely on the longer term list of things to explore
<azonenberg>
fp16 is hugely faster, that's why it exists
<d1b2>
<johnsel> or it's 64
<azonenberg>
fp64 is slow especially on consumer GPUs. it's msotly used in scientific applications and the vendors deliberately cripple fp64 performance on consumer GPUs so you have to buy quadro/tesla cards
<azonenberg>
for that work
<d1b2>
<johnsel> at this point I"m talking out of my ass
<azonenberg>
fp16 is used all over the place in games and ML
<d1b2>
<johnsel> I just remember something being weird about it
<azonenberg>
it's a question of doing the numerical analysis to determine if we can get away with it and get acceptable accuracy
<azonenberg>
the key thing is, fp16 halves the memory requirements
<d1b2>
<johnsel> on that note are you aware of the funky AI formats?
<azonenberg>
half the vram and more improtantly have the bandwidth
<azonenberg>
fp16 would make sense, i think, for data coming from an 8 bit scope
<azonenberg>
but not for anything bigger
<azonenberg>
at this point i dont want to put in the effort as we have bigger issues
<azonenberg>
It looks like it is possible to do in vulkan via VK_NV_cooperative_matrix
<azonenberg>
But i don't know how useful that would be for our use case
<d1b2>
<johnsel> well an order of magnitude faster if you have some matrixes to multiply
<d1b2>
<johnsel> perhaps it's a fun exercise for me to do to better familiarize myself with both Vulkan and the codebase
<azonenberg>
Yeah I just dont know if any of the work we're doing is at all related to matrix multiplication
<azonenberg>
S-parameter de-embedding is multiplying a 2-element vector by a 2x2 matrix
<azonenberg>
but that's way too small to benefit from tensor cores
<d1b2>
<johnsel> for a whole block?
<d1b2>
<johnsel> that you will have to cut up
<d1b2>
<johnsel> sounds vaguely familiar
<azonenberg>
The de-embed algorithm we use is to resample the channel response to match the number of samples in the transformed input
<azonenberg>
FFT the input
<azonenberg>
multiply each complex FFT point by the interpolated channel response as a 2D matrix (scale by mag, rotate by angle)
<azonenberg>
then inverse FFT
<d1b2>
<johnsel> sure but you can still do a few of those together
<azonenberg>
the cooperative matrix extensions expect all threads to work together on one big matrix
<azonenberg>
Also... next developer call is scheduled for 10:00 Pacific time (13:00 Eastern, 19:00 CET) on Monday the 12th
<d1b2>
<johnsel> yes I think it goes back to our previous convo actually (which is coincidental), but it's a good example of what I meant
<d1b2>
<johnsel> there's a 10x speedup to be gained if you can provide it the arrays to work on
<azonenberg>
Yeah i just dont know if we have math that actually will map well to that currently
<d1b2>
<johnsel> in matrix format, and process the output
<azonenberg>
anyway... ping lain, johnsel, miek, mxshift re dev call, if you're interested
<d1b2>
<johnsel> fft does as far as I am aware
<azonenberg>
We're using vkFFT for FFT. i have no idea how it works under the hood
<d1b2>
<johnsel> but hence why I said it would be a good thing for me to play around with
<azonenberg>
but that is an issue to raise w/ the vkfft dev team not here
<azonenberg>
i have no intention of maintaining a fft lib
<azonenberg>
we moved away from clfft in part so we wouldn't have to
<d1b2>
<johnsel> I only picked fft because I think it easily maps
<d1b2>
<johnsel> I'll see what else there is if there's another algorithm that might be more useful to have
<d1b2>
<johnsel> I will join in to the call
<azonenberg>
We'll make more announcements closer to the time. Louis is hosting and will have the zoom link
<d1b2>
<johnsel> though I have to warn everyone it will be the second time I have a convo in English the past 3? 4? years
<azonenberg>
(I don't want to paste it in the open channel in case a bot finds it and decides to spam us - anyone is free to join but you'll have to ask)
<d1b2>
<johnsel> by speaking that is
<azonenberg>
Feel free to just listen in
<d1b2>
<johnsel> I'll do my best to be understandable
<d1b2>
<johnsel> But I appreciate the sentiment
<d1b2>
<johnsel> unrelated: I hate PSpice
Johnsel has quit [Ping timeout: 240 seconds]
Johnsel has joined #scopehal
Johnsel has quit [Ping timeout: 244 seconds]
Johnsel has joined #scopehal
Johnsel has quit [Ping timeout: 244 seconds]
Johnsel has joined #scopehal
massi has joined #scopehal
<azonenberg>
Working on Vulkan pipeline caching now
<azonenberg>
For those of you who aren't familiar, Vulkan shaders are compiled at build time from your source language to SPIR-V bytecode, but then they get JITted at run time to a GPU specific native blob
<azonenberg>
this takes a fair bit of time, ~5 ms for even a very simple shader
<azonenberg>
so doing this a lot adds up
<azonenberg>
there is a mechanism to cache that blob within or across runs of the app
<azonenberg>
In current (not yet pushed) code I have caching working within a single run of the app
<azonenberg>
but the cache is not persisted to disk
<azonenberg>
the challenge there is ensuring that you invalidate the cache properly if, for example, you upgrade your GPU driver
<azonenberg>
or the bytecode changes
<azonenberg>
So i'm starting to work on that but have more to do
<azonenberg>
vkFFT is going to need caching too, but this will be even more extensive
<azonenberg>
because vkFFT does JIT generation of GLSL source code which then gets compiled to SPIR-V and then from that to the native blob
<azonenberg>
(so we're actually toting around a shader compiler just like OpenGL etc would)
<azonenberg>
And it's a different API but both will live in the same cache manager object
<electronic_eel>
is caching on disk across runs that important?
<electronic_eel>
i mean if the jit for one shader takes 5ms, and you have a hundred and jit them all at startup, then starting up glscopeclient takes half a second longer. no big deal in my view
<electronic_eel>
or if you run the jit just when a specific shader is used the first time, then it just makes the first waveform using some shader appearing a few milliseconds later
<azonenberg>
electronic_eel: it's a bigger deal with the FFT which takes a lot longer
<azonenberg>
or more complex shaders in general
<azonenberg>
the fft feels like hundreds of ms
<azonenberg>
havent profiled it yet, was going to do that after i take care of some other stuff
<electronic_eel>
but the "hundreds of ms" only happens once, when you initialize the filter graph and include something new that needs fft, right?
<electronic_eel>
so when you change some setting, the fft is still in cache, and then you wouldn't have to wait again
<azonenberg>
every new size of FFT
<azonenberg>
the shaders are all jitted
<azonenberg>
and unrolled and such
<azonenberg>
that is the single biggest thing to cache, doign the rest of the shaders is a "because we can and it's not much more work"
<azonenberg>
let me check...
<electronic_eel>
hmm, not really convinced yet. the disk cache adds quite a lot of causes for errors. like driver update, vulkan update, vkfft update, jit compiler update and so on.
<azonenberg>
That's why the disk cache needs to store metadata in each cache entry for that
<azonenberg>
the jit compiler is part of the driver
<azonenberg>
vkfft has a version number in the header
<electronic_eel>
and i don't think comparing released versions is good enough, there could be a bug somewhere, you patch it and recompile, and then you still got the old data
<azonenberg>
its just gonna be a bit of yaml
<azonenberg>
In rare cases like that, you can manually flush the cache, but i don't see us doing vkfft development much if ever
<electronic_eel>
so maybe get a sha256sum of the binaries?
<azonenberg>
vulkan actually includes a device uuid for this purpose
<azonenberg>
which is supposed to change every driver release and is a typically hash over the device id, driver version, etc
<azonenberg>
but good practice is to check the uuid as well as the driver version in case a particular driver doesn't always bump the uuid
<azonenberg>
also ok, 300-400 ms to initialize vkFFT for half a million points
<azonenberg>
thats not even particularly large
<azonenberg>
another test, for the de-embed filter
<azonenberg>
300-400 ms *each* to initialize for forward and reverse directions
<azonenberg>
so 600-800 ms every time you change memory depth
<azonenberg>
that is absolutely somethign to cache as much as you possibly can
<electronic_eel>
hmm, yeah 800 ms is annoying
<azonenberg>
it's to the point that i'm thinking of prefetching your scope's known memory depths in a background thread as soon as the app starts
<azonenberg>
so by the time you create a fft or de-embed filter, they're ready to go
<electronic_eel>
that would be another idea. then you wouldn't need the disk cache
<azonenberg>
No
<azonenberg>
the disk cache is to keep that from happening every app start
<azonenberg>
~every game using vulkan has an on disk cache, the mechanism exists for a reason
<d1b2>
<Darius> caching shaders is very common for games too, definitely a normal thing these days
<d1b2>
<Darius> heh
<azonenberg>
Exactly
<azonenberg>
ok so it's ~50ms to initialize vkfft with the cache
<azonenberg>
not great, but could be a lot worse
<azonenberg>
(still ram only)
<azonenberg>
the cache data blobs are ~38K 32-bit words of data
<azonenberg>
i might want to rethink my plan of using yaml as the container format for the cache
<azonenberg>
that is a bit large to stick in a text based file format
<azonenberg>
(this is for a 1M point FFT, presumably different sizes will be different)
<_whitenotifier-7>
[scopehal] azonenberg d606dba - Added RAM-only (nonpersistent) implementation of pipeline caching. See #677.
<_whitenotifier-7>
[scopehal] azonenberg 862d42e - Initial RAM-only (nonpersistent) implementation of VkFFT plan caching. See #676.
<_whitenotifier-7>
[scopehal] azonenberg a158c13 - Refactoring: moved CRC32 to global function and not in Filter class, it now takes a uint8_t* instead of a vector
<_whitenotifier-7>
[scopehal] azonenberg 7e97c7d - Added pipeline caching for vkFFT including disk persistence. Fixes #676.
<miek>
the sampler is 18GHz, and i think it's a nominal 40ps step
<azonenberg>
So fairly comparable to my leo bodnar pulse generator + 16 GHz scope then
<azonenberg>
the dip looks similar to what i've seen on connector transitions, which is why i askeed
<azonenberg>
to make sure i understand that right, you've got a mismatch at one side and matched on the other?
<azonenberg>
are you seeing any residual mismatch at the connector at all?
<miek>
yeah, that's right - in the design i used their footprint diagram unchanged on one side & shrunk the signal trace to match the microstrip on the other side
<miek>
not as far as i can see, if there is anything it's lost in the noise of the fr4 weave
<miek>
at least i'm pretty sure that's what the waviness is during the microstrip part, on the board the weave is a few degrees off alignment to the trace so i think it makes sense to see it have those slow variations
<azonenberg>
miek: here's that connector plus a relatively unoptimized SMA-J-P-H-ST-EM1
<azonenberg>
glscopeclient channel emulation with 6 GHz PicoVNA and 26.5 GHz FieldFox measurements of the same board (the latter by derek kozel from gnuradio)
<azonenberg>
this was before i got the 8.5 GHz PicoVNA
<azonenberg>
but you can see how the lower BW measurement loses a ton of detail
<azonenberg>
you can also see my microstrip on this test board is about 10% high from the nominal 50 ohms
massi has quit [Remote host closed the connection]