azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/glscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal
<d1b2> <louis> azonenberg: About to PR the minimal zero-hold changes I was carrying
<d1b2> <louis> More generally, how do we want to deal with sparse waveform data that does not satisfy offset[i] + duration[i] == offset[i+1]? Because currently all(?) rendering pathways assume this
<d1b2> <louis> I would be loath to lose the optimization of avoiding copying the durations[], but clearly we need to in some cases to correctly render waveforms that don't have data at some points
Degi has quit [Ping timeout: 244 seconds]
Degi_ has joined #scopehal
Degi_ is now known as Degi
<d1b2> <Darius> I ended up using PocketFFT at work (which is used by numpy/scipy/etc)
<azonenberg> I think it's OK to give up on that optimization near term
<azonenberg> because we dont usually have huge sparse waveforms
<azonenberg> at least analog/digital, vs protocol
<azonenberg> the one exception is things like a cdr pll output
<azonenberg> which is large and sparse but gap free
<d1b2> <louis> It isn't actually that complex in the shader to support both
<d1b2> <louis> More it's just annoying to have this combinatorical explosion of shader versions to handle in cpp
<d1b2> <louis> So wondering if it would be improved by rewriting that handling to be less manual
<_whitenotifier-7> [scopehal-apps] 602p opened pull request #504: Zero-hold rendering flag - https://github.com/glscopeclient/scopehal-apps/pull/504
<_whitenotifier-7> [scopehal] 602p opened pull request #698: Zero-hold rendering flag - https://github.com/glscopeclient/scopehal/pull/698
<d1b2> <louis> PR opened. If you're not hankering to merge right now I will take a stab at having sparse rendering take durations
<_whitenotifier-7> [scopehal-apps] 602p synchronize pull request #504: Zero-hold rendering flag - https://github.com/glscopeclient/scopehal-apps/pull/504
<_whitenotifier-7> [scopehal-apps] 602p edited pull request #504: Zero-hold rendering - https://github.com/glscopeclient/scopehal-apps/pull/504
<d1b2> <louis> At the moment the behaviour in that PR keeps the existing behavior of disregarding duration except in the circumstance that zero-hold rendering is requested.
<d1b2> <louis> Not sure what we want to do more generally since I'm not sure how well-defined interpolation is on a noncontinuous (i.e. offs[i] + dur[i] != offs[i+1]) waveform and I wanted to leave that logic as-is
<_whitenotifier-7> [scopehal-apps] 602p synchronize pull request #504: Zero-hold rendering - https://github.com/glscopeclient/scopehal-apps/pull/504
<azonenberg> yeah i think that probably makes the most sense
<azonenberg> zero hold has gaps, interpolation doesnt
<azonenberg> pass the duration buffer always, but shader wont read from it unless doing zero hold?
<azonenberg> I'm not sure i like the idea of having it be a flag in the waveform though
<azonenberg> it makes more sense to be a view setting
<azonenberg> IMO
<d1b2> <louis> As PRed the flag is specified as a hint with a note that it could be overridden by a per-view setting
<d1b2> <louis> I'm not attached to the flag, my wish is just that (IMO) the sparse measurement filters make more sense to be rendered sparse as a default so we need some way to communicate that
<d1b2> <louis> Maybe a flag for "the offsets and durations here are perfectly precise (relative to the input waveform) and not a product of a sample rate" that gets set on measurements, and then there could be a setting for the (default) rendering behavior of that kind of waveform
<d1b2> <louis> Since that's really the semantic to me
<azonenberg> hmm i'm not even sure about that
<azonenberg> like, if i'm looking at a FM modulation or something
<azonenberg> i think linear interpolation makes more sense
<d1b2> <louis> hm
<d1b2> <louis> At root my UX wish is that non-contigous measurements [configurably] aren't interpolate by default, I don't really care how that's accomplished. But it seems to neccessitate some flag for noncontigousness
<azonenberg> yeah
<azonenberg> i'm just thinking that for some cycle by cycle measurements it might make more sense
<azonenberg> but i guess to start, making all of them zero-hold makes the most sense?
<d1b2> <louis> Thinking about it more, part of my beef is that the current interpolation behaviour means that the line contradicts the data inside the sample, since interpolation is between leftmost point of sample n and n+1
<azonenberg> can you explain what you mean?
<azonenberg> interpolation is effectively putting a dot at each sample then connecting the dots
<d1b2> <louis> yes, but the samples have duration so they aren't dots really
<d1b2> <louis> the code in that PR does
<d1b2> <louis> current interpolation behaviour is
<d1b2> <louis> Maybe preferable for noncontigous waveforms would be
<azonenberg> ah so you're saying we ignore duration and interpolate immediately. i see
<azonenberg> yes
<azonenberg> I think we ultimately will want a couple of different options we can select form
<d1b2> <louis> If we did this (bonus points for lighter coloring of the interpolated part) it would also solve my UX issue
<azonenberg> for now, what you have is probably a good starting point. i'll look at the code and merge shortly
<azonenberg> before we do too much more on that i want to get it pulled into ngscopeclient
<azonenberg> I'm trying to avoid having to do too much UX work twice
<d1b2> <louis> roger
<azonenberg> essentially my thought is, at this point i think we're mostly in agreement that ngscopeclient is the future, but glscopeclient is now and people have to get their work done
<azonenberg> so fix any bugs in glscopeclient that are blocking people
<azonenberg> continue developing backend as normal
<d1b2> <louis> 👍
<azonenberg> but avoid adding new features to glscopeclient that we'll just have to port to ngscopeclient
<azonenberg> and have ngscopeclient be the focus of UI dev
<d1b2> <louis> yeah that PR satisfies my desire w/r/t glscopeclient. and i'll bikeshed the exact rendering in ngscope
<azonenberg> ok
<azonenberg> sounds like a plan
<azonenberg> i'll review and merge then get moving on getting the new renderer into ngscopeclient
<azonenberg> On that note, M asked me to prepare a brief for marketing people to help us come up with a new name
<azonenberg> since the new tool is a perfect opportunity to do the rebranding we've needed to do from day one :p
<azonenberg> So if you or anyone else wants to weigh in, dump ideas in the channel
Degi_ has joined #scopehal
Degi has quit [Ping timeout: 250 seconds]
Degi_ is now known as Degi
<_whitenotifier-7> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±9] https://github.com/glscopeclient/scopehal-apps/compare/e38ec20a0b19...7983c3fc0380
<_whitenotifier-7> [scopehal-apps] azonenberg 7983c3f - Added ToneMapAllWaveforms()
<azonenberg> So we definitely need to optimize our queue allocations at some point
<azonenberg> For example, moltenvk has a fairly small number of queues, and each one can do anything (they're drop in equivalents)
<azonenberg> my recent nvidia card has three types of queue
<azonenberg> the first one (which we currently use for everything) has 16 queues, and can do anything
<azonenberg> the second has only 2 queues, and can only do host/device transfers. but we should use those for transfer as it's optimized for that (and it means we're not using queues that could be doing something else)
<azonenberg> and there's a third type with 8 queues, which can do anything but rendering
<azonenberg> (so we should be using those for a lot of the background processing)
<azonenberg> Right now we are using four queues that I don't have names for at the moment (I have to figure out where I allocate those)
<azonenberg> eight (one per thread) for filter graph execution
<azonenberg> one for actual rendering
<azonenberg> one for the rendering thread
<azonenberg> one for vkFFT that i think we can optimize out, as it's only used during initialization
<azonenberg> and one dedicated to host/device transfers
<azonenberg> So we're maxed out on my nvidia card and on some other cards we might even be running out
<azonenberg> But i think that's something to work on later
bvernoux has joined #scopehal
<_whitenotifier-7> [scopehal-apps] bvernoux edited pull request #503: Fix GitHub CI Windows Build/Install issue (WIP) - https://github.com/glscopeclient/scopehal-apps/pull/503
<_whitenotifier-7> [scopehal-apps] bvernoux edited pull request #503: Fix GitHub CI Windows Build/Install issue (WIP) - https://github.com/glscopeclient/scopehal-apps/pull/503
massi has joined #scopehal
azonenberg has quit [Ping timeout: 260 seconds]
azonenberg has joined #scopehal
<_whitenotifier-7> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal/compare/6022d1756b94...7a135f408215
<_whitenotifier-7> [scopehal] azonenberg 7a135f4 - ComputePipeline: AddComputeMemoryBarrier is now static
<_whitenotifier-7> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±11] https://github.com/glscopeclient/scopehal-apps/compare/7983c3fc0380...62934e00b14b
<_whitenotifier-7> [scopehal-apps] azonenberg 62934e0 - Continued setup for renderer
massi has quit [Remote host closed the connection]
<_whitenotifier-7> [scopehal-apps] bvernoux opened issue #505: Windows GitHub CI MSI Build / PortableApps missing shaders waveform-compute.* - https://github.com/glscopeclient/scopehal-apps/issues/505
<_whitenotifier-7> [scopehal-apps] bvernoux commented on pull request #503: Fix GitHub CI Windows Build/Install issue (WIP) - https://github.com/glscopeclient/scopehal-apps/pull/503#issuecomment-1259765159
<_whitenotifier-7> [scopehal-apps] bvernoux closed pull request #503: Fix GitHub CI Windows Build/Install issue (WIP) - https://github.com/glscopeclient/scopehal-apps/pull/503
<azonenberg> Strange threading issue in the new renderer. vulkan validation complains that the queue for WaveformThread is being accessed by another thread simultaneously under certain conditions (like a window being resized)
<azonenberg> but i don't see how that could be possible because i declared the queue as a local variable within the thread method
<azonenberg> and i never pass it to anything that holds onto it
<azonenberg> so i can't imagine how the handle could ever end up in another thread
<azonenberg> adding a mutex to prevent updating the framebuffer while doing this seems to have solved it
<d1b2> <louis> Whoops, totally forgot to handle no int64 in my renderer patch. Will look at that this afternoon/evening
<azonenberg> so possible i have to block all background processing on any queue when making a swapchain? if so this is not well documented
<bvernoux> Does anyone have understood the weirdness of CMake Install for windows which does not want to copy all shaders as each time the waveform-compute.* are not copied
<azonenberg> (poke lain, since i think she wrote that code?)
<bvernoux> Why does this line
<bvernoux> does not do the job ?
<bvernoux> It is a simple directory copy of shaders which contains the famous waveform-compute.*
<azonenberg> louis: yeah that is needed to support a lot of older intel integrated cards off the top of my head
<azonenberg> i forget if the M1 gpu had native int64 or not
<_whitenotifier-7> [scopehal-apps] 602p synchronize pull request #504: Zero-hold rendering - https://github.com/glscopeclient/scopehal-apps/pull/504
<d1b2> <louis> OK, I think it's just that simple
<azonenberg> Will have a look and merge after work
<d1b2> <Mughees> @louis @azonenberg Is there a limit to the number of significant digits while outputting a float value? (fabs(uadin->m_samples[i]) * (float) din->m_timescale) / FS_PER_SECOND; evaluates to zero when outputting through stats window. Graph view shows atleast something....Can we increase it somehow.... I am implementing area under the curve filter
<azonenberg> hmmm did you declare the units for the filter?
<azonenberg> it should be using Unit::PrettyPrint() iirc
<azonenberg> which will give you nice suffixes like ps/ns/fs etc
<d1b2> <Mughees> no..let me try
<azonenberg> if you're trying to output in volts or something i could see that being a problem
<azonenberg> also i'm not sure why you are dividing by fs per second
<azonenberg> the native output of the filter should be in fs
<azonenberg> then the unit should take care of converting that to whatever SI scale is appropriate
<azonenberg> yeah you're outputting in volts
<d1b2> <Mughees> yeah
<d1b2> <Mughees> should be Vs
<azonenberg> ... oh, yeah this is volt-seconds
<azonenberg> ok so you are starting to hit a problem with our existing units framework
<azonenberg> this is actually the thing that blocked me last time i looked at making an integration filter
<d1b2> <Mughees> hmmm
<d1b2> <Mughees> ay quick expert suggestions 🙂
<azonenberg> It's an open problem at this point. We need a more general units framework that understands how to represent arbitrary algebraic combinations of existing units
<azonenberg> of base units*
<azonenberg> ping @louis as well
<d1b2> <Mughees> Can't we have somthing quick that sipl displays values?
<azonenberg> I mean the quick and dirty hack is to extend the Unit class with a new volt-seconds unit
<azonenberg> and set your y axis to use that
<azonenberg> (this is separate from the issue about your value not showing up in the stat, i think)
<d1b2> <Mughees> but how would the problem of zero resolve
<azonenberg> So that's a different problem
<d1b2> <Mughees> basically any value like 1e-11 outputs zero
<azonenberg> I would have to look at the code and/or poke at stuff in a debugger. I do know that there is some rounding involved, but in general the unit class should use SI suffixes when printing things
<azonenberg> and i think it knows down to femto
<azonenberg> so this *should* not be a problem
<azonenberg> It could also be a numerical stability issue somewhere in either the statistic or your filter implementation
<azonenberg> if you're summing a bunch of extremely small numbers that add to almost zero, fp32 intermediates may not be sufficient precision
<azonenberg> and/or you may have to use a more sophisticated summing algorithm that sorts the values and sums similar magnitude ones first etc
<d1b2> <Mughees> fp 32 is between e-38 to e38
<azonenberg> yes but that's total range
<azonenberg> if you add say 1e-20 to 1.0 you get 1.0
<azonenberg> because the dynamic range is much smaller
<azonenberg> I'm not saying this is your problem but it's certainly something i'm suspicious of being a contributing factor
<d1b2> <Mughees> hmmmm....
<d1b2> <Mughees> will explore more
Bird|otherbox has quit [Ping timeout: 248 seconds]
Bird|otherbox has joined #scopehal
<azonenberg> also random other interesting observation: i threw VTune at my LeCroy scope and it seems that at least the FFT block *does* use AVX
<azonenberg> But e.g. the SDA package seems to only use SSE
<d1b2> <louis> @Mughees if you crank up the magnitude of the sine wave and crank down the sample rate does it stop returning zero? That would indicate it's a very-small-float problem to me
<d1b2> <Mughees> yeah that is exactly what happens
<azonenberg> ok yeah so you have a numerical stability problem. Quick and dirty fix for smaller datasets is to do the integration in a fp64 temporary but that will still fail with larger datasets
<azonenberg> The proper option is to use a more numerically optimized summation algorithm
<azonenberg> https://en.wikipedia.org/wiki/Kahan_summation_algorithm may be worth looking at, although we'd have to make sure our optimization settings on the compiler didn't cause problems (at one point i think we had -ffast-math enabled)
<azonenberg> i forget if we still do
<d1b2> <Mughees> ok.
<d1b2> <Mughees> in a tektronix demo video i saw that the area unit was in terms of nano units
<d1b2> <Mughees> seems like it automatically was judging the sampling rate and assigning an appropriate unit based on that
<d1b2> <Mughees> like nano volts seconds
<azonenberg> So you have two different problems
<azonenberg> the first is numerical stability causing your result to be zero
<azonenberg> second is the unit issue
<azonenberg> For the near term, if we assume that you are only using this on inputs of type volt, we can just create a volt-second unit and have the units library automatically add SI scaling factors as needed
<azonenberg> it's not ideal but will get us something usable near term
<azonenberg> But you still need a more stable summation algorithm to have something to print
<d1b2> <Mughees> ok
<d1b2> <louis> I think mughees is saying that you could determine (somehow) the scale and then produce an output where 1f is 1 nano-voltamp instead of 1 voltamp
<d1b2> <louis> Might be something to think about when we do some hacking for algebraic units. We already have this with mV/V I think
<azonenberg> louis: mV is a hack for Y axis bathtub curves
<azonenberg> we need to solve the algebraic unit issue first
<azonenberg> for display scale, i dont see the need to worry about it, fp32 has plenty of range to use SI base units for the most part
<azonenberg> our problem is numerical stability not lack of range
bvernoux has quit [Quit: Leaving]
* lain begins attacking #685
<azonenberg> lain: one thing we should probably do, i forget if i mentioned in the issue comments
<azonenberg> is make use of other queue types
<lain> ah yeah
<azonenberg> right now we use the first compatible type for all of our queues
<azonenberg> my nvidia card has 16x general purpose/rendering queues that can do anything
<azonenberg> we currently allocate all 16 of them although i think i can optimize out 1-2 at least
<lain> I will keep that in mind, I read what you said about that earlier in the backlog
<azonenberg> then the 2 transfer-only queues and the 8 compute-focused ones go unused
<azonenberg> so e.g. g_vkTransferQueue should be put in a transfer-only queue type if one exists
<azonenberg> It's not a trivial problem given the range of GPUs out there and the different types of queues available
<lain> should we make an issue for it?
<azonenberg> i consider all of that part of 685
<azonenberg> but basically we need to make a single unified allocator
<lain> alright
<azonenberg> that understands all of the queue types available
<azonenberg> and when we request a new queue will decide the right type to use
<azonenberg> and it needs to understand how many we've already allocated from each family and such
<azonenberg> orthogonally we also probably want to optimize our code to not overuse queues
<azonenberg> how many queues does moltenvk provide?
<azonenberg> if you can pastebin a log of glscopeclient or ngscopeclient initializing on your mac with all of the gpu info that would help inform discussions
<lain> kk
<d1b2> <zyp> I think that depends on the gpu
<d1b2> <zyp> fwiw here's vulkaninfo from my hackintosh: https://paste.jvnv.net/view/ROzdl
<d1b2> <zyp> and my macbook: https://paste.jvnv.net/view/stceb
<azonenberg> ok so the first one has a total of four queues
<d1b2> <zyp> looks like both do
<azonenberg> as does the second
<azonenberg> So we definitely need to do some work to squeeze into less queues. this may mean adding mutexing or something to certain threads if we are limited on queues
<azonenberg> since a queue cannot be used from two threads concurrently
<azonenberg> Right now we have one thread for the backend waveform processing and rendering, one for the GUI, one for transfers, eight for the filter graph...
<azonenberg> we're using a total of 16 queues now
<azonenberg> one can definitely be optimized out, we can cut the filter graph down from 8 threads to less
<azonenberg> four of the others i don't know where they came from and need to investigate
<d1b2> <zyp> by transfers you mean between regular and GPU memory?
<d1b2> <zyp> on the macbook with integrated graphics, there's no distinction there and hence no need to transfer
<azonenberg> Correct. so we can likely optimize out that queue entirely
<azonenberg> however there are a few cases where we use that queue for memory barriers too
<azonenberg> so we'd have to move those elsewhere
* lain currently on a side quest to fix catch2 on macos, since brew defaults to pulling in latest stable (3.1.0) which has a breaking change, you have to include <catch2/catch_all.hpp> instead of <catch2/catch.hpp>
<lain> how do I see what variables are defined within cmake for a given package? I just want to add a -D to the CXX args so I can ifdef the includes appropriately
<azonenberg> re side quest, can we identify the version and change the include path or somehting?
<lain> that's what I'm doing :3
<azonenberg> you can look at cmakecache.txt to see all of the intermediate values
<lain> ah ok
<lain> I see "set(PACKAGE_VERSION "3.1.0")" in Catch2ConfigVersion.cmake, can I use that immediately after the find_package(Catch2 REQUIRED) call to check the catch2 package version?
<azonenberg> yes i think so
<lain> alrighty
<azonenberg> also lol my laptop's intel integrated card only has a single queue total
<azonenberg> that will be fun to deal with
<lain> ha
<lain> ok yeah, it becomes Catch2_VERSION after you include(Catch)
<lain> some day I'll understand cmake
<lain> ok, the catch2 v3.x fix is in my macos branch for now
<_whitenotifier-7> [scopehal] lainy commented on issue #685: Improve Vulkan queue allocator - https://github.com/glscopeclient/scopehal/issues/685#issuecomment-1260103490
<_whitenotifier-7> [scopehal-apps] azonenberg labeled issue #506: Figure out how to function on systems with relatively few Vulkan queues - https://github.com/glscopeclient/scopehal-apps/issues/506
<_whitenotifier-7> [scopehal-apps] azonenberg opened issue #506: Figure out how to function on systems with relatively few Vulkan queues - https://github.com/glscopeclient/scopehal-apps/issues/506
<azonenberg> lain: opened separate issue 506 for the "handle using only a handful of queues" scenario
<azonenberg> there are 1-2 queues i think i can optimize out of existence easily
<azonenberg> we can trivially single thread the filter graph (although for extra fun, we can multithread evaluation but use a single vulkan queue when doing that)
<azonenberg> we will probably need to add some kind of class/method that does conditional mutexing if two threads have collisions on the same queue ID
<azonenberg> i have ideas but haven't implemented any of it yet
<_whitenotifier-7> [scopehal] azonenberg assigned issue #681: AcceleratorBuffer: better unified memory support - https://github.com/glscopeclient/scopehal/issues/681
<azonenberg> lain: i also assigned the unified memory ticket to you since you've got access to a platform with unified memory and i don't
<azonenberg> and while it's not a first line "won't work at all on M1" portability issue, it would be a serious performance and memory usage impact
* monochroma wonders if unified memory optomization will help with low end intel/amd integrated GPUs
<lain> likely
<lain> azonenberg: how do you feel about a global instance of a class that hands out queue handles and has a Dispatch() method which, if necessary, grabs the mutex for the given queue before actually dispatching
<lain> s/Dispatch/Submit/
<azonenberg> Hmmmm i'm not thrilled with that
<azonenberg> better idea: have a queue allocator class that hands out queue wrapper objects
<lain> eyy that was my next suggestion
<azonenberg> more specifically it gives out shared_ptr<QueueHandle> objects, that may have already been given out to someone else
<lain> yep, exactly
<azonenberg> a QueueHandle contains a vk::raii::Queue and a mutex
<lain> and that would provide the Dispatch method
<azonenberg> well it would be a submit method that takes a command buffer as an argument
<azonenberg> the one potential wrinkle is how we'd handle non-blocking submit operations
<lain> er yes I keep Submit<->Dispatch in my head for some reason
<azonenberg> i.e. if you don't block on the fence immediately after submitting
<azonenberg> you don't know when to release the mutex
<lain> hmm true
<azonenberg> I guess we can hold onto the fence
<azonenberg> and have the next submit call wait on it
<azonenberg> and have an atomic<bool> or something you can use to check
<azonenberg> idk, think about it
<lain> k
<azonenberg> i want to support nonblocking workflows
<azonenberg> having longer lived fences is probably a good thing wrt performance
<lain> I like that, I'll give it some more thought and start implementing it
<azonenberg> right now we create/destroy fences far more often than we need to
<azonenberg> note that if you are reusing a fence i think it has to be reset somehow
<azonenberg> check the api,
<azonenberg> right now we destroy ours after a single use
<lain> will do
<azonenberg> meanwhile i'm working on refactoring the fft code to not need a dedicated queue
<azonenberg> so that will relieve some pressure on limited queues
<lain> nice
<_whitenotifier-7> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±3] https://github.com/glscopeclient/scopehal/compare/7a135f408215...4d90bd6546b5
<_whitenotifier-7> [scopehal] azonenberg 4d90bd6 - Refactored vkFFT wrapper code to not need a dedicated queue
<azonenberg> lain: other ideas: the allocator object should keep track of how many users of a given queue slot there are somehow
<azonenberg> maybe instead of giving out shared_ptr's give out unique_ptr's and have them refcount the undelrying vulkan object, idk
<azonenberg> but basically, if you can avoid locking a mutex because you know a priori that there's no other users of that queue, it's faster
<azonenberg> i dont know if that can be done in a thread safe fashion faster than locking a mutex
<azonenberg> the other thing is, there should be a free list so that if we create a queue to do some background processing and destroy it, that queue becomes free and is eligible for reallocation
<azonenberg> basically, we should try to overlap multiple users in the same physical queue as little as possible
<azonenberg> to reduce contention
<lain> agree
<lain> I was going to suggest handing out shared_ptr's and keeping a weak_ptr, but I'm not sure making that thread safe is any faster than just always grabbing the mutex
<azonenberg> anyway, build it and we can make the implementation fast later
<azonenberg> the API should be the same
<lain> yep
<azonenberg> If it indeed ends up being a bottleneck
<azonenberg> optimally, you want to do as much work as possible in a single submit operation
<azonenberg> so fill up as much as you can in a single commandbuffer