azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/glscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal
<_whitenotifier-9> [scopehal-apps] azonenberg pushed 3 commits to master [+0/-0/±4] https://github.com/glscopeclient/scopehal-apps/compare/9689bca00264...eb2ac09aff3f
<_whitenotifier-9> [scopehal-apps] azonenberg 27d491d - Fixed LoadState being created with invalid size
<_whitenotifier-9> [scopehal-apps] azonenberg 360096d - RightJustifiedText: fixed use of incorrect API for getting table column width
<_whitenotifier-9> [scopehal-apps] azonenberg eb2ac09 - FilterGraphEditor: added dynamic node and column size based on port text length
Degi_ has joined #scopehal
Degi has quit [Ping timeout: 248 seconds]
Degi_ is now known as Degi
<_whitenotifier-9> [scopehal-apps] vonnieda forked the repository - https://github.com/vonnieda
<_whitenotifier-9> [scopehal] vonnieda forked the repository - https://github.com/vonnieda
<azonenberg> ooooook so, I'm chasing a very interesting and annoying crash
<azonenberg> It happens when i'm using ngscopeclient to do this power supply testing, have not yet seen it anywhere else
<azonenberg> it happens often enough to be bothersome, but not often enough that it's easy to reproduce on cue
<azonenberg> the only way i know to trigger it is to keep building out my test setup and hope it strikes
<azonenberg> The actual segfault is in vk::raii::CommandBuffer::Dispatch()
<azonenberg> and it's always while tone mapping a waveform
<azonenberg> Having multiple different instruments, e.g. power supply and multimeter, and doing stuff in the filter graph editor appear to be contributing factors
<azonenberg> but I do not yet know for certain if they're necessary or only make it more likely
<azonenberg> It may or may not be related to a deadlock i've been seeing here and there but have not yet managed to find the origins of as it's never struck while i had gdb attached
<azonenberg> (braindump incoming)
<azonenberg> Connecting to load, power supply, multimeter: no issues
<azonenberg> Calculate power into load and trend it: no issues
<azonenberg> adjust vertical scale of plot: VkDeviceLostError
<azonenberg> that's a new one
<azonenberg> But it's still while tone mapping
<azonenberg> (this time i didn't have the validation layer attached... possibly related?)
<azonenberg> [0] 0x555556bfeef0, type: 4, name: WaveformThread.queue SYNC-HAZARD-WRITE-RACING-READ(ERROR / SPEC): msgNum: -860391127 - Validation Error: [ SYNC-HAZARD-WRITE-RACING-READ ] Object 0: handle = 0x555556bfeef0, name = WaveformThread.queue, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0xccb77929 | vkQueueSubmit: Hazard WRITE_RACING_READ for entry 0, VkCommandBuffer 0x7ffef8002070[WaveformThread.cmdbuf], Recorded access info
<azonenberg> d_usage: SYNC_COMPUTE_SHADER_SHADER_STORAGE_WRITE, command: vkCmdDispatch, seq_no: 1, reset_no: 3395). Access info (prior_usage: SYNC_COMPUTE_SHADER_SHADER_STORAGE_READ, read_barriers: VK_PIPELINE_STAGE_2_FRAGMENT_SHADER_BIT|VK_PIPELINE_STAGE_2_LATE_FRAGMENT_TESTS_BIT|VK_PIPELINE_STAGE_2_COLOR_ATTACHMENT_OUTPUT_BIT|VK_PIPELINE_STAGE_2_BOTTOM_OF_PIPE_BIT, queue: VkQueue 0x555557d78d30[g_mainWindow.render], submit: 12854, batch:
<azonenberg> _tag: 147804, command: vkCmdDispatch, seq_no: 1, command_buffer: VkCommandBuffer 0x555558dc1500[MainWindow.m_cmdBuffer], reset_no: 3387).
<azonenberg> hmmm
<azonenberg> i enabled QueueSubmit synchronization validation and it's finding this
<azonenberg> I thought it smelled racey
<azonenberg> So if i'm understanding this right, WaveformThread.queue is having WaveformThread.cmdbuf submitted to it
<azonenberg> and this is somehow racing g_mainWindow.render having MainWindow.m_cmdBuffer submitted to it
<azonenberg> More specifically, it appears that ToneMapAllWaveforms is trying to read from a memory buffer (i.e. the rasterized waveform density plot) that WaveformThread is still writing to (presumably in RenderAllWaveforms())
<azonenberg> I guess the easy solution is a single mutex to interlock access to the rasterized waveforms between those threads
<azonenberg> Well, good find there lol. I just upgraded from the .224 to the .239 vulkan SDK and this option didn't used to be there
<azonenberg> (it's now listed as alpha)
<azonenberg> wonder how many new bugs will be found when new validation checks come out lol
<azonenberg> aaaand that bug is fixed but now there's more behind it lol
<azonenberg> yay
<azonenberg> new bug is g_mainWindow.render writing to a shader in m_cmdBuffer concurrently with vkCmdDrawIndexed in g_mainWindow.render
<azonenberg> sorry writing to a storage buffer from a shader
<azonenberg> i think we're about to get a lot more sync stuff added :p
<azonenberg> Ok so something related to the context menu, in particular context menu extending in a popup outside the main window, is related to this...
<_whitenotifier-9> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±4] https://github.com/glscopeclient/scopehal-apps/compare/eb2ac09aff3f...db6a059037bf
<_whitenotifier-9> [scopehal-apps] azonenberg db6a059 - Fixed several race conditions in rendering
<azonenberg> ok there's still at least one more bug causing a segfault when dragging waveform areas to a new plot
<azonenberg> but this is good progress
<azonenberg> And I still need to work on the drag-and-drop stuff so that I can have a "drag to new plot in existing waveform group" option
<azonenberg> Which does not currently exist
bvernoux has joined #scopehal
<tnt> Interestingly it seems to work better on my unsupported haswell now. It's still glitchy when there is no waveform displayed, but as soon as something is shown, it's fine. Also I get "i915 0000:00:02.0: GPU HANG" if I zoom out too much, but looking at issues, it might be a general issue with intel even the supported ones.
<whitequark> it's surprisingly easy to hang an intel gpu with scopeclient
<whitequark> it's not my domain of expertise but that feels kind of concerning? like, why should any normal graphical application be able to mount a DoS on your desktop?..
<tnt> Ah well here at least only scopehal locks up for a small time and then quits, the rest of the desktop is fine.
<_whitenotifier-9> [scopehal-apps] smunaut commented on issue #325: GPU hang on iris Plus driver - https://github.com/glscopeclient/scopehal-apps/issues/325#issuecomment-1465750493
<monochroma> i think we have had reports from intel/nvidia/amd GPU owners hanging/crashing, so my take has been, the GPU drivers and firmware are likely an awful ratsnest of vulns
<t4nk_freenode> I'm staying well clear of the software, whole system crashes upon a single try
<t4nk_freenode> ;)
<t4nk_freenode> don't have the knowledge to easily go back to where it did work for me either, unfortunately
<t4nk_freenode> made some strides on other projects though ;)
<azonenberg> tnt: So yeah i think some of these race conditions are the source of some of your crashes
<azonenberg> the i915 gpu hang i suspect might be the result of an overly aggressive timeout if you have a deep waveform and zoom out too far
<azonenberg> i.e. the rendering shader takes some tens of ms and the driver doesnt want to wait that long
<azonenberg> i'm not sure what the proper fix for that would be
<azonenberg> any HPC application is going to be capable of processing "too much" data for a hardcoded timeout
<azonenberg> i mean i guess it could be possible to break up the rendering shader into several smaller invocations so the driver doesnt choke on it
<azonenberg> but then you risk losing efficiency on beefier platforms like nvidia
<whitequark> I think you need a WCET for the shaders yeah
<miek> huh, that's a new one: "/home/mike/repos/scopehal-apps/lib/scopehal/FlowGraphNode.cpp:324:1: internal compiler error: Segmentation fault". haven't managed to reproduce it though
<azonenberg> yeah i just dont know how to do that because most of them are memory bound
<azonenberg> and trying to predict how long it will take to do a pass over X MB of data with unknown cache hotness etc is not trivial :p
<whitequark> maybe adjust the work size based on actual measured wall clock time?
<azonenberg> that assumes the first shader invocation didn't time out
<azonenberg> but maybe?
<azonenberg> So, the way the actual rendering shader works is basically a software rasterizer in compute shader optimized for waveforms
<azonenberg> e.g. in most cases we dont waste memory BW on X axis point values because the sample rate is uniform, so it's cheaper to just multiply the sample index and rate to get the timestamp
<azonenberg> the actual shader runs one thread block per column of pixels
<azonenberg> since vertical lines are impossible, we basically run a slightly degenerate parallelized Bresenham to find the start/end Y values of the line segment between each pair of samples
<azonenberg> then alpha blend (in fast shared memory to avoid hitting GDDR) these 1 pixel wide x N pixel high rectangles to get a density map of the entire column of pixels
<azonenberg> and finally write to GDDR at the end
<azonenberg> What I might do is add a preference setting to cap this so that the shader stops after some user-specified max number of samples per column of pixels
<azonenberg> which will result in loss of information if you have say 50K samples per pixel and the first 5K are rendered, then you have a glitch in the rest
<azonenberg> but if you have a slower GPU it beats hangs
<azonenberg> this algorithm is massively faster than tesselating to triangles in software then feeding a giant batch of polygons to the normal geometry transformation pipeline, and also deliverers nicer looking results in most cases
<azonenberg> but has the tradeoff that it requires a potentially slow shader invocation, vs having the slow part of drawing be done by the hardware rasterizer in the GPU which probably has different tmieouts since the driver knows how many triangles are being drawn etc
<azonenberg> whitequark: also is it bad that any time i see someone say WCET I read it as WCEE?
bvernoux1 has joined #scopehal
bvernoux is now known as Guest622
bvernoux1 is now known as bvernoux
Guest622 has quit [Ping timeout: 250 seconds]
bvernoux has quit [Read error: Connection reset by peer]
fridtjof[m] has quit [Quit: You have been kicked for being idle]
bvernoux has joined #scopehal
bvernoux has quit [Quit: Leaving]
bvernoux has joined #scopehal