<_whitenotifier>
[scopehal] azonenberg 1b19016 - Set output clock frequency at startup. Deleted comment referencing invalid data file we've now fixed server side
<_whitenotifier>
[scopehal-apps] azonenberg ec15bd8 - Updated with patches from upstream imgui-node-editor. Fixes #572.
<_whitenotifier>
[scopehal-apps] azonenberg closed issue #572: Filter graph editor window disappears when a tooltip is being displayed if the graph editor window is not docked - https://github.com/glscopeclient/scopehal-apps/issues/572
<azonenberg>
So we do some weird draw list things in the filter graph editor that i think may be confusing it
<azonenberg>
it's throwing an assertion using the upstream version of imgui-node-editor
<azonenberg>
But if i comment out the assert, it draws correctly without any of the weird flickering we've been having even if undocked
<azonenberg>
so i merged those updated patches
<azonenberg>
we'll run this build until there's a more complete upstream fix
<d1b2>
<246tnt> Filter graph works fine indeed.
<d1b2>
<246tnt> This is the render one zoom click before my GPU hangs : https://i.imgur.com/4drAVze.jpg still doing 60 fps and ~ 10ms of actual work per frame. The GPU top also shows only about 50% usage and it's only running at one third to one half of its max frequency.
<azonenberg>
Definitely a bug somewhere but hard to diagnose these things
<azonenberg>
if i can get an intel gpu exhibiting the problem in my lab, maybe i can try to troubleshoot it
<d1b2>
<246tnt> So you have intel GPU that don't exhibit the issue ?
<azonenberg>
What are exact steps to reproduce? is it simply opening the demo at certain zooms?
<tnt>
Yeah, just bring it like it look in that screen shot ... then zoom out one more time and it will show a few frames and then hang.
<tnt>
(No need for the eye thing, I was just trying to load the GPU a bit more for it to be really worse case)
<azonenberg>
The eye doesn't load the GPU much if at all, it's currently not GPU accelerated
<azonenberg>
it's software integrated into a texture and then the only thing the gpu does is draw a textured polygon
<azonenberg>
This is almost certainly an issue related to the waveform rendering compute shader
<azonenberg>
the only question is if it's a mesa/intel gpu driver bug or a shader bug in ngscopeclient
<azonenberg>
Or perhaps both
<azonenberg>
tnt: ok so i was actually able to reproduce it zooming out far enough on my "Intel(R) UHD Graphics (CML GT2"
<d1b2>
<246tnt> Trying to find if those numbers means anything ...
<d1b2>
<246tnt> are they the same for you ?
<d1b2>
<246tnt> 12:1:85dcfdfb
<azonenberg>
what numbers?
<d1b2>
<david.rysk> for some reason it is not linking with /usr/lib/libSPIRV-Tools-opt.so for me, so I get link errors
<d1b2>
<david.rysk> did spirv-tools split up the libs upstream in newer versions perhaps?
<d1b2>
<246tnt> @azonenberg in the dmesg error that pops up when it happens ?
<azonenberg>
oh in dmesg not stdout
<azonenberg>
gimme a sec
<azonenberg>
um i'm actually getting spammed by kernel soft lock errors i need to troubleshoot too lol
<d1b2>
<246tnt> Ok, the 12 for me is the GPU generation the 1 means I915_ENGINE_CLASS_RENDER
<azonenberg>
i'm updating my OS anyway will try on that box after a reboot
<azonenberg>
And see what happens once i've upgraded to bookworm and whether i still see the hang (and if ngscopeclient breaks in new interesting ways or what)
<d1b2>
<246tnt> And the last number is also not really providing much info : * Generate a semi-unique error code. The code is not meant to have meaning, The * code's only purpose is to try to prevent false duplicated bug reports by * grossly estimating a GPU error state.
<azonenberg>
lol
<azonenberg>
I mean you can try commenting out random parts of the rendering shader and see if it goes away :p
<azonenberg>
honestly i'm about out of other ideas
<d1b2>
<246tnt> So the only info that provided is that the bug didn't happen in the BLITTER or VIDEO or VIDEOENHANCE part of the GPU ... not really a surprise.
<azonenberg>
actually what might be useful is if you were to log arguments to the shader invocation that hung
<azonenberg>
x/y/z sizes and such
<d1b2>
<246tnt> waveform-compute.glsl ?
<azonenberg>
Yep
<azonenberg>
start by logging arguments
<azonenberg>
see if there are any clues there, maybe some upstream code is giving it bogus array sizes or something
<azonenberg>
i cant imagine it would only do that on intel igpu
<azonenberg>
but maybe intl is more picky than nvidia or something idk
<d1b2>
<246tnt> But that end up casted to uint32_t
<azonenberg>
So, inside the shader
<azonenberg>
offset_samples is unsigned since it's literally a sample index
<azonenberg>
it should never be negative
<azonenberg>
we probably need to clamp that host side then
<azonenberg>
i'm in the middle of updating the OS on my intel gpu box (work laptop) but once it's back up will try developing a fix
<azonenberg>
Probably just a quick if(offset_samples < 0) wdata->m_config.offset_samples = 0
<d1b2>
<246tnt> yeah in the shader there is also : c uint iend = uint(floor((gl_GlobalInvocationID.x + 1) / xscale)) + offset_samples; if(iend <= 0) g_done = true; But <= 0 test for a uint 😅 (yeah it can be 0 but not negative).
<azonenberg>
that can be fixed to test for ==0 but i dont think it could ever underflow there
<azonenberg>
since all values should be positive at that point?
<azonenberg>
let me double check though
<azonenberg>
we might need some stuff to end up being signed
<d1b2>
<246tnt> Yeah, btw, no worries, take you time with the update, I'm just posting what I find here as I explore, but I'm not in a hurry.
<azonenberg>
apt will take as long as it wants :p
<azonenberg>
Feel free to debug on your own and send a PR as well if you get somewhere
<d1b2>
<246tnt> I tried to use signed values and limit istart / iend but no luck. One thing I just noticed, is that I was able to zoom way out ... as long as I keep the "end" of the waveform "off screen" it didn't crash.
<azonenberg>
Yes, that tracks
<azonenberg>
the bug is caused by the shader trying to render points before the left edge of the waveform
<azonenberg>
i.e. negative sample indexes
<azonenberg>
and presumably not clamping correctly when it does that
<azonenberg>
it *should* simply no-op once it determines there's no samples to draw
<d1b2>
<246tnt> I meant the "end" as in the "2 us" mark in the case of the sample scope
<d1b2>
<246tnt> So I was looking from -8 us to 1.99 us or so.
<d1b2>
<246tnt> But if I shift a bit to the right, and go to bring the 2.1 us mark into view : crash.
<d1b2>
<246tnt> So it's more like it's reading past the end of the samples.
<azonenberg>
oh interesting thats not what i expected
<azonenberg>
well, keep poking i guess?
<azonenberg>
i think you're on the trail of the bug for sure
<d1b2>
<246tnt> 😁
<d1b2>
<246tnt> It's not an out of bound access to the buffer. g_done just never becomes true which .. hangs the GPU.
<azonenberg>
Yeah. it's probably bounds checking correctly but not ending the loop
<azonenberg>
likely not a true *infinite* loop
<azonenberg>
but 2^32 iterations or similar
<azonenberg>
We had a bug like that a while back (unrelated to this one, i think)
<azonenberg>
where in certain corner cases the done flag never got set
<d1b2>
<246tnt> I think having a "max iteration" safeguard would be useful in anycase. Tanking the FPS is better than a crash 😅 Obviously fixing the actual issue would be good but just as a fallback for potential other future issues.
<azonenberg>
Yeah the challenge is not adding more instructions to the inner loop than necessary
<azonenberg>
as that ruins FPS for everyone
<azonenberg>
static bounds checks prior to start of the loop are good
<azonenberg>
there is actually a separate ticket for capping max iteration count on low end GPUs to avoid horrible slowdowns if you zoom out too far
<azonenberg>
Might be possible to integrate here
<azonenberg>
either way lets find and fix the bug
<d1b2>
<246tnt> I don't get : uint i = istart + gl_GlobalInvocationID.y;
<d1b2>
<246tnt> Ok, nm, I guess that works, I thought that was 0 for the whole local workgroup but not that's WorkgroupID.
<d1b2>
<246tnt> Not sure if it's a coincidence yet but I think iend-istart = 129 = ROWS_PER_BLOCK * 2 + 1 ....
bvernoux has quit [Read error: Connection reset by peer]