azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/glscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal
Degi has quit [Ping timeout: 255 seconds]
Degi has joined #scopehal
<_whitenotifier> [scopehal-apps] aquarius20th forked the repository - https://github.com/aquarius20th
<_whitenotifier> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±7] https://github.com/glscopeclient/scopehal/compare/ef8a304eed30...d99b09282b57
<_whitenotifier> [scopehal] azonenberg d99b092 - Added integration time control
<_whitenotifier> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±4] https://github.com/glscopeclient/scopehal-apps/compare/49b446fccec5...139650c22a26
<_whitenotifier> [scopehal-apps] azonenberg 139650c - Added realtime BER support and integration control
bvernoux has joined #scopehal
electronic_eel_ has joined #scopehal
electronic_eel has quit [*.net *.split]
josuah has quit [*.net *.split]
josuah has joined #scopehal
<_whitenotifier> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal/compare/d99b09282b57...1b19016eb25b
<_whitenotifier> [scopehal] azonenberg 1b19016 - Set output clock frequency at startup. Deleted comment referencing invalid data file we've now fixed server side
<_whitenotifier> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://github.com/glscopeclient/scopehal-apps/compare/139650c22a26...ec15bd85dc37
<_whitenotifier> [scopehal-apps] azonenberg ec15bd8 - Updated with patches from upstream imgui-node-editor. Fixes #572.
<_whitenotifier> [scopehal-apps] azonenberg closed issue #572: Filter graph editor window disappears when a tooltip is being displayed if the graph editor window is not docked - https://github.com/glscopeclient/scopehal-apps/issues/572
<azonenberg> So we do some weird draw list things in the filter graph editor that i think may be confusing it
<azonenberg> it's throwing an assertion using the upstream version of imgui-node-editor
<azonenberg> But if i comment out the assert, it draws correctly without any of the weird flickering we've been having even if undocked
<azonenberg> so i merged those updated patches
<azonenberg> we'll run this build until there's a more complete upstream fix
<d1b2> <246tnt> Filter graph works fine indeed.
<d1b2> <246tnt> This is the render one zoom click before my GPU hangs : https://i.imgur.com/4drAVze.jpg still doing 60 fps and ~ 10ms of actual work per frame. The GPU top also shows only about 50% usage and it's only running at one third to one half of its max frequency.
electronic_eel_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
electronic_eel has joined #scopehal
<azonenberg> weird
<azonenberg> Definitely a bug somewhere but hard to diagnose these things
<azonenberg> if i can get an intel gpu exhibiting the problem in my lab, maybe i can try to troubleshoot it
<d1b2> <246tnt> So you have intel GPU that don't exhibit the issue ?
<azonenberg> What are exact steps to reproduce? is it simply opening the demo at certain zooms?
<tnt> Yeah, just bring it like it look in that screen shot ... then zoom out one more time and it will show a few frames and then hang.
<tnt> (No need for the eye thing, I was just trying to load the GPU a bit more for it to be really worse case)
<azonenberg> The eye doesn't load the GPU much if at all, it's currently not GPU accelerated
<azonenberg> it's software integrated into a texture and then the only thing the gpu does is draw a textured polygon
<azonenberg> This is almost certainly an issue related to the waveform rendering compute shader
<azonenberg> the only question is if it's a mesa/intel gpu driver bug or a shader bug in ngscopeclient
<azonenberg> Or perhaps both
<azonenberg> tnt: ok so i was actually able to reproduce it zooming out far enough on my "Intel(R) UHD Graphics (CML GT2"
<d1b2> <246tnt> Trying to find if those numbers means anything ...
<d1b2> <246tnt> are they the same for you ?
<d1b2> <246tnt> 12:1:85dcfdfb
<azonenberg> what numbers?
<d1b2> <david.rysk> for some reason it is not linking with /usr/lib/libSPIRV-Tools-opt.so for me, so I get link errors
<d1b2> <david.rysk> did spirv-tools split up the libs upstream in newer versions perhaps?
<d1b2> <246tnt> @azonenberg in the dmesg error that pops up when it happens ?
<azonenberg> oh in dmesg not stdout
<azonenberg> gimme a sec
<azonenberg> um i'm actually getting spammed by kernel soft lock errors i need to troubleshoot too lol
<d1b2> <246tnt> Ok, the 12 for me is the GPU generation the 1 means I915_ENGINE_CLASS_RENDER
<azonenberg> i'm updating my OS anyway will try on that box after a reboot
<azonenberg> And see what happens once i've upgraded to bookworm and whether i still see the hang (and if ngscopeclient breaks in new interesting ways or what)
<d1b2> <246tnt> And the last number is also not really providing much info : * Generate a semi-unique error code. The code is not meant to have meaning, The * code's only purpose is to try to prevent false duplicated bug reports by * grossly estimating a GPU error state.
<azonenberg> lol
<azonenberg> I mean you can try commenting out random parts of the rendering shader and see if it goes away :p
<azonenberg> honestly i'm about out of other ideas
<d1b2> <246tnt> So the only info that provided is that the bug didn't happen in the BLITTER or VIDEO or VIDEOENHANCE part of the GPU ... not really a surprise.
<azonenberg> actually what might be useful is if you were to log arguments to the shader invocation that hung
<azonenberg> x/y/z sizes and such
<d1b2> <246tnt> waveform-compute.glsl ?
<azonenberg> Yep
<azonenberg> start by logging arguments
<azonenberg> see if there are any clues there, maybe some upstream code is giving it bogus array sizes or something
<azonenberg> i cant imagine it would only do that on intel igpu
<azonenberg> but maybe intl is more picky than nvidia or something idk
florolf has quit [Server closed connection]
florolf has joined #scopehal
<d1b2> <246tnt> uint32_t offset_samples;
<d1b2> <246tnt> wdata->m_config.offset_samples : 4294966871
<azonenberg> ooooh that sounds fishy
<azonenberg> see if you can figure out where it came from? something obviously underflowed
<azonenberg> why only intel gpus though???
<d1b2> <246tnt> int64_t offset_samples = (group->m_xAxisOffset - pdat->m_triggerPhase) / pdat->m_timescale;
<d1b2> <246tnt> But that end up casted to uint32_t
<azonenberg> So, inside the shader
<azonenberg> offset_samples is unsigned since it's literally a sample index
<azonenberg> it should never be negative
<azonenberg> we probably need to clamp that host side then
<azonenberg> i'm in the middle of updating the OS on my intel gpu box (work laptop) but once it's back up will try developing a fix
<azonenberg> Probably just a quick if(offset_samples < 0) wdata->m_config.offset_samples = 0
<d1b2> <246tnt> yeah in the shader there is also : c uint iend = uint(floor((gl_GlobalInvocationID.x + 1) / xscale)) + offset_samples; if(iend <= 0) g_done = true; But <= 0 test for a uint 😅 (yeah it can be 0 but not negative).
<azonenberg> that can be fixed to test for ==0 but i dont think it could ever underflow there
<azonenberg> since all values should be positive at that point?
<azonenberg> let me double check though
<azonenberg> we might need some stuff to end up being signed
<d1b2> <246tnt> Yeah, btw, no worries, take you time with the update, I'm just posting what I find here as I explore, but I'm not in a hurry.
<azonenberg> apt will take as long as it wants :p
<azonenberg> Feel free to debug on your own and send a PR as well if you get somewhere
<d1b2> <246tnt> I tried to use signed values and limit istart / iend but no luck. One thing I just noticed, is that I was able to zoom way out ... as long as I keep the "end" of the waveform "off screen" it didn't crash.
<azonenberg> Yes, that tracks
<azonenberg> the bug is caused by the shader trying to render points before the left edge of the waveform
<azonenberg> i.e. negative sample indexes
<azonenberg> and presumably not clamping correctly when it does that
<azonenberg> it *should* simply no-op once it determines there's no samples to draw
<d1b2> <246tnt> I meant the "end" as in the "2 us" mark in the case of the sample scope
<d1b2> <246tnt> So I was looking from -8 us to 1.99 us or so.
<d1b2> <246tnt> But if I shift a bit to the right, and go to bring the 2.1 us mark into view : crash.
<d1b2> <246tnt> So it's more like it's reading past the end of the samples.
<azonenberg> oh interesting thats not what i expected
<azonenberg> well, keep poking i guess?
<azonenberg> i think you're on the trail of the bug for sure
<d1b2> <246tnt> 😁
<d1b2> <246tnt> It's not an out of bound access to the buffer. g_done just never becomes true which .. hangs the GPU.
<azonenberg> Yeah. it's probably bounds checking correctly but not ending the loop
<azonenberg> likely not a true *infinite* loop
<azonenberg> but 2^32 iterations or similar
<azonenberg> We had a bug like that a while back (unrelated to this one, i think)
<azonenberg> where in certain corner cases the done flag never got set
<d1b2> <246tnt> I think having a "max iteration" safeguard would be useful in anycase. Tanking the FPS is better than a crash 😅 Obviously fixing the actual issue would be good but just as a fallback for potential other future issues.
<azonenberg> Yeah the challenge is not adding more instructions to the inner loop than necessary
<azonenberg> as that ruins FPS for everyone
<azonenberg> static bounds checks prior to start of the loop are good
<azonenberg> there is actually a separate ticket for capping max iteration count on low end GPUs to avoid horrible slowdowns if you zoom out too far
<azonenberg> Might be possible to integrate here
<azonenberg> either way lets find and fix the bug
<d1b2> <246tnt> I don't get : uint i = istart + gl_GlobalInvocationID.y;
<d1b2> <246tnt> Ok, nm, I guess that works, I thought that was 0 for the whole local workgroup but not that's WorkgroupID.
<d1b2> <246tnt> Not sure if it's a coincidence yet but I think iend-istart = 129 = ROWS_PER_BLOCK * 2 + 1 ....
bvernoux has quit [Read error: Connection reset by peer]