#scopehal on 2023-09-01 — irc logs at libera.irclog.whitequark.org

2022-03-25 21:41 azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/glscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal

01:19 Degi has quit [Ping timeout: 255 seconds]

01:20 Degi has joined #scopehal

03:39 <_whitenotifier> [scopehal-apps] aquarius20th forked the repository - https://github.com/aquarius20th

04:12 <_whitenotifier> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±7] https://github.com/glscopeclient/scopehal/compare/ef8a304eed30...d99b09282b57

04:12 <_whitenotifier> [scopehal] azonenberg d99b092 - Added integration time control

04:13 <_whitenotifier> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±4] https://github.com/glscopeclient/scopehal-apps/compare/49b446fccec5...139650c22a26

04:13 <_whitenotifier> [scopehal-apps] azonenberg 139650c - Added realtime BER support and integration control

06:13 bvernoux has joined #scopehal

06:20 electronic_eel_ has joined #scopehal

06:25 electronic_eel has quit [*.net *.split]

06:25 josuah has quit [*.net *.split]

12:54 josuah has joined #scopehal

16:14 <_whitenotifier> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal/compare/d99b09282b57...1b19016eb25b

16:14 <_whitenotifier> [scopehal] azonenberg 1b19016 - Set output clock frequency at startup. Deleted comment referencing invalid data file we've now fixed server side

16:14 <_whitenotifier> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://github.com/glscopeclient/scopehal-apps/compare/139650c22a26...ec15bd85dc37

16:14 <_whitenotifier> [scopehal-apps] azonenberg ec15bd8 - Updated with patches from upstream imgui-node-editor. Fixes #572.

16:14 <_whitenotifier> [scopehal-apps] azonenberg closed issue #572: Filter graph editor window disappears when a tooltip is being displayed if the graph editor window is not docked - https://github.com/glscopeclient/scopehal-apps/issues/572

16:21 <azonenberg> So we do some weird draw list things in the filter graph editor that i think may be confusing it

16:21 <azonenberg> it's throwing an assertion using the upstream version of imgui-node-editor

16:22 <azonenberg> But if i comment out the assert, it draws correctly without any of the weird flickering we've been having even if undocked

16:22 <azonenberg> so i merged those updated patches

16:22 <azonenberg> we'll run this build until there's a more complete upstream fix

17:03 <d1b2> <246tnt> Filter graph works fine indeed.

17:05 <d1b2> <246tnt> This is the render one zoom click before my GPU hangs : https://i.imgur.com/4drAVze.jpg still doing 60 fps and ~ 10ms of actual work per frame. The GPU top also shows only about 50% usage and it's only running at one third to one half of its max frequency.

17:13 electronic_eel_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

17:13 electronic_eel has joined #scopehal

17:28 <azonenberg> weird

17:29 <azonenberg> Definitely a bug somewhere but hard to diagnose these things

17:29 <azonenberg> if i can get an intel gpu exhibiting the problem in my lab, maybe i can try to troubleshoot it

17:30 <d1b2> <246tnt> So you have intel GPU that don't exhibit the issue ?

17:31 <azonenberg> What are exact steps to reproduce? is it simply opening the demo at certain zooms?

17:35 <tnt> Yeah, just bring it like it look in that screen shot ... then zoom out one more time and it will show a few frames and then hang.

17:36 <tnt> (No need for the eye thing, I was just trying to load the GPU a bit more for it to be really worse case)

17:36 <azonenberg> The eye doesn't load the GPU much if at all, it's currently not GPU accelerated

17:36 <azonenberg> it's software integrated into a texture and then the only thing the gpu does is draw a textured polygon

17:36 <azonenberg> This is almost certainly an issue related to the waveform rendering compute shader

17:37 <azonenberg> the only question is if it's a mesa/intel gpu driver bug or a shader bug in ngscopeclient

17:37 <azonenberg> Or perhaps both

17:46 <azonenberg> tnt: ok so i was actually able to reproduce it zooming out far enough on my "Intel(R) UHD Graphics (CML GT2"

17:51 <d1b2> <246tnt> Trying to find if those numbers means anything ...

17:52 <d1b2> <246tnt> are they the same for you ?

17:52 <d1b2> <246tnt> 12:1:85dcfdfb

17:52 <azonenberg> what numbers?

17:53 <d1b2> <david.rysk> for some reason it is not linking with /usr/lib/libSPIRV-Tools-opt.so for me, so I get link errors

17:53 <d1b2> <david.rysk> did spirv-tools split up the libs upstream in newer versions perhaps?

17:54 <d1b2> <246tnt> @azonenberg in the dmesg error that pops up when it happens ?

17:54 <azonenberg> oh in dmesg not stdout

17:54 <azonenberg> gimme a sec

17:55 <azonenberg> um i'm actually getting spammed by kernel soft lock errors i need to troubleshoot too lol

17:56 <d1b2> <246tnt> Ok, the 12 for me is the GPU generation the 1 means I915_ENGINE_CLASS_RENDER

17:56 <azonenberg> i'm updating my OS anyway will try on that box after a reboot

17:57 <azonenberg> And see what happens once i've upgraded to bookworm and whether i still see the hang (and if ngscopeclient breaks in new interesting ways or what)

17:58 <d1b2> <246tnt> And the last number is also not really providing much info : * Generate a semi-unique error code. The code is not meant to have meaning, The * code's only purpose is to try to prevent false duplicated bug reports by * grossly estimating a GPU error state.

17:58 <azonenberg> lol

17:59 <azonenberg> I mean you can try commenting out random parts of the rendering shader and see if it goes away :p

17:59 <azonenberg> honestly i'm about out of other ideas

18:00 <d1b2> <246tnt> So the only info that provided is that the bug didn't happen in the BLITTER or VIDEO or VIDEOENHANCE part of the GPU ... not really a surprise.

18:00 <azonenberg> actually what might be useful is if you were to log arguments to the shader invocation that hung

18:00 <azonenberg> x/y/z sizes and such

18:02 <d1b2> <246tnt> waveform-compute.glsl ?

18:03 <azonenberg> Yep

18:03 <azonenberg> start by logging arguments

18:03 <azonenberg> see if there are any clues there, maybe some upstream code is giving it bogus array sizes or something

18:03 <azonenberg> i cant imagine it would only do that on intel igpu

18:03 <azonenberg> but maybe intl is more picky than nvidia or something idk

18:19 florolf has quit [Server closed connection]

18:19 florolf has joined #scopehal

18:24 <d1b2> <246tnt> uint32_t offset_samples;

18:24 <d1b2> <246tnt> wdata->m_config.offset_samples : 4294966871

18:24 <azonenberg> ooooh that sounds fishy

18:25 <azonenberg> see if you can figure out where it came from? something obviously underflowed

18:25 <azonenberg> why only intel gpus though???

18:25 <d1b2> <246tnt> int64_t offset_samples = (group->m_xAxisOffset - pdat->m_triggerPhase) / pdat->m_timescale;

18:26 <d1b2> <246tnt> But that end up casted to uint32_t

18:28 <azonenberg> So, inside the shader

18:29 <azonenberg> offset_samples is unsigned since it's literally a sample index

18:29 <azonenberg> it should never be negative

18:29 <azonenberg> we probably need to clamp that host side then

18:29 <azonenberg> i'm in the middle of updating the OS on my intel gpu box (work laptop) but once it's back up will try developing a fix

18:31 <azonenberg> Probably just a quick if(offset_samples < 0) wdata->m_config.offset_samples = 0

18:32 <d1b2> <246tnt> yeah in the shader there is also : c uint iend = uint(floor((gl_GlobalInvocationID.x + 1) / xscale)) + offset_samples; if(iend <= 0) g_done = true; But <= 0 test for a uint 😅 (yeah it can be 0 but not negative).

18:32 <azonenberg> that can be fixed to test for ==0 but i dont think it could ever underflow there

18:32 <azonenberg> since all values should be positive at that point?

18:33 <azonenberg> let me double check though

18:33 <azonenberg> we might need some stuff to end up being signed

18:39 <d1b2> <246tnt> Yeah, btw, no worries, take you time with the update, I'm just posting what I find here as I explore, but I'm not in a hurry.

18:40 <azonenberg> apt will take as long as it wants :p

18:40 <azonenberg> Feel free to debug on your own and send a PR as well if you get somewhere

18:40 <d1b2> <246tnt> I tried to use signed values and limit istart / iend but no luck. One thing I just noticed, is that I was able to zoom way out ... as long as I keep the "end" of the waveform "off screen" it didn't crash.

18:41 <azonenberg> Yes, that tracks

18:42 <azonenberg> the bug is caused by the shader trying to render points before the left edge of the waveform

18:42 <azonenberg> i.e. negative sample indexes

18:42 <azonenberg> and presumably not clamping correctly when it does that

18:42 <azonenberg> it *should* simply no-op once it determines there's no samples to draw

18:42 <d1b2> <246tnt> I meant the "end" as in the "2 us" mark in the case of the sample scope

18:43 <d1b2> <246tnt> So I was looking from -8 us to 1.99 us or so.

18:43 <d1b2> <246tnt> But if I shift a bit to the right, and go to bring the 2.1 us mark into view : crash.

18:44 <d1b2> <246tnt> So it's more like it's reading past the end of the samples.

18:44 <azonenberg> oh interesting thats not what i expected

18:44 <azonenberg> well, keep poking i guess?

18:44 <azonenberg> i think you're on the trail of the bug for sure

18:44 <d1b2> <246tnt> 😁

18:52 <d1b2> <246tnt> It's not an out of bound access to the buffer. g_done just never becomes true which .. hangs the GPU.

18:54 <azonenberg> Yeah. it's probably bounds checking correctly but not ending the loop

18:54 <azonenberg> likely not a true *infinite* loop

18:55 <azonenberg> but 2^32 iterations or similar

18:55 <azonenberg> We had a bug like that a while back (unrelated to this one, i think)

18:55 <azonenberg> where in certain corner cases the done flag never got set

19:03 <d1b2> <246tnt> I think having a "max iteration" safeguard would be useful in anycase. Tanking the FPS is better than a crash 😅 Obviously fixing the actual issue would be good but just as a fallback for potential other future issues.

19:05 <azonenberg> Yeah the challenge is not adding more instructions to the inner loop than necessary

19:05 <azonenberg> as that ruins FPS for everyone

19:05 <azonenberg> static bounds checks prior to start of the loop are good

19:05 <azonenberg> there is actually a separate ticket for capping max iteration count on low end GPUs to avoid horrible slowdowns if you zoom out too far

19:05 <azonenberg> Might be possible to integrate here

19:05 <azonenberg> either way lets find and fix the bug

19:33 <d1b2> <246tnt> I don't get : uint i = istart + gl_GlobalInvocationID.y;

19:35 <d1b2> <246tnt> Ok, nm, I guess that works, I thought that was 0 for the whole local workgroup but not that's WorkgroupID.

19:50 <d1b2> <246tnt> Not sure if it's a coincidence yet but I think iend-istart = 129 = ROWS_PER_BLOCK * 2 + 1 ....

21:55 bvernoux has quit [Read error: Connection reset by peer]