#scopehal on 2022-11-17 — irc logs at libera.irclog.whitequark.org

2022-03-25 21:41 azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/glscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal

00:02 <_whitenotifier> [scopehal] azonenberg labeled issue #731: Consider deprecating blocking versions of PrepareForCpuAccess() / PrepareForGpuAccess() that use implicit global queue object - https://github.com/glscopeclient/scopehal/issues/731

00:02 <_whitenotifier> [scopehal] azonenberg opened issue #731: Consider deprecating blocking versions of PrepareForCpuAccess() / PrepareForGpuAccess() that use implicit global queue object - https://github.com/glscopeclient/scopehal/issues/731

00:10 <_whitenotifier> [scopehal] lainy pushed 1 commit to master [+0/-0/±2] https://github.com/glscopeclient/scopehal/compare/5e8c4c2d56ad...fea6d5969e72

00:10 <_whitenotifier> [scopehal] lain fea6d59 - Use recursive mutex for queues to avoid deadlock.

00:11 <_whitenotifier> [scopehal-apps] lainy pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal-apps/compare/f662e9b24138...ceeb34b8adc7

00:11 <_whitenotifier> [scopehal-apps] lain ceeb34b - Bump scopehal lib submodule

00:19 <_whitenotifier> [scopehal] lainy opened issue #732: Update Filter API for new queue handles - https://github.com/glscopeclient/scopehal/issues/732

00:20 <_whitenotifier> [scopehal] lainy labeled issue #732: Update Filter API for new queue handles - https://github.com/glscopeclient/scopehal/issues/732

00:21 <_whitenotifier> [scopehal] lainy labeled issue #732: Update Filter API for new queue handles - https://github.com/glscopeclient/scopehal/issues/732

00:24 <_whitenotifier> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±2] https://github.com/glscopeclient/scopehal/compare/fea6d5969e72...37e8747716df

00:24 <_whitenotifier> [scopehal] azonenberg 3bc4884 - ImportFilter: avoid false positives declaring a non-uniform waveform to be uniform

00:24 <_whitenotifier> [scopehal] azonenberg 37e8747 - CSV: Added a few MarkModified() calls to be safe

00:24 <_whitenotifier> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal-apps/compare/ceeb34b8adc7...8119a839d11a

00:24 <_whitenotifier> [scopehal-apps] azonenberg 8119a83 - Updated submodules

01:41 nelgau has quit [Ping timeout: 256 seconds]

01:53 <lain> azonenberg: looking at the issue of two FFTs causing deadlock... one FFT thread is blocked on the lock_guard in QueueHandle::SubmitAndBlock() via AcceleratorBuffer::CopyToCpu, the other is blocked on the lock_guard on g_vkTransferMutex in AcceleratorBuffer::CopyToCpu, hrm

01:54 <lain> ok so the first one is blocking the second.. but what's blocking the first one

02:01 <d1b2> <Darius> lock order reversal?

02:01 <d1b2> <Darius> (ie not acquiring two different locks in the same order in two different places)

02:03 <lain> hmm

02:06 <azonenberg> I continue to say that we should not be using AcceleratorBuffer::CopyToCpu and should work towards refactoring out g_vkTransfer* entirely

02:07 <azonenberg> actually the other thing is, g_vkTransferMutex is now redundant

02:07 <azonenberg> because g_vkTransferQueue is now a QueueHandle right?

02:07 <lain> it's not entirely redundant because you also have a global command buffer iirc

02:07 <lain> for transfers

02:07 <lain> g_vkTransferCommandBuffer

02:07 <azonenberg> Hmm

02:08 <azonenberg> well

02:08 <azonenberg> ok

02:08 <azonenberg> But it would be redundant if we only used the nonblocking transfer calls that took a commandbuffer as an argument

02:08 <azonenberg> at least within filters, that should probably be the rule

02:09 <lain> yeah

02:09 <lain> ok I think I see the deadlock now

02:09 <lain> thread 12's FFT compute is executing on the same queue as g_vkTransferQueue, which has to be held with QueueLock until we change the Filter API

02:10 <lain> so yeah

02:11 <azonenberg> So yeah basically we need to just nuke g_vkTransferQueue from orbit?

02:11 <azonenberg> i.e. https://github.com/glscopeclient/scopehal/issues/731 is now a lot more important than we had originally planned

02:12 <azonenberg> or is there a less invasive way to solve this for the near term?

02:15 <lain> specifically, out of interest, we have:

02:15 <lain> thread 12 holds: QueueHandle::m_mutex 0x00000001044a66b8 (FFTFilter::Refresh), g_vkTransferMutex (AcceleratorBuffer::CopyToCpu), and is waiting on QueueHandle::m_mutex 0x000000010444e0a8 in AcceleratorBuffer::CopyToCpu

02:15 <lain> thread 14 holds: QueueHandle::m_mutex 0x000000010444e0a8, and is waiting on g_vkTransferMutex (AcceleratorBuffer::CopyToCpu)

02:16 <lain> so yes, lock order reversal as Darius suggested

02:16 <lain> I think the filter API change is the big one tbh

02:17 <lain> because this would still deadlock even if we were using the nonblocking transfer calls

02:20 <lain> the problem is we're holding the compute queue lock for the duration of Filter::Refresh(), in which we also (for FFTFilter) call FindPeaks, which performs a transfer. we can't guarantee there's >1 queue available, so we shouldn't be holding the compute queue lock for all of Refresh() if we need to perform queue operations within it

02:21 <lain> if instead we use QueueHandle in Filter::Refresh, each filter's queue->Submit() or queue->SubmitAndBlock() call would be effectively atomic

02:22 <azonenberg> well what i'm saying is

02:22 <azonenberg> FindPeaks should submit to ... oh yeah

02:22 <azonenberg> actually no it should

02:23 <azonenberg> so basically we need to allow the filter to submit the command buffer early

02:23 <azonenberg> then read back results and do cpu side processing

02:23 <lain> hm

02:23 <lain> well, regardless, Filter::Refresh() holds the queue lock the entire time until we make the api change

02:23 <lain> so I don't think that solves anything for single-queue systems?

02:25 <lain> I'm just gonna go ahead and make the Filter::Refresh() change real quick

02:25 <lain> it'll be fast

02:38 <azonenberg> yes it does make the difference IMO

02:38 <azonenberg> because we no longer reference multiple QueueHandle's

02:38 <azonenberg> we do the transfer on the same handle we submit the shader operation to

02:38 <azonenberg> at that point there's no sync with another thread

02:39 <azonenberg> the two refresh calls may not be able to execute in parallel

02:39 <azonenberg> but that's a non-issue if we're short on queues anyway

02:39 <lain> mmm

02:39 <lain> we need both changes to make it ideal

02:40 <lain> oh I see what you're saying, yeah

02:40 <azonenberg> basically, if we never have a single thread use >1 queuehandle

02:40 <azonenberg> we can never have such a deadlock

02:44 <lain> yeah

02:45 <azonenberg> So in that case i think the correct solution is to implement #731 and remove the global transfer queue/mutex

02:46 <lain> but we *also* need to implement #732

02:47 <lain> well, ok

02:47 <lain> there's two choices here: either we implement #732 and pass QueueHandles around, or we hold the queue lock for the duration of the filter execution like we do currently, and pass a vk::raii::Queue& for transfers

02:48 <lain> thoughts?

02:49 <lain> eh

02:49 <lain> it's moot now, I just finished implementing #732 :P

02:50 <lain> now able to run any number of FFTs on my M1 with no issues

02:53 <d1b2> <Darius> yay I guessed right!

02:56 <lain> azonenberg: may I push this to master, or would you prefer a PR?

02:59 <lain> eh, it's working quite well here, and should fix many bugs, soooo yolo!

02:59 <_whitenotifier> [scopehal] lainy pushed 1 commit to master [+0/-0/±23] https://github.com/glscopeclient/scopehal/compare/37e8747716df...12805eadf287

02:59 <_whitenotifier> [scopehal] lain 12805ea - Implement #732

03:00 <_whitenotifier> [scopehal] lainy closed issue #732: Update Filter API for new queue handles - https://github.com/glscopeclient/scopehal/issues/732

03:00 <_whitenotifier> [scopehal-apps] lainy pushed 2 commits to master [+0/-0/±1] https://github.com/glscopeclient/scopehal-apps/compare/8119a839d11a...eecdc0270f80

03:00 <_whitenotifier> [scopehal-apps] lain 4350fa9 - Bump scopehal library

03:00 <_whitenotifier> [scopehal-apps] lain eecdc02 - Merge branch 'master' of github.com:glscopeclient/scopehal-apps

03:34 Degi_ has joined #scopehal

03:35 Degi has quit [Ping timeout: 256 seconds]

03:35 Degi_ is now known as Degi

03:54 <azonenberg> lain: So the #731 refactoring is still on the todo list but no longer a critical blocker?

03:54 <lain> azonenberg: correct

03:54 <azonenberg> ok great it probably makes sens to do that as part of a larger cleanup then

04:13 <azonenberg> Anyway, i'm now working on packet tree support in the ngscopeclient protocol analyzer

05:12 <azonenberg> Which had been on my list for a couple days but on hold since i didnt have scopesession load support or any DUTs handy that spoke some of the "easier" protocols that do packet merging

05:12 <azonenberg> (i.e. i cant save sessions so i didnt want a protocol that was gonna need a dozen probe hookups and complex multilayer decodes)

05:12 <azonenberg> so happens i'm using one of those protocols at work right now

05:12 <azonenberg> So now i have test data and i'm making good progress

05:12 <azonenberg> also found what i'm pretty sure is a memory leak in the protoco lanalyzer

06:36 balrog has quit [Quit: Bye]

06:38 balrog has joined #scopehal

06:38 <_whitenotifier> [scopehal-apps] azonenberg pushed 2 commits to master [+0/-0/±5] https://github.com/glscopeclient/scopehal-apps/compare/eecdc0270f80...a5a65d3da3cd

06:38 <_whitenotifier> [scopehal-apps] azonenberg b936b50 - ProtocolAnalyzerDialog: fixed bug where packets with <1 line of text showed no bytes in the data column. See #541.

06:38 <_whitenotifier> [scopehal-apps] azonenberg a5a65d3 - Protocol analyzer now supports packet hierarchies for complex multi-packet transactions. See #541.

07:14 nelgau has joined #scopehal

07:16 nelgau_ has joined #scopehal

07:19 nelgau has quit [Ping timeout: 260 seconds]

07:39 <azonenberg> lain: you borked the unit tests

07:39 <azonenberg> we're getting CI failures

07:39 <azonenberg> looks like you didnt refactor them to take a queuehandle

09:17 massi has joined #scopehal

11:14 nelgau_ has quit [Ping timeout: 260 seconds]

16:45 massi has quit [Remote host closed the connection]

18:34 nelgau has joined #scopehal