azonenberg changed the topic of #scopehal to: ngscopeclient, libscopehal, and libscopeprotocols development and testing | https://github.com/ngscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal
<d1b2> <johnsel> and now I can't get access to the account
<_whitenotifier-3> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <david.rysk> not a bad idea to have them and just forward them to your domains 😛
<d1b2> <johnsel> I have some kind of exchange hosting it now
<d1b2> <azonenberg> i mean i also still have a cell number from where i got my first phone as an undergraduate lol
<d1b2> <johnsel> I can't be arsed to deal with the admin of hosting a private email server anymore
<d1b2> <azonenberg> yeah spam blacklists have made that just about impossible
<d1b2> <johnsel> it's still doable but yeah
<_whitenotifier-e> [scopehal] d235j synchronize pull request #841: Refactor of Cmake scripts - https://github.com/ngscopeclient/scopehal/pull/841
<d1b2> <azonenberg> its enough of a pain trying to figure out all of the new SPF/DKIM/whatever records you need to keep adding to DNS when using third party mail servers :p
<_whitenotifier-3> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <johnsel> yep
<_whitenotifier-3> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <azonenberg> what happened to /opt/actions-runner/_work/scopehal-apps/scopehal-apps/build/Testing/Temporary/LastTest.log
<d1b2> <azonenberg> is that something we upload as an artifact?
<d1b2> <azonenberg> or is that debug info lost forever
<d1b2> <johnsel> that's gone
<d1b2> <johnsel> we can save it though
<d1b2> <azonenberg> can you fix the scripts to upload that somewhere or cat it or something?
<d1b2> <johnsel> assuming it doesn't contain anything we don't want uploaded
<d1b2> <david.rysk> at the moment I'm testing with the github hosted CI on my fork
<d1b2> <azonenberg> that should all be unit test output
<d1b2> <azonenberg> nothing sensitive
<d1b2> <johnsel> I'll upload it
<d1b2> <johnsel> we can figure out a way to make a cleaner report at some point
<d1b2> <azonenberg> interestingly the tests seem to have segfaulted
<d1b2> <david.rysk> oh can we bump nativefiledialog-extended?
<d1b2> <johnsel> right
<d1b2> <azonenberg> i wonder if we can get a binary and core dump out or something
<d1b2> <david.rysk> link?
<d1b2> <johnsel> I posted it above
<d1b2> <johnsel> it skipped that step because of the segfault
<d1b2> <johnsel> stupid
<d1b2> <azonenberg> On the list, i'm half paying attention while routing a pcb and spinning a cranky toddler on an office chair with the other hand lol
<_whitenotifier-e> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <david.rysk> @johnsel let me play with that, does this VM have a GPU or does it need Lavapipe to be installed?
<_whitenotifier-3> [scopehal-apps] Johnsel pushed 1 commit to master [+0/-0/±1] https://github.com/ngscopeclient/scopehal-apps/compare/bc14063cd369...a40211b729c9
<_whitenotifier-e> [scopehal-apps] Johnsel a40211b - ci: always upload artifacts, added LastTest.log as artifact
<d1b2> <azonenberg> all of the CI builder VMs should have a GTX 1650 pcie passthru
<d1b2> <johnsel> it has a gpu
<d1b2> <johnsel> it may or may not have vulkan and drivers properly installed
<d1b2> <johnsel> your pr 673 is ran on it with the self-hosted config
<d1b2> <johnsel> feel free to pull master to get the changes I just made for debugging and to play with it
<d1b2> <azonenberg> conjecture: vulkan init fails then it crashes trying to access some vulka nstate
<d1b2> <johnsel> your PR gets executed
<d1b2> <azonenberg> (likely due to not having an x server or something)
<d1b2> <johnsel> possible, though there should be one
<d1b2> <johnsel> it might not be attached to the nvidia gpu though
<d1b2> <johnsel> not sure if that matters, I was able to run vkcube
<d1b2> <azonenberg> @david.rysk if it will help, i can set you up with VPN access to the CI network and we can probably provision the builder so you can ssh to it
<d1b2> <johnsel> we can't
<d1b2> <johnsel> the networking is broken
<d1b2> <azonenberg> oh? interesting
<d1b2> <johnsel> I was going to redo it but can't with the resources/templates
<d1b2> <azonenberg> ah ok yeah we have that weird leaking of resources that arnt actually leaked
<d1b2> <johnsel> yep and I need you to clean up the old template as well
<_whitenotifier-e> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <johnsel> aside from the windows template
<d1b2> <johnsel> though right now it's still in use
<d1b2> <david.rysk> @azonenberg I probably can do all the testing I need on GH hosted runners
<d1b2> <david.rysk> which are more secure anyway
<d1b2> <azonenberg> @david.rysk i'm thinking specifically about getting accelerated vulkan working in our environment
<d1b2> <david.rysk> @azonenberg I mean if Lavapipe vulkan works, accelerated is not much harder.
<d1b2> <azonenberg> @johnsel ok so if i get a chance tonight after i put little miss crankypants to bed, i'll update xoa and xoa2 to the latest, install the latest patches on the host
<d1b2> <azonenberg> suspend all VMs, reboot the host, resume them
<d1b2> <azonenberg> and we'll see what happens to scopehal-ci-set after that
<_whitenotifier-3> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <johnsel> yes, you can shut down the ci vms though
<d1b2> <azonenberg> ok
<d1b2> <david.rysk> @azonenberg it really just involves having the gpu and driver installed at that point
<d1b2> <johnsel> and assuming the resources free up the debian gpu v2 template can stay
<d1b2> <david.rysk> anyway I'm working on getting GH hosted Mac (x86_64) build working, it's just a bunch of stupid stuff right now
<d1b2> <azonenberg> in any case tests should fail gracefully not segfault if vulkan goes haywire
<d1b2> <david.rysk> they have arm64 runners now
<d1b2> <azonenberg> oh cool that would be nice to add
<d1b2> <david.rysk> hmm are they available for FOSS though
<_whitenotifier-e> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <david.rysk> sorry about the noise, keep making typos
<d1b2> <david.rysk> > 500 mins/month Free for Public Repos
<d1b2> <david.rysk> oh waiththat's a third party offering lol
<d1b2> <david.rysk> wait that's*
<d1b2> <azonenberg> lol
<d1b2> <azonenberg> anyway long term we want to be doing hardware in loop testing anyway
<d1b2> <azonenberg> we're going to end up with a mac mini on the rack at my place or something of that nature
<d1b2> <azonenberg> just focusing on the virtual environment first as it's easier to se tup
<d1b2> <johnsel> try macos-13-arm64
<_whitenotifier-3> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <johnsel> it's weird though since the runner can access the internet presumably it should have an ip
<d1b2> <johnsel> even though xoa thinks it doesn't
<d1b2> <johnsel> presumably the management tools aren't picking it up
<_whitenotifier-e> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <azonenberg> huh
<d1b2> <david.rysk> seems to stick on "Waiting for a runner to pick up this job..."
<d1b2> <david.rysk> for macos-latest-arm64
<d1b2> <david.rysk> I'll need to test on real hardware to see why it doesn't pick up openmp
<d1b2> <azonenberg> On that note, openmp is something we will probably transition away from eventually since it has been annoying from time to time
<d1b2> <azonenberg> I dont have an immediate plan but most of the uses we have for it can be better served by application-managed threading or just pushing compute to the GPU
<_whitenotifier-e> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<_whitenotifier-e> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <johnsel> glfw init failed
<d1b2> <david.rysk> yeah you probably need lavapipe then
<d1b2> <david.rysk> and/or you need to set an env var for it to use it
<d1b2> <david.rysk> OR you need to passthru the GPU and install the GP
<d1b2> <david.rysk> the GPU drivers*
<d1b2> <johnsel> this vm should have both GPU and drivers
<d1b2> <johnsel> but it has multiple GPUs
<d1b2> <david.rysk> what's in, uh...
<d1b2> <david.rysk> /usr/share/vulkan/icd.d/
<d1b2> <johnsel> the physical nvidia gpu and some virtual xen gpu
<d1b2> <david.rysk> and what does vulkaninfo produce
<d1b2> <johnsel> can't access the machine right now
<d1b2> <david.rysk> could make the ci run those commands
<d1b2> <johnsel> I can't be arsed since I'll have to re-do the template anyway
<d1b2> <david.rysk> lol ok
<d1b2> <david.rysk> I'm working on the GH hosted templates right now
<d1b2> <johnsel> if you want to play with it you can do that
<d1b2> <david.rysk> though I'm working in the cmake-fixes branch which I shouldn't be lol
<d1b2> <azonenberg> @johnsel if you are redoing the template do you want to transition to bookworm?
<d1b2> <david.rysk> let me fix that
<d1b2> <azonenberg> since bullseye goes EOL in july
<d1b2> <johnsel> yeah I was thinking the same actually
<d1b2> <johnsel> since most of this misery came from the linux kernel not having sources
<d1b2> <johnsel> and having to use a backports kernel
<d1b2> <azonenberg> my work laptop and all of my other VMs are upgraded already
<_whitenotifier-3> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <azonenberg> my other machines were held back by a ceph bug that was fixed in november
<d1b2> <azonenberg> and i just havent wanted to reboot since then
<d1b2> <johnsel> did you run a dist-upgrade type thing?
<d1b2> <johnsel> I'm not familiar with debian
<d1b2> <azonenberg> i just bump to the new sources.list and full-upgrade
<d1b2> <david.rysk> @johnsel I recommend working in a branch for this CI work, as to not pollute master with crap
<_whitenotifier-3> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <johnsel> yeah we should do that at the same time
<d1b2> <johnsel> I did that on purpose to limit the amount of ci task backlog generated
<d1b2> <david.rysk> hahaha yeah
<d1b2> <johnsel> hehe
<d1b2> <david.rysk> my email box is full
<d1b2> <johnsel> though I guess I could've let it remain a PR now that I changed it so it can run those
<d1b2> <johnsel> I made an executive decision not to do that, I'm not going to change anything beyond what is changed now anyway
<d1b2> <johnsel> I don't think it's helpful to start changing things on 2 fronts
<d1b2> <johnsel> and the vm templates themselves need a little work first
<d1b2> <johnsel> I did set up msys2 and ehh wix on Windows
<d1b2> <david.rysk> what do you think of wix vs. alternatives
<d1b2> <johnsel> so that just needs a new cloudbase init script and then that can become self-hosted
<d1b2> <david.rysk> imo we should also have a portable install
<d1b2> <david.rysk> but I haven't looked at any of that
<d1b2> <johnsel> there is a portable version
<d1b2> <david.rysk> also we need to do macOS packaging. Do we have icon art?
<d1b2> <johnsel> though it has been broken for a while on linux
<d1b2> <johnsel> I believe the Windows portable is actually working
<d1b2> <david.rysk> for macOS icon art we need 16 × 16, 32 × 32, 48 × 48, 128 × 128, 256 × 256, 512 × 512 images
<d1b2> <david.rysk> and downscaling doesn't always look good
<d1b2> <david.rysk> also 1024x1024
<d1b2> <david.rysk> and 96x96
<d1b2> <johnsel> 1024x1024 jeez
<d1b2> <johnsel> that's rather a poster than an icon
<d1b2> <david.rysk> well it's meant for 2x scale use
<d1b2> <david.rysk> (HiDPI)
<d1b2> <johnsel> I know, but still
<d1b2> <johnsel> for your retina XDR display
<d1b2> <johnsel> or whatever they call it
<d1b2> <johnsel> liquid retina xdr even
<d1b2> <david.rysk> on some iPhones you also need 3x scale art
<d1b2> <azonenberg> no, we do not have an icon. let me find the ticket
<d1b2> <azonenberg> I don't know if @tetrikitty is interested / able
<d1b2> <azonenberg> if anyone else wants to draw one, we do need one :p
Degi_ has joined #scopehal
Degi has quit [Ping timeout: 252 seconds]
Degi_ is now known as Degi
<_whitenotifier-3> [scopehal] azonenberg opened issue #842: PCAP import filter - https://github.com/ngscopeclient/scopehal/issues/842
<_whitenotifier-3> [scopehal-apps] azonenberg commented on issue #669: File Dialog Bookmark does not work on OSX - https://github.com/ngscopeclient/scopehal-apps/issues/669#issuecomment-1913406177
<_whitenotifier-3> [scopehal-apps] azonenberg closed issue #669: File Dialog Bookmark does not work on OSX - https://github.com/ngscopeclient/scopehal-apps/issues/669
<_whitenotifier-3> [scopehal-apps] d235j commented on issue #621: macOS: Build with Homebrew-provided Vulkan dependencies - https://github.com/ngscopeclient/scopehal-apps/issues/621#issuecomment-1913407246
<_whitenotifier-3> [scopehal-apps] d235j commented on issue #622: macOS: ngscopeclient binary is missing RPATH load command - https://github.com/ngscopeclient/scopehal-apps/issues/622#issuecomment-1913407830
<d1b2> <david.rysk> @johnsel got mac and ubuntu CI working, now expanding ubuntu to include 20.04, 22.04, sdk, and no-sdk
<d1b2> <johnsel> nice
<d1b2> <johnsel> good work
<d1b2> <johnsel> I'm fighting the Xilinx DDR4 controller
<azonenberg> @johnsel so i wonder if it might make sense to have two classes of tests
<azonenberg> one that's "build in a non-gpu machine, dont run any tests, just checking for distro specific compile issues"
<azonenberg> as many of those as we want
<azonenberg> then one linux and one windows system with GPU that run all the tests
<d1b2> <johnsel> that makes sense yes
<azonenberg> (the idea being, given that the tests and drivers dont do much distro specific stuff, we're looking for bugs in our code and dont need to run those a zillion times)
<d1b2> <johnsel> I wanted to finish windows + linux first before thinking about that
<d1b2> <johnsel> there's the "do -we- have to run those" question
<d1b2> <johnsel> or do we outsource that to github
<azonenberg> That is entirely an option, assuming we dont run out of free build minutes
<azonenberg> Which is very much a possibility
<d1b2> <johnsel> if we want to self-host that is possible but we would have to make our templates
<d1b2> <johnsel> and with the current resource issue it's a definite no-go
<azonenberg> yeah I have cpu cycles to burn on my infrastructure RAM is the only issue. Can we sequence runners so no more than N run concurrently?
<d1b2> <johnsel> I asked in the xcp-ng discord but those people are also entirely unhelpful
<azonenberg> (I'm busy on a board design and might bump the host reboot to tomorrow)
<d1b2> <johnsel> presumably when you buy a support contract that would change but f that
<azonenberg> yeah lol
<d1b2> <david.rysk> what are your thoughts on using lavapipe to run tests?
<d1b2> <johnsel> I'll have to think about it
<azonenberg> i mean if some corporate sponsor wants to buy me a xcp-ng commercial support contract i'll take advantage of it
<d1b2> <johnsel> we specifically bought GPUs because we want GPUs
<azonenberg> but i'm not spending my own money on it lol
<azonenberg> Yeah we want to replicate in an environment that's as realistic as possible
<azonenberg> also it'll be way faster
<d1b2> <johnsel> we'll also put real scopes in the network eventually
<azonenberg> Yeah exactly
<azonenberg> we want as close to real world deployments as we can
<d1b2> <johnsel> we really want this to function as a sort of testbed for what we could deploy as a solution eventually
<azonenberg> i'm not opposed to having one runner somewhere checking on llvmpipe and/or swiftshader just as "edge case vulkan implementation to find issues where we make assumptions about what the implementation is capable of"
<d1b2> <johnsel> with automated testing etc
<azonenberg> Yeah exactly
<d1b2> <johnsel> so yeah agreed with azonenberg, if you make it we can run them but we're not super interested in the output
<azonenberg> i mean i want to know if i broke something on another platform
<d1b2> <david.rysk> Are you still planning to allow any random person to make a PR and the selfhosted CI to run it?
<d1b2> <david.rysk> since that's one benefit of github CI
<d1b2> <johnsel> we have manual review for that
<azonenberg> Once approved by a maintainer, yes
<d1b2> <johnsel> you can't just run anything on ci
<d1b2> <david.rysk> my point is you won't need manual review for GH CI
<d1b2> <david.rysk> 😛
<d1b2> <david.rysk> just for the selfhosted runners
<d1b2> <johnsel> Well you need to review PRs anyway
<d1b2> <david.rysk> that allows for an automated first pass
<azonenberg> If and when we have a high enough volume of contributions that this becomes an issue we can reconsider
<d1b2> <johnsel> we can run other linux no problem on the self-hosted infra @david.rysk
<azonenberg> Yeah. And give them GPU or not as we see fit, as long as no more than two concurrent runners have GPUs
<d1b2> <johnsel> it's flexible enough that we can deploy something very quickly
<azonenberg> (but we'd likely run out of RAM before that became an issue)
<azonenberg> @johnsel what are your thoughts on making a new resource set for scopehal stuff and transitioning stuff over there to see if we can find and eliminate the leak that way?
<d1b2> <johnsel> Yep but adding another runner with a different OS is very simple, we use all the proper IaaS stuff that all distros support
<d1b2> <johnsel> that's a really good idea @azonenberg
<d1b2> <johnsel> although it wouldn't solve the root cause
<azonenberg> ok i'll do that after the reboot which i should do anyway because i'm due for patches
<d1b2> <david.rysk> my point is we use GH-hosted runners that don't require approval to catch stupid shit early
<azonenberg> so my thought is, if we suddenly see ghost usage after transitioning X over to it
<azonenberg> we know thats what did it
<d1b2> <david.rysk> then self-hosted runners that do require approval for more thorough testing
<d1b2> <johnsel> please also find time to click the "convert to template" button :p
<azonenberg> @david.rysk: what exactly would we be expecting the GH runners to catch if we don't manually review the output?
<azonenberg> ah right i can do that
<azonenberg> link the vm again?
<azonenberg> Converting
<d1b2> <johnsel> they're very slow though, and we'd be managing things in 2 places
<d1b2> <david.rysk> @azonenberg they will catch build errors and inform the PR submitter well ahead of when we are reviewing
<d1b2> <david.rysk> They're not to inform us but the PR submitter
<azonenberg> yeah the point is that i'm sick of waiting hours to get results from stuff
<azonenberg> the github runners are super slow
<azonenberg> and as we add more distros and extensive tests we are likely going to hit the cap on the free tier very soon
<d1b2> <david.rysk> We don't get many outside PRs and aren't merging them frequently, so I don't exactly mind them being slow for that use
<azonenberg> at which point we either pay for azure time or use my own hardware i'm already paying for
<d1b2> <david.rysk> We can only have them enabled for PRs
<d1b2> <david.rysk> And they only support Ubuntu, macOS, and Windows anyway, which we should support self-hosted too
<d1b2> <johnsel> That's a decent amount of management overhead, I'd rather work towards core contributors getting the ability to have their PRs ran automatically
<d1b2> <johnsel> I think that makes more sense than going backwards
<d1b2> <david.rysk> hm what's the management overhead (on top of having extra stuff in the .yml)
<d1b2> <david.rysk> I mean if we're also running self-hosted runners with the same OS
<d1b2> <johnsel> they aren't entirely the same
<azonenberg> @johnsel FYI i made scopehal-ci-set2
<azonenberg> which should have the same perms as scopehal-ci-set and currently has no instances in it
<d1b2> <johnsel> I see it
<azonenberg> 2TB storage cap, 128GB RAM, 96 vcpus
<azonenberg> try slowly moving stuff into it and see if/when you see ghost usage
<azonenberg> this isnt meant to be a workaround for the issue so much as a way to identify what we did to trigger it
<d1b2> <johnsel> I'm not sure I can
<d1b2> <johnsel> yeah I can't do that
<d1b2> <johnsel> you need to
<d1b2> <johnsel> I can shut down the instances though
<azonenberg> ok yeah if you shut them down i can probably transition between sets myself
<azonenberg> then maybe we can at least see what the ghost is attached to
<azonenberg> and dig deeper
<d1b2> <johnsel> yes
<d1b2> <johnsel> done btw
<azonenberg> great ok gimme a few
<d1b2> <johnsel> I don't have networks and SRs
<d1b2> <johnsel> fyi
<azonenberg> in -set2?
<d1b2> <johnsel> correct
<d1b2> <johnsel> so that's scopehal, cidmz, and isos and the ci-sr
<d1b2> <johnsel> did you reboot yet, no right?
<azonenberg> i show ci-ceph-sr, installmedia-localisos, xcp-ng tools SRs attached
<azonenberg> to set2
<azonenberg> as well as scopehal and cidmz networks
<azonenberg> @david.rysk for background we are troubleshooting an issue in which the self hosted runners are in a resource pool that is showing ghost usage
<d1b2> <johnsel> my mistake
<d1b2> <johnsel> it pops up when I select a template
<azonenberg> i.e. quotas are being hit that show usage beyond the actual vms in the group
<d1b2> <johnsel> should we reboot first?
<d1b2> <johnsel> moving the vms may be prevent us from knowing if that fixes it
<azonenberg> yeah probably a good idea i just wanted to patch a bunc hof things at once
<azonenberg> to make the most of the downtime
<d1b2> <johnsel> yup no problem
<d1b2> <johnsel> just wanted to confer on what is best
<d1b2> <johnsel> as an FYI we should be able to get github based login working as well
<d1b2> <johnsel> to give others access to the ci infra
<azonenberg> oh cool
<azonenberg> ok so first things first
<azonenberg> xoa is upgrading from 5.85.0 to 5.89.0
<d1b2> <david.rysk> @azonenberg @johnsel I've been rewriting the Mac and Ubuntu GH-hosted templates
<d1b2> <david.rysk> Haven't touched Windows yet
<d1b2> <david.rysk> But setting up matrices to test all the various combinations of SDK and OS type
<azonenberg> monthly OS patching on xoa2 outer VM
<azonenberg> xen-orchestra updated to b1e879c release 5.90.0, building the update now
<azonenberg> Rebooting xoa2 VM
<d1b2> <david.rysk> fwiw installing texlive slows these runners down a lot
<azonenberg> yeah i'm thinking we want to bake that into the image
<azonenberg> vs installing every build
<d1b2> <johnsel> correct
<azonenberg> ok, xoa2 rebooted, same issue
<d1b2> <johnsel> interesting
<azonenberg> not surprised, but at least i got it updated and confirmed that wasn't the root cause
<d1b2> <johnsel> I think v2 already has texlive on there btw
<azonenberg> Now to update the host fully, finish a bit of work i'm doing in another instance, suspend all of my instances (including the one i'm IRCing from right now)
<azonenberg> and reboot it
<d1b2> <johnsel> Ah the xcp instance might still be the culprit
<azonenberg> yeah. i'm gonna do a full hard powerdown and up of the host, fully hypervisor and dom0 reset
<azonenberg> and see what we get when it comes back
<azonenberg> gonna be a few mins while i wrap up what i'm doing in the other vm first
<azonenberg> ok instances going down and rebooting the host
Johnsel has quit [Quit: ZNC 1.8.2+deb3.1 - https://znc.in]
azonenberg has quit [Ping timeout: 260 seconds]
azonenberg has joined #scopehal
<azonenberg> ok back up and instances restarted
<azonenberg> Let's see what i see in your resource set
<azonenberg> No change
<d1b2> <johnsel> weird
<d1b2> <johnsel> move them over?
<azonenberg> WTAF, lol. There are two VMs in the resource set
<azonenberg> ci-orchestra with 4 vCPUs, 8GB RAM, 48GB disk
<azonenberg> ci_linux_builder_gpu_105 with 32 vCPUs, 60GB RAM, 48GB disk
<d1b2> <johnsel> yeah it's nuts
<azonenberg> And yet scopehal-ci-set is using 110 vCPUs and 149 GB of RAM in the accounting manager
<azonenberg> Which is of course physically impossible
<d1b2> <johnsel> I'd almost say that every template is still taking up it's vCPUs and RAM
<azonenberg> i dont have enough ram for you to be using 149GB and still ahve my stuff running :p
<d1b2> <johnsel> right
<azonenberg> i wonder if "convert to template" is not clearing usage
<d1b2> <johnsel> that's what I said
<d1b2> <johnsel> Move the VMs over if you will
<azonenberg> what i mean is the template is not using resources
<azonenberg> it's the ghost of the VM that became the template
<d1b2> <johnsel> right
<azonenberg> ok let me see if i can figure out how to move them
<d1b2> <johnsel> I shut them down
<azonenberg> Moved orchestrator over. Now set2 is using 4 vCPUs and 8GB RAM as expected
<azonenberg> set1 is still using a ridiculous amount
<azonenberg> now to move the builder
<azonenberg> aaand yep. the original set is now using 74 vCPUs and 81 GB of ghost ram and 500GB of ghost storage
<azonenberg> with no vms in it
<d1b2> <johnsel> nuke it
<azonenberg> i was about to say, am i safe to delete the original set?
<d1b2> <johnsel> I guess we'll find out
<azonenberg> as a near term workaround we may just have to keep on making new self service sets to get rid of the leaks lol
<d1b2> <johnsel> (as far as I am aware, yes)
<azonenberg> there's no instanecs in it, lets see what happens
<azonenberg> Done
<d1b2> <johnsel> yep looks fine
<azonenberg> You're now using 36/96 vCPUs, 68/128GB RAM, 93G/2TB storage
<d1b2> <johnsel> which is correct
<azonenberg> Yeah. Next time you ask me to template something i'm going to try taking it out of the self service set first
<azonenberg> and see if that avoids the leak
<d1b2> <johnsel> You can do that
<d1b2> <johnsel> I was going to lower the amount of CPUs + RAM to 1 each
<d1b2> <johnsel> well if it is the templates at least we shouldn't have to deal with it too much
<azonenberg> Yeah
<d1b2> <johnsel> and you could just create a throwaway resource set
<azonenberg> I dont think it needs to be in a set at all unless you're adminning it
<azonenberg> i can just take it out of your set, template it, then give you access to the template
<azonenberg> i don't have resource sets for any of my vms
<d1b2> <johnsel> that might work
<azonenberg> we'll find out next time you need a template made
<d1b2> <johnsel> yep
<azonenberg> also, let me know if i can delete any of the old templates to save disk space at some point
<d1b2> <johnsel> sure once I have everything 100%
<d1b2> <johnsel> there's some old disks as well that I think are associated with templates
<azonenberg> although i am... not exactly short on storage space on the cluster now
<d1b2> <johnsel> but I can only see half of everything unfortunately with my account permissions so it's hard to tell what can be deleted
<azonenberg> (RAM is in shorter supply)
<d1b2> <johnsel> well, provisioned ram is in shorter supply
<d1b2> <johnsel> I'm sure most vms barely touch their ram
<azonenberg> Not sure about the builders, but the ones i've got for my infra are actually using a lot
<azonenberg> e.g. i get firefox oomkilled every day or two in my "general browsing" instance
<azonenberg> (which has 16GB)
<d1b2> <johnsel> that's pretty impressive
<d1b2> <johnsel> I have 80GB for my workstation
<d1b2> <johnsel> but I barely ever reach 40GB
<d1b2> <johnsel> then again I'm not overly interested in it
<azonenberg> this might have something to do with why it eats ram :p
<d1b2> <johnsel> I have 80GB not to have to think about it
<azonenberg> let's just say those windows have >1 tab each :p
<d1b2> <johnsel> yeah I do the same too
<d1b2> <johnsel> my usb controller crashed so I had to reboot
<d1b2> <johnsel> otherwise I'd have the same to show for you lol
<azonenberg> ok so lets see the CPU has 8 memory channels and is capable of ddr4 2933 but is currently loaded with 8x 32GB lrdimms of ddr4 2666
<azonenberg> mobo has 16 sockets and supports up to 4TB
<azonenberg> Looks like identical or near-identical dimms list for $134.99 on newegg now, so $1079.92 to double the ram. I'm not in a rush to do it, but good to have a number
<azonenberg> (and that way i'd never have to worry about memory capacity)
<azonenberg> I have plenty of CPU capacity, i'm averaging less than 25% across the board
<azonenberg> per core
<azonenberg> cumulative peak over the last 2 hours was around 900%
<azonenberg> so more memory = more instances = making more effective use of the cpu :p
<azonenberg> If i were to do a maintenance outage for hardware upgrades, i would likely throw a 40/100G NIC in there as well
<azonenberg> right now i have a single 10G link to the network core (of a dual port nic, one port lit up) with a .1q trunk shared by both VMs and storage access
<azonenberg> nearish term i'm thinking of transitioning some or all VM traffic to the second port of the nic leaving the first one exclusively for storage
<azonenberg> @johnsel are you now blocked on anything from me?
<azonenberg> or are all your near term needs wrt CI infra work met?
<d1b2> <johnsel> sounds good
<d1b2> <johnsel> and no I think I'm good now
<d1b2> <johnsel> you could increase my RAM back to 192
<d1b2> <johnsel> and CPUs to 128
<d1b2> <johnsel> at least I think those were the values
<d1b2> <johnsel> I'm testing with Windows right now
<d1b2> <azonenberg> done. but note that at this time, with whatever instances you have running
<d1b2> <azonenberg> i have 26GB of free ram
<d1b2> <azonenberg> that is not an ACL limit thats the actual amount i have free 🙂
<d1b2> <johnsel> Hmm
<d1b2> <azonenberg> your two builders have 60GB each lol
<d1b2> <azonenberg> Which is fine, i don't plan to start any more large instances in the near future
<d1b2> <johnsel> I'll work on minimizing that lol
<d1b2> <azonenberg> just be advised that you are close to maxing out the host 🙂
<d1b2> <johnsel> that's not sustainable if we want to run more distros
<d1b2> <azonenberg> Yeah
<d1b2> <johnsel> I'll keep it mind
<d1b2> <johnsel> I thought you had about half reserved for CI
<d1b2> <azonenberg> (this is also why i wanted to double the ram on the host in the indefinite nearish future)
<d1b2> <azonenberg> I did lol
<d1b2> <johnsel> right
<d1b2> <azonenberg> i have 256G physical ram
<d1b2> <azonenberg> you're using 60G for each CI instance 🙂
<d1b2> <johnsel> thanks, sometimes I can't count
<d1b2> <azonenberg> experimentally, building with -j32 i only need about 20GB RAM for the actual build
<d1b2> <azonenberg> so 32 per node is likely very reasonable
<d1b2> <azonenberg> per instance*
<d1b2> <azonenberg> That would give you capacity to run 3-4 distros simultaneously on the current hardware
<d1b2> <johnsel> well if you use 20 then let's use 20
<d1b2> <azonenberg> and if i added another 256G, another eight
<d1b2> <johnsel> buttt
<d1b2> <azonenberg> no
<d1b2> <johnsel> I have a problem
<d1b2> <azonenberg> thats 20G used by the build
<d1b2> <azonenberg> above OS baseline
<d1b2> <johnsel> oh right
<d1b2> <johnsel> sorry it's 7:23am here
<d1b2> <johnsel> details are starting to get vague
<d1b2> <azonenberg> and you probably want a bit for disk cache and whatever in the OS
<d1b2> <johnsel> sure yeah
<d1b2> <azonenberg> 32 per builder i think is reasonable
<d1b2> <johnsel> we can do 24 or 32 gb
<d1b2> <johnsel> depending on how many other OS we want to run
<d1b2> <johnsel> but
<d1b2> <johnsel> I have a problem
<d1b2> <azonenberg> if that exceeds the capacity of our hardware, i'll throw hardware at it :p
<d1b2> <johnsel> can you figure out why it won't start?
<d1b2> <azonenberg> let's see
<d1b2> <johnsel> oh huh
<d1b2> <johnsel> it started now
<d1b2> <johnsel> but still the CI wasn't able to kick it off so please still look
<d1b2> <azonenberg> console showed no bootable device
<d1b2> <azonenberg> it was stuck on a bootloader/bios screen
<d1b2> <johnsel> yes but the vm itself did not start
<d1b2> <johnsel> that's another problem
<d1b2> <johnsel> but the vm went from starting back to off
<d1b2> <azonenberg> ok so do you want me to try manual starting and see what happens?
<d1b2> <johnsel> no need
<d1b2> <johnsel> I do wonder why it won't boot
<d1b2> <azonenberg> ok well i'll let you troubleshoot, i have to start the dishwasher and make a bit more progress on this PCB layout then get some sleep
<d1b2> <azonenberg> we can sync up tomorrow and see
<d1b2> <johnsel> sure thing
<d1b2> <johnsel> what pcb are you working on?
<d1b2> <johnsel> my next project is a clock generator for the ADC/JESD
<d1b2> <azonenberg> 100M/gig baseT (depending on how you stuff it) to baseT1 media converter
<d1b2> <johnsel> with a off the top of my head 16GHz pll
<d1b2> <johnsel> should be fun
<d1b2> <johnsel> 15GHz
<d1b2> <azonenberg> For playing with automotive ethernet decodes in scopehal, as well as because we wanted some media converters for work to use on car gigs and don't seem to have any
<d1b2> <azonenberg> so we could buy some or i could make one for fun :p
<d1b2> <azonenberg> (and if i make it, i get low level phy register access and all kinds of things i probably wont get with a commercial one)
<d1b2> <azonenberg> then i have a quick and dirty oshpark ultrascale+ test board in the pipeline to see if my sketchy reballed aliexpress FPGA is still alive
<d1b2> <azonenberg> and the ultrascale+ based BERT
<d1b2> <johnsel> car gigs? that sounds interesting
<d1b2> <azonenberg> all in the pipeline at various stages of schematic capture (no layout for any of those, the media converter is ~2/3 done with layout)
<d1b2> <azonenberg> i mean we do embedded security at work in general
<d1b2> <azonenberg> we've done a fair number of automotive projects involving canbus and a few involving ethernet
<d1b2> <johnsel> aah for $dayjob
<d1b2> <johnsel> I thought you were doing some off the clock
<d1b2> <azonenberg> but we dont have any single pair ethernet dongles in the lab now
<d1b2> <azonenberg> every time we encountered it we bummed one off the client :p
<d1b2> <johnsel> lol
<d1b2> <azonenberg> So i figure i'll make the converter boards for fun, write a scopehal decode using them
<d1b2> <azonenberg> then leave them on my desk at the office for whoever needs one to use
<d1b2> <johnsel> do you work on request of automotive companies then I assume on these gigs?
<d1b2> <johnsel> if you can't answer that that's fine too
<d1b2> <johnsel> I don't want to pry
<d1b2> <azonenberg> I cant speak to specific project (NDAs etc). but as a general rule we do projects on behalf of either OEMs, vendors for them, or prospective buyers
<d1b2> <johnsel> just find that they really need to with all the issues they have
<d1b2> <azonenberg> e.g. "I make X and want it tested before Y integrates it into a product"
<d1b2> <azonenberg> or "I make X, a product in itself, and want it tested"
<d1b2> <johnsel> that explains a lot actually
<d1b2> <azonenberg> or "I plan to buy a thousand X's across my company and want to do due diligence"
<d1b2> <johnsel> lots of integration issues in automotive
<d1b2> <azonenberg> Or even "I want to decide whether to buy X or Y and a security analysis of the two factors into our decision"
<d1b2> <johnsel> individual components are fairly well secured but then they tie them together and put an API key somewhere they shouldn't
<d1b2> <johnsel> Yeah sure that makes sense. I was just wondering if it was on request or self-directed research. It makes sense that it isn't the latter
<d1b2> <azonenberg> We do research on our own as well, either if it's something one of the guys fnids interesting or if we as a company want to push on what we think is an unexplored market
<d1b2> <azonenberg> find a bunch of high profile bugs, publish them, suddenly all the players in that space want testing done :p
<d1b2> <johnsel> yep that's how the cookie crumbles
<d1b2> <azonenberg> But the actual billable work is all prearranged contracts
<d1b2> <azonenberg> we're not like bug bounty mercenaries
<d1b2> <johnsel> shame
<d1b2> <johnsel> you'd do well as one I think
<d1b2> <johnsel> from what I've seen though it makes little sense financially
<d1b2> <johnsel> maybe if you live in a third world country or something
<d1b2> <azonenberg> I like the stability of a salary, sales guys to find customers, project managers to yell at customers who don't give us the info we need to do our work, etc
<d1b2> <azonenberg> this is why i work for a consultancy and not solo
<d1b2> <azonenberg> and why i build T&M gear as a maybe-eventually-profitable sideline and not to earn a living
<d1b2> <johnsel> yep
<d1b2> <johnsel> weird, the windows vm works fine if I create it via the web ui
<d1b2> <azonenberg> But yeah, given the amount of reverse engineering involved in the job you can probably guess why i'm all in on scopehal lol
<d1b2> <azonenberg> lots of "i'm staring at a bus trying to figure out how to abuse something"
<d1b2> <azonenberg> one thing i plan to spend more time on in the indefinite future is visualizations and analysis / classification tools for getting the lay of the land
<d1b2> <johnsel> Yeah I definitely think RE-ing is one of the most interesting markets for scopehal and hw in general
<d1b2> <azonenberg> basically, i'm looking at an embedded device i know nothing about
<d1b2> <azonenberg> what's interesting?
<d1b2> <johnsel> yeah I have some ideas about that as well
<d1b2> <azonenberg> e.g. i get on an i2c bus, who's talking? how many devices are there? how often are they being accessed? what can you identify?
<d1b2> <johnsel> which reminds me that you still need to finish the file storage
<d1b2> <azonenberg> the s3? yes
<d1b2> <johnsel> and the scopesessions
<d1b2> <azonenberg> i had trouble getting the service to run and need to investigate that
<d1b2> <johnsel> so I can train my model
<d1b2> <azonenberg> oh. the file server is up and running
<d1b2> <azonenberg> I just dont have anything on it :p
<d1b2> <azonenberg> i need to clean up and upload some data to it
<d1b2> <johnsel> yes please
<d1b2> <azonenberg> i.e. re-save as latest scopesession format, make sure it doesnt segfault when loaded, make sure there's nothing sensitive or work related in it
<d1b2> <azonenberg> also use this as an opportunity to find any issues with data quality and things i might want to re-capture
<d1b2> <johnsel> it's a shame the Rigol needs to go back
<d1b2> <azonenberg> i think at least one of the sample ethernet waveforms has some truncated packets
<d1b2> <johnsel> I haven't had much time to capture sessions
<d1b2> <johnsel> I think with a few 100k points per type of signal it should be possible to train a model
<d1b2> <azonenberg> in general, i think i am probably going to do a lot of new data collection
<d1b2> <azonenberg> vs just cleaning up existing stuff
<d1b2> <johnsel> but I'm still thinking about how to deal with different clock speeds
<d1b2> <johnsel> well anything is welcome
<d1b2> <azonenberg> One of the first thing i would like to see is a lowest level PHY classifier for an unknown signal
<d1b2> <johnsel> right now I don't have enough to do anything with
<d1b2> <johnsel> right, that's the plan
<d1b2> <azonenberg> what is the modulation (NRZ? MLT3? PAM4? PAM16?)
<d1b2> <azonenberg> and what is the symbol rate
<d1b2> <azonenberg> not even anything past that looking at line coding
<d1b2> <johnsel> yep start bottom up
<d1b2> <azonenberg> also, is it a baseband digital modulation at all
<d1b2> <johnsel> I actually think I can train a model to do much faster clock recovery as well
<d1b2> <azonenberg> or something analog/RF
<d1b2> <johnsel> yeah I do have a sdr, though it's not compatible with ngscopeclient
<d1b2> <azonenberg> well, it likely wont be useful for compliance testing or eye patterns but if it's good enough for protocol decoding and can run in parallel on a whole signal on a GPU, that would still be valuable
<d1b2> <azonenberg> yet
<d1b2> <azonenberg> we have a (WIP) UHD driver already and more will be coming
<d1b2> <johnsel> I might write a driver for it, the adi pluto
<d1b2> <azonenberg> i want to do IIO soon
<d1b2> <johnsel> that'd do it
<d1b2> <azonenberg> the antsdr supports both UHD and IIO depending on which firmware it's running
<d1b2> <azonenberg> UHD is being very unstable for me
<d1b2> <azonenberg> i am going to test with a legit ettus i'm borrowing from a friend and see if it's my code or the antsdr
<d1b2> <johnsel> but yeah if we collect this data I definitely can design and train a model
<d1b2> <azonenberg> if it's the SDR i'm gonna see if it's any more stable with IIO
<d1b2> <johnsel> or probably several models
<d1b2> <johnsel> that would be a very interesting expansion in ngscopeclient I think
<d1b2> <azonenberg> Yeah. I think the first step is "RF or baseband modulation?"
<d1b2> <johnsel> I know aartos are doing ai based signal recognition
<d1b2> <azonenberg> second is "what modulation and symbol rate"
<d1b2> <johnsel> and decoding as well I suspect
<d1b2> <johnsel> of wifi for one
<d1b2> <johnsel> bt
<d1b2> <johnsel> drone 5.8ghz
<d1b2> <azonenberg> I (and @lainpants ) would likely be very interested in anything that can do real time or close to real time RF classification as well
<d1b2> <johnsel> for sure real time
<d1b2> <azonenberg> i.e. given an RTSA sweep or something, what's talking?
<d1b2> <azonenberg> the more accurate you can get the better, if you can say this is specifically a huawei 5G base station with model number 12345 that's ideal :p
<d1b2> <johnsel> it all starts with having enough data
<d1b2> <azonenberg> but at least "this is LTE", "this is an unknown 433 MHz ISM band device transmitting 100 Kbps BFSK"
<d1b2> <johnsel> I mean I'd love for us to start working on a collection of signals and their decoded counterpart
<d1b2> <azonenberg> Yes. and eventually integrate that into unit tests
<d1b2> <azonenberg> i want to be able to provide a pcap and a scopesession of an ethernet or usb session, some metadata like "ignore the first 8 packets of the pcap the scope didn't trigger yet"
<d1b2> <johnsel> I am much more familiar with writing the dnn than the collection side of things
<d1b2> <azonenberg> and have it validate that our decode is correct
<d1b2> <azonenberg> so we can do automated regression testing of all kinds of filters and protocol decodes
<d1b2> <azonenberg> That's a long way out but that is the dream
<d1b2> <johnsel> Yeah it's definitely on my roadmap
<d1b2> <johnsel> Just need to get some base data to start with
<d1b2> <azonenberg> Yeah I'll work on it. Soon
<d1b2> <azonenberg> Too much on my plate lol
<d1b2> <johnsel> Yeah I know, it was a statement of fact not to stress you out or anything
<d1b2> <johnsel> I may switch to start with SDR
veegee has quit [Quit: Textual IRC Client: www.textualapp.com]
veegee has joined #scopehal
<d1b2> <246tnt> I think I understand what triggers the hang on intel. I'm not 100% sure if this should work and the driver is at fault or if that pattern of access is UB but at least I can work around it.
bvernoux has joined #scopehal
<d1b2> <azonenberg> So just to be clear (also + @david.rysk since we talked about this earlier)
<d1b2> <azonenberg> this is a separate bug
<d1b2> <azonenberg> Which I think we're tracking as https://github.com/ngscopeclient/scopehal-apps/issues/325
<d1b2> <azonenberg> and is not the same as "intel has max 64k thread blocks so we need to use 2D grid for >2M point waveforms" shader issue
<d1b2> <azonenberg> (which i need to file a ticket for as i don't think we have one)
<d1b2> <246tnt> Ah yeah, I wasn't referencing this 64k limit at all.
<d1b2> <azonenberg> yeah but there was some confusion
<d1b2> <246tnt> I didn't even read about it in the backlog 😅
<d1b2> <azonenberg> (i think that may have been private signal convo not this chat but anyway)
<d1b2> <azonenberg> Anyway so it sounds like we have two bugs
<d1b2> <david.rysk> when I tested on intel I only ran into the 64k limit, but the github issue makes this sound like a newer bug
<d1b2> <david.rysk> (meaning, tripped up by newer Intel drivers or hardware)
<_whitenotifier-3> [scopehal] azonenberg opened issue #843: Refactor all shaders to handle Vulkan implementations with 64K max thread blocks in X axis - https://github.com/ngscopeclient/scopehal/issues/843
<_whitenotifier-3> [scopehal-apps] azonenberg opened issue #675: Refactor all shaders to handle Vulkan implementations with 64K max thread blocks in X axis - https://github.com/ngscopeclient/scopehal-apps/issues/675
<azonenberg> Ok so, tickets filed for the 64k issue as it seems we didn't have one
<azonenberg> 843/675 are the same issue just frontend vs backend
<azonenberg> and 325 is an unrelated problem
<d1b2> <azonenberg> @246tnt anyway go on, what's the problem?
<d1b2> <johnsel> it's that race condition right
<d1b2> <johnsel> from a few months ago
<azonenberg> Also if this access pattern is UB i would expect even on nvidia the vulkan validation layer should complain about it?
<azonenberg> (if it fails that might be something we can get khronos to add checks for)
<d1b2> <david.rysk> @johnsel CI update: macOS working, Linux Ubuntu WIP, I'm splitting out docs since installing all the LaTeX dependencies seriously slows down the runners
<azonenberg> yeah if we are doing GH runners as a "quick check" we dont need to build docs or upload binary artifacts etc
<d1b2> <david.rysk> uploading binary artifacts is quick, docs aren't :p
<azonenberg> (we also don't need to build docs on every platform)
<d1b2> <david.rysk> Yeah I was thinking we just do it on latest Ubuntu
<azonenberg> and our linux selfhosted image should have / might already have texlive preinstalled on the base image
<azonenberg> that we clone to make each test vm
<d1b2> <johnsel> hmm
<d1b2> <johnsel> I think I've argued this before but it's stupid that we're building them for every app change anyway
<azonenberg> Yeah
<d1b2> <johnsel> let the owner repo build it on change instead
<azonenberg> there is definitely room to optimize how much we build for CI testing
<d1b2> <johnsel> we can link 'latest' artifacts
<d1b2> <246tnt> From my reading of memory access to shared variable, you're not supposed to do "incoherent" access to them. So you basically need barriers to make sure there isn't any pending writes before either reading or rewriting the same shared location. So what I think is happening is that, on the GPU if you have something like : if (cond) { do_A; } else { do_B; } It will execute both do_A and do_B and just mask off any writes on the inactive
<d1b2> block. But if in both do_A and do_B block you have writes to a shared variable like say g_done and some work items in a work group take the do_A path and some other take the do_B path, you can end up with a write to a shared variable that already has a pending write to it. Because some work items will have issued a write to g_done while executing do_A and some other will issue a write to it when executiong do_B. And it seesm to not matter that both
<d1b2> writes actually write true it ends up screwing up the access.
<d1b2> <johnsel> so refs needs not go out of date
<azonenberg> Interesting. So what's the solution, adding barriers?
<d1b2> <246tnt> So doing something like : if (cond) { do_A; }; barrier; if (~cond) { do_B; } Solves the issue.
<azonenberg> And only the rendering shader is impacted as far as you know right?
<azonenberg> (most of our other filter shaders are simple number crunching with few if any conditionals)
<d1b2> <246tnt> What I ended up doing is a bit different and just setting a local flag in do_A / do_B and performing the global write at the end. Which avoid adding a new barrier.
<_whitenotifier-3> [scopehal-apps] d235j synchronize pull request #673: Cmake cleanups - https://github.com/ngscopeclient/scopehal-apps/pull/673
<d1b2> <246tnt> (This code still adds a barrier but it's unrelated ... that one was always needed and always missing and doesn't solve the issue by itself)
<azonenberg> yeah makes sense
<azonenberg> and yeah the vulkan memory model definitely has footguns in it lol
<d1b2> <246tnt> I'll open a PR.
<azonenberg> Is this tested to the point you're ready to send a PR yet?
<azonenberg> ah
<d1b2> <246tnt> Well ... Usually I can trigger it in like ... 2 sec without the patch. And with the patch I played with it for a minute without it crashing.
<azonenberg> well congrats on finding what looks like a truly nasty and subtle bug lol
<d1b2> <246tnt> I was trying to reproduce the issue in a synthetic test program but unable to do so ATM.
<azonenberg> yeah it's probably dependent on some specific bits of intel microarchitecture and access patterns
<azonenberg> having multiple in flight writes to the same shared memory location with some particular uarch state confuses something
<azonenberg> might even only fail on one particular gpu uarch or something lol
<d1b2> <johnsel> not sure how useful this is to anyone but I have this bookmarked
<azonenberg> @johnsel so that's a *different* vulkan footgun
<azonenberg> sync within a pipeline from one shader or memory operation to another
<azonenberg> That is something the validation layers do a good job of catching
<azonenberg> what tnt found is related to memory ordering within different threads of a single shader
<d1b2> <johnsel> yes I understood, just thought I'd share since it's a decent enough summary
<azonenberg> yeah that page is a good reference and i use it from time to time
<azonenberg> Also speaking of, we do still have infrequent validation errors related to queue synchronization
<azonenberg> but i havent root caused them yet
<_whitenotifier-3> [scopehal-apps] smunaut opened pull request #676: Fix Hang on intel iGPUs - https://github.com/ngscopeclient/scopehal-apps/pull/676
<d1b2> <johnsel> vivado vivado vivado
<d1b2> <johnsel> whatever do you mean
<_whitenotifier-e> [scopehal-apps] azonenberg commented on pull request #676: Fix Hang on intel iGPUs - https://github.com/ngscopeclient/scopehal-apps/pull/676#issuecomment-1913657943
<azonenberg> Tested and works fine for me on nvidia
<d1b2> <azonenberg> @david.rysk can you test that PR on your intel box? if it works the same or better then i'll merge
<d1b2> <david.rysk> It would take me some time since I haven’t been using that particular box in a while and most of my remaining ones are Windows
<d1b2> <azonenberg> In that case i'll merge as is. it seems unlikely to cause regressions
<_whitenotifier-3> [scopehal-apps] azonenberg closed pull request #676: Fix Hang on intel iGPUs - https://github.com/ngscopeclient/scopehal-apps/pull/676
<_whitenotifier-e> [scopehal-apps] smunaut 8bb6522 - Waveform rendering shader: Add missing barrier for termination flag init Once the init of the termination flag is done, we need to make sure all threads see it before continuing. Signed-off-by: Sylvain Munaut <tnt@246tNt.com>
<_whitenotifier-3> [scopehal-apps] azonenberg pushed 2 commits to master [+0/-0/±2] https://github.com/ngscopeclient/scopehal-apps/compare/a40211b729c9...0f265a9dd045
<_whitenotifier-3> [scopehal-apps] smunaut 0f265a9 - Waveform rendering shader: Group write to termination flag Instead of writing to the termination flag from different point in the code, which might lead to some writes being issued while there are already pending writes from other threads coming from other code path. This seems to be an issue on intel at least. So instead each thread keeps a local flag and then in the
<_whitenotifier-3> common execution path, that flag might trigger a write to the shared variable and immediately after a barrier will be issued before any work item uses the flag or try to issue another write to it. Fixes #325 Signed-off-by: Sylvain Munaut <tnt@246tNt.com>
<_whitenotifier-3> [scopehal-apps] azonenberg closed issue #325: GPU hang on iris Plus driver - https://github.com/ngscopeclient/scopehal-apps/issues/325
<d1b2> <david.rysk> @azonenberg @johnsel would it be possible to keep test/fast or so fast enough to include in the GH hosted CI?
<d1b2> <david.rysk> I think I finally have the Ubuntu targets working (with a 5 build matrix testing standalone vs. repo SDK, and docs separately)
<d1b2> <david.rysk> (need SDK because glslc was added to Ubuntu in 22.10 or 23.04)
<d1b2> <david.rysk> also Mac (x86_64) has been working but I have to troubleshoot some detection on my own machine
<d1b2> <david.rysk> has been working on CI*
<d1b2> <david.rysk> I haven't touched Windows yet but that's the next big piece
<d1b2> <david.rysk> also throwing on caching of the standalone SDK and ccache, we'll see how much that speeds up repeated builds 🙂
<_whitenotifier-e> [scopehal] iromero91 forked the repository - https://github.com/iromero91
<cyborg_ar> fork it
<cyborg_ar> hp662xA driver is almost done....
<cyborg_ar> azonenberg: on PowerSupplyChannel.cpp, line 85. couldnt there be some way of refraining from sending a setpowervoltage command if the exact same value was sent in the previous frame? Im assuming refresh is what gets called every frame
<azonenberg> refresh gets called every filter graph update
<azonenberg> it's async to rendering
<azonenberg> but anyway, i think the proper solution is actually to do it in the driver
<azonenberg> make setpowervoltage be a no-op if the value being sent is the same as what was sent previously
<azonenberg> (since you can also set voltage via the API or GUI etc)
<azonenberg> just make it start as "unknown" then once you call it, save the last value you set
<azonenberg> and only actually send commands to the device if differen
<cyborg_ar> well i would like the gui one to work no matter what. there are also some wrinkles related to instruments not being able to accept every set point. this HP will round your SP to the nearest possible set unit on readback so it will always not match if you want to set in between two DAC units. if im caching the setpoint there is an argument for storing both the last sent and the last read back setpoints...
<azonenberg> Yeah. A lot of stuff we do cache readback results, but if it's something that can change every time we don't
<azonenberg> e.g. we cache the v/div on a scope in most (should be all?) drivers
<azonenberg> but we don't cache live readings from a meter for obvious reasons
<cyborg_ar> also im seeing there is no infrastructure for limits on settings, iirc the hp will raise a fault and ignore the command if you set an invalid value
<azonenberg> Yes. That's something we've been thinking about for a while
<azonenberg> but havent defined APIs for yet
<azonenberg> By all means open a ticket to track that
<azonenberg> for scopes we track limits on some settings like sample rate and memory depth
<azonenberg> but we dont have anywhere near all APIs expose limits right now
<cyborg_ar> also, due to the way the power supply works, changing the vset or iset can affect the other, with the last set taking priority, since the SOA of the power supply is L shaped
<azonenberg> oh fun. So in that case what you might need to do is do a readback after sending a set command
<cyborg_ar> yep
<azonenberg> to see what the actual set points are
<cyborg_ar> but that busts the queing
<azonenberg> Yeah. It's a nontrivial problem
<cyborg_ar> tbh queing in psu setpoints doesnt make much sense
<azonenberg> one option there would be to implement clientside limiting in the driver
<azonenberg> making it aware of the SOA
<azonenberg> and the point of queueing is mostly so that when you apply a change in the GUI
<azonenberg> you dont want to block the rendering thread
<azonenberg> you want to push the command into a buffer and have it execute asynchronously
<azonenberg> imagine the instrument is over a VPN 500ms away
<azonenberg> you dont want the gui to lock up for a second getting a reply back
<azonenberg> so every chance we have to avoid round trips, we take
<cyborg_ar> yeah its hard on GPIB because it is not packet based
<cyborg_ar> on packet based protocols you can keep track of which packet is response to which
<azonenberg> i mean most scpi transports arent packet based. VICP kinda-sorta is
<azonenberg> VXI-11 sorta is but doesnt support pipelining
<d1b2> <johnsel> well, the underlying transport is
<cyborg_ar> if youre talking to 5 different instruments over ethernet you dont have to wait for one to respond before talking to the next one, on gpib i dont know if you can interleave that way, you might be able to
<azonenberg> Oh *that*. Yeah, transports support mutexing and if we are working with multiple instrments we may need to lock things when we are waiting for a reply
<azonenberg> there is some mutexing already to prevent the gui and backend threads from stepping on each other
<azonenberg> unsure if that is sufficient for multiple cnocurrent gpib connections
<azonenberg> in particular i dont know what happens with linux-gpib if you have two devices being accessed at onec
<azonenberg> can you only have one handle open?
<azonenberg> do we need to implement that arbitration ourself?
<cyborg_ar> i need dot check that, i'll write a simple second driver for another instrument and check...
<d1b2> <johnsel> there's a multidevice section
<d1b2> <johnsel> it looks like you might be able to e.g. set up each instrument to give a poll response and then parallel poll them
<d1b2> <johnsel> that's not how we understand multiple concurrent connections though
<azonenberg> Yeah we generally expect independent connections
<azonenberg> we may need to make some kind of abstraction layer for gpib that gives you virtual circuits to each instrment
<azonenberg> but this could be tricky for ancient non-scpi gear where it's not trivial to know what's a request or reply vs the scpi format
<d1b2> <johnsel> While an asynchronous operation is in progress, most library functions will fail with an EOIP error
<d1b2> <johnsel> stilll not entirely what we want
<azonenberg> yeah
<azonenberg> anyway, near term maybe we should add a section to the docs
<azonenberg> saying that we do not currently support simultaneous connections to two or more devices on the same gpib network segment
<azonenberg> i.e. we can have as many instrments open as we want as long as no more than one at a time is on a single gpib channel
<azonenberg> if you have two different gpib interfaces you can run both simultaneously without issue
<d1b2> <johnsel> yeah
<d1b2> <johnsel> ibbna() changes the GPIB interface board used to access the device specified by ud
<azonenberg> anyway, so at this point my recommendation is file a ticket against scopehal to improve this in the future
<d1b2> <johnsel> I think you ran just read/write to different devices
<d1b2> <johnsel> if you do it sync
<azonenberg> and add a note to documentation saying you can only have one concurrent driver accessing a single gpib interface board at once
<azonenberg> and yeah the challenge is doing it async :p
<azonenberg> and making it so drivers dont step on each other
<azonenberg> getting replies from one and requests from a different one stepping on each other
<d1b2> <johnsel> I don't think that is possible
<d1b2> <johnsel> there's a ATN line
<d1b2> <johnsel> so if the drivers run in the same thread I think the library takes care of talking to one device at a time
<azonenberg> drivers dont have a thread
<azonenberg> some driver methods can be called from the gui thread
<azonenberg> then for psus, meters, etc there's a background thread polling for readback updates
<azonenberg> (in ngscopeclient, scopehal doesnt thread drivers at all)
<azonenberg> each driver has its own background thread, they're fully independent
<azonenberg> the scpitransport has a mutex that serializes access between these threads so a split request/reply or request/binary data readback works
<d1b2> <johnsel> yeah that might clash then
<azonenberg> this is also the layer that adds the queueing so the gui thread doesnt have to lock the mutex at all for write-only calls
<azonenberg> What we might need to do for gpib is make a global gpib interface object, one per applicatio ninstance
<azonenberg> that all gpib transports call out to
<azonenberg> and have it serialize requests
<azonenberg> but even this won't solve the problem of multiple ngscopeclient instances each trying to control a different subset of instruments
<azonenberg> we may just give up and say if you want to do that, move them to separate dongles or use a more modern transport
<d1b2> <johnsel> an array then I guess since you can have multiple?
<azonenberg> yeah. one global interface manager object per physical bus
<azonenberg> then each gpib transport instance would find the one for its bus and talk to it
<d1b2> <johnsel> yeah that makes sense
<azonenberg> anyway, tl;dr this is nontrivial and we definitely need to file a ticket and plan to do it
<azonenberg> but its not happening today
<azonenberg> can you file it?
<d1b2> <johnsel> I was looking at another DDA as well but just the gpib alone makes me hesitant
<d1b2> <johnsel> it's so clunky compared to ethernet
<azonenberg> the DDA5 series have ethernet
<d1b2> <johnsel> sure I'll file it
<azonenberg> i cant speak to the older ones
<d1b2> <johnsel> yes that was an older 2GHz one
<azonenberg> dda5005a is a pentium 2 or 3 running windows 2000 or xp iirc
<d1b2> <johnsel> I did send the ebay guy an email what his lowest price is
<d1b2> <johnsel> xp indeed
<azonenberg> you can do full windows rdp to it, or use lecroy's vicp protocol
<azonenberg> which is one of the better scpi transports IMO
<azonenberg> in terms of supporting multiple outstanding operations, proper framing, etc
<d1b2> <johnsel> yes it's unfortunate but I may get another one eventually when I have proper budget
<d1b2> <johnsel> anyway I plan to do some videos first and go the beg vendors for kit route as well
<azonenberg> @johnsel also can you add a note to scopehal-docs under the gpib section
<azonenberg> just summarizing the state of this?
<azonenberg> basically, works fine if you are talking to only one device from one ngscopeclient session
<azonenberg> multiple deviecs in one session is planned but doesnt work yet
<azonenberg> multiple devices in multiple sessions likely wont happen
<_whitenotifier-3> [scopehal] iromero91 opened pull request #844: Add driver for HP 662xA power supplies - https://github.com/ngscopeclient/scopehal/pull/844
<cyborg_ar> :D
<cyborg_ar> i think i'll try the driver for the DL1540 now
<cyborg_ar> wanna see some waveforms
<_whitenotifier-e> [scopehal] azonenberg closed pull request #837: SiglentSCPIOscilloscope implement SDS5000X specific sample rate and sample depth - https://github.com/ngscopeclient/scopehal/pull/837
<_whitenotifier-3> [scopehal] azonenberg 1dd55af - Merge pull request #837 from codepox/master SiglentSCPIOscilloscope implement SDS5000X specific sample rate and sample depth
<_whitenotifier-e> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±2] https://github.com/ngscopeclient/scopehal/compare/bf1476fe97f8...1dd55aff4c5c
<_whitenotifier-e> [scopehal] Johnsel opened issue #845: Improved GPIB support by implementing global GPIB interface handlers - https://github.com/ngscopeclient/scopehal/issues/845
<d1b2> <johnsel> yes sir
<d1b2> <johnsel> or lady
<d1b2> <johnsel> cyborg_ar, help me out here, a board is not the usb/pcie interface right?
<cyborg_ar> hmm not sure
<d1b2> <johnsel> I think it is
<d1b2> <johnsel> can you show me your connection string?
<d1b2> <johnsel> nvm I got it already, it is
<_whitenotifier-e> [scopehal-docs] Johnsel opened pull request #78: Updated GPIB interface limitations - https://github.com/ngscopeclient/scopehal-docs/pull/78
<_whitenotifier-e> [scopehal-docs] Johnsel closed pull request #78: Updated GPIB interface limitations - https://github.com/ngscopeclient/scopehal-docs/pull/78
<_whitenotifier-3> [scopehal-docs] Johnsel 38b64e9 - Updated GPIB interface limitations (#78) * Update section-transports.tex Added some information about currently supported GPIB usage. * Update section-transports.tex Clarified a detail about gpib and multiple ngscopeclient sessions
<_whitenotifier-e> [scopehal-docs] Johnsel pushed 1 commit to master [+0/-0/±1] https://github.com/ngscopeclient/scopehal-docs/compare/c28db6f4dc97...38b64e993d3c
<d1b2> <johnsel> there
Johnsel has joined #scopehal
<d1b2> <johnsel> andrew: can you make me another template?
<azonenberg> kk
<d1b2> <johnsel> stares at resource usage screen
<azonenberg> ok so i'm gonna take it out of your resrouce set
<azonenberg> did it just disappear?
<d1b2> <johnsel> I saw 4 CPUs disappear yes
<azonenberg> and ram?
<azonenberg> its not templatized yet but is out of the set
<d1b2> <johnsel> I'm not sure
<azonenberg> ok now its templating
<d1b2> <johnsel> I didn't see that change
<d1b2> <johnsel> 60+8 seems reasonable though
<azonenberg> yeah
<azonenberg> new template added to your set
<d1b2> <johnsel> cool
<azonenberg> i think that might be the workaround
<azonenberg> don't make a vm into a template without taking it out of the set first
bvernoux has quit [Quit: Leaving]
<d1b2> <johnsel> yup
<d1b2> <johnsel> found the issue why it didn't boot btw
<azonenberg> oh?
<d1b2> <johnsel> yep should've been booting uefi but it was booting bios
<azonenberg> oops
<_whitenotifier-e> [scopehal] d235j opened pull request #846: Add missing includes masked by PCH - https://github.com/ngscopeclient/scopehal/pull/846
<_whitenotifier-3> [scopehal] azonenberg closed pull request #846: Add missing includes masked by PCH - https://github.com/ngscopeclient/scopehal/pull/846
<_whitenotifier-e> [scopehal] azonenberg 1c5994a - Merge pull request #846 from d235j/add-missing-includes Add missing includes masked by PCH
<_whitenotifier-3> [scopehal] azonenberg pushed 2 commits to master [+0/-0/±4] https://github.com/ngscopeclient/scopehal/compare/1dd55aff4c5c...1c5994a1fda2
<_whitenotifier-e> [scopehal] azonenberg synchronize pull request #841: Refactor of Cmake scripts - https://github.com/ngscopeclient/scopehal/pull/841
<_whitenotifier-3> [scopehal] d235j synchronize pull request #841: Refactor of Cmake scripts - https://github.com/ngscopeclient/scopehal/pull/841