<d1b2>
<azonenberg> its enough of a pain trying to figure out all of the new SPF/DKIM/whatever records you need to keep adding to DNS when using third party mail servers :p
<d1b2>
<azonenberg> what happened to /opt/actions-runner/_work/scopehal-apps/scopehal-apps/build/Testing/Temporary/LastTest.log
<d1b2>
<azonenberg> is that something we upload as an artifact?
<d1b2>
<azonenberg> or is that debug info lost forever
<d1b2>
<johnsel> that's gone
<d1b2>
<johnsel> we can save it though
<d1b2>
<azonenberg> can you fix the scripts to upload that somewhere or cat it or something?
<d1b2>
<johnsel> assuming it doesn't contain anything we don't want uploaded
<d1b2>
<david.rysk> at the moment I'm testing with the github hosted CI on my fork
<d1b2>
<azonenberg> that should all be unit test output
<d1b2>
<azonenberg> nothing sensitive
<d1b2>
<johnsel> I'll upload it
<d1b2>
<johnsel> we can figure out a way to make a cleaner report at some point
<d1b2>
<azonenberg> interestingly the tests seem to have segfaulted
<d1b2>
<david.rysk> oh can we bump nativefiledialog-extended?
<d1b2>
<johnsel> right
<d1b2>
<azonenberg> i wonder if we can get a binary and core dump out or something
<d1b2>
<david.rysk> link?
<d1b2>
<johnsel> I posted it above
<d1b2>
<johnsel> it skipped that step because of the segfault
<d1b2>
<johnsel> stupid
<d1b2>
<azonenberg> On the list, i'm half paying attention while routing a pcb and spinning a cranky toddler on an office chair with the other hand lol
<d1b2>
<azonenberg> all of the CI builder VMs should have a GTX 1650 pcie passthru
<d1b2>
<johnsel> it has a gpu
<d1b2>
<johnsel> it may or may not have vulkan and drivers properly installed
<d1b2>
<johnsel> your pr 673 is ran on it with the self-hosted config
<d1b2>
<johnsel> feel free to pull master to get the changes I just made for debugging and to play with it
<d1b2>
<azonenberg> conjecture: vulkan init fails then it crashes trying to access some vulka nstate
<d1b2>
<johnsel> your PR gets executed
<d1b2>
<azonenberg> (likely due to not having an x server or something)
<d1b2>
<johnsel> possible, though there should be one
<d1b2>
<johnsel> it might not be attached to the nvidia gpu though
<d1b2>
<johnsel> not sure if that matters, I was able to run vkcube
<d1b2>
<azonenberg> @david.rysk if it will help, i can set you up with VPN access to the CI network and we can probably provision the builder so you can ssh to it
<d1b2>
<johnsel> we can't
<d1b2>
<johnsel> the networking is broken
<d1b2>
<azonenberg> oh? interesting
<d1b2>
<johnsel> I was going to redo it but can't with the resources/templates
<d1b2>
<azonenberg> ah ok yeah we have that weird leaking of resources that arnt actually leaked
<d1b2>
<johnsel> yep and I need you to clean up the old template as well
<d1b2>
<johnsel> though right now it's still in use
<d1b2>
<david.rysk> @azonenberg I probably can do all the testing I need on GH hosted runners
<d1b2>
<david.rysk> which are more secure anyway
<d1b2>
<azonenberg> @david.rysk i'm thinking specifically about getting accelerated vulkan working in our environment
<d1b2>
<david.rysk> @azonenberg I mean if Lavapipe vulkan works, accelerated is not much harder.
<d1b2>
<azonenberg> @johnsel ok so if i get a chance tonight after i put little miss crankypants to bed, i'll update xoa and xoa2 to the latest, install the latest patches on the host
<d1b2>
<azonenberg> suspend all VMs, reboot the host, resume them
<d1b2>
<azonenberg> and we'll see what happens to scopehal-ci-set after that
<d1b2>
<david.rysk> seems to stick on "Waiting for a runner to pick up this job..."
<d1b2>
<david.rysk> for macos-latest-arm64
<d1b2>
<david.rysk> I'll need to test on real hardware to see why it doesn't pick up openmp
<d1b2>
<azonenberg> On that note, openmp is something we will probably transition away from eventually since it has been annoying from time to time
<d1b2>
<azonenberg> I dont have an immediate plan but most of the uses we have for it can be better served by application-managed threading or just pushing compute to the GPU
<d1b2>
<david.rysk> @johnsel got mac and ubuntu CI working, now expanding ubuntu to include 20.04, 22.04, sdk, and no-sdk
<d1b2>
<johnsel> nice
<d1b2>
<johnsel> good work
<d1b2>
<johnsel> I'm fighting the Xilinx DDR4 controller
<azonenberg>
@johnsel so i wonder if it might make sense to have two classes of tests
<azonenberg>
one that's "build in a non-gpu machine, dont run any tests, just checking for distro specific compile issues"
<azonenberg>
as many of those as we want
<azonenberg>
then one linux and one windows system with GPU that run all the tests
<d1b2>
<johnsel> that makes sense yes
<azonenberg>
(the idea being, given that the tests and drivers dont do much distro specific stuff, we're looking for bugs in our code and dont need to run those a zillion times)
<d1b2>
<johnsel> I wanted to finish windows + linux first before thinking about that
<d1b2>
<johnsel> there's the "do -we- have to run those" question
<d1b2>
<johnsel> or do we outsource that to github
<azonenberg>
That is entirely an option, assuming we dont run out of free build minutes
<azonenberg>
Which is very much a possibility
<d1b2>
<johnsel> if we want to self-host that is possible but we would have to make our templates
<d1b2>
<johnsel> and with the current resource issue it's a definite no-go
<azonenberg>
yeah I have cpu cycles to burn on my infrastructure RAM is the only issue. Can we sequence runners so no more than N run concurrently?
<d1b2>
<johnsel> I asked in the xcp-ng discord but those people are also entirely unhelpful
<azonenberg>
(I'm busy on a board design and might bump the host reboot to tomorrow)
<d1b2>
<johnsel> presumably when you buy a support contract that would change but f that
<azonenberg>
yeah lol
<d1b2>
<david.rysk> what are your thoughts on using lavapipe to run tests?
<d1b2>
<johnsel> I'll have to think about it
<azonenberg>
i mean if some corporate sponsor wants to buy me a xcp-ng commercial support contract i'll take advantage of it
<d1b2>
<johnsel> we specifically bought GPUs because we want GPUs
<azonenberg>
but i'm not spending my own money on it lol
<azonenberg>
Yeah we want to replicate in an environment that's as realistic as possible
<azonenberg>
also it'll be way faster
<d1b2>
<johnsel> we'll also put real scopes in the network eventually
<azonenberg>
Yeah exactly
<azonenberg>
we want as close to real world deployments as we can
<d1b2>
<johnsel> we really want this to function as a sort of testbed for what we could deploy as a solution eventually
<azonenberg>
i'm not opposed to having one runner somewhere checking on llvmpipe and/or swiftshader just as "edge case vulkan implementation to find issues where we make assumptions about what the implementation is capable of"
<d1b2>
<johnsel> with automated testing etc
<azonenberg>
Yeah exactly
<d1b2>
<johnsel> so yeah agreed with azonenberg, if you make it we can run them but we're not super interested in the output
<azonenberg>
i mean i want to know if i broke something on another platform
<d1b2>
<david.rysk> Are you still planning to allow any random person to make a PR and the selfhosted CI to run it?
<d1b2>
<david.rysk> since that's one benefit of github CI
<d1b2>
<johnsel> we have manual review for that
<azonenberg>
Once approved by a maintainer, yes
<d1b2>
<johnsel> you can't just run anything on ci
<d1b2>
<david.rysk> my point is you won't need manual review for GH CI
<d1b2>
<david.rysk> 😛
<d1b2>
<david.rysk> just for the selfhosted runners
<d1b2>
<johnsel> Well you need to review PRs anyway
<d1b2>
<david.rysk> that allows for an automated first pass
<azonenberg>
If and when we have a high enough volume of contributions that this becomes an issue we can reconsider
<d1b2>
<johnsel> we can run other linux no problem on the self-hosted infra @david.rysk
<azonenberg>
Yeah. And give them GPU or not as we see fit, as long as no more than two concurrent runners have GPUs
<d1b2>
<johnsel> it's flexible enough that we can deploy something very quickly
<azonenberg>
(but we'd likely run out of RAM before that became an issue)
<azonenberg>
@johnsel what are your thoughts on making a new resource set for scopehal stuff and transitioning stuff over there to see if we can find and eliminate the leak that way?
<d1b2>
<johnsel> Yep but adding another runner with a different OS is very simple, we use all the proper IaaS stuff that all distros support
<d1b2>
<johnsel> that's a really good idea @azonenberg
<d1b2>
<johnsel> although it wouldn't solve the root cause
<azonenberg>
ok i'll do that after the reboot which i should do anyway because i'm due for patches
<d1b2>
<david.rysk> my point is we use GH-hosted runners that don't require approval to catch stupid shit early
<azonenberg>
so my thought is, if we suddenly see ghost usage after transitioning X over to it
<azonenberg>
we know thats what did it
<d1b2>
<david.rysk> then self-hosted runners that do require approval for more thorough testing
<d1b2>
<johnsel> please also find time to click the "convert to template" button :p
<azonenberg>
@david.rysk: what exactly would we be expecting the GH runners to catch if we don't manually review the output?
<d1b2>
<johnsel> they're very slow though, and we'd be managing things in 2 places
<d1b2>
<david.rysk> @azonenberg they will catch build errors and inform the PR submitter well ahead of when we are reviewing
<d1b2>
<david.rysk> They're not to inform us but the PR submitter
<azonenberg>
yeah the point is that i'm sick of waiting hours to get results from stuff
<azonenberg>
the github runners are super slow
<azonenberg>
and as we add more distros and extensive tests we are likely going to hit the cap on the free tier very soon
<d1b2>
<david.rysk> We don't get many outside PRs and aren't merging them frequently, so I don't exactly mind them being slow for that use
<azonenberg>
at which point we either pay for azure time or use my own hardware i'm already paying for
<d1b2>
<david.rysk> We can only have them enabled for PRs
<d1b2>
<david.rysk> And they only support Ubuntu, macOS, and Windows anyway, which we should support self-hosted too
<d1b2>
<johnsel> That's a decent amount of management overhead, I'd rather work towards core contributors getting the ability to have their PRs ran automatically
<d1b2>
<johnsel> I think that makes more sense than going backwards
<d1b2>
<david.rysk> hm what's the management overhead (on top of having extra stuff in the .yml)
<d1b2>
<david.rysk> I mean if we're also running self-hosted runners with the same OS
<d1b2>
<johnsel> they aren't entirely the same
<azonenberg>
@johnsel FYI i made scopehal-ci-set2
<azonenberg>
which should have the same perms as scopehal-ci-set and currently has no instances in it
<d1b2>
<johnsel> I see it
<azonenberg>
2TB storage cap, 128GB RAM, 96 vcpus
<azonenberg>
try slowly moving stuff into it and see if/when you see ghost usage
<azonenberg>
this isnt meant to be a workaround for the issue so much as a way to identify what we did to trigger it
<d1b2>
<johnsel> I'm not sure I can
<d1b2>
<johnsel> yeah I can't do that
<d1b2>
<johnsel> you need to
<d1b2>
<johnsel> I can shut down the instances though
<azonenberg>
ok yeah if you shut them down i can probably transition between sets myself
<azonenberg>
then maybe we can at least see what the ghost is attached to
<azonenberg>
and dig deeper
<d1b2>
<johnsel> yes
<d1b2>
<johnsel> done btw
<azonenberg>
great ok gimme a few
<d1b2>
<johnsel> I don't have networks and SRs
<d1b2>
<johnsel> fyi
<azonenberg>
in -set2?
<d1b2>
<johnsel> correct
<d1b2>
<johnsel> so that's scopehal, cidmz, and isos and the ci-sr
<d1b2>
<johnsel> did you reboot yet, no right?
<azonenberg>
i show ci-ceph-sr, installmedia-localisos, xcp-ng tools SRs attached
<azonenberg>
to set2
<azonenberg>
as well as scopehal and cidmz networks
<azonenberg>
@david.rysk for background we are troubleshooting an issue in which the self hosted runners are in a resource pool that is showing ghost usage
<d1b2>
<johnsel> my mistake
<d1b2>
<johnsel> it pops up when I select a template
<azonenberg>
i.e. quotas are being hit that show usage beyond the actual vms in the group
<d1b2>
<johnsel> should we reboot first?
<d1b2>
<johnsel> moving the vms may be prevent us from knowing if that fixes it
<azonenberg>
yeah probably a good idea i just wanted to patch a bunc hof things at once
<azonenberg>
to make the most of the downtime
<d1b2>
<johnsel> yup no problem
<d1b2>
<johnsel> just wanted to confer on what is best
<d1b2>
<johnsel> as an FYI we should be able to get github based login working as well
<d1b2>
<johnsel> to give others access to the ci infra
<azonenberg>
oh cool
<azonenberg>
ok so first things first
<azonenberg>
xoa is upgrading from 5.85.0 to 5.89.0
<d1b2>
<david.rysk> @azonenberg @johnsel I've been rewriting the Mac and Ubuntu GH-hosted templates
<d1b2>
<david.rysk> Haven't touched Windows yet
<d1b2>
<david.rysk> But setting up matrices to test all the various combinations of SDK and OS type
<azonenberg>
monthly OS patching on xoa2 outer VM
<azonenberg>
xen-orchestra updated to b1e879c release 5.90.0, building the update now
<azonenberg>
Rebooting xoa2 VM
<d1b2>
<david.rysk> fwiw installing texlive slows these runners down a lot
<azonenberg>
yeah i'm thinking we want to bake that into the image
<azonenberg>
vs installing every build
<d1b2>
<johnsel> correct
<azonenberg>
ok, xoa2 rebooted, same issue
<d1b2>
<johnsel> interesting
<azonenberg>
not surprised, but at least i got it updated and confirmed that wasn't the root cause
<d1b2>
<johnsel> I think v2 already has texlive on there btw
<azonenberg>
Now to update the host fully, finish a bit of work i'm doing in another instance, suspend all of my instances (including the one i'm IRCing from right now)
<azonenberg>
and reboot it
<d1b2>
<johnsel> Ah the xcp instance might still be the culprit
<azonenberg>
yeah. i'm gonna do a full hard powerdown and up of the host, fully hypervisor and dom0 reset
<azonenberg>
and see what we get when it comes back
<azonenberg>
gonna be a few mins while i wrap up what i'm doing in the other vm first
<azonenberg>
ok instances going down and rebooting the host
Johnsel has quit [Quit: ZNC 1.8.2+deb3.1 - https://znc.in]
azonenberg has quit [Ping timeout: 260 seconds]
azonenberg has joined #scopehal
<azonenberg>
ok back up and instances restarted
<azonenberg>
Let's see what i see in your resource set
<azonenberg>
No change
<d1b2>
<johnsel> weird
<d1b2>
<johnsel> move them over?
<azonenberg>
WTAF, lol. There are two VMs in the resource set
<azonenberg>
ci-orchestra with 4 vCPUs, 8GB RAM, 48GB disk
<azonenberg>
ci_linux_builder_gpu_105 with 32 vCPUs, 60GB RAM, 48GB disk
<d1b2>
<johnsel> yeah it's nuts
<azonenberg>
And yet scopehal-ci-set is using 110 vCPUs and 149 GB of RAM in the accounting manager
<azonenberg>
Which is of course physically impossible
<d1b2>
<johnsel> I'd almost say that every template is still taking up it's vCPUs and RAM
<azonenberg>
i dont have enough ram for you to be using 149GB and still ahve my stuff running :p
<d1b2>
<johnsel> right
<azonenberg>
i wonder if "convert to template" is not clearing usage
<d1b2>
<johnsel> that's what I said
<d1b2>
<johnsel> Move the VMs over if you will
<azonenberg>
what i mean is the template is not using resources
<azonenberg>
it's the ghost of the VM that became the template
<d1b2>
<johnsel> right
<azonenberg>
ok let me see if i can figure out how to move them
<d1b2>
<johnsel> I shut them down
<azonenberg>
Moved orchestrator over. Now set2 is using 4 vCPUs and 8GB RAM as expected
<azonenberg>
set1 is still using a ridiculous amount
<azonenberg>
now to move the builder
<azonenberg>
aaand yep. the original set is now using 74 vCPUs and 81 GB of ghost ram and 500GB of ghost storage
<azonenberg>
with no vms in it
<d1b2>
<johnsel> nuke it
<azonenberg>
i was about to say, am i safe to delete the original set?
<d1b2>
<johnsel> I guess we'll find out
<azonenberg>
as a near term workaround we may just have to keep on making new self service sets to get rid of the leaks lol
<d1b2>
<johnsel> (as far as I am aware, yes)
<azonenberg>
there's no instanecs in it, lets see what happens
<azonenberg>
Done
<d1b2>
<johnsel> yep looks fine
<azonenberg>
You're now using 36/96 vCPUs, 68/128GB RAM, 93G/2TB storage
<d1b2>
<johnsel> which is correct
<azonenberg>
Yeah. Next time you ask me to template something i'm going to try taking it out of the self service set first
<azonenberg>
and see if that avoids the leak
<d1b2>
<johnsel> You can do that
<d1b2>
<johnsel> I was going to lower the amount of CPUs + RAM to 1 each
<d1b2>
<johnsel> well if it is the templates at least we shouldn't have to deal with it too much
<azonenberg>
Yeah
<d1b2>
<johnsel> and you could just create a throwaway resource set
<azonenberg>
I dont think it needs to be in a set at all unless you're adminning it
<azonenberg>
i can just take it out of your set, template it, then give you access to the template
<azonenberg>
i don't have resource sets for any of my vms
<d1b2>
<johnsel> that might work
<azonenberg>
we'll find out next time you need a template made
<d1b2>
<johnsel> yep
<azonenberg>
also, let me know if i can delete any of the old templates to save disk space at some point
<d1b2>
<johnsel> sure once I have everything 100%
<d1b2>
<johnsel> there's some old disks as well that I think are associated with templates
<azonenberg>
although i am... not exactly short on storage space on the cluster now
<d1b2>
<johnsel> but I can only see half of everything unfortunately with my account permissions so it's hard to tell what can be deleted
<d1b2>
<johnsel> then again I'm not overly interested in it
<azonenberg>
this might have something to do with why it eats ram :p
<d1b2>
<johnsel> I have 80GB not to have to think about it
<azonenberg>
let's just say those windows have >1 tab each :p
<d1b2>
<johnsel> yeah I do the same too
<d1b2>
<johnsel> my usb controller crashed so I had to reboot
<d1b2>
<johnsel> otherwise I'd have the same to show for you lol
<azonenberg>
ok so lets see the CPU has 8 memory channels and is capable of ddr4 2933 but is currently loaded with 8x 32GB lrdimms of ddr4 2666
<azonenberg>
mobo has 16 sockets and supports up to 4TB
<azonenberg>
Looks like identical or near-identical dimms list for $134.99 on newegg now, so $1079.92 to double the ram. I'm not in a rush to do it, but good to have a number
<azonenberg>
(and that way i'd never have to worry about memory capacity)
<azonenberg>
I have plenty of CPU capacity, i'm averaging less than 25% across the board
<azonenberg>
per core
<azonenberg>
cumulative peak over the last 2 hours was around 900%
<azonenberg>
so more memory = more instances = making more effective use of the cpu :p
<azonenberg>
If i were to do a maintenance outage for hardware upgrades, i would likely throw a 40/100G NIC in there as well
<azonenberg>
right now i have a single 10G link to the network core (of a dual port nic, one port lit up) with a .1q trunk shared by both VMs and storage access
<azonenberg>
nearish term i'm thinking of transitioning some or all VM traffic to the second port of the nic leaving the first one exclusively for storage
<azonenberg>
@johnsel are you now blocked on anything from me?
<azonenberg>
or are all your near term needs wrt CI infra work met?
<d1b2>
<johnsel> sounds good
<d1b2>
<johnsel> and no I think I'm good now
<d1b2>
<johnsel> you could increase my RAM back to 192
<d1b2>
<johnsel> and CPUs to 128
<d1b2>
<johnsel> at least I think those were the values
<d1b2>
<johnsel> I'm testing with Windows right now
<d1b2>
<azonenberg> done. but note that at this time, with whatever instances you have running
<d1b2>
<azonenberg> i have 26GB of free ram
<d1b2>
<azonenberg> that is not an ACL limit thats the actual amount i have free 🙂
<d1b2>
<johnsel> Hmm
<d1b2>
<azonenberg> your two builders have 60GB each lol
<d1b2>
<azonenberg> Which is fine, i don't plan to start any more large instances in the near future
<d1b2>
<johnsel> I'll work on minimizing that lol
<d1b2>
<azonenberg> just be advised that you are close to maxing out the host 🙂
<d1b2>
<johnsel> that's not sustainable if we want to run more distros
<d1b2>
<azonenberg> Yeah
<d1b2>
<johnsel> I'll keep it mind
<d1b2>
<johnsel> I thought you had about half reserved for CI
<d1b2>
<azonenberg> (this is also why i wanted to double the ram on the host in the indefinite nearish future)
<d1b2>
<azonenberg> I did lol
<d1b2>
<johnsel> right
<d1b2>
<azonenberg> i have 256G physical ram
<d1b2>
<azonenberg> you're using 60G for each CI instance 🙂
<d1b2>
<johnsel> thanks, sometimes I can't count
<d1b2>
<azonenberg> experimentally, building with -j32 i only need about 20GB RAM for the actual build
<d1b2>
<azonenberg> so 32 per node is likely very reasonable
<d1b2>
<azonenberg> per instance*
<d1b2>
<azonenberg> That would give you capacity to run 3-4 distros simultaneously on the current hardware
<d1b2>
<johnsel> well if you use 20 then let's use 20
<d1b2>
<azonenberg> and if i added another 256G, another eight
<d1b2>
<johnsel> buttt
<d1b2>
<azonenberg> no
<d1b2>
<johnsel> I have a problem
<d1b2>
<azonenberg> thats 20G used by the build
<d1b2>
<azonenberg> above OS baseline
<d1b2>
<johnsel> oh right
<d1b2>
<johnsel> sorry it's 7:23am here
<d1b2>
<johnsel> details are starting to get vague
<d1b2>
<azonenberg> and you probably want a bit for disk cache and whatever in the OS
<d1b2>
<johnsel> sure yeah
<d1b2>
<azonenberg> 32 per builder i think is reasonable
<d1b2>
<johnsel> we can do 24 or 32 gb
<d1b2>
<johnsel> depending on how many other OS we want to run
<d1b2>
<azonenberg> if that exceeds the capacity of our hardware, i'll throw hardware at it :p
<d1b2>
<johnsel> can you figure out why it won't start?
<d1b2>
<azonenberg> let's see
<d1b2>
<johnsel> oh huh
<d1b2>
<johnsel> it started now
<d1b2>
<johnsel> but still the CI wasn't able to kick it off so please still look
<d1b2>
<azonenberg> console showed no bootable device
<d1b2>
<azonenberg> it was stuck on a bootloader/bios screen
<d1b2>
<johnsel> yes but the vm itself did not start
<d1b2>
<johnsel> that's another problem
<d1b2>
<johnsel> but the vm went from starting back to off
<d1b2>
<azonenberg> ok so do you want me to try manual starting and see what happens?
<d1b2>
<johnsel> no need
<d1b2>
<johnsel> I do wonder why it won't boot
<d1b2>
<azonenberg> ok well i'll let you troubleshoot, i have to start the dishwasher and make a bit more progress on this PCB layout then get some sleep
<d1b2>
<azonenberg> we can sync up tomorrow and see
<d1b2>
<johnsel> sure thing
<d1b2>
<johnsel> what pcb are you working on?
<d1b2>
<johnsel> my next project is a clock generator for the ADC/JESD
<d1b2>
<azonenberg> 100M/gig baseT (depending on how you stuff it) to baseT1 media converter
<d1b2>
<johnsel> with a off the top of my head 16GHz pll
<d1b2>
<johnsel> should be fun
<d1b2>
<johnsel> 15GHz
<d1b2>
<azonenberg> For playing with automotive ethernet decodes in scopehal, as well as because we wanted some media converters for work to use on car gigs and don't seem to have any
<d1b2>
<azonenberg> so we could buy some or i could make one for fun :p
<d1b2>
<azonenberg> (and if i make it, i get low level phy register access and all kinds of things i probably wont get with a commercial one)
<d1b2>
<azonenberg> then i have a quick and dirty oshpark ultrascale+ test board in the pipeline to see if my sketchy reballed aliexpress FPGA is still alive
<d1b2>
<azonenberg> and the ultrascale+ based BERT
<d1b2>
<johnsel> car gigs? that sounds interesting
<d1b2>
<azonenberg> all in the pipeline at various stages of schematic capture (no layout for any of those, the media converter is ~2/3 done with layout)
<d1b2>
<azonenberg> i mean we do embedded security at work in general
<d1b2>
<azonenberg> we've done a fair number of automotive projects involving canbus and a few involving ethernet
<d1b2>
<johnsel> aah for $dayjob
<d1b2>
<johnsel> I thought you were doing some off the clock
<d1b2>
<azonenberg> but we dont have any single pair ethernet dongles in the lab now
<d1b2>
<azonenberg> every time we encountered it we bummed one off the client :p
<d1b2>
<johnsel> lol
<d1b2>
<azonenberg> So i figure i'll make the converter boards for fun, write a scopehal decode using them
<d1b2>
<azonenberg> then leave them on my desk at the office for whoever needs one to use
<d1b2>
<johnsel> do you work on request of automotive companies then I assume on these gigs?
<d1b2>
<johnsel> if you can't answer that that's fine too
<d1b2>
<johnsel> I don't want to pry
<d1b2>
<azonenberg> I cant speak to specific project (NDAs etc). but as a general rule we do projects on behalf of either OEMs, vendors for them, or prospective buyers
<d1b2>
<johnsel> just find that they really need to with all the issues they have
<d1b2>
<azonenberg> e.g. "I make X and want it tested before Y integrates it into a product"
<d1b2>
<azonenberg> or "I make X, a product in itself, and want it tested"
<d1b2>
<johnsel> that explains a lot actually
<d1b2>
<azonenberg> or "I plan to buy a thousand X's across my company and want to do due diligence"
<d1b2>
<johnsel> lots of integration issues in automotive
<d1b2>
<azonenberg> Or even "I want to decide whether to buy X or Y and a security analysis of the two factors into our decision"
<d1b2>
<johnsel> individual components are fairly well secured but then they tie them together and put an API key somewhere they shouldn't
<d1b2>
<johnsel> Yeah sure that makes sense. I was just wondering if it was on request or self-directed research. It makes sense that it isn't the latter
<d1b2>
<azonenberg> We do research on our own as well, either if it's something one of the guys fnids interesting or if we as a company want to push on what we think is an unexplored market
<d1b2>
<azonenberg> find a bunch of high profile bugs, publish them, suddenly all the players in that space want testing done :p
<d1b2>
<johnsel> yep that's how the cookie crumbles
<d1b2>
<azonenberg> But the actual billable work is all prearranged contracts
<d1b2>
<azonenberg> we're not like bug bounty mercenaries
<d1b2>
<johnsel> shame
<d1b2>
<johnsel> you'd do well as one I think
<d1b2>
<johnsel> from what I've seen though it makes little sense financially
<d1b2>
<johnsel> maybe if you live in a third world country or something
<d1b2>
<azonenberg> I like the stability of a salary, sales guys to find customers, project managers to yell at customers who don't give us the info we need to do our work, etc
<d1b2>
<azonenberg> this is why i work for a consultancy and not solo
<d1b2>
<azonenberg> and why i build T&M gear as a maybe-eventually-profitable sideline and not to earn a living
<d1b2>
<johnsel> yep
<d1b2>
<johnsel> weird, the windows vm works fine if I create it via the web ui
<d1b2>
<azonenberg> But yeah, given the amount of reverse engineering involved in the job you can probably guess why i'm all in on scopehal lol
<d1b2>
<azonenberg> lots of "i'm staring at a bus trying to figure out how to abuse something"
<d1b2>
<azonenberg> one thing i plan to spend more time on in the indefinite future is visualizations and analysis / classification tools for getting the lay of the land
<d1b2>
<johnsel> Yeah I definitely think RE-ing is one of the most interesting markets for scopehal and hw in general
<d1b2>
<azonenberg> basically, i'm looking at an embedded device i know nothing about
<d1b2>
<azonenberg> what's interesting?
<d1b2>
<johnsel> yeah I have some ideas about that as well
<d1b2>
<azonenberg> e.g. i get on an i2c bus, who's talking? how many devices are there? how often are they being accessed? what can you identify?
<d1b2>
<johnsel> which reminds me that you still need to finish the file storage
<d1b2>
<azonenberg> the s3? yes
<d1b2>
<johnsel> and the scopesessions
<d1b2>
<azonenberg> i had trouble getting the service to run and need to investigate that
<d1b2>
<johnsel> so I can train my model
<d1b2>
<azonenberg> oh. the file server is up and running
<d1b2>
<azonenberg> I just dont have anything on it :p
<d1b2>
<azonenberg> i need to clean up and upload some data to it
<d1b2>
<johnsel> yes please
<d1b2>
<azonenberg> i.e. re-save as latest scopesession format, make sure it doesnt segfault when loaded, make sure there's nothing sensitive or work related in it
<d1b2>
<azonenberg> also use this as an opportunity to find any issues with data quality and things i might want to re-capture
<d1b2>
<johnsel> it's a shame the Rigol needs to go back
<d1b2>
<azonenberg> i think at least one of the sample ethernet waveforms has some truncated packets
<d1b2>
<johnsel> I haven't had much time to capture sessions
<d1b2>
<johnsel> I think with a few 100k points per type of signal it should be possible to train a model
<d1b2>
<azonenberg> in general, i think i am probably going to do a lot of new data collection
<d1b2>
<azonenberg> vs just cleaning up existing stuff
<d1b2>
<johnsel> but I'm still thinking about how to deal with different clock speeds
<d1b2>
<johnsel> well anything is welcome
<d1b2>
<azonenberg> One of the first thing i would like to see is a lowest level PHY classifier for an unknown signal
<d1b2>
<johnsel> right now I don't have enough to do anything with
<d1b2>
<johnsel> right, that's the plan
<d1b2>
<azonenberg> what is the modulation (NRZ? MLT3? PAM4? PAM16?)
<d1b2>
<azonenberg> and what is the symbol rate
<d1b2>
<azonenberg> not even anything past that looking at line coding
<d1b2>
<johnsel> yep start bottom up
<d1b2>
<azonenberg> also, is it a baseband digital modulation at all
<d1b2>
<johnsel> I actually think I can train a model to do much faster clock recovery as well
<d1b2>
<azonenberg> or something analog/RF
<d1b2>
<johnsel> yeah I do have a sdr, though it's not compatible with ngscopeclient
<d1b2>
<azonenberg> well, it likely wont be useful for compliance testing or eye patterns but if it's good enough for protocol decoding and can run in parallel on a whole signal on a GPU, that would still be valuable
<d1b2>
<azonenberg> yet
<d1b2>
<azonenberg> we have a (WIP) UHD driver already and more will be coming
<d1b2>
<johnsel> I might write a driver for it, the adi pluto
<d1b2>
<azonenberg> i want to do IIO soon
<d1b2>
<johnsel> that'd do it
<d1b2>
<azonenberg> the antsdr supports both UHD and IIO depending on which firmware it's running
<d1b2>
<azonenberg> UHD is being very unstable for me
<d1b2>
<azonenberg> i am going to test with a legit ettus i'm borrowing from a friend and see if it's my code or the antsdr
<d1b2>
<johnsel> but yeah if we collect this data I definitely can design and train a model
<d1b2>
<azonenberg> if it's the SDR i'm gonna see if it's any more stable with IIO
<d1b2>
<johnsel> or probably several models
<d1b2>
<johnsel> that would be a very interesting expansion in ngscopeclient I think
<d1b2>
<azonenberg> Yeah. I think the first step is "RF or baseband modulation?"
<d1b2>
<johnsel> I know aartos are doing ai based signal recognition
<d1b2>
<azonenberg> second is "what modulation and symbol rate"
<d1b2>
<johnsel> and decoding as well I suspect
<d1b2>
<johnsel> of wifi for one
<d1b2>
<johnsel> bt
<d1b2>
<johnsel> drone 5.8ghz
<d1b2>
<azonenberg> I (and @lainpants ) would likely be very interested in anything that can do real time or close to real time RF classification as well
<d1b2>
<johnsel> for sure real time
<d1b2>
<azonenberg> i.e. given an RTSA sweep or something, what's talking?
<d1b2>
<azonenberg> the more accurate you can get the better, if you can say this is specifically a huawei 5G base station with model number 12345 that's ideal :p
<d1b2>
<johnsel> it all starts with having enough data
<d1b2>
<azonenberg> but at least "this is LTE", "this is an unknown 433 MHz ISM band device transmitting 100 Kbps BFSK"
<d1b2>
<johnsel> I mean I'd love for us to start working on a collection of signals and their decoded counterpart
<d1b2>
<azonenberg> Yes. and eventually integrate that into unit tests
<d1b2>
<azonenberg> i want to be able to provide a pcap and a scopesession of an ethernet or usb session, some metadata like "ignore the first 8 packets of the pcap the scope didn't trigger yet"
<d1b2>
<johnsel> I am much more familiar with writing the dnn than the collection side of things
<d1b2>
<azonenberg> and have it validate that our decode is correct
<d1b2>
<azonenberg> so we can do automated regression testing of all kinds of filters and protocol decodes
<d1b2>
<azonenberg> That's a long way out but that is the dream
<d1b2>
<johnsel> Yeah it's definitely on my roadmap
<d1b2>
<johnsel> Just need to get some base data to start with
<d1b2>
<azonenberg> Yeah I'll work on it. Soon
<d1b2>
<azonenberg> Too much on my plate lol
<d1b2>
<johnsel> Yeah I know, it was a statement of fact not to stress you out or anything
<d1b2>
<246tnt> I think I understand what triggers the hang on intel. I'm not 100% sure if this should work and the driver is at fault or if that pattern of access is UB but at least I can work around it.
bvernoux has joined #scopehal
<d1b2>
<azonenberg> So just to be clear (also + @david.rysk since we talked about this earlier)
<azonenberg>
Ok so, tickets filed for the 64k issue as it seems we didn't have one
<azonenberg>
843/675 are the same issue just frontend vs backend
<azonenberg>
and 325 is an unrelated problem
<d1b2>
<azonenberg> @246tnt anyway go on, what's the problem?
<d1b2>
<johnsel> it's that race condition right
<d1b2>
<johnsel> from a few months ago
<azonenberg>
Also if this access pattern is UB i would expect even on nvidia the vulkan validation layer should complain about it?
<azonenberg>
(if it fails that might be something we can get khronos to add checks for)
<d1b2>
<david.rysk> @johnsel CI update: macOS working, Linux Ubuntu WIP, I'm splitting out docs since installing all the LaTeX dependencies seriously slows down the runners
<azonenberg>
yeah if we are doing GH runners as a "quick check" we dont need to build docs or upload binary artifacts etc
<d1b2>
<david.rysk> uploading binary artifacts is quick, docs aren't :p
<azonenberg>
(we also don't need to build docs on every platform)
<d1b2>
<david.rysk> Yeah I was thinking we just do it on latest Ubuntu
<azonenberg>
and our linux selfhosted image should have / might already have texlive preinstalled on the base image
<azonenberg>
that we clone to make each test vm
<d1b2>
<johnsel> hmm
<d1b2>
<johnsel> I think I've argued this before but it's stupid that we're building them for every app change anyway
<azonenberg>
Yeah
<d1b2>
<johnsel> let the owner repo build it on change instead
<azonenberg>
there is definitely room to optimize how much we build for CI testing
<d1b2>
<johnsel> we can link 'latest' artifacts
<d1b2>
<246tnt> From my reading of memory access to shared variable, you're not supposed to do "incoherent" access to them. So you basically need barriers to make sure there isn't any pending writes before either reading or rewriting the same shared location. So what I think is happening is that, on the GPU if you have something like : if (cond) { do_A; } else { do_B; } It will execute both do_A and do_B and just mask off any writes on the inactive
<d1b2>
block. But if in both do_A and do_B block you have writes to a shared variable like say g_done and some work items in a work group take the do_A path and some other take the do_B path, you can end up with a write to a shared variable that already has a pending write to it. Because some work items will have issued a write to g_done while executing do_A and some other will issue a write to it when executiong do_B. And it seesm to not matter that both
<d1b2>
writes actually write true it ends up screwing up the access.
<d1b2>
<johnsel> so refs needs not go out of date
<azonenberg>
Interesting. So what's the solution, adding barriers?
<d1b2>
<246tnt> So doing something like : if (cond) { do_A; }; barrier; if (~cond) { do_B; } Solves the issue.
<azonenberg>
And only the rendering shader is impacted as far as you know right?
<azonenberg>
(most of our other filter shaders are simple number crunching with few if any conditionals)
<d1b2>
<246tnt> What I ended up doing is a bit different and just setting a local flag in do_A / do_B and performing the global write at the end. Which avoid adding a new barrier.
<d1b2>
<246tnt> (This code still adds a barrier but it's unrelated ... that one was always needed and always missing and doesn't solve the issue by itself)
<azonenberg>
yeah makes sense
<azonenberg>
and yeah the vulkan memory model definitely has footguns in it lol
<d1b2>
<246tnt> I'll open a PR.
<azonenberg>
Is this tested to the point you're ready to send a PR yet?
<azonenberg>
ah
<d1b2>
<246tnt> Well ... Usually I can trigger it in like ... 2 sec without the patch. And with the patch I played with it for a minute without it crashing.
<azonenberg>
well congrats on finding what looks like a truly nasty and subtle bug lol
<d1b2>
<246tnt> I was trying to reproduce the issue in a synthetic test program but unable to do so ATM.
<azonenberg>
yeah it's probably dependent on some specific bits of intel microarchitecture and access patterns
<azonenberg>
having multiple in flight writes to the same shared memory location with some particular uarch state confuses something
<azonenberg>
might even only fail on one particular gpu uarch or something lol
<d1b2>
<johnsel> not sure how useful this is to anyone but I have this bookmarked
<_whitenotifier-e>
[scopehal-apps] smunaut 8bb6522 - Waveform rendering shader: Add missing barrier for termination flag init Once the init of the termination flag is done, we need to make sure all threads see it before continuing. Signed-off-by: Sylvain Munaut <tnt@246tNt.com>
<_whitenotifier-3>
[scopehal-apps] smunaut 0f265a9 - Waveform rendering shader: Group write to termination flag Instead of writing to the termination flag from different point in the code, which might lead to some writes being issued while there are already pending writes from other threads coming from other code path. This seems to be an issue on intel at least. So instead each thread keeps a local flag and then in the
<_whitenotifier-3>
common execution path, that flag might trigger a write to the shared variable and immediately after a barrier will be issued before any work item uses the flag or try to issue another write to it. Fixes #325 Signed-off-by: Sylvain Munaut <tnt@246tNt.com>
<cyborg_ar>
azonenberg: on PowerSupplyChannel.cpp, line 85. couldnt there be some way of refraining from sending a setpowervoltage command if the exact same value was sent in the previous frame? Im assuming refresh is what gets called every frame
<azonenberg>
refresh gets called every filter graph update
<azonenberg>
it's async to rendering
<azonenberg>
but anyway, i think the proper solution is actually to do it in the driver
<azonenberg>
make setpowervoltage be a no-op if the value being sent is the same as what was sent previously
<azonenberg>
(since you can also set voltage via the API or GUI etc)
<azonenberg>
just make it start as "unknown" then once you call it, save the last value you set
<azonenberg>
and only actually send commands to the device if differen
<cyborg_ar>
well i would like the gui one to work no matter what. there are also some wrinkles related to instruments not being able to accept every set point. this HP will round your SP to the nearest possible set unit on readback so it will always not match if you want to set in between two DAC units. if im caching the setpoint there is an argument for storing both the last sent and the last read back setpoints...
<azonenberg>
Yeah. A lot of stuff we do cache readback results, but if it's something that can change every time we don't
<azonenberg>
e.g. we cache the v/div on a scope in most (should be all?) drivers
<azonenberg>
but we don't cache live readings from a meter for obvious reasons
<cyborg_ar>
also im seeing there is no infrastructure for limits on settings, iirc the hp will raise a fault and ignore the command if you set an invalid value
<azonenberg>
Yes. That's something we've been thinking about for a while
<azonenberg>
but havent defined APIs for yet
<azonenberg>
By all means open a ticket to track that
<azonenberg>
for scopes we track limits on some settings like sample rate and memory depth
<azonenberg>
but we dont have anywhere near all APIs expose limits right now
<cyborg_ar>
also, due to the way the power supply works, changing the vset or iset can affect the other, with the last set taking priority, since the SOA of the power supply is L shaped
<azonenberg>
oh fun. So in that case what you might need to do is do a readback after sending a set command
<cyborg_ar>
yep
<azonenberg>
to see what the actual set points are
<cyborg_ar>
but that busts the queing
<azonenberg>
Yeah. It's a nontrivial problem
<cyborg_ar>
tbh queing in psu setpoints doesnt make much sense
<azonenberg>
one option there would be to implement clientside limiting in the driver
<azonenberg>
making it aware of the SOA
<azonenberg>
and the point of queueing is mostly so that when you apply a change in the GUI
<azonenberg>
you dont want to block the rendering thread
<azonenberg>
you want to push the command into a buffer and have it execute asynchronously
<azonenberg>
imagine the instrument is over a VPN 500ms away
<azonenberg>
you dont want the gui to lock up for a second getting a reply back
<azonenberg>
so every chance we have to avoid round trips, we take
<cyborg_ar>
yeah its hard on GPIB because it is not packet based
<cyborg_ar>
on packet based protocols you can keep track of which packet is response to which
<azonenberg>
i mean most scpi transports arent packet based. VICP kinda-sorta is
<azonenberg>
VXI-11 sorta is but doesnt support pipelining
<d1b2>
<johnsel> well, the underlying transport is
<cyborg_ar>
if youre talking to 5 different instruments over ethernet you dont have to wait for one to respond before talking to the next one, on gpib i dont know if you can interleave that way, you might be able to
<azonenberg>
Oh *that*. Yeah, transports support mutexing and if we are working with multiple instrments we may need to lock things when we are waiting for a reply
<azonenberg>
there is some mutexing already to prevent the gui and backend threads from stepping on each other
<azonenberg>
unsure if that is sufficient for multiple cnocurrent gpib connections
<azonenberg>
in particular i dont know what happens with linux-gpib if you have two devices being accessed at onec
<azonenberg>
can you only have one handle open?
<azonenberg>
do we need to implement that arbitration ourself?
<cyborg_ar>
i need dot check that, i'll write a simple second driver for another instrument and check...
<_whitenotifier-3>
[scopehal-docs] Johnsel 38b64e9 - Updated GPIB interface limitations (#78) * Update section-transports.tex Added some information about currently supported GPIB usage. * Update section-transports.tex Clarified a detail about gpib and multiple ngscopeclient sessions