azonenberg changed the topic of #scopehal to: ngscopeclient, libscopehal, and libscopeprotocols development and testing | https://github.com/ngscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal
coralreef has quit [Quit: Do not go gentle into that goodnight.]
Degi_ has joined #scopehal
Degi has quit [Ping timeout: 255 seconds]
Degi_ is now known as Degi
<_whitenotifier-3> [scopehal-apps] dizzystem opened pull request #670: more icons + tweaks to step and area under curve - https://github.com/ngscopeclient/scopehal-apps/pull/670
<_whitenotifier-3> [scopehal-apps] azonenberg closed pull request #670: more icons + tweaks to step and area under curve - https://github.com/ngscopeclient/scopehal-apps/pull/670
<_whitenotifier> [scopehal-apps] dizzystem a36b13d - more icons + tweaks to step and area under curve
<_whitenotifier-3> [scopehal-apps] azonenberg pushed 2 commits to master [+48/-0/±10] https://github.com/ngscopeclient/scopehal-apps/compare/70a56ae75aea...6c94d67e2621
<_whitenotifier-3> [scopehal-apps] azonenberg 6c94d67 - Merge pull request #670 from dizzystem/icons more icons + tweaks to step and area under curve
josuah has joined #scopehal
josuah has quit [Client Quit]
josuah has joined #scopehal
<d1b2> <johnsel> @azonenberg do you have a template for a ngscopeclient driver plugin?
<d1b2> <azonenberg> No
<d1b2> <azonenberg> You should be able to figure it out, basically there's an extern "C" void PluginInit() that gets called when the plugin is loaded where you register your driver just like if it was in scopehal proper
<d1b2> <azonenberg> then you write the driver class just like normal
<d1b2> <johnsel> alright
<d1b2> <johnsel> it looks like the CI is still failing
<d1b2> <johnsel> weird
<d1b2> <johnsel> ╷ │ Error: timeout while waiting for state to become 'Running' (last state: 'Halted', timeout: 5m0s) │ │ with xenorchestra_vm.builder_gpu_39, │ on linux.tf line 1, in resource "xenorchestra_vm" "builder_gpu_39": │ 1: resource "xenorchestra_vm" "builder_gpu_39" { │ ╵ Error: Process completed with exit code 1.
<d1b2> <johnsel> it created the VM but for some reason it did not start
<d1b2> <johnsel> could it be asking for too much resources?
<d1b2> <azonenberg> Dont know. it's been doing that for a week or two, i've been too busy to remember to bug you
<d1b2> <johnsel> Yeah, weird
<d1b2> <azonenberg> scopehal-ci-set right now is using 90/128 vCPUs, 68/128 GB RAM, 694 GB / 2TB storage allocated
<d1b2> <johnsel> I've reduced the requested resources but it may be a bug somewhere
<d1b2> <azonenberg> ok well please investigate and let me know if any action is needed on my side
<d1b2> <johnsel> yep
<d1b2> <johnsel> do you have something to push?
<d1b2> <johnsel> ah no need we still have some tasks pending
<d1b2> <johnsel> or I could restart one of those I guess
<d1b2> <johnsel> it's weird because the script works fine if I trigger it from a ssh session
<d1b2> <johnsel> but somehow when the CI triggers it it doesn't work properly
<d1b2> <johnsel> hmm it may be as simple as the resources not being available because it does both delete + create actions at the same time
<d1b2> <johnsel> xenorchestra_vm.builder_gpu_38: Destroying... [id=c4b61988-51a3-89ed-0710-47f0e746a216] xenorchestra_vm.builder_gpu_39: Creating... xenorchestra_vm.builder_gpu_38: Still destroying... [id=c4b61988-51a3-89ed-0710-47f0e746a216, 10s elapsed] xenorchestra_vm.builder_gpu_39: Still creating... [10s elapsed] xenorchestra_vm.builder_gpu_38: Still destroying... [id=c4b61988-51a3-89ed-0710-47f0e746a216, 20s elapsed] xenorchestra_vm.builder_gpu_39: Still
<d1b2> creating... [20s elapsed] xenorchestra_vm.builder_gpu_38: Destruction complete after 24s xenorchestra_vm.builder_gpu_39: Still creating... [30s elapsed]
<d1b2> <johnsel> so the creation might succeed before the destruction has succeeded which means there aren't any resources available
<d1b2> <azonenberg> oh interesting
<d1b2> <azonenberg> yeah so you just need to wait for the destruction to complete before creating?
<d1b2> <johnsel> it would seem so
<d1b2> <johnsel> I will have to see how I can do this
<d1b2> <johnsel> I hope it doesn't require individual scripts because that would suck
<d1b2> <johnsel> well, it wouldn't be horrible but would be extra work
<d1b2> <johnsel> or you could increase the resources ofcourse
<d1b2> <azonenberg> How far?
<d1b2> <azonenberg> more vCPUs is easy, we're oversubscribed plenty
<d1b2> <azonenberg> RAM is less excessive
<d1b2> <johnsel> I think it's the RAM that is causing the issue
<d1b2> <johnsel> I've lowered the RAM requested to 30GB now so it should have headroom for the 2 vms to run at the same time
<d1b2> <johnsel> hmm I see another bug
<d1b2> <johnsel> we run other people's code
<d1b2> <azonenberg> ok yes you're oversubscribed. gimme a minute i can give you a bit more ram
<d1b2> <johnsel> gotta disable that so it doesn't do that
<d1b2> <azonenberg> what do you mean?
<d1b2> <azonenberg> for PRs?
<d1b2> <johnsel> yep
<d1b2> <azonenberg> Just bumped your RAM cap to 140G and you should no longer be oversubscribed
<d1b2> <azonenberg> And we want to be running it for PRs, but only once i've approved that user to do so
<d1b2> <johnsel> We can't
<d1b2> <azonenberg> what do you mean? there's a button to approve workflows
<d1b2> <azonenberg> it won't run any of our CI scripts, even the ones on the github runners, until i approve
<d1b2> <johnsel> It doesn't allow the PR job to get secrets so it doesn't have credentials
<d1b2> <johnsel> so it can run the job itself but cycling the VM will fail
<d1b2> <johnsel> unless we put the credentials inside the vm instead of the job
<d1b2> <johnsel> but this might cause problems with them showing up in the log output
<d1b2> <azonenberg> Hmm
<d1b2> <johnsel> I'd have to look for that latter
<d1b2> <azonenberg> well i really would like the ability to have trusted PRs run on the CI environment
<d1b2> <azonenberg> but for now let's fix it so at least stuff pushed to the repo directly CI's right
<d1b2> <johnsel> I mean if you trust the PR not to cat the .sh file it should be possible to put the credentials there
<d1b2> <johnsel> your pick @azonenberg
<d1b2> <azonenberg> So I thought the idea was that there would be an unauthenticated cycle-vm endpoint on a long lived instance that has creds to the xen backend
<d1b2> <azonenberg> and the runner vm could request that cycle request whenever it wanted without having other xen access
<d1b2> <azonenberg> is that not how you ended up building it?
<d1b2> <johnsel> It ended up becoming a long-lived runner instance that runs a script and gets the credentials through GitHub secrets
<d1b2> <johnsel> This was easier, and until this problem it seemed to suffice for our needs
<d1b2> <azonenberg> (also how does this relate to the s3 stuff? when i was sick i got halfway through provisioning that, and never finished debugging it)
<d1b2> <azonenberg> iirc rgw is installed on the nodes but it's not actually running and serving anything and i haven't figured out why
<d1b2> <johnsel> s3 will hold the terraform state, this will mean the long-lived runner can be ephemeral too as it will no longer have the database in a file on the local filesystem
<d1b2> <azonenberg> ok so that's an unrelated issue
<d1b2> <johnsel> it's not directly related to the authentication
<d1b2> <johnsel> yep
<d1b2> <johnsel> it's possible to build a web server and trigger that, though it would be some work to set it up
<d1b2> <azonenberg> i think thats the best option because I want to be able to CI approved PRs
<d1b2> <johnsel> we can also put the credential in the shell script and make sure nobody changes the ci script to cat that file
<d1b2> <johnsel> that's a 1 minute solution
<d1b2> <azonenberg> if you think we can do that securely, go for it. I expect the design will evolve over time
<d1b2> <azonenberg> i mean, I review PRs before approving them to run
<d1b2> <azonenberg> and if somebody is messing with the yaml that's gonna catch my eye
<d1b2> <johnsel> yeah that would be the vector
<d1b2> <johnsel> and worst case it would mean someone gets access to the xoa instance
<d1b2> <johnsel> it's not like there is much to get there
<d1b2> <azonenberg> well more to the point it would mean they'd get access to the xoa with credentials that have CI-level access
<d1b2> <azonenberg> i.e. they can't touch anything outside that resource set
<d1b2> <azonenberg> right?
<d1b2> <johnsel> correct, right now it's my user
<d1b2> <johnsel> we should introduce a separate one
<d1b2> <azonenberg> Yeah
<d1b2> <johnsel> but yes it's just the ci resource set
<d1b2> <azonenberg> But the point is, it's a limited user that can't touch other instances
<d1b2> <johnsel> can't delete templates
<d1b2> <azonenberg> and the worst they can do is things like spawn CI builder VMs or run arbitrary code in the VM they are already running arbitrary code in
<d1b2> <johnsel> can't even access the vpn network
<d1b2> <johnsel> right
<d1b2> <johnsel> they could mine bitcoins
<d1b2> <azonenberg> Lol. That would get noticed fast when other instances start slowing down
<d1b2> <johnsel> and we both monitor the xoa/ci anyway so anything weird would likely be noticed quickly
<d1b2> <azonenberg> So yeah i'm not super concerned about that. What i mostly am concerned about is the potential for malicious unit test binaries running in the VM finding a way to break out and escape the CI network. So we'll definitely want to carefully verify the segmentation
<d1b2> <azonenberg> which I think is solid but i'd like to get a few more eyes on soon
<d1b2> <azonenberg> the main line of defense there is just not enabling workflows until i've vetted the PR and the contributor is at least somewhat trusted
<d1b2> <johnsel> we can test that if you want
<d1b2> <johnsel> but the basic ping tests we did showed it just had access to the internet
<d1b2> <azonenberg> Yeah once we get things debugged and operational i'd like you to spend some time trying to break out of the CI
<d1b2> <johnsel> and the xoa instance
<d1b2> <azonenberg> i'll probably get lain to hack on it as well if she has time
<d1b2> <azonenberg> We're both security professionals so if neither of us can find a way to get out it's probably good enough to guard against the unlikely scenario of someone who's already submitted good patches turning rogue
<d1b2> <johnsel> yep sure we can plan a round of red testing once we have things working properly
<d1b2> <azonenberg> exactly
<d1b2> <azonenberg> anyway, first things first you now have more RAM so the resource exhaustion problem should no longer be an issue
<d1b2> <johnsel> yes I'll return the requested RAM back to 64GB in a bit
<d1b2> <azonenberg> (that said, we probably don't want to run both builders at one if we can avoid it)
<d1b2> <azonenberg> at least not with 64G each
<d1b2> <johnsel> if things work properly it should only overlap seconds to minutes
<d1b2> <johnsel> it cycled correctly so it looks like that was the issue
<d1b2> <johnsel> I'll still investigate if we can ensure the creation happens after the destruction
<d1b2> <azonenberg> my concern is ooming during those seconds
<d1b2> <azonenberg> i dont want to have 64GB of ram permanently unusable because i need it for those few seconds to avoid errors
<d1b2> <azonenberg> (i.e. i can never schedule other vms to use that memory)
<d1b2> <azonenberg> here's what i have host wide right now
<d1b2> <azonenberg> the three right hand blobs on the memory bar are your instances
<d1b2> <johnsel> can you not overprovision the RAM?
<d1b2> <azonenberg> Not to my knowledge, with the current setup
<d1b2> <azonenberg> i can oversubscribe CPU capacity but not memory
<d1b2> <azonenberg> anyway, if i have to throw another 128 or 256 gigs in there its not the end of the world but it'll take time and money
<d1b2> <johnsel> I think we can use dynamic memory management
<d1b2> <azonenberg> so for now let's make it work with what we have
<d1b2> <johnsel> yeah no extra hardware shouldn't be necessary
<d1b2> <johnsel> I'm sure we're not the first running into this issue
<d1b2> <azonenberg> anyway for the short term i dont have any other memory hungry instances planned to run there
<d1b2> <azonenberg> so let's just make it functional asap
<d1b2> <johnsel> weird
<d1b2> <johnsel> it says I'm using 140GB
<d1b2> <johnsel> well, 133GB of 140
<d1b2> <johnsel> but I have 28GB + 4x 8 when I look at vm level
<d1b2> <azonenberg> so that usage count seems to count vms that you arent running
<d1b2> <azonenberg> maybe you have some vm created but not running that is counting against your set cap
<d1b2> <johnsel> I don't
<d1b2> <azonenberg> huuuh
<d1b2> <johnsel> I counted including those that aren't running
<d1b2> <johnsel> that's why I'm confused
<d1b2> <johnsel> or maybe it is confused
<d1b2> <johnsel> either way something is off lol
<d1b2> <azonenberg> digs
<d1b2> <azonenberg> ok so i see ci_windows11 with 8GB, not running
<d1b2> <azonenberg> ci_windows11-cloudconfigtest_clone, 8gb, not running
<d1b2> <azonenberg> windows 11 cloudinit gpu, 8gb, running
<d1b2> <azonenberg> ci_linux_builder_gpu_58, 28gb, running
<d1b2> <azonenberg> ci_orchestrator, 8gb, running
<d1b2> <johnsel> yep
<d1b2> <azonenberg> so yeah i dont know why it's saying you're using 125 gigs lol
<d1b2> <johnsel> it's a problem though
<d1b2> <johnsel> I can't create anything new now
<d1b2> <johnsel> (this may even be the issue previously)
<d1b2> <azonenberg> Can you shut down all of your instances for a min so i can debug?
<d1b2> <azonenberg> everything even the orchestrator
<d1b2> <johnsel> done
<d1b2> <azonenberg> So it's still saying 125G RAM used in this resource set
<d1b2> <johnsel> yeah weird right
<d1b2> <johnsel> can you change how much is assigned?
<d1b2> <johnsel> maybe that'll trigger a refresh or something
<d1b2> <johnsel> hmmm
<d1b2> <johnsel> that also shows 140GB used
<d1b2> <johnsel> so it is somehow counting the host usage?
<d1b2> <azonenberg> No
<d1b2> <azonenberg> It's also showing 94 vCPUs used
<d1b2> <azonenberg> it seems like it counts all of your templates and snapshots against the cap
<d1b2> <johnsel> that's lame
<d1b2> <johnsel> I can't see them on my end properly either
<d1b2> <azonenberg> currently scheduled for xen orchestra v6
<d1b2> <azonenberg> anyawy, near term solution seems to be i allocate you way more ram than you can actually use
<d1b2> <johnsel> or cleaning up
<d1b2> <azonenberg> if it counts all of your templates and snapshots
<d1b2> <azonenberg> do you have unused templates we can get rid of?
<d1b2> <johnsel> can you list all my snapshots?
<d1b2> <azonenberg> i'm not sure about snapshots let's start with templates
<d1b2> <johnsel> and templates for that matter
<d1b2> <azonenberg> i see windows 10 64 bit, ci_windows11, centos 7, debian 11 cloudinit hub, debian 11 cloudinit gpu, ci_builder_gpu, windows 11 cloudinit gpu
<d1b2> <johnsel> sec, checking which ones we use
<d1b2> <johnsel> can you pull a download of those?
<d1b2> <azonenberg> i wouldn't delete them
<d1b2> <azonenberg> just remove from the self service set so they dont count against your usage
<d1b2> <johnsel> ooh right
<d1b2> <johnsel> that's a good idea
<d1b2> <azonenberg> can always put them back
<d1b2> <azonenberg> i have plenty of disk space ram is in shorter supply
<d1b2> <johnsel> in that case I think we actually use Debian 11 Cloud-Init (GPU) and Debian 11 Cloud-Init (Hub) right now
<d1b2> <azonenberg> and what on the windows side?
<d1b2> <johnsel> and then for Windows 11 let me see
<d1b2> <azonenberg> so ci_builder_gpu, windows10, and ci_windows11 can be removed?
<d1b2> <johnsel> let's remove them all
<d1b2> <azonenberg> and centos7?
<d1b2> <johnsel> I have a VM that we will convert to a template at some point
<d1b2> <johnsel> for win11
<d1b2> <johnsel> yes centos7 too
<d1b2> <azonenberg> aaand that didn't change usage at all
<d1b2> <azonenberg> good to do the cleanup but that wasn't a contributor
<d1b2> <johnsel> what the f
<d1b2> <johnsel> I deleted 2 vms
<d1b2> <johnsel> still using 109GB
<d1b2> <azonenberg> Now it says 109 GB
<d1b2> <azonenberg> thats down from 125
<d1b2> <azonenberg> so that did something
<d1b2> <johnsel> I also removed the runner
<d1b2> <azonenberg> now down to 81.19
<d1b2> <johnsel> Looks like we have 64 GB ghost usage
<d1b2> <azonenberg> Welp
<d1b2> <johnsel> or 48
<d1b2> <johnsel> no, 64
<d1b2> <johnsel> it must be something
<d1b2> <johnsel> perhaps a snapshot
<d1b2> <azonenberg> I bumped your cap up to 192GB which is enough for 128G of actual usage plus the ghost
<d1b2> <azonenberg> so that should fix it for now
<d1b2> <azonenberg> we can dig more into this later
<d1b2> <azonenberg> i have to get back to work
<d1b2> <azonenberg> in the meantime, restart all the nodes and test that it's at least working now?
<d1b2> <johnsel> yupyp
<d1b2> <johnsel> alright done, testing now
<d1b2> <azonenberg> All good?
<d1b2> <azonenberg> also are we only running debian in the test CI setup or do we have windows running on our hardware yet too?
<d1b2> <johnsel> 1. yes, 2. not yet
<d1b2> <azonenberg> Ok
<d1b2> <azonenberg> So how far are we from being ready to remove the github hosted ubuntu test runners?
<d1b2> <azonenberg> What if anything has to be done?
<d1b2> <johnsel> I think we're very near
<d1b2> <johnsel> Assuming the cycling keeps working now it should be done
<d1b2> <azonenberg> yeah
<d1b2> <azonenberg> And what's the blockers if any to getting windows running as well?
<d1b2> <johnsel> We need to do some cleanup on the template but that's not a lot of work
<d1b2> <azonenberg> Also, we still need to actually make the CI run the tests
<d1b2> <johnsel> Ah yes vulkan support may or may not work at this point
<d1b2> <azonenberg> It does not, i can tell you that now
<d1b2> <johnsel> We'll have to double check that
<d1b2> <azonenberg> because i see "glfw init failed" in the log when it tries to initialize vulkan while it's enumerating the list of available tests
<d1b2> <azonenberg> (this is itself a bug, we shouldn't init vulkan until we are ready to run a test)
<d1b2> <johnsel> yeah then we'll have to look into that
<d1b2> <azonenberg> the vulkan issue may be, and hopefully is, as simple as not having an x server running or $DISPLAY set right
<d1b2> <johnsel> we may need to set up a display
<d1b2> <johnsel> right
<d1b2> <johnsel> or set it to offline rendering
<d1b2> <johnsel> this should be possible
<d1b2> <azonenberg> setting up x is the simplest route imo, and will be needed down the road if we start adding unit tests for the ngscopeclient gui
<d1b2> <johnsel> I think that requires some code changes though
<d1b2> <azonenberg> e.g. simulated mouse clicks, testing rendering against golden outputs, etc
<d1b2> <johnsel> true
<d1b2> <azonenberg> That's where i want to end up long term
<d1b2> <azonenberg> so we should just do it
<d1b2> <azonenberg> wrt the glfw init failed bug, i intend to fix this soonish because it wastes time during the build
<d1b2> <azonenberg> But i'm holding off for now as it's a useful indicator the CI doesn't have working vulkan :p
<d1b2> <azonenberg> once we get to the point that we have working vulkan and are ready to actually run the tests in the CI environment, then i'll fix it
<d1b2> <johnsel> Yeah so setting Vulkan/X up on Debian and setting msys up on Windows remain as mayor tasks
<d1b2> <azonenberg> Yep
<d1b2> <azonenberg> That's not needed to be able to remove the existing ubuntu builder though
<d1b2> <azonenberg> as that doesnt have working vulkan either
<d1b2> <johnsel> True
<d1b2> <johnsel> It should just run now in theory
<d1b2> <azonenberg> so once we're at parity with that we can transition that to be the official linux CI (which i think we're at, we'll give it a little time to make sure the cycling is stable)
<d1b2> <azonenberg> then delete the ubuntu runner
<d1b2> <johnsel> Yep
<d1b2> <azonenberg> then work on getting vulkan working in the debian runner and getting the windows runner moved in house as well
<d1b2> <azonenberg> then we still have the x86 macos runner on github infrastructure as a stand-in for the eventual macos arm64 runner we need to figure out a plan for
<d1b2> <johnsel> yeah I think those are starting to be available on github too
<d1b2> <johnsel> I'd like to work on a ARM build too at some point
<d1b2> <johnsel> and I'd like to change the ci script so it doesn't use the template function to fill in the version number of vulkan
<d1b2> <david.rysk> ARM macOS? I can do some testing here if we're closer to working
<d1b2> <johnsel> now I can't copy paste the ci script to run it manually and get the same 'golden' result on my machine
<d1b2> <david.rysk> do you have a runner? need a mac mini or something
<d1b2> <johnsel> for me ARM linux, but
<d1b2> <johnsel> we also we want mac sure
<d1b2> <azonenberg> we do not currently have any arm ci
<d1b2> <azonenberg> this actually allowed a bug to escape not long ago
<d1b2> <azonenberg> where the build was broken on apple silicon because i accidentally un-ifdef'd an x86-ism
<d1b2> <azonenberg> and none of our CI caught it so i didn't know anything was broken until a mac user complained
<d1b2> <azonenberg> long term plan is mac mini or somebody's m1 macbook when they upgrade
<d1b2> <azonenberg> hosted at my place so it will be on the same sandbox network as the other CI stuff
<d1b2> <azonenberg> because one of the reasons we're moving CI in house is so that we can eventually do hardware-in-loop testing
<d1b2> <johnsel> yep we have a nice little setup now
<d1b2> <johnsel> I'm quite happy with how we integrated things
<d1b2> <johnsel> alright since we're looking at it anyway I'm taking a stab at fixing Vulkan as well
<d1b2> <johnsel> that did not go easy
<d1b2> <johnsel> @azonenberg let's pull a template from this vm
<d1b2> <johnsel> it looks like snapshots do indeed count for the resources
<d1b2> <johnsel> I think we have an orphaned snapshot somewhere