<Cheyenne>
Question on failover workers. Is it possible to have a list of workers that if one of them "goes missing" or isn't available to switch to an alternative worker?
<sknebel>
so you don't want the failover workers to be used unless the first one "goes missing"?
<sknebel>
that checks if the primary is working and only if it isnt allows the secondary
<sknebel>
but I'm not sure when exactly this check happens, maybe this could end up with builds already having "decided" that they cant use the secondary when the primary fails - would need investigation
<Cheyenne>
at the moment, I created a "watchedworker" class that is based on latent workers, I "watch" for traffic coming back from the worker and have a timeout for it.
<sknebel>
or if you are not trying to run other builds on the secondaries, you could have them paused by default and only unpause them if the primary fails (potentially through an external monitoring system)
<Cheyenne>
but a coworker isn't "happy" with that solution and says that we should just uses multiple workers for a builder and have a "phony" worker that can be used..
<sknebel>
not sure what that would mean
<Cheyenne>
the problem we are trying to resolve is with gerrit. If a worker gets hung up for whatever reason, it can hold up gerrit reporting. So, the idea is of a worker has "gone away", we can mark it as a build failure to gerrit.
<sknebel>
wouldnt the normal timeouts already work that? worker goes away - build times out?
<sknebel>
and thats a failure
<Cheyenne>
No.. we've had workers that are "hung" or stalled that have gone for many hours. We have had to go manually cancel or stop the build for that worker
<sknebel>
ah, ok
<sknebel>
trying to use worker scheduling for that feels weird though, especially since it would have probably the same limitations (i.e. if buildbot thinks the worker is still fine, it wouldnt schedule to the fallback)
<sknebel>
should probably rather live in some external monitoring
<sknebel>
I'd guess
<Cheyenne>
the "watched" worker seems to be doing okay. If it detects that a worker hasn't responded in a certain period of time, it cancels the job
<Cheyenne>
but we've still had some odd problems. today we have a couple of workers that were stuck on "worker_preparation" for about 23 hours
<sknebel>
we had kinda similar issues when we had problems with the master machine - apparently that getting I/O starved can badly confuse things