#buildbot on 2023-08-03 — irc logs at libera.irclog.whitequark.org

2022-03-07 03:24 verm__ changed the topic of #buildbot to: A Software Freedom Conservancy Project | Buildbot-3.5.0 | docs: http://docs.buildbot.net/current/ | tutorial: http://docs.buildbot.net/current/tutorial | irclogs: https://libera.irclog.whitequark.org/buildbot

07:03 ivanhoe has quit [Ping timeout: 260 seconds]

07:16 ivanhoe has joined #buildbot

07:17 ivanhoe has left #buildbot [#buildbot]

10:01 ivanhoe has joined #buildbot

10:02 ivanhoe is now known as iv4nhoe

10:03 iv4nhoe is now known as ivanhoe

17:26 cmouse has joined #buildbot

18:18 thm has left #buildbot [#buildbot]

18:40 ivanhoe has left #buildbot [#buildbot]

21:01 <Cheyenne> Question on failover workers. Is it possible to have a list of workers that if one of them "goes missing" or isn't available to switch to an alternative worker?

21:33 <sknebel> so you don't want the failover workers to be used unless the first one "goes missing"?

21:33 <Cheyenne> correct

21:35 <sknebel> hm... maybe a custom https://docs.buildbot.net/latest/manual/customization.html#canstartbuild-functions could work

21:36 <sknebel> that checks if the primary is working and only if it isnt allows the secondary

21:37 <sknebel> but I'm not sure when exactly this check happens, maybe this could end up with builds already having "decided" that they cant use the secondary when the primary fails - would need investigation

21:38 <Cheyenne> at the moment, I created a "watchedworker" class that is based on latent workers, I "watch" for traffic coming back from the worker and have a timeout for it.

21:38 <sknebel> or if you are not trying to run other builds on the secondaries, you could have them paused by default and only unpause them if the primary fails (potentially through an external monitoring system)

21:39 <Cheyenne> but a coworker isn't "happy" with that solution and says that we should just uses multiple workers for a builder and have a "phony" worker that can be used..

21:41 <sknebel> not sure what that would mean

21:45 <Cheyenne> the problem we are trying to resolve is with gerrit. If a worker gets hung up for whatever reason, it can hold up gerrit reporting. So, the idea is of a worker has "gone away", we can mark it as a build failure to gerrit.

21:46 <sknebel> wouldnt the normal timeouts already work that? worker goes away - build times out?

21:46 <sknebel> and thats a failure

21:47 <Cheyenne> No.. we've had workers that are "hung" or stalled that have gone for many hours. We have had to go manually cancel or stop the build for that worker

21:52 <sknebel> ah, ok

21:53 <sknebel> trying to use worker scheduling for that feels weird though, especially since it would have probably the same limitations (i.e. if buildbot thinks the worker is still fine, it wouldnt schedule to the fallback)

21:53 <sknebel> should probably rather live in some external monitoring

21:53 <sknebel> I'd guess

22:04 <Cheyenne> the "watched" worker seems to be doing okay. If it detects that a worker hasn't responded in a certain period of time, it cancels the job

22:06 <Cheyenne> but we've still had some odd problems. today we have a couple of workers that were stuck on "worker_preparation" for about 23 hours

22:08 <sknebel> we had kinda similar issues when we had problems with the master machine - apparently that getting I/O starved can badly confuse things