You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After reading about leader election, updating timestamps, my initial thought was that rescuer would also monitor all workers that have running jobs attached to them.
I assumed that all workers would update some kind of a healthcheck timestamp every 5 seconds or so. And if some worker hasn't updated their timestamp in a reasonable time period, all its currently running jobs should be rescued.
However, I see that jobs are only rescued based on a global rescue timeout.
What do you think about this?
The text was updated successfully, but these errors were encountered:
It's doable, but yep, also not how things currently work.
There's easy terminology confusion possible here, so just to make sure we're not being ambiguous: because you said "if some worker hasn't updated their timestamp in a reasonable time period, all its currently running jobs should be rescued", I'm going to assume that by "worker" you mean "client", because it's a client that has multiple concurrent jobs running. In River lingo, a "worker" is an implementation of a thing that runs a job (i.e. generally a struct with a Work implementation).
A side effect of Go's error returning design is that generally if one job in one goroutine fails, it should not take down the entire process, and by extension we should be concerned more with the health of individual jobs/workers rather than an entire client at once. Because even panics during a single run are rescued, in general it should really not be a thing for an entire client to go down all at once. If the process is given any chance to die gracefully, jobs are still expected to land in a usable place as long as all workers implement context cancellation properly (i.e. after a hard stop is invoked).
That leaves you with a kill -9 or equivalent, but even there, hopefully most uses are okay as long as reasonable timeouts are set because the job rescuer will do its job even for all jobs that belong to a single client.
So although possible, I think to go down this road we'd need some pretty strong rationale for it, especially given it'd likely need another migration which we'd like to avoid if possible.
Hello there!
After reading about leader election, updating timestamps, my initial thought was that rescuer would also monitor all workers that have running jobs attached to them.
I assumed that all workers would update some kind of a healthcheck timestamp every 5 seconds or so. And if some worker hasn't updated their timestamp in a reasonable time period, all its currently running jobs should be rescued.
However, I see that jobs are only rescued based on a global rescue timeout.
What do you think about this?
The text was updated successfully, but these errors were encountered: