Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring crashed workers and rescuing stuck jobs #773

Open
BeLuckyDaf opened this issue Feb 18, 2025 · 1 comment
Open

Monitoring crashed workers and rescuing stuck jobs #773

BeLuckyDaf opened this issue Feb 18, 2025 · 1 comment

Comments

@BeLuckyDaf
Copy link

Hello there!

After reading about leader election, updating timestamps, my initial thought was that rescuer would also monitor all workers that have running jobs attached to them.

I assumed that all workers would update some kind of a healthcheck timestamp every 5 seconds or so. And if some worker hasn't updated their timestamp in a reasonable time period, all its currently running jobs should be rescued.

However, I see that jobs are only rescued based on a global rescue timeout.

What do you think about this?

@brandur
Copy link
Contributor

brandur commented Feb 18, 2025

It's doable, but yep, also not how things currently work.

There's easy terminology confusion possible here, so just to make sure we're not being ambiguous: because you said "if some worker hasn't updated their timestamp in a reasonable time period, all its currently running jobs should be rescued", I'm going to assume that by "worker" you mean "client", because it's a client that has multiple concurrent jobs running. In River lingo, a "worker" is an implementation of a thing that runs a job (i.e. generally a struct with a Work implementation).

A side effect of Go's error returning design is that generally if one job in one goroutine fails, it should not take down the entire process, and by extension we should be concerned more with the health of individual jobs/workers rather than an entire client at once. Because even panics during a single run are rescued, in general it should really not be a thing for an entire client to go down all at once. If the process is given any chance to die gracefully, jobs are still expected to land in a usable place as long as all workers implement context cancellation properly (i.e. after a hard stop is invoked).

That leaves you with a kill -9 or equivalent, but even there, hopefully most uses are okay as long as reasonable timeouts are set because the job rescuer will do its job even for all jobs that belong to a single client.

So although possible, I think to go down this road we'd need some pretty strong rationale for it, especially given it'd likely need another migration which we'd like to avoid if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants