Monitoring crashed workers and rescuing stuck jobs #773

BeLuckyDaf · 2025-02-18T13:19:42Z

Hello there!

After reading about leader election, updating timestamps, my initial thought was that rescuer would also monitor all workers that have running jobs attached to them.

I assumed that all workers would update some kind of a healthcheck timestamp every 5 seconds or so. And if some worker hasn't updated their timestamp in a reasonable time period, all its currently running jobs should be rescued.

However, I see that jobs are only rescued based on a global rescue timeout.

What do you think about this?

brandur · 2025-02-18T17:11:42Z

It's doable, but yep, also not how things currently work.

There's easy terminology confusion possible here, so just to make sure we're not being ambiguous: because you said "if some worker hasn't updated their timestamp in a reasonable time period, all its currently running jobs should be rescued", I'm going to assume that by "worker" you mean "client", because it's a client that has multiple concurrent jobs running. In River lingo, a "worker" is an implementation of a thing that runs a job (i.e. generally a struct with a Work implementation).

A side effect of Go's error returning design is that generally if one job in one goroutine fails, it should not take down the entire process, and by extension we should be concerned more with the health of individual jobs/workers rather than an entire client at once. Because even panics during a single run are rescued, in general it should really not be a thing for an entire client to go down all at once. If the process is given any chance to die gracefully, jobs are still expected to land in a usable place as long as all workers implement context cancellation properly (i.e. after a hard stop is invoked).

That leaves you with a kill -9 or equivalent, but even there, hopefully most uses are okay as long as reasonable timeouts are set because the job rescuer will do its job even for all jobs that belong to a single client.

So although possible, I think to go down this road we'd need some pretty strong rationale for it, especially given it'd likely need another migration which we'd like to avoid if possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring crashed workers and rescuing stuck jobs #773

Monitoring crashed workers and rescuing stuck jobs #773

BeLuckyDaf commented Feb 18, 2025

brandur commented Feb 18, 2025

Monitoring crashed workers and rescuing stuck jobs #773

Monitoring crashed workers and rescuing stuck jobs #773

Comments

BeLuckyDaf commented Feb 18, 2025

brandur commented Feb 18, 2025