Job timeout applied after middleware #752

bgentry · 2025-02-10T22:07:30Z

While reading through the executor for another change, I realized that as of #632 / #584 we're applying the job timeout after stepping through all the middleware:

river/job_executor.go

Lines 209 to 227 in fc667eb

    
           doInner := func(ctx context.Context) error { 
        
           	jobTimeout := e.WorkUnit.Timeout() 
        
           	if jobTimeout == 0 { 
        
           		jobTimeout = e.ClientJobTimeout 
        
           	} 
        
           	// No timeout if a -1 was specified. 
        
           	if jobTimeout > 0 { 
        
           		var cancel context.CancelFunc 
        
           		ctx, cancel = context.WithTimeout(ctx, jobTimeout) 
        
           		defer cancel() 
        
           	} 
        
           	if err := e.WorkUnit.Work(ctx); err != nil { 
        
           		return err 
        
           	} 
        
           	return nil 
        
           }

I'm not sure if this was an intentional decision and I couldn't find any conversations where it was mentioned, so I thought we should at least talk about it. Do you think this is the right place for it, or should the timeout happen before the middleware stack gets called? I was thinking of a worst case scenario where a middleware got stuck or took awhile on a blocking operation and the job ended up running far longer than either the job's timeout or the client's default timeout.

brandur · 2025-02-12T02:56:53Z

Hmm, so I didn't consider it one way or the other super deeply admittedly, but including the middleware in the timeout doesn't seem super obviously more or less correct to me.

An argument could be made that that when setting work timeouts, users are trying to set a timeout specifically on their code as opposed to River's internal code, which to their eye would be an implementation detail. So similarly how the timeout doesn't include the time it takes to complete a job, it doesn't include middleware, which is largely made up of other internal stuff.

bgentry · 2025-02-12T03:23:20Z

The risk I see is that as it currently is, the middleware have no timeout whatsoever applied to them. Beyond that I don't think it matters a lot either way, but I do worry about the risk of a middleware getting stuck and the job just running forever despite whatever timeouts are configured.

brandur · 2025-02-14T21:36:09Z

Yeah I suppose that could be a danger depending on what you're trying to do in the middleware. Probably a separate timeout like you'd find in http.Transport (each phase of a request can be configured separately) would be as defensible as mixing in the same timeout being used for the job's work phase.

bgentry mentioned this issue Feb 21, 2025

Job executor: Unmarshal job args late, after middlewares have run #783

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job timeout applied after middleware #752

Job timeout applied after middleware #752

bgentry commented Feb 10, 2025

brandur commented Feb 12, 2025

bgentry commented Feb 12, 2025

brandur commented Feb 14, 2025

Job timeout applied after middleware #752

Job timeout applied after middleware #752

Comments

bgentry commented Feb 10, 2025

brandur commented Feb 12, 2025

bgentry commented Feb 12, 2025

brandur commented Feb 14, 2025