Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad JobSchedule query performance with lots of scheduled jobs #785

Open
rose-m opened this issue Feb 26, 2025 · 16 comments
Open

Bad JobSchedule query performance with lots of scheduled jobs #785

rose-m opened this issue Feb 26, 2025 · 16 comments

Comments

@rose-m
Copy link

rose-m commented Feb 26, 2025

Hey there, thanks for the project in general! Really nice and lightweight to use 👍

We're currently using it in production and running ~45k jobs per day, so not too big. However, all of our jobs are basically scheduled (we get an external action with what to execute and it gets distributed over the day).

Looking into DB monitoring we actually see that the JobSchedule query has very bad performance and causes significant load on our DB. Given that the scheduler runs every 5s an average latency of 2.5s per call is concerning us. Is that something you could look into?

Here's a screenshot from AWS RDS performance insights:

Image

Let me know if I can provide any additional details.

@bgentry
Copy link
Contributor

bgentry commented Feb 26, 2025

That’s odd, it should be utilizing the prioritized fetch index for an efficient index scan. Any chance you could provide the output of that query run with an EXPLAIN ANALYZE to help diagnose?

@rose-m
Copy link
Author

rose-m commented Feb 26, 2025

Sure, I hope this link works for you. I've had to manually put in the timestamps using now() + interval '5 seconds' since when looking into the riverqueue code that's what it seems to look ahead for.

https://explain.dalibo.com/plan/ec514gccf444cg22

@bgentry
Copy link
Contributor

bgentry commented Feb 26, 2025

I'm a lot more used to looking at raw EXPLAIN ANALYZE output so I might be missing something, but from your link my impression is that ~all of the time is going to two slow parts in your case:

  1. Locking the 5 rows chosen to be scheduled (~1.5s)
  2. Doing a CTE scan of those 5 rows (also ~1.5s)

I can't really make sense of that, unless your DB instance is hyper constrained or you're somehow locking those rows elsewhere and blocking them from being scheduled. Both of those operations should be very fast.

@rose-m
Copy link
Author

rose-m commented Feb 26, 2025

Yeah we're not doing anything with this table on our own... We're also not scheduling constantly but essentially just once upfront. The DB is kind of idling around and this is the only query that is producing load on the DB with some significant CPU usage.
The instance is an AWS RDS Postgres db.t4g.xlarge.

@bgentry
Copy link
Contributor

bgentry commented Feb 26, 2025

Can you also post the raw (non json) explain analyze output just so I’m sure I’m reading it right?

@rose-m
Copy link
Author

rose-m commented Feb 26, 2025

Sure, here it is:

Nested Loop  (cost=87408.68..136046.56 rows=7377 width=1034) (actual time=4222.227..4222.324 rows=2 loops=1)
  CTE jobs_to_schedule
    ->  Limit  (cost=86331.54..86332.79 rows=100 width=31) (actual time=4221.817..4221.866 rows=2 loops=1)
          ->  LockRows  (cost=86331.54..86823.59 rows=39364 width=31) (actual time=4221.816..4221.864 rows=2 loops=1)
                ->  Sort  (cost=86331.54..86429.95 rows=39364 width=31) (actual time=7.222..7.226 rows=19 loops=1)
                      Sort Key: river_job_1.priority, river_job_1.scheduled_at, river_job_1.id
                      Sort Method: quicksort  Memory: 25kB
                      ->  Bitmap Heap Scan on river_job river_job_1  (cost=2437.75..84827.08 rows=39364 width=31) (actual time=4.956..7.203 rows=19 loops=1)
"                            Recheck Cond: ((state = ANY ('{retryable,scheduled}'::river_job_state[])) AND (priority >= 0) AND (scheduled_at <= (now() + '00:00:05'::interval)))"
                            Heap Blocks: exact=3652
                            ->  Bitmap Index Scan on river_job_prioritized_fetching_index  (cost=0.00..2427.91 rows=39364 width=0) (actual time=3.818..3.819 rows=4723 loops=1)
"                                  Index Cond: ((state = ANY ('{retryable,scheduled}'::river_job_state[])) AND (priority >= 0) AND (scheduled_at <= (now() + '00:00:05'::interval)))"
  CTE jobs_with_rownum
    ->  WindowAgg  (cost=5.35..7.82 rows=100 width=67) (actual time=4221.899..4221.903 rows=2 loops=1)
          ->  Sort  (cost=5.32..5.57 rows=100 width=59) (actual time=4221.894..4221.894 rows=2 loops=1)
                Sort Key: jobs_to_schedule.unique_key, jobs_to_schedule.priority, jobs_to_schedule.scheduled_at, jobs_to_schedule.id
                Sort Method: quicksort  Memory: 25kB
                ->  CTE Scan on jobs_to_schedule  (cost=0.00..2.00 rows=100 width=59) (actual time=4221.820..4221.869 rows=2 loops=1)
  CTE updated_jobs
    ->  Update on river_job river_job_2  (cost=16.89..1067.64 rows=7377 width=201) (actual time=4222.217..4222.306 rows=2 loops=1)
          ->  Hash Left Join  (cost=16.89..1067.64 rows=7377 width=201) (actual time=4221.955..4221.969 rows=2 loops=1)
                Hash Cond: (job.unique_key = river_job_3.unique_key)
                ->  Nested Loop  (cost=0.43..842.50 rows=100 width=230) (actual time=4221.926..4221.937 rows=2 loops=1)
                      ->  CTE Scan on jobs_with_rownum job  (cost=0.00..2.00 rows=100 width=120) (actual time=4221.908..4221.913 rows=2 loops=1)
                      ->  Index Scan using river_job_pkey on river_job river_job_2  (cost=0.43..8.40 rows=1 width=126) (actual time=0.010..0.010 rows=1 loops=2)
                            Index Cond: (id = job.id)
                ->  Hash  (cost=14.07..14.07 rows=192 width=71) (actual time=0.019..0.020 rows=0 loops=1)
                      Buckets: 1024  Batches: 1  Memory Usage: 8kB
                      ->  Merge Join  (cost=0.28..14.07 rows=192 width=71) (actual time=0.019..0.020 rows=0 loops=1)
                            Merge Cond: (river_job_3.unique_key = jobs_with_rownum.unique_key)
                            Join Filter: (river_job_3.id <> jobs_with_rownum.id)
                            ->  Index Scan using river_job_unique_idx on river_job river_job_3  (cost=0.28..8.69 rows=5 width=15) (actual time=0.012..0.012 rows=1 loops=1)
                                  Index Cond: (unique_key IS NOT NULL)
"                                  Filter: ((unique_states IS NOT NULL) AND (CASE state WHEN 'available'::river_job_state THEN get_bit(unique_states, 7) WHEN 'cancelled'::river_job_state THEN get_bit(unique_states, 6) WHEN 'completed'::river_job_state THEN get_bit(unique_states, 5) WHEN 'discarded'::river_job_state THEN get_bit(unique_states, 4) WHEN 'pending'::river_job_state THEN get_bit(unique_states, 3) WHEN 'retryable'::river_job_state THEN get_bit(unique_states, 2) WHEN 'running'::river_job_state THEN get_bit(unique_states, 1) WHEN 'scheduled'::river_job_state THEN get_bit(unique_states, 0) ELSE 0 END = 1))"
                            ->  Materialize  (cost=0.00..2.25 rows=100 width=104) (actual time=0.005..0.005 rows=1 loops=1)
                                  ->  CTE Scan on jobs_with_rownum  (cost=0.00..2.00 rows=100 width=104) (actual time=0.003..0.003 rows=1 loops=1)
  ->  CTE Scan on updated_jobs  (cost=0.00..147.54 rows=7377 width=9) (actual time=4222.219..4222.306 rows=2 loops=1)
  ->  Index Scan using river_job_pkey on river_job  (cost=0.43..6.57 rows=1 width=1033) (actual time=0.003..0.003 rows=1 loops=2)
        Index Cond: (id = updated_jobs.id)
Planning Time: 1.707 ms
Execution Time: 4222.523 ms

@rose-m
Copy link
Author

rose-m commented Feb 27, 2025

Would it maybe somehow be possible to at least get a SKIP LOCKED into the query if its running against Postrgres? Looking at the plan the big issue seems to be getting the row locks on those job robs... but I still don't understand why it would need that much time to lock them as nothing else should be operating on those?

@bgentry
Copy link
Contributor

bgentry commented Feb 28, 2025

@brandur I wonder if you have any thoughts on what could be happening here. We could certainly add SKIP LOCKED to this query, I think that would maybe make sense (why waste time waiting for locked jobs if you can just schedule them on the next iteration or next run). But I don't understand why that would be necessary or how this could possibly be so slow if they're not causing other contention and the database is overprovisioned as stated 🤔

@brandur
Copy link
Contributor

brandur commented Mar 1, 2025

Alright, I'm not an expert on this, but here's a reasonable SO answer that I think is directionally related to what we're dealing with here:

https://stackoverflow.com/questions/67445749/lockrows-plan-node-taking-long-time

In particular:

But it has to acquire a transient exclusive lock on a section of memory for each attempt. If it is fighting with many other processes for those locks, you can get a cascading collapse of performance.

It may not be obvious as to the specifics, but if we see LockRows eating a lot of time, then I think it means our intuitions are correct. Namely, it's lock contention. We don't understand exactly what's contending against it, but it's probably lock contention.

Likely candidates are one of the other built-in River queries, although I couldn't find an obvious candidate reading code.

@rose-m Can you try and see in pg_stat_activity whether there's any obvious other outlier queries that seem to be sticking around longer than expected?

@rose-m
Copy link
Author

rose-m commented Mar 2, 2025

Thanks for looking into that, too, @brandur. I'll have a look tomorrow during peak times 👍

@rose-m
Copy link
Author

rose-m commented Mar 3, 2025

We're also now seeing that the JobCleaner query seems to get slow at ~1s query execution time for a batch of 1000 jobs to clean up.

See this plan: https://explain.dalibo.com/plan/59707gde1d48ee14

It seems to use a full index scan on the primary key which is a little bit confusing as I'd have expected it to actually use the river_job_state_and_finalized_at_index index.

Is this maybe conflicting with the other query sometimes for the lock contention? 🤔 (will check in the afternoon on the pg_stat_activity).

Aggregate  (cost=31476.97..31476.98 rows=1 width=8) (actual time=1072.702..1072.704 rows=1 loops=1)
  CTE deleted_jobs
    ->  Delete on river_job  (cost=23347.57..31454.47 rows=1000 width=38) (actual time=1072.696..1072.698 rows=0 loops=1)
          ->  Nested Loop  (cost=23347.57..31454.47 rows=1000 width=38) (actual time=1072.694..1072.696 rows=0 loops=1)
                ->  HashAggregate  (cost=23347.14..23357.14 rows=1000 width=40) (actual time=1072.693..1072.694 rows=0 loops=1)
"                      Group Key: ""ANY_subquery"".id"
                      Batches: 1  Memory Usage: 73kB
"                      ->  Subquery Scan on ""ANY_subquery""  (cost=0.43..23344.64 rows=1000 width=40) (actual time=1072.682..1072.683 rows=0 loops=1)"
                            ->  Limit  (cost=0.43..23334.64 rows=1000 width=8) (actual time=1072.681..1072.682 rows=0 loops=1)
                                  ->  Index Scan using river_job_pkey on river_job river_job_1  (cost=0.43..395375.33 rows=16944 width=8) (actual time=1072.680..1072.681 rows=0 loops=1)
"                                        Filter: ((finalized_at < (now() - '30 days'::interval)) AND ((state = 'cancelled'::river_job_state) OR (state = 'completed'::river_job_state) OR (state = 'discarded'::river_job_state)))"
                                        Rows Removed by Filter: 1276471
                ->  Index Scan using river_job_pkey on river_job  (cost=0.43..8.11 rows=1 width=14) (never executed)
"                      Index Cond: (id = ""ANY_subquery"".id)"
  ->  CTE Scan on deleted_jobs  (cost=0.00..20.00 rows=1000 width=0) (actual time=1072.698..1072.698 rows=0 loops=1)
Planning Time: 0.500 ms
Execution Time: 1072.772 ms

Edited this after realizing - we currently have a 30 day retention of the jobs. Maybe this is what's messing up the queries as the size of the table is way larger than if you only have 24 hour default retention. I'll check and see if we just go down to a couple days retention.

These are our current numbers by state:

Image

@rose-m
Copy link
Author

rose-m commented Mar 3, 2025

I did a bit of investigation to see if I can find anything; I'm by no means a DB expert so what I was looking at might also be irrelevant.

Lock contention

I was looking at this: https://wiki.postgresql.org/wiki/Lock_Monitoring to run some queries.

pg_locks: No results for that query.
pg_stat_activity: Also did not find any results (I ran it a couple of times over a 5s interval).

Table statistics

SELECT relname, n_tup_ins, n_tup_upd, n_tup_del, n_live_tup, n_dead_tup FROM pg_stat_user_tables WHERE relname = 'river_job';
Image
SELECT pg_size_pretty(pg_total_relation_size('river_job')) as total_size, pg_size_pretty(pg_relation_size('river_job')) as table_size, pg_size_pretty(pg_indexes_size('river_job')) as index_size;
Image

Experiments

Using SKIP LOCKED

Manually running the same query but with FOR UPDATE SKIP LOCKED produces this plan and just takes ~25ms:

Nested Loop  (cost=98244.26..127501.23 rows=3960 width=1048) (actual time=24.210..24.317 rows=2 loops=1)
  CTE jobs_to_schedule
    ->  Limit  (cost=97254.67..97255.92 rows=100 width=31) (actual time=23.818..23.854 rows=2 loops=1)
          ->  LockRows  (cost=97254.67..97790.41 rows=42859 width=31) (actual time=23.817..23.852 rows=2 loops=1)
                ->  Sort  (cost=97254.67..97361.82 rows=42859 width=31) (actual time=23.680..23.682 rows=10 loops=1)
                      Sort Key: river_job_1.priority, river_job_1.scheduled_at, river_job_1.id
                      Sort Method: quicksort  Memory: 25kB
                      ->  Bitmap Heap Scan on river_job river_job_1  (cost=2540.05..95616.63 rows=42859 width=31) (actual time=21.255..23.665 rows=10 loops=1)
"                            Recheck Cond: ((state = ANY ('{retryable,scheduled}'::river_job_state[])) AND (priority >= 0) AND (scheduled_at <= (now() + '00:00:05'::interval)))"
                            Heap Blocks: exact=2036
                            ->  Bitmap Index Scan on river_job_prioritized_fetching_index  (cost=0.00..2529.34 rows=42859 width=0) (actual time=20.908..20.908 rows=2338 loops=1)
"                                  Index Cond: ((state = ANY ('{retryable,scheduled}'::river_job_state[])) AND (priority >= 0) AND (scheduled_at <= (now() + '00:00:05'::interval)))"
  CTE jobs_with_rownum
    ->  WindowAgg  (cost=5.32..7.82 rows=100 width=67) (actual time=23.876..23.881 rows=2 loops=1)
          ->  Sort  (cost=5.32..5.57 rows=100 width=59) (actual time=23.868..23.869 rows=2 loops=1)
                Sort Key: jobs_to_schedule.unique_key, jobs_to_schedule.priority, jobs_to_schedule.scheduled_at, jobs_to_schedule.id
                Sort Method: quicksort  Memory: 25kB
                ->  CTE Scan on jobs_to_schedule  (cost=0.00..2.00 rows=100 width=59) (actual time=23.822..23.857 rows=2 loops=1)
  CTE updated_jobs
    ->  Update on river_job river_job_2  (cost=0.71..980.09 rows=3960 width=201) (actual time=24.199..24.296 rows=2 loops=1)
          ->  Merge Left Join  (cost=0.71..980.09 rows=3960 width=201) (actual time=23.912..23.929 rows=2 loops=1)
                Merge Cond: (job.unique_key = river_job_3.unique_key)
                ->  Nested Loop  (cost=0.43..846.50 rows=100 width=239) (actual time=23.906..23.921 rows=2 loops=1)
                      ->  CTE Scan on jobs_with_rownum job  (cost=0.00..2.00 rows=100 width=120) (actual time=23.883..23.891 rows=2 loops=1)
                      ->  Index Scan using river_job_pkey on river_job river_job_2  (cost=0.43..8.45 rows=1 width=135) (actual time=0.012..0.012 rows=1 loops=2)
                            Index Cond: (id = job.id)
                ->  Materialize  (cost=0.28..14.98 rows=178 width=71) (never executed)
                      ->  Merge Join  (cost=0.28..14.54 rows=178 width=71) (never executed)
                            Merge Cond: (river_job_3.unique_key = jobs_with_rownum.unique_key)
                            Join Filter: (river_job_3.id <> jobs_with_rownum.id)
                            ->  Index Scan using river_job_unique_idx on river_job river_job_3  (cost=0.28..9.40 rows=8 width=15) (never executed)
                                  Index Cond: (unique_key IS NOT NULL)
"                                  Filter: ((unique_states IS NOT NULL) AND (CASE state WHEN 'available'::river_job_state THEN get_bit(unique_states, 7) WHEN 'cancelled'::river_job_state THEN get_bit(unique_states, 6) WHEN 'completed'::river_job_state THEN get_bit(unique_states, 5) WHEN 'discarded'::river_job_state THEN get_bit(unique_states, 4) WHEN 'pending'::river_job_state THEN get_bit(unique_states, 3) WHEN 'retryable'::river_job_state THEN get_bit(unique_states, 2) WHEN 'running'::river_job_state THEN get_bit(unique_states, 1) WHEN 'scheduled'::river_job_state THEN get_bit(unique_states, 0) ELSE 0 END = 1))"
                            ->  Materialize  (cost=0.00..2.25 rows=100 width=104) (never executed)
                                  ->  CTE Scan on jobs_with_rownum  (cost=0.00..2.00 rows=100 width=104) (never executed)
  ->  CTE Scan on updated_jobs  (cost=0.00..79.20 rows=3960 width=9) (actual time=24.201..24.297 rows=2 loops=1)
  ->  Index Scan using river_job_pkey on river_job  (cost=0.43..7.37 rows=1 width=1047) (actual time=0.004..0.004 rows=1 loops=2)
        Index Cond: (id = updated_jobs.id)
Planning Time: 1.268 ms
Execution Time: 24.455 ms

Different Index

I've manually added the following index to fully tailor the index:

CREATE INDEX river_job_scheduler_index ON river_job (state, priority, scheduled_at, id) WHERE state IN ('retryable', 'scheduled') AND priority >= 0;

The problem remains the same that the lock takes a very long time.

TL;DR

The only thing that really seems to "fix" the issue is to use SKIP LOCKED. I'm not sure if this will have any other side effects but I'm not sure what else to try. Since we plan to 3-4x the number of scheduled jobs in the coming weeks this is definitely a big concern for us right now.

@rose-m
Copy link
Author

rose-m commented Mar 3, 2025

We have now reduced the retention periods down to 7 days for cancelled / discarded and 4 days for completed as well as manually cleaned up all obsolete jobs. Performance of the scheduler and deletion query are now back down to where we would expect them.

@brandur
Copy link
Contributor

brandur commented Mar 5, 2025

@rose-m Thanks for the detailed write up. I'll have to check with Blake, but I think the downside of SKIP LOCKED might mean that your jobs don't end up getting scheduled as expected. If the locking contention is transitory, then maybe they would get still end up getting scheduled on a subsequent pass, but if not, it may result in starved out scheduled jobs that never get a chance to run.

Regarding uniqueness and cleaning up completed/cancelled/discarded jobs: would a lot of the jobs that were trying to be scheduled match jobs in completed/cancelled/discarded (i.e. same kind, args, etc., whatever uniqueness criteria you're using)?

I ask because the default set of unique states includes completed, and I wonder if the problem is basically that the scheduler has to go through and lock a bunch of already completed jobs to make a scheduling decision. I still don't understand where the lock contention would be coming from though. NM, there's a state IN ('retryable', 'scheduled') predicate in the query, it shouldn't be looking at completed.

pg_stat_activity: Also did not find any results (I ran it a couple of times over a 5s interval).

Ideally you would've at least seen the JobSchedule query in there (it runs every five seconds). If this happens again, it might not be a bad idea to poll that table using something like SELECT * FROM pg_stat_activity; \watch 2 for a few minutes until you're able to catch it running, then see if it's conflicting with anything else at the same time.

@brandur
Copy link
Contributor

brandur commented Mar 5, 2025

@rose-m Actually one extra thing to check on that note: are you using the default set of unique states? i.e. Are you modifying ByState at all?

@brandur
Copy link
Contributor

brandur commented Mar 5, 2025

@rose-m And actually one more thing to make sure we have all our bases covered: when you're scheduling jobs, are they scheduled for the far future in general, or are they scheduled close to the present?

If the latter, it's possible that somehow JobGetAvailable's (what producers use to lock jobs) AND scheduled_at <= coalesce(sqlc.narg('now')::timestamptz, now()) predicate is coming into effect. This shouldn't interfere because it only locks available jobs, but it's not beyond the realm of possibility there's some bad interaction there that we don't quite understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants