-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad JobSchedule query performance with lots of scheduled jobs #785
Comments
That’s odd, it should be utilizing the prioritized fetch index for an efficient index scan. Any chance you could provide the output of that query run with an |
Sure, I hope this link works for you. I've had to manually put in the timestamps using |
I'm a lot more used to looking at raw
I can't really make sense of that, unless your DB instance is hyper constrained or you're somehow locking those rows elsewhere and blocking them from being scheduled. Both of those operations should be very fast. |
Yeah we're not doing anything with this table on our own... We're also not scheduling constantly but essentially just once upfront. The DB is kind of idling around and this is the only query that is producing load on the DB with some significant CPU usage. |
Can you also post the raw (non json) explain analyze output just so I’m sure I’m reading it right? |
Sure, here it is:
|
Would it maybe somehow be possible to at least get a |
@brandur I wonder if you have any thoughts on what could be happening here. We could certainly add |
Alright, I'm not an expert on this, but here's a reasonable SO answer that I think is directionally related to what we're dealing with here: https://stackoverflow.com/questions/67445749/lockrows-plan-node-taking-long-time In particular:
It may not be obvious as to the specifics, but if we see Likely candidates are one of the other built-in River queries, although I couldn't find an obvious candidate reading code. @rose-m Can you try and see in |
Thanks for looking into that, too, @brandur. I'll have a look tomorrow during peak times 👍 |
We're also now seeing that the JobCleaner query seems to get slow at ~1s query execution time for a batch of 1000 jobs to clean up. See this plan: https://explain.dalibo.com/plan/59707gde1d48ee14 It seems to use a full index scan on the primary key which is a little bit confusing as I'd have expected it to actually use the Is this maybe conflicting with the other query sometimes for the lock contention? 🤔 (will check in the afternoon on the
Edited this after realizing - we currently have a 30 day retention of the jobs. Maybe this is what's messing up the queries as the size of the table is way larger than if you only have 24 hour default retention. I'll check and see if we just go down to a couple days retention. These are our current numbers by state: ![]() |
I did a bit of investigation to see if I can find anything; I'm by no means a DB expert so what I was looking at might also be irrelevant. Lock contentionI was looking at this: https://wiki.postgresql.org/wiki/Lock_Monitoring to run some queries. pg_locks: No results for that query. Table statistics
![]()
![]() ExperimentsUsing SKIP LOCKEDManually running the same query but with
Different IndexI've manually added the following index to fully tailor the index:
The problem remains the same that the lock takes a very long time. TL;DRThe only thing that really seems to "fix" the issue is to use |
We have now reduced the retention periods down to 7 days for cancelled / discarded and 4 days for completed as well as manually cleaned up all obsolete jobs. Performance of the scheduler and deletion query are now back down to where we would expect them. |
@rose-m Thanks for the detailed write up. I'll have to check with Blake, but I think the downside of Regarding uniqueness and cleaning up completed/cancelled/discarded jobs: would a lot of the jobs that were trying to be scheduled match jobs in completed/cancelled/discarded (i.e. same kind, args, etc., whatever uniqueness criteria you're using)?
Ideally you would've at least seen the |
@rose-m Actually one extra thing to check on that note: are you using the default set of unique states? i.e. Are you modifying |
@rose-m And actually one more thing to make sure we have all our bases covered: when you're scheduling jobs, are they scheduled for the far future in general, or are they scheduled close to the present? If the latter, it's possible that somehow |
Hey there, thanks for the project in general! Really nice and lightweight to use 👍
We're currently using it in production and running ~45k jobs per day, so not too big. However, all of our jobs are basically scheduled (we get an external action with what to execute and it gets distributed over the day).
Looking into DB monitoring we actually see that the
JobSchedule
query has very bad performance and causes significant load on our DB. Given that the scheduler runs every 5s an average latency of 2.5s per call is concerning us. Is that something you could look into?Here's a screenshot from AWS RDS performance insights:
Let me know if I can provide any additional details.
The text was updated successfully, but these errors were encountered: