Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak bulk indexer #701

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mhmtszr
Copy link

@mhmtszr mhmtszr commented Jul 19, 2023

Bulk indexer makes a lot of heap allocation, it affect our applications' performance. I tried to reduce allocations by using "sync.pool".

Bulkindexers that we regularly open and close cause allocation.

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@cla-checker-service
Copy link

cla-checker-service bot commented Jul 19, 2023

💚 CLA has been signed

@T-J-L
Copy link

T-J-L commented Oct 12, 2023

What's the reason for opening/closing indexers? It can last the lifetime of the application, when doing so with a large buffer there are zero allocations here.

@mhmtszr
Copy link
Author

mhmtszr commented Dec 12, 2023

@T-J-L hello, is there any way to use same indexers for different operations? we needed to close after adding to batch per operation.

@T-J-L
Copy link

T-J-L commented Dec 14, 2023

@T-J-L hello, is there any way to use same indexers for different operations? we needed to close after adding to batch per operation.

I create a single indexer at start up, with a low flush time (100ms). Then for each application request create a couple of channels for success/errors, perform BulkIndexer.Add then write back to the channels in the BulkIndexerItem.OnSuccess and BulkIndexerItem.OnFailure callbacks. So effectivly each request is syncronous, with all requests to ES are batched.

You can set the Index and Action per item, so this works for all types of operation.

@mhmtszr
Copy link
Author

mhmtszr commented Dec 15, 2023

@T-J-L hello, is there any way to use same indexers for different operations? we needed to close after adding to batch per operation.

I create a single indexer at start up, with a low flush time (100ms). Then for each application request create a couple of channels for success/errors, perform BulkIndexer.Add then write back to the channels in the BulkIndexerItem.OnSuccess and BulkIndexerItem.OnFailure callbacks. So effectivly each request is syncronous, with all requests to ES are batched.

You can set the Index and Action per item, so this works for all types of operation.

Great solution, but how can you be sure your documents will be written to Elasticsearch? We need to be sure that our documents will be written to Elasticsearch thus we are closing the bulk indexer.

@JAndritsch
Copy link

JAndritsch commented Feb 15, 2025

@T-J-L hello, is there any way to use same indexers for different operations? we needed to close after adding to batch per operation.

I create a single indexer at start up, with a low flush time (100ms). Then for each application request create a couple of channels for success/errors, perform BulkIndexer.Add then write back to the channels in the BulkIndexerItem.OnSuccess and BulkIndexerItem.OnFailure callbacks. So effectivly each request is syncronous, with all requests to ES are batched.

You can set the Index and Action per item, so this works for all types of operation.

It would seem odd to me that the intended use of a BulkIndexer would be a singleton instance living for the lifetime of an application. Especially since the examples in this repo show calling Close() on the indexer after your operations have been added.

As @mhmtszr suggested, Close() will force a flush and wait for the documents to index. That should also theoretically free up any resources used by the BulkIndexer, but it seems that's not the case.

I will try to test these changes and see if it resolves my issue. I'm not a maintainer of the repo, but hopefully that will help with the acceptance of this PR.

@JAndritsch
Copy link

I can confirm that this change resolves the memory issues I've been seeing: #956.

Copy link

@JAndritsch JAndritsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested these changes in my own custom fork of the repo, since the original clone here is a bit old.

Prior to these changes, memory usage would spike when calling Close() on multiple instances of BulkIndexer. My application would spike from ~8mb to ~300mb of usage after just a few invocations.

I saw memory usage remain much more expected after applying these changes to my fork. Memory usage stayed within ~1-2mb of the original bootup usage.

Resolves #956

Edit: Although I'm seeing greatly improved memory usage after this change, I'm starting to wonder if this fix masks an underlying problem with how resource cleanup is handled in BulkIndexer. The flushBuffer method is also allocating new buffers in a similar way that the NewBulkIndexer was prior to this PR change.

I think there's probably more to investigate here.

@T-J-L
Copy link

T-J-L commented Feb 15, 2025

@T-J-L hello, is there any way to use same indexers for different operations? we needed to close after adding to batch per operation.

I create a single indexer at start up, with a low flush time (100ms). Then for each application request create a couple of channels for success/errors, perform BulkIndexer.Add then write back to the channels in the BulkIndexerItem.OnSuccess and BulkIndexerItem.OnFailure callbacks. So effectivly each request is syncronous, with all requests to ES are batched.

You can set the Index and Action per item, so this works for all types of operation.

It would seem odd to me that the intended use of a BulkIndexer would be a singleton instance living for the lifetime of an application. Especially since the examples in this repo show calling Close() on the indexer after your operations have been added.

As @mhmtszr suggested, Close() will force a flush and wait for the documents to index. That should also theoretically free up any resources used by the BulkIndexer, but it seems that's not the case.

I will try to test these changes and see if it resolves my issue. I'm not a maintainer of the repo, but hopefully that will help with the acceptance of this PR.

IMO a long running process is the only reason for using the bulk indexer. If you are closing the indexer immediately to force a flush, it seems like using a normal bulk request would be a better choice.

arp242 pushed a commit to arp242/go-elasticsearch that referenced this pull request Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants