Skip to content

[Feature Request] Multi-threaded writes in pull-based ingestion #17875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
varunbharadwaj opened this issue Apr 10, 2025 · 0 comments · Fixed by #17912
Closed

[Feature Request] Multi-threaded writes in pull-based ingestion #17875

varunbharadwaj opened this issue Apr 10, 2025 · 0 comments · Fixed by #17912
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing untriaged

Comments

@varunbharadwaj
Copy link
Contributor

varunbharadwaj commented Apr 10, 2025

Is your feature request related to a problem? Please describe

The current pull-based ingestion design decouples poller and writer for better performance but uses a single writer thread. This can further be improved by supporting multi-threaded writes.

Describe the solution you'd like

Multi-threaded writes will be supported as follows:

  1. The internal blocking queue will be divided into partitions, one per writer thread.
  2. The poller will use the ID field to map the incoming message to an internal blocking queue partition. One blocking queue will be maintained per writer thread.
  3. The writer threads will consume from the respective in-memory queue and write to the index.

This solution guarantees that the updates for the same document are sequentially processed, in the same order visible to the consumer. Versioning will be supported when the underlying streaming source does not provide ordering guarantees.
Note that a message without an ID field will result in an auto-generated ID at runtime, and can be mapped to different partitions on retries. This however should not affect the correctness of the data as subsequent updates to the same document must provide the ID field.

Shard recovery will be handled in a multi-threaded write scenario as follows:

  1. Each processor/writer thread will track the current shard pointer that is being processed.
  2. Commits will include the minimum shard pointer across all writer threads indicating the batch start pointer.
  3. The poller will start polling from the batch start pointer persisted along with the commit on shard recovery.

Related component

Indexing

Describe alternatives you've considered

No response

Additional context

No response

@varunbharadwaj varunbharadwaj added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 10, 2025
@github-actions github-actions bot added the Indexing Indexing, Bulk Indexing and anything related to indexing label Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing untriaged
Projects
None yet
1 participant