Skip to content

[META] Streams using Apache Arrow and Flight  #16679

@rishabhmaurya

Description

@rishabhmaurya

Please describe the end goal of this project

  • In-memory columnar representation of any intermediate results from search
    • Data adjacency for sequential access (scans)
    • O(1) (constant-time) random access
    • SIMD and vectorization-friendly
    • Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory.
  • Interoperable representation of columnar data to be used across different engines like sharing between opensearch and datafusion, which is a rust based engine.
  • RPC using bidirectional streams: making use of GRPC bidirectional streams handling backpressure from the client in realtime and producing batches of records on demand. Used both for internode communication (between data nodes and cordinator) as well as communication with end client.

Use cases

  • Optimize memory overhead, cpu utilization and performance for -
    • Search pagination API
    • Aggregation (more details to follow) .

Apache Arrow will serve as a library for in-memory columnar representation on any transient results used for retrieval in these use cases. Arrow Flight to be used for stream RPC.

Supporting References

JOINs RFC making use of this integration - #15185

Issues

Basic Framework Changes:

Client support

Migrate Search API

Aggregations

Operational excellence

  • Stream cancellation, error handling and renewal.
  • Metrics & Troubleshooting

Documentation

Related component

Search

Sub-issues

Metadata

Metadata

Assignees

Labels

MetaMeta issue, not directly linked to a PRRoadmap:SearchProject-wide roadmap labelv3.0.0Issues and PRs related to version 3.0.0v3.1.0v3.2.0v3.3.0

Type

No type

Projects

Status

In Progress

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions