Improvements to the Faiss codec

### Description

Lucene added a Faiss-based vector search format in #14178, opening this issue to track possible enhancements:

1. Allow Faiss to use SIMD instructions (more context [here](https://github.com/apache/lucene/blob/602bfbd9af0ee9027de45c1572527eee6b073841/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissLibrary.java#L36-L40))
2. Byte vector support (more context [here](https://github.com/apache/lucene/blob/602bfbd9af0ee9027de45c1572527eee6b073841/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java#L103-L104))
3. Implement batched indexing (more context [here](https://github.com/apache/lucene/blob/602bfbd9af0ee9027de45c1572527eee6b073841/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissLibraryNativeImpl.java#L199-L201))
4. Mmap-ed IO with disk-based indexes
    - Today, the Faiss codec does not make any assumptions about the type of the Lucene index, and consequently uses [abstract methods](https://github.com/apache/lucene/blob/602bfbd9af0ee9027de45c1572527eee6b073841/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissLibraryNativeImpl.java#L120-L132) to read / write bytes from Faiss indexes
    - Faiss recently added [mmap support](https://github.com/facebookresearch/faiss/blame/4fab13c9c67b5402343ca722c83ff7a65a9a48ba/faiss/impl/index_read.cpp#L66-L73) for some indexes, making it possible to read indexes without loading it entirely into RAM -- and we should make use of this functionality wherever possible / desired
5. GPU support
    - Faiss has support for GPU-based indexes using [CUDA or ROCm](https://github.com/facebookresearch/faiss/blob/4fab13c9c67b5402343ca722c83ff7a65a9a48ba/README.md#installing) -- but we do not use it from Lucene today. This may be a shorter path to #14243
    - One reason is that GPU-based searches are beneficial when running queries in parallel, whereas the Faiss codec today runs queries one-at-a-time -- i.e. we may need a "batching" layer in the codec
6. Expose a safe schema for Faiss indexes
    - Today, the Faiss codec can be [configured](https://github.com/apache/lucene/blob/602bfbd9af0ee9027de45c1572527eee6b073841/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsFormat.java#L84-L91) using an [index factory](https://github.com/facebookresearch/faiss/wiki/The-index-factory) string
    - While this provides flexibility to the user, it is also error-prone (syntax errors, using incompatible features, etc). It may be more prudent to expose a safe schema from Lucene and build Faiss indexes programmatically instead of relying on the index factory
    - This has subtle benefits, because Lucene now "knows" the type of Faiss index used -- for example decisions like _not_ requiring a [separate copy of raw vectors](https://github.com/apache/lucene/blob/602bfbd9af0ee9027de45c1572527eee6b073841/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsFormat.java#L71) if the Faiss index is "lossless", or query-level configuration (setting parameters like `efSearch`, `nprobe`, etc per-query)
    - One caveat is having to map many of Faiss' features in Lucene / Java as an ongoing effort
7. Support for parent-join queries or timeouts
8. Re-using information from previous indexes at merge-time
    - Today, we simply [create a new Faiss index from scratch](https://github.com/apache/lucene/blob/602bfbd9af0ee9027de45c1572527eee6b073841/lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/faiss/FaissKnnVectorsWriter.java#L106-L109) during merge -- but it may be possible to re-use previous indexes to speed up merges

Please feel free to add enhancements I may have missed, or link issues / PRs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to the Faiss codec #15287

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improvements to the Faiss codec #15287

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions