Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions docs/fast_merge.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,32 @@
# Fast Merge
* v2.6.0 comes with the support for a feature called "fast merge" where we first train the index on a vector dataset and then build the vector index using this trained information of the centroid layout.

* v2.6.0 comes with the support for a feature called "fast merge" where we first train the index on a vector dataset and then build the vector index using this trained information of the centroid layout.
* This is an improvement on the existing behavior where we were performing the merge in a very naive fashion of reconstructing the participating vector indexes, re-training and then adding back the vectors into the index. Fast merge essentially merges the corresponding centroid cells' data vectors in a block wise fashion without doing the expensive operations
* This feature underneath the hood utilizes the existing [`merge_from` API](https://github.com/blevesearch/faiss/blob/ffd910a91f1acf49b9898a7e514e462db89ee7b3/faiss/Index.h#L396) in our fork of the faiss codebase.

## Support

* This feature is supported primarily for the IVF family of indexes. This happens when
* the field mapping has optimization type as `ivf,rabitq` or when using `bivf-sq8`/`bivf-flat` for binary quantization.
* when the above optimizations aren't used but the scale of data exceeds 10K vectors.
* the field mapping has optimization type as `ivf,rabitq` or when using `bivf-sq8`/`bivf-flat` for binary quantization.
* when the above optimizations aren't used but the scale of data exceeds 10K vectors.

## Usage

The feature can be enabled by first passing a key value pair in the config part while creating a new index. If the flag is false, then the behavior falls back to the more expensive naive merge.

```go
kvconfig := map[string]interface{}{
"vector_index_fast_merge": true
kvConfig := map[string]interface{}{
"vector_index_fast_merge": true,
}

index, err := index, err := bleve.NewUsing("example.bleve", bleve.NewIndexMapping(), bleve.Config.DefaultIndexType, bleve.Config.DefaultMemKVStore, kvConfig)
index, err := bleve.NewUsing("example.bleve", bleve.NewIndexMapping(), bleve.Config.DefaultIndexType, bleve.Config.DefaultMemKVStore, kvConfig)
if err != nil {
return err
log.Fatal(err)
}
```

User should now "train" the index on a random sample of the vector dataset they're planning to index and search.
User should now "train" the index on a random sample of the vector dataset they're planning to index and search.

* It's completely up to the user as to how much data they want to use for training, controlling the batch size used while training and also marking whether the training is complete.
* NOTE: User must index their data only after marking the training as complete, otherwise the batch won't be indexed.

Expand Down Expand Up @@ -49,9 +54,10 @@ if err := index.Train(batch); err != nil {
```

## Disclaimer

* This feature is primarily meant for the use case where the user is aware about much data they want to index and also for a ready heavy workload and little to no updates on the index itself.
* The intention of the feature is to be able to quickly index a massive scale of data on an index in an expensive manner and perform search on it.
* Without this feature, i.e. when the index build happens without a prior training phase
* The user wouldn't have to worry about use cases where the dataset is continuously updated with new "type" of vector. This is because each merge cycle would do the training afresh.
* The user doesn't have a lag in indexing the data either, they can start ingesting the data immediately.
* Based on what's mentioned above, when it comes to update and delete type of workloads on the dataset its extremely difficult to detect when the data drift will occur. So we end up falling back to the naive way of reconstructing + re-training.
* The user wouldn't have to worry about use cases where the dataset is continuously updated with new "type" of vector. This is because each merge cycle would do the training afresh.
* The user doesn't have a lag in indexing the data either, they can start ingesting the data immediately.
* Based on what's mentioned above, when it comes to update and delete type of workloads on the dataset its extremely difficult to detect when the data drift will occur. So we end up falling back to the naive way of reconstructing + re-training.
14 changes: 10 additions & 4 deletions docs/vectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,18 @@
* Supported dimensionality is between 1 and 2048 (v2.4.0), and up to **4096** (v2.4.1+).
* Supported vector index optimizations:
* `recall`, `latency` (v2.4.0+)
* Combination of Flat and IVF indexes with SQ8 quantization
* Combination of Flat and IVF indexes with SQ8 quantization.
* `memory_efficient` (v2.4.1+)
* Combination of Flat and IVF indexes with SQ4 quantization
* Combination of Flat and IVF indexes with SQ4 quantization.
* `bivf-flat`, `bivf-sq8` (v2.6.0+)
* Uses binary-based BIVF indexes with a backing Flat/SQ8 index for re-ranking.
* Combination of BFlat and BIVF indexes with Binary quantization.
* Uses an additional index which is either a Flat index (`bivf-flat`) or an SQ8 index (`bivf-sq8`) for re-ranking.
* `ivf,rabitq` (v2.6.0+)
* The ivf,rabitq index is a standalone rabitq quantized binary index which works only with `vector_index_fast_merge`. This technique will first build a centroid index trained on an already complete dataset, and replicates that template for all segments introduced after, see [fast-merge](https://github.com/blevesearch/bleve/blob/master/docs/fast_merge.md).
* Combination of Flat and IVF indexes with RaBitQ quantization.
* Works with [Fast Merge](https://github.com/blevesearch/bleve/blob/master/docs/fast_merge.md) - where a centroid index needs to be built/trained on a sample dataset prior to building the index for segments that deploy IVF indexes to reflect.

---

* Vectors from documents that do not conform to the index mapping dimensionality are simply discarded at index time.
* The dimensionality of the query vector must match the dimensionality of the indexed vectors to obtain any results.
* Pure kNN searches can be performed, but the `query` attribute within the search request must be set - to `{"match_none": {}}` in this case. The `query` attribute is made optional when `knn` is available with v2.4.1+.
Expand All @@ -58,7 +62,9 @@ aggregate_score = (query_boost * query_hit_score) + (knn_boost * knn_hit_distanc
* an array of objects each containing a vector (nested-vector field)
* For single-kNN queries, each document is scored using its single best-matching vector.
* For multi-kNN queries, the system selects the best-matching vector for each query vector within the document.

---

* GPU-Accelerated vector search (v2.6.0+):
* Requires FAISS built with `-DFAISS_ENABLE_GPU=ON` CMake option (needs NVIDIA CUDA toolkit).
* Requires the `gpu` go tag in addition to the `vectors` tag when building bleve.
Expand Down
Loading