From cfeaa66d88c56cb35ffd85427b441189d3a29aee Mon Sep 17 00:00:00 2001
From: Metin Dumandag <29387993+mdumandag@users.noreply.github.com>
Date: Thu, 12 Dec 2024 20:04:50 +0300
Subject: [PATCH 1/4] VEC-223: Documentation for sparse and hybrid indexes
Added them under features, also updated the REST API specification.
---
mint.json | 4 +-
vector/api/endpoints/fetch-random.mdx | 7 +-
vector/api/endpoints/fetch.mdx | 5 +-
vector/api/endpoints/query-data.mdx | 27 +-
vector/api/endpoints/query.mdx | 27 +-
vector/api/endpoints/range.mdx | 7 +-
.../api/endpoints/resumable-query/resume.mdx | 5 +-
.../resumable-query/start-with-data.mdx | 19 +
.../resumable-query/start-with-vector.mdx | 24 +-
vector/api/endpoints/update.mdx | 10 +-
vector/api/endpoints/upsert.mdx | 15 +-
vector/features/embeddingmodels.mdx | 11 +-
vector/features/hybridindexes.mdx | 400 ++++++++++++++++++
vector/features/sparseindexes.mdx | 370 ++++++++++++++++
14 files changed, 915 insertions(+), 16 deletions(-)
create mode 100644 vector/features/hybridindexes.mdx
create mode 100644 vector/features/sparseindexes.mdx
diff --git a/mint.json b/mint.json
index 5c4099b5..de987514 100644
--- a/mint.json
+++ b/mint.json
@@ -766,7 +766,9 @@
"vector/features/filtering",
"vector/features/embeddingmodels",
"vector/features/namespaces",
- "vector/features/resumablequery"
+ "vector/features/resumablequery",
+ "vector/features/sparseindexes",
+ "vector/features/hybridindexes"
]
},
{
diff --git a/vector/api/endpoints/fetch-random.mdx b/vector/api/endpoints/fetch-random.mdx
index bd81604a..dc2af116 100644
--- a/vector/api/endpoints/fetch-random.mdx
+++ b/vector/api/endpoints/fetch-random.mdx
@@ -23,8 +23,11 @@ The response will be `null` if the namespace is empty.
The id of the vector.
-
- The vector value.
+
+ The dense vector value for dense and hybrid indexes.
+
+
+ The sparse vector value for sparse and hybrid indexes.
diff --git a/vector/api/endpoints/fetch.mdx b/vector/api/endpoints/fetch.mdx
index de591198..7c099a86 100644
--- a/vector/api/endpoints/fetch.mdx
+++ b/vector/api/endpoints/fetch.mdx
@@ -49,7 +49,10 @@ their vector ids.
The id of the vector.
- The vector value.
+ The dense vector value for dense and hybrid indexes.
+
+
+ The sparse vector value for sparse and hybrid indexes.
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/query-data.mdx b/vector/api/endpoints/query-data.mdx
index 569d63ad..ed77b1b4 100644
--- a/vector/api/endpoints/query-data.mdx
+++ b/vector/api/endpoints/query-data.mdx
@@ -44,6 +44,23 @@ of fields below.
[Metadata filter](/vector/features/filtering) to apply.
+
+ For sparse vectors of sparse and hybrid indexes, specifies what kind of
+ weighting strategy should be used while querying the matching non-zero
+ dimension values of the query vector with the documents.
+
+ If not provided, no weighting will be used.
+
+ Only possible value is `IDF` (inverse document frequency).
+
+
+ Fusion algorithm to use while fusing scores
+ from dense and sparse components of a hybrid index.
+
+ If not provided, defaults to `RRF` (Reciprocal Rank Fusion).
+
+ Other possible value is `DBSF` (Distribution-Based Score Fusion).
+
## Path
@@ -61,9 +78,12 @@ If the request was an array of more than one items, an array of
objects below is returned, one for each query item.
- The score is normalized to always be between 0 and 1.
+ For dense indexes, the score is normalized to always be between 0 and 1.
The closer the score is to 1, the more similar the vector is to the query vector.
This does not depend on the distance metric you use.
+
+ For sparse and hybrid indexes, scores can be arbitrary values, but the score
+ will be higher for more similar vectors.
@@ -75,7 +95,10 @@ objects below is returned, one for each query item.
The similarity score of the vector, calculated based on the distance metric of your index.
- The vector value.
+ The dense vector value for dense and hybrid indexes.
+
+
+ The sparse vector value for sparse and hybrid indexes.
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/query.mdx b/vector/api/endpoints/query.mdx
index 7cc8b4dd..208ef8a7 100644
--- a/vector/api/endpoints/query.mdx
+++ b/vector/api/endpoints/query.mdx
@@ -40,6 +40,23 @@ of fields below.
[Metadata filter](/vector/features/filtering) to apply.
+
+ For sparse vectors of sparse and hybrid indexes, specifies what kind of
+ weighting strategy should be used while querying the matching non-zero
+ dimension values of the query vector with the documents.
+
+ If not provided, no weighting will be used.
+
+ Only possible value is `IDF` (inverse document frequency).
+
+
+ Fusion algorithm to use while fusing scores
+ from dense and sparse components of a hybrid index.
+
+ If not provided, defaults to `RRF` (Reciprocal Rank Fusion).
+
+ Other possible value is `DBSF` (Distribution-Based Score Fusion).
+
## Path
@@ -57,9 +74,12 @@ If the request was an array of more than one items, an array of
objects below is returned, one for each query item.
- The score is normalized to always be between 0 and 1.
+ For dense indexes, the score is normalized to always be between 0 and 1.
The closer the score is to 1, the more similar the vector is to the query vector.
This does not depend on the distance metric you use.
+
+ For sparse and hybrid indexes, scores can be arbitrary values, but the score
+ will be higher for more similar vectors.
@@ -71,7 +91,10 @@ objects below is returned, one for each query item.
The similarity score of the vector, calculated based on the distance metric of your index.
- The vector value.
+ The dense vector value for dense and hybrid indexes.
+
+
+ The sparse vector value for sparse and hybrid indexes.
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/range.mdx b/vector/api/endpoints/range.mdx
index 28fa9089..b2f9f888 100644
--- a/vector/api/endpoints/range.mdx
+++ b/vector/api/endpoints/range.mdx
@@ -52,8 +52,11 @@ authMethod: "GET"
The id of the vector.
-
- The vector value.
+
+ The dense vector value for dense and hybrid indexes.
+
+
+ The sparse vector value for sparse and hybrid indexes.
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/resumable-query/resume.mdx b/vector/api/endpoints/resumable-query/resume.mdx
index f18853ad..007409b3 100644
--- a/vector/api/endpoints/resumable-query/resume.mdx
+++ b/vector/api/endpoints/resumable-query/resume.mdx
@@ -27,7 +27,10 @@ authMethod: "bearer"
metric of your index.
- The vector value.
+ The dense vector value for dense and hybrid indexes.
+
+
+ The sparse vector value for sparse and hybrid indexes.
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/resumable-query/start-with-data.mdx b/vector/api/endpoints/resumable-query/start-with-data.mdx
index d04bc001..707f9a36 100644
--- a/vector/api/endpoints/resumable-query/start-with-data.mdx
+++ b/vector/api/endpoints/resumable-query/start-with-data.mdx
@@ -40,6 +40,25 @@ authMethod: "bearer"
Maximum idle time for the resumable query in seconds.
+
+ For sparse vectors of sparse and hybrid indexes, specifies what kind of
+ weighting strategy should be used while querying the matching non-zero
+ dimension values of the query vector with the documents.
+
+ If not provided, no weighting will be used.
+
+ Only possible value is `IDF` (inverse document frequency).
+
+
+
+ Fusion algorithm to use while fusing scores
+ from dense and sparse components of a hybrid index.
+
+ If not provided, defaults to `RRF` (Reciprocal Rank Fusion).
+
+ Other possible value is `DBSF` (Distribution-Based Score Fusion).
+
+
## Path
diff --git a/vector/api/endpoints/resumable-query/start-with-vector.mdx b/vector/api/endpoints/resumable-query/start-with-vector.mdx
index b70643b6..04e60f29 100644
--- a/vector/api/endpoints/resumable-query/start-with-vector.mdx
+++ b/vector/api/endpoints/resumable-query/start-with-vector.mdx
@@ -46,6 +46,25 @@ authMethod: "bearer"
Maximum idle time for the resumable query in seconds.
+
+ For sparse vectors of sparse and hybrid indexes, specifies what kind of
+ weighting strategy should be used while querying the matching non-zero
+ dimension values of the query vector with the documents.
+
+ If not provided, no weighting will be used.
+
+ Only possible value is `IDF` (inverse document frequency).
+
+
+
+ Fusion algorithm to use while fusing scores
+ from dense and sparse components of a hybrid index.
+
+ If not provided, defaults to `RRF` (Reciprocal Rank Fusion).
+
+ Other possible value is `DBSF` (Distribution-Based Score Fusion).
+
+
## Path
@@ -69,7 +88,10 @@ authMethod: "bearer"
metric of your index.
- The vector value.
+ The dense vector value for dense and hybrid indexes.
+
+
+ The sparse vector value for sparse and hybrid indexes.
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/update.mdx b/vector/api/endpoints/update.mdx
index ac363e15..972164c9 100644
--- a/vector/api/endpoints/update.mdx
+++ b/vector/api/endpoints/update.mdx
@@ -19,9 +19,12 @@ of those.
The id of the vector.
- The vector value to update to.
+ The dense vector value to update to for dense and hybrid indexes.
The vector should have the same dimensions as your index.
+
+ The sparse vector value to update to for sparse and hybrid indexes.
+
The raw text data to update to.
If the index is created with an [embedding model](/vector/features/embeddingmodels)
@@ -38,6 +41,11 @@ of those.
`OVERWRITE` for overwrite, `PATCH` for patch.
+
+For hybrid indexes either none or both of `vector` and `sparseVector` fields
+must be present. It is not allowed to update only `vector` or `sparseVector`.
+
+
## Path
diff --git a/vector/api/endpoints/upsert.mdx b/vector/api/endpoints/upsert.mdx
index 91e40138..bd833618 100644
--- a/vector/api/endpoints/upsert.mdx
+++ b/vector/api/endpoints/upsert.mdx
@@ -17,10 +17,13 @@ You can either upsert a single vector, or multiple vectors in an array.
The id of the vector.
-
- The vector value.
+
+ The dense vector value for dense and hybrid indexes.
The vector should have the same dimensions as your index.
+
+ The sparse vector value for sparse and hybrid indexes.
+
The metadata of the vector. This makes identifying vectors
on retrieval easier and can be used to with filters on queries.
@@ -30,6 +33,14 @@ You can either upsert a single vector, or multiple vectors in an array.
data, which can be anything associated with this vector.
+
+For dense indexes, only `vector` should be provided, and `sparseVector` should not be set.
+
+For sparse indexes, only `sparseVector` should be provided, and `vector` should not be set.
+
+For hybrid indexes both of `vector` and `sparseVector` must be present.
+
+
## Path
diff --git a/vector/features/embeddingmodels.mdx b/vector/features/embeddingmodels.mdx
index 0c372c1c..f2126f89 100644
--- a/vector/features/embeddingmodels.mdx
+++ b/vector/features/embeddingmodels.mdx
@@ -32,7 +32,7 @@ Upstash Vector comes with a variety of embedding models that score well in the
for measuring the performance of embedding models. They support use cases such
as classification, clustering, or retrieval.
-You can choose the following general purpose models:
+You can choose the following general purpose models for dense and hybrid indexes:
| Name | Dimension | Sequence Length | MTEB |
| ------------------------------------------------------------------------------------------------------- | --------- | --------------- | ----- |
@@ -56,6 +56,15 @@ You can choose the following general purpose models:
MTEB score for the `BAAI/bge-m3` is not fully measured.
+For sparse and hybrid indexes, on the following models can be selected:
+
+| Name |
+| ------------------------------------------------- |
+| [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) |
+| [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) |
+
+See [Creating Sparse Vectors](/vector/features/sparseindexes#creating-sparse-vectors) for the details of the above models.
+
## Using a Model
To start using embedding models, create the index with a model of your choice.
diff --git a/vector/features/hybridindexes.mdx b/vector/features/hybridindexes.mdx
new file mode 100644
index 00000000..2cfa05fe
--- /dev/null
+++ b/vector/features/hybridindexes.mdx
@@ -0,0 +1,400 @@
+---
+title: Hybrid Indexes
+---
+
+Dense indexes are useful to perform semantic searches over
+a dataset to find the most similar items quickly. It relies on the
+embedding models to generate dense vectors that are similar to each
+other for similar concepts. And, they do it well for the
+data or the domain of the data that the embedding model is trained on.
+But they sometimes fail, especially in the case where the data
+is out of the training domain of the model. For such cases, a more traditional
+exact search with sparse vectors performs better.
+
+Hybrid indexes allow you to combine the best of these two worlds so that
+you can get semantically similar results, and enhance them with exact
+token/word matching to make the query results more relevant.
+
+Upstash supports hybrid indexes that manage a dense and a sparse index
+component for you. When you perform a query, it queries both the dense
+and the sparse index and fuses the results.
+
+## Creating Dense And Sparse Vectors
+
+Since a hybrid index is a combination of a dense and a sparse index,
+you can use the same methods you have used for dense and sparse indexes,
+and combine them.
+
+Upstash allows you to upsert and query raw dense and sparse vectors to
+give you full control over the models you would use.
+
+Also, to make embedding easier for you, Upstash provides some hosted
+models and allows you to upsert and query raw text data. Behind the scenes,
+the text data is converted to dense and sparse vectors.
+
+You can create your index with a dense and sparse embedding model
+to use this feature.
+
+## Using Hybrid Indexes
+
+### Upserting Dense and Sparse Vectors
+
+You can upsert dense and sparse vectors into Upstash Vector indexes in two different ways.
+
+#### Upserting Raw Dense and Sparse Vectors
+
+You can upsert raw dense and sparse vectors into the index as follows:
+
+
+
+
+
+```python
+from upstash_vector import Index, Vector
+from upstash_vector.types import SparseVector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.upsert(
+ vectors=[
+ Vector(id="id-0", vector=[0.1, 0.5], sparse_vector=SparseVector([1, 2], [0.1, 0.2])),
+ Vector(id="id-1", vector=[0.3, 0.7], sparse_vector=SparseVector([123, 44232], [0.5, 0.4])),
+ ]
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/upsert \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '[
+ {"id": "id-0", "vector": [0.1, 0.5], "sparseVector": {"indices": [1, 2], "values": [0.1, 0.2]}},
+ {"id": "id-1", "vector": [0.3, 0.7], "sparseVector": {"indices": [123, 44232], "values": [0.5, 0.4]}}
+ ]'
+```
+
+
+
+
+
+Note that, for hybrid indexes, you have to provide both dense and sparse
+vectors. You can't omit one or both.
+
+#### Upserting Text as Dense and Sparse Vectors
+
+If you created the hybrid index with Upstash-hosted dense and sparse embedding models,
+you can upsert raw text data, and Upstash can embed it behind the scenes.
+
+
+
+
+
+```python
+from upstash_vector import Index, Vector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.upsert(
+ vectors=[
+ Vector(id="id-0", data="Upstash Vector provides dense and sparse embedding models.")),
+ Vector(id="id-1", data="You can upsert raw text data with these embedding models.")),
+ ]
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/upsert-data \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '[
+ {"id": "id-0", "data": "Upstash Vector provides dense and sparse embedding models."},
+ {"id": "id-1", "data": "You can upsert raw text data with these embedding models."}
+ ]'
+```
+
+
+
+
+
+### Querying Dense and Sparse Vectors
+
+Similar to upserts, you can query dense and sparse vectors in two different ways.
+
+#### Querying with Raw Dense and Sparse Vectors
+
+Hybrid indexes can be queried by providing raw dense and sparse vectors.
+
+
+
+
+
+```python
+from upstash_vector import Index
+from upstash_vector.types import SparseVector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.query(
+ vector=[0.5, 0.4],
+ sparse_vector=SparseVector([3, 5], [0.3, 0.5]),
+ top_k=5,
+ include_metadata=True,
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"vector": [0.5, 0.4], "sparseVector": {"indices": [3, 5], "values": [0.3, 0.5]}, "topK": 5, "includeMetadata": true}'
+```
+
+
+
+
+
+The query results will be fused scores from the dense and sparse indexes.
+
+#### Querying with Text as Dense and Sparse Vectors
+
+If you created the hybrid index with Upstash-hosted dense and sparse embedding models,
+you can query with raw text data, and Upstash can embed it behind the scenes
+before performing the actual query.
+
+
+
+
+
+```python
+from upstash_vector import Index
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.query(
+ data="Upstash Vector",
+ top_k=5,
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query-data \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"data": "Upstash Vector", "topK": 5}'
+```
+
+
+
+
+
+### Fusing Dense And Sparse Query Scores
+
+One of the most crucial parts of the hybrid search pipeline is the step
+where we fuse or rerank dense and sparse search results.
+
+By default, Upstash returns the hybrid query results by fusing/reranking
+the dense and the sparse search results. It provides two fusing algorithms
+to choose from to do so.
+
+#### Reciprocal Rank Fusion
+
+RRF is a method for combining results from dense and sparse indexes.
+It focuses on the order of results, not their raw scores. Each result's score
+is mapped using the formula:
+
+```
+Mapped Score = 1 / (rank + K)
+```
+
+Here, rank is the position of the result in the dense or sparse scores, and `K`
+is a constant set to `60`.
+
+If a result appears in both the dense and sparse indexes, its mapped scores are
+added together. If it appears in only one of the indexes, its score remains unchanged.
+After all scores are processed, the results are sorted by their combined scores,
+and the top-K results are returned.
+
+RRF effectively combines rankings from different sources, making use of their strengths,
+while keeping the process simple and focusing on the order of results.
+
+By default, hybrid indexes use RRF to fuse dense and sparse scores. It can be explicitly
+set for queries as follows:
+
+
+
+
+
+```python
+from upstash_vector import Index
+from upstash_vector.types import FusionAlgorithm, SparseVector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.query(
+ vector=[0.5, 0.4],
+ sparse_vector=SparseVector([3, 5], [0.3, 0.5]),
+ fusion_algorithm=FusionAlgorithm.RRF,
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"vector": [0.5, 0.4], "sparseVector": {"indices": [3, 5], "values": [0.3, 0.5]}, "fusionAlgorithm": "RRF"}'
+```
+
+
+
+
+
+#### Distribution-Based Score Fusion
+
+DBSF is a method for combining results from dense and sparse indexes by considering
+the distribution of scores. Each score is normalized using the formula:
+
+```
+ s − (μ − 3 * σ)
+Normalized Score = -------------------------
+ (μ + 3 * σ) − (μ − 3 * σ)
+```
+
+Where:
+
+- `s` is the score.
+- `μ` is the mean of the scores.
+- `σ` is the standard deviation.
+- `(μ − 3 * σ)` represents the minimum value (lower tail of the distribution).
+- `(μ + 3 * σ)` represents the maximum value (upper tail of the distribution).
+
+This formula scales each score to fit between 0 and 1 based on the range defined by
+the distribution's tails.
+
+If a result appears in both the dense and sparse indexes, the normalized scores
+are added together. For results that appear in only one index, the individual
+normalized score is used. After all scores are processed, the results are
+sorted by their combined scores, and the top-K results are returned.
+
+Unlike RRF, this approach takes the distribution of scores into account,
+making it more sensitive to variations in score ranges from the dense and sparse indexes.
+
+It can be used in hybrid index queries as follows:
+
+
+
+
+
+```python
+from upstash_vector import Index
+from upstash_vector.types import FusionAlgorithm, SparseVector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.query(
+ vector=[0.5, 0.4],
+ sparse_vector=SparseVector([3, 5], [0.3, 0.5]),
+ fusion_algorithm=FusionAlgorithm.DBSF,
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"vector": [0.5, 0.4], "sparseVector": {"indices": [3, 5], "values": [0.3, 0.5]}, "fusionAlgorithm": "DBSF"}'
+```
+
+
+
+
+
+#### Using a Custom Reranker
+
+For some use cases, you might need something other than RRF or DBSF.
+Maybe you want to use the [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3),
+or any reranker model or algorithm of your choice on the dense and sparse
+components of the hybrid index.
+
+For such scenarios, hybrid indexes allow you to perform queries over
+only dense and only sparse components. This way, the hybrid index
+would return semantically similar vectors from the dense index, and
+exact query matches from the sparse index. Then, you can rerank them
+as you like.
+
+
+
+
+
+```python
+from upstash_vector import Index
+from upstash_vector.types import SparseVector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+dense_results = index.query(
+ vector=[0.5, 0.4],
+)
+
+sparse_results = index.query(
+ sparse_vector=SparseVector([3, 5], [0.3, 0.5]),
+)
+
+# Rerank dense and sparse results as you like here
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"vector": [0.5, 0.4]}'
+
+curl $UPSTASH_VECTOR_REST_URL/query \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"sparseVector": {"indices": [3, 5], "values": [0.3, 0.5]}}'
+```
+
+
+
+
diff --git a/vector/features/sparseindexes.mdx b/vector/features/sparseindexes.mdx
new file mode 100644
index 00000000..9cf9138b
--- /dev/null
+++ b/vector/features/sparseindexes.mdx
@@ -0,0 +1,370 @@
+---
+title: Sparse Indexes
+---
+
+Sparse vectors are representations in a high-dimensional space,
+where only a small number of dimensions have non-zero values.
+
+For example, for the same text, a dense vector representation with
+the BGE-M3 model would have 1024 non-zero valued dimensions.
+However, the sparse vector representation of the same text would
+have less than a hundred non-zero valued dimensions, whereas vector
+space potentially has more than 250 thousand dimensions. Also, unlike
+dense vector representations, sparse vectors might have varying
+non-zero valued dimensions depending on the text.
+
+Generally, sparse vectors can be represented with two arrays of equal
+sizes:
+
+- The first array for the indices contains the indices of the non-zero
+ dimensions.
+- The second array for values contains the floating point values for
+ the non-zero dimensions.
+
+```python
+dense = [0.1, 0.3, , ...thousands of non-zero values..., 0.5, 0.2]
+
+sparse = (
+ [23, 42, 5523, 123987, 240001], # some low number of dimension indices
+ [0.1, 0.3, 0.1, 0.2, 0.5], # non-zero values corresponding to dimensions
+)
+```
+
+Unlike dense vectors which excel at approximate semantic matching,
+sparse vectors are particularly useful for tasks that require exact or
+near exact matching of tokens/words/features. That makes it useful
+for various tasks, such as:
+
+- **Information Retrieval and Text Analysis**: By representing documents
+ as sparse vectors where each token/word would correspond to a dimension
+ in high dimensional vocabulary; and varying values by the frequencies
+ of the tokens/words in the document or by weighting them with inverse
+ document frequencies to favor rare terms, you can build complex
+ search pipelines.
+- **Recommender Systems**: By representing user interactions, preferences,
+ ratings, or purchases as sparse vectors, you can identify relevant
+ recommendations, and personalize content delivery.
+
+## Creating Sparse Vectors
+
+There are various ways to create sparse vectors. You can use
+[BM25](https://en.wikipedia.org/wiki/Okapi_BM25) for information
+retrieval tasks, or use models like [SPLADE](https://github.com/naver/splade)
+that enhance documents and queries with term weighting and expansion.
+
+Upstash gives you full control by allowing you to upsert and query
+raw sparse vectors.
+
+Similar to dense vectors, it also provides Upstash-hosted models
+that can encode raw text data into sparse vectors for you.
+
+To use them, you have to create your index with the appropriate models.
+
+### BGE-M3 Sparse Vectors
+
+BGE-M3 is a multi-functional, multi-lingual, and multi-granular model
+widely used for dense indexes.
+
+We also provide BGE-M3 as a sparse vector embedder, which outputs
+sparse vectors from `250_002` dimensional space.
+
+These sparse vectors have values where each token is weighted
+according to the input text, which enhances traditional sparse vectors
+with contextuality.
+
+### BM25 Sparse Vectors
+
+BM25 is a popular algorithm used in full-text search systems to rank
+documents based on their relevance to a query.
+
+This algorithm relies on key principles of term frequency,
+inverse document frequency, and document length normalization,
+making it well-suited for text retrieval tasks.
+
+- **Rare terms are important**: BM25 gives more weight to words that are
+ less common in the collection of documents. For example, in a search
+ for “Upstash Vector”, the word “Upstash” might be considered more
+ important than “Vector” if it appears less frequently across all documents.
+- **Repeating a Word Helps—But Only Up to a Point**: BM25 considers how
+ often a word appears in a document, but it limits the benefit of repeating
+ the word too many times. This means mentioning “Upstash” a hundred times
+ won’t make a document overly important compared to one that mentions
+ it just a few times.
+- **Shorter Documents Often Rank Higher**: Shorter documents that match
+ the query are usually more relevant. BM25 adjusts for document length
+ so longer documents don’t get unfairly ranked just because they contain
+ more words.
+
+Upstash provides a general purpose BM25 algorithm, that applies to documents
+and queries in English. It tokenizes the text into words, removes stop words,
+stems the remaining words, and assigns a weighted value to them, based on the
+BM25 formula:
+
+```
+ IDF(qᵢ) * f(qᵢ, D) * (k₁ + 1)
+BM25(D, Q) = Σ ----------------------------------------------
+ f(qᵢ, D) + k₁ * (1 - b + b * (|D| / avg(|D|)))
+```
+
+Where:
+
+- `f(qᵢ, D)` is the frequency of term `qᵢ` in document `D`.
+- `|D|` is the length of document `D`.
+- `avg(|D|)` is the average document length in the collection.
+- `k₁` is the term frequency saturation parameter.
+- `b` is the length normalization parameter.
+- `IDF(qᵢ)` is the inverse document frequency of term `qᵢ`
+
+To make it a general purpose model, we had to decide on some of the
+constants mentioned above, which would differ from implementation
+to implementation. We decided to use the following values for
+
+- `k₁` = `1.2`, a widely used value in the absence of advanced optimizations
+- `b` = `0.75`, a widely used value in the absence of advanced optimizations
+- `avg(|D|)` = `32`, which was chosen by tokenizing and taking the average of
+ [MSMARCO](https://microsoft.github.io/msmarco/) dataset vectors, rounded
+ to the nearest power of two.
+
+In the future, we might provide support for more languages and the ability to
+provide different values for the above constants.
+
+As for the inverse document frequency `IDF(qᵢ)`, we maintain that information
+per token in the vector database itself. You can use it by providing it
+as the weighting strategy for your queries so that you don't have to weight
+it yourself.
+
+## Using Sparse Indexes
+
+### Upserting Sparse Vectors
+
+You can upsert sparse vectors into Upstash Vector indexes in two different ways.
+
+#### Upserting Raw Sparse Vectors
+
+You can upsert raw sparse vectors by representing them as two arrays of equal
+sizes. One signed 32-bit integer array for non-zero dimension indices,
+and one 32-bit float array for the values.
+
+
+
+
+
+```python
+from upstash_vector import Index, Vector
+from upstash_vector.types import SparseVector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.upsert(
+ vectors=[
+ Vector(id="id-0", sparse_vector=SparseVector([1, 2], [0.1, 0.2])),
+ Vector(id="id-1", sparse_vector=SparseVector([123, 44232], [0.5, 0.4])),
+ ]
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/upsert \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '[
+ {"id": "id-0", "sparseVector": {"indices": [1, 2], "values": [0.1, 0.2]}},
+ {"id": "id-1", "sparseVector": {"indices": [123, 44232], "values": [0.5, 0.4]}}
+ ]'
+```
+
+
+
+
+
+Note that, we do not allow sparse vectors to have more than `1_000` non-zero valued dimension.
+
+#### Upserting Text as Sparse Vectors
+
+If you created the sparse index with an Upstash-hosted sparse embedding model,
+you can upsert raw text data, and Upstash can embed it behind the scenes.
+
+
+
+
+
+```python
+from upstash_vector import Index, Vector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.upsert(
+ vectors=[
+ Vector(id="id-0", data="Upstash Vector provides sparse embedding models.")),
+ Vector(id="id-1", data="You can upsert raw text data with these embedding models.")),
+ ]
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/upsert-data \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '[
+ {"id": "id-0", "data": "Upstash Vector provides sparse embedding models."},
+ {"id": "id-1", "data": "You can upsert raw text data with these embedding models."}
+ ]'
+```
+
+
+
+
+
+### Querying Sparse Vectors
+
+Similar to upserts, you can query sparse vectors in two different ways.
+
+#### Querying with Raw Sparse Vectors
+
+You can query raw sparse vectors by representing the sparse query vector
+as two arrays of equal sizes. One signed 32-bit integer array for
+non-zero dimension indices, and one 32-bit float array for the values.
+
+We use the inner product similarity metric while calculating the
+similarity scores, only considering the matching non-zero valued
+dimension indices between the query vector and the indexed vectors.
+
+
+
+
+
+```python
+from upstash_vector import Index
+from upstash_vector.types import SparseVector
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.query(
+ sparse_vector=SparseVector([3, 5], [0.3, 0.5]),
+ top_k=5,
+ include_metadata=True,
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"sparseVector": {"indices": [3, 5], "values": [0.3, 0.5]}, "topK": 5, "includeMetadata": true}'
+```
+
+
+
+
+
+Note that, the similarity scores are exact, not approximate. So, if there
+are no vectors with one or more matching non-zero valued dimension indices with
+the query vector, the result might be less than the provided top-K value.
+
+#### Querying with Text as Sparse Vectors
+
+If you created the sparse index with an Upstash-hosted sparse embedding model,
+you can query with raw text data, and Upstash can embed it behind the scenes
+before performing the actual query.
+
+
+
+
+
+```python
+from upstash_vector import Index
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.query(
+ data="Upstash Vector",
+ top_k=5,
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query-data \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"data": "Upstash Vector", "topK": 5}'
+```
+
+
+
+
+
+#### Weighting Query Values
+
+For algorithms like BM25, it is important to take the inverse
+document frequencies that make matching rare terms more important
+into account. It might be tricky to maintain that information
+yourself, so Upstash Vector provides it out of the box. To make use
+of IDF in your queries, you can pass it as a weighting strategy.
+
+Since this is mainly meant to be used with BM25 models, the IDF
+is defined as:
+
+```
+IDF(qᵢ) = log((N - n(qᵢ) + 0.5) / (n(qᵢ) + 0.5))
+```
+
+- `N` is the total number of documents in the collection.
+- `n(qᵢ)` is the number of documents containing term `qᵢ`.
+
+
+
+
+
+```python
+from upstash_vector import Index
+from upstash_vector.types import WeightingStrategy
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+index.query(
+ data="Upstash Vector",
+ top_k=5,
+ weighting_strategy=WeightingStrategy.IDF,
+)
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query-data \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"data": "Upstash Vector", "topK": 5, "weightingStrategy": "IDF"}'
+```
+
+
+
+
From 31b8ef474bd20e82cde7ffa53d9d65b209e38d18 Mon Sep 17 00:00:00 2001
From: Metin Dumandag <29387993+mdumandag@users.noreply.github.com>
Date: Mon, 16 Dec 2024 15:04:33 +0300
Subject: [PATCH 2/4] add field names for the sparse vector request/response
fields
---
vector/api/endpoints/fetch-random.mdx | 8 ++++++++
vector/api/endpoints/fetch.mdx | 8 ++++++++
vector/api/endpoints/query-data.mdx | 8 ++++++++
vector/api/endpoints/query.mdx | 8 ++++++++
vector/api/endpoints/range.mdx | 8 ++++++++
vector/api/endpoints/resumable-query/resume.mdx | 8 ++++++++
.../api/endpoints/resumable-query/start-with-vector.mdx | 8 ++++++++
vector/api/endpoints/update.mdx | 8 ++++++++
vector/api/endpoints/upsert.mdx | 8 ++++++++
9 files changed, 72 insertions(+)
diff --git a/vector/api/endpoints/fetch-random.mdx b/vector/api/endpoints/fetch-random.mdx
index dc2af116..bc5e272c 100644
--- a/vector/api/endpoints/fetch-random.mdx
+++ b/vector/api/endpoints/fetch-random.mdx
@@ -28,6 +28,14 @@ The response will be `null` if the namespace is empty.
The sparse vector value for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
diff --git a/vector/api/endpoints/fetch.mdx b/vector/api/endpoints/fetch.mdx
index 7c099a86..e9a33889 100644
--- a/vector/api/endpoints/fetch.mdx
+++ b/vector/api/endpoints/fetch.mdx
@@ -53,6 +53,14 @@ their vector ids.
The sparse vector value for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/query-data.mdx b/vector/api/endpoints/query-data.mdx
index ed77b1b4..63c4364a 100644
--- a/vector/api/endpoints/query-data.mdx
+++ b/vector/api/endpoints/query-data.mdx
@@ -99,6 +99,14 @@ objects below is returned, one for each query item.
The sparse vector value for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/query.mdx b/vector/api/endpoints/query.mdx
index 208ef8a7..ee3d3017 100644
--- a/vector/api/endpoints/query.mdx
+++ b/vector/api/endpoints/query.mdx
@@ -95,6 +95,14 @@ objects below is returned, one for each query item.
The sparse vector value for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/range.mdx b/vector/api/endpoints/range.mdx
index b2f9f888..5c205b45 100644
--- a/vector/api/endpoints/range.mdx
+++ b/vector/api/endpoints/range.mdx
@@ -57,6 +57,14 @@ authMethod: "GET"
The sparse vector value for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/resumable-query/resume.mdx b/vector/api/endpoints/resumable-query/resume.mdx
index 007409b3..999215b1 100644
--- a/vector/api/endpoints/resumable-query/resume.mdx
+++ b/vector/api/endpoints/resumable-query/resume.mdx
@@ -31,6 +31,14 @@ authMethod: "bearer"
The sparse vector value for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/resumable-query/start-with-vector.mdx b/vector/api/endpoints/resumable-query/start-with-vector.mdx
index 04e60f29..a9f3593e 100644
--- a/vector/api/endpoints/resumable-query/start-with-vector.mdx
+++ b/vector/api/endpoints/resumable-query/start-with-vector.mdx
@@ -92,6 +92,14 @@ authMethod: "bearer"
The sparse vector value for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
The metadata of the vector, if any.
diff --git a/vector/api/endpoints/update.mdx b/vector/api/endpoints/update.mdx
index 972164c9..ee6a5f16 100644
--- a/vector/api/endpoints/update.mdx
+++ b/vector/api/endpoints/update.mdx
@@ -24,6 +24,14 @@ of those.
The sparse vector value to update to for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
The raw text data to update to.
diff --git a/vector/api/endpoints/upsert.mdx b/vector/api/endpoints/upsert.mdx
index bd833618..07b47989 100644
--- a/vector/api/endpoints/upsert.mdx
+++ b/vector/api/endpoints/upsert.mdx
@@ -23,6 +23,14 @@ You can either upsert a single vector, or multiple vectors in an array.
The sparse vector value for sparse and hybrid indexes.
+
+
+ Indices of the non-zero valued dimensions.
+
+
+ Values of the non-zero valued dimensions.
+
+
The metadata of the vector. This makes identifying vectors
From ebe40d6c1a336ef0070fd2b1b6b3d99872b567a5 Mon Sep 17 00:00:00 2001
From: Metin Dumandag <29387993+mdumandag@users.noreply.github.com>
Date: Tue, 17 Dec 2024 10:37:54 +0300
Subject: [PATCH 3/4] fix python code snippets
---
vector/features/hybridindexes.mdx | 4 ++--
vector/features/sparseindexes.mdx | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/vector/features/hybridindexes.mdx b/vector/features/hybridindexes.mdx
index 2cfa05fe..482deff4 100644
--- a/vector/features/hybridindexes.mdx
+++ b/vector/features/hybridindexes.mdx
@@ -105,8 +105,8 @@ index = Index(
index.upsert(
vectors=[
- Vector(id="id-0", data="Upstash Vector provides dense and sparse embedding models.")),
- Vector(id="id-1", data="You can upsert raw text data with these embedding models.")),
+ Vector(id="id-0", data="Upstash Vector provides dense and sparse embedding models."),
+ Vector(id="id-1", data="You can upsert raw text data with these embedding models."),
]
)
```
diff --git a/vector/features/sparseindexes.mdx b/vector/features/sparseindexes.mdx
index 9cf9138b..2cc16222 100644
--- a/vector/features/sparseindexes.mdx
+++ b/vector/features/sparseindexes.mdx
@@ -204,8 +204,8 @@ index = Index(
index.upsert(
vectors=[
- Vector(id="id-0", data="Upstash Vector provides sparse embedding models.")),
- Vector(id="id-1", data="You can upsert raw text data with these embedding models.")),
+ Vector(id="id-0", data="Upstash Vector provides sparse embedding models."),
+ Vector(id="id-1", data="You can upsert raw text data with these embedding models."),
]
)
```
From 89d215145a9e06fbafa34e0daa208248d62bff31 Mon Sep 17 00:00:00 2001
From: Metin Dumandag <29387993+mdumandag@users.noreply.github.com>
Date: Fri, 20 Dec 2024 16:19:26 +0300
Subject: [PATCH 4/4] add query mode support documentation for hybrid indexes
---
vector/api/endpoints/query-data.mdx | 11 +++
.../resumable-query/start-with-data.mdx | 12 +++
vector/features/hybridindexes.mdx | 82 ++++++++++++++++---
vector/features/sparseindexes.mdx | 29 +++----
4 files changed, 107 insertions(+), 27 deletions(-)
diff --git a/vector/api/endpoints/query-data.mdx b/vector/api/endpoints/query-data.mdx
index 63c4364a..3b3fe805 100644
--- a/vector/api/endpoints/query-data.mdx
+++ b/vector/api/endpoints/query-data.mdx
@@ -61,6 +61,17 @@ of fields below.
Other possible value is `DBSF` (Distribution-Based Score Fusion).
+
+ Query mode for hybrid indexes with Upstash-hosted
+ embedding models.
+
+ Specifies whether to run the query in only the
+ dense index, only the sparse index, or in both.
+
+ If not provided, defaults to `HYBRID`.
+
+ Possible values are `HYBRID`, `DENSE`, and `SPARSE`.
+
## Path
diff --git a/vector/api/endpoints/resumable-query/start-with-data.mdx b/vector/api/endpoints/resumable-query/start-with-data.mdx
index 707f9a36..32e8db9d 100644
--- a/vector/api/endpoints/resumable-query/start-with-data.mdx
+++ b/vector/api/endpoints/resumable-query/start-with-data.mdx
@@ -59,6 +59,18 @@ authMethod: "bearer"
Other possible value is `DBSF` (Distribution-Based Score Fusion).
+
+ Query mode for hybrid indexes with Upstash-hosted
+ embedding models.
+
+ Specifies whether to run the query in only the
+ dense index, only the sparse index, or in both.
+
+ If not provided, defaults to `HYBRID`.
+
+ Possible values are `HYBRID`, `DENSE`, and `SPARSE`.
+
+
## Path
diff --git a/vector/features/hybridindexes.mdx b/vector/features/hybridindexes.mdx
index 482deff4..3645b41c 100644
--- a/vector/features/hybridindexes.mdx
+++ b/vector/features/hybridindexes.mdx
@@ -25,11 +25,11 @@ Since a hybrid index is a combination of a dense and a sparse index,
you can use the same methods you have used for dense and sparse indexes,
and combine them.
-Upstash allows you to upsert and query raw dense and sparse vectors to
+Upstash allows you to upsert and query dense and sparse vectors to
give you full control over the models you would use.
Also, to make embedding easier for you, Upstash provides some hosted
-models and allows you to upsert and query raw text data. Behind the scenes,
+models and allows you to upsert and query text data. Behind the scenes,
the text data is converted to dense and sparse vectors.
You can create your index with a dense and sparse embedding model
@@ -41,9 +41,9 @@ to use this feature.
You can upsert dense and sparse vectors into Upstash Vector indexes in two different ways.
-#### Upserting Raw Dense and Sparse Vectors
+#### Upserting Dense and Sparse Vectors
-You can upsert raw dense and sparse vectors into the index as follows:
+You can upsert dense and sparse vectors into the index as follows:
@@ -86,10 +86,10 @@ curl $UPSTASH_VECTOR_REST_URL/upsert \
Note that, for hybrid indexes, you have to provide both dense and sparse
vectors. You can't omit one or both.
-#### Upserting Text as Dense and Sparse Vectors
+#### Upserting Text Data
If you created the hybrid index with Upstash-hosted dense and sparse embedding models,
-you can upsert raw text data, and Upstash can embed it behind the scenes.
+you can upsert text data, and Upstash can embed it behind the scenes.
@@ -106,7 +106,7 @@ index = Index(
index.upsert(
vectors=[
Vector(id="id-0", data="Upstash Vector provides dense and sparse embedding models."),
- Vector(id="id-1", data="You can upsert raw text data with these embedding models."),
+ Vector(id="id-1", data="You can upsert text data with these embedding models."),
]
)
```
@@ -120,7 +120,7 @@ curl $UPSTASH_VECTOR_REST_URL/upsert-data \
-H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
-d '[
{"id": "id-0", "data": "Upstash Vector provides dense and sparse embedding models."},
- {"id": "id-1", "data": "You can upsert raw text data with these embedding models."}
+ {"id": "id-1", "data": "You can upsert text data with these embedding models."}
]'
```
@@ -132,9 +132,9 @@ curl $UPSTASH_VECTOR_REST_URL/upsert-data \
Similar to upserts, you can query dense and sparse vectors in two different ways.
-#### Querying with Raw Dense and Sparse Vectors
+#### Querying with Dense and Sparse Vectors
-Hybrid indexes can be queried by providing raw dense and sparse vectors.
+Hybrid indexes can be queried by providing dense and sparse vectors.
@@ -173,10 +173,10 @@ curl $UPSTASH_VECTOR_REST_URL/query \
The query results will be fused scores from the dense and sparse indexes.
-#### Querying with Text as Dense and Sparse Vectors
+#### Querying with Text Data
If you created the hybrid index with Upstash-hosted dense and sparse embedding models,
-you can query with raw text data, and Upstash can embed it behind the scenes
+you can query with text data, and Upstash can embed it behind the scenes
before performing the actual query.
@@ -223,7 +223,7 @@ to choose from to do so.
#### Reciprocal Rank Fusion
RRF is a method for combining results from dense and sparse indexes.
-It focuses on the order of results, not their raw scores. Each result's score
+It focuses on the order of results, not their scores. Each result's score
is mapped using the formula:
```
@@ -398,3 +398,59 @@ curl $UPSTASH_VECTOR_REST_URL/query \
+
+#### Using a Custom Reranker with Text Data
+
+Similar the section above, you might want to use a custom reranker
+for the hybrid indexes created with Upstash-hosted embedding models.
+
+For such scenarios, hybrid indexes with Upstash-hosted embedding models
+allow you to perform queries over only dense and only sparse components.
+This way, the hybrid index would return semantically similar vectors
+from the dense index by embedding the text data into a dense vector,
+and exact query matches from the sparse index by embedding the text data
+into a sparse vector. Then, you can rerank them as you like.
+
+
+
+
+
+```python
+from upstash_vector import Index
+from upstash_vector.types import SparseVector, QueryMode
+
+index = Index(
+ url="UPSTASH_VECTOR_REST_URL",
+ token="UPSTASH_VECTOR_REST_TOKEN",
+)
+
+dense_results = index.query(
+ data="Upstash Vector",
+ query_mode=QueryMode.DENSE,
+)
+
+sparse_results = index.query(
+ data="Upstash Vector",
+ query_mode=QueryMode.SPARSE,
+)
+
+# Rerank dense and sparse results as you like here
+```
+
+
+
+
+
+```shell
+curl $UPSTASH_VECTOR_REST_URL/query-data \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"data": "Upstash Vector", "queryMode": "DENSE"}'
+
+curl $UPSTASH_VECTOR_REST_URL/query-data \
+ -H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
+ -d '{"data": "Upstash Vector", "queryMode": "SPARSE"}'
+```
+
+
+
+
diff --git a/vector/features/sparseindexes.mdx b/vector/features/sparseindexes.mdx
index 2cc16222..5e302d55 100644
--- a/vector/features/sparseindexes.mdx
+++ b/vector/features/sparseindexes.mdx
@@ -53,12 +53,13 @@ retrieval tasks, or use models like [SPLADE](https://github.com/naver/splade)
that enhance documents and queries with term weighting and expansion.
Upstash gives you full control by allowing you to upsert and query
-raw sparse vectors.
+sparse vectors.
-Similar to dense vectors, it also provides Upstash-hosted models
-that can encode raw text data into sparse vectors for you.
+Also, to make embedding easier for you, Upstash provides some hosted
+models and allows you to upsert and query text data. Behind the scenes,
+the text data is converted to sparse vectors.
-To use them, you have to create your index with the appropriate models.
+You can create your index with a sparse embedding model to use this feature.
### BGE-M3 Sparse Vectors
@@ -139,9 +140,9 @@ it yourself.
You can upsert sparse vectors into Upstash Vector indexes in two different ways.
-#### Upserting Raw Sparse Vectors
+#### Upserting Sparse Vectors
-You can upsert raw sparse vectors by representing them as two arrays of equal
+You can upsert sparse vectors by representing them as two arrays of equal
sizes. One signed 32-bit integer array for non-zero dimension indices,
and one 32-bit float array for the values.
@@ -185,10 +186,10 @@ curl $UPSTASH_VECTOR_REST_URL/upsert \
Note that, we do not allow sparse vectors to have more than `1_000` non-zero valued dimension.
-#### Upserting Text as Sparse Vectors
+#### Upserting Text Data
If you created the sparse index with an Upstash-hosted sparse embedding model,
-you can upsert raw text data, and Upstash can embed it behind the scenes.
+you can upsert text data, and Upstash can embed it behind the scenes.
@@ -205,7 +206,7 @@ index = Index(
index.upsert(
vectors=[
Vector(id="id-0", data="Upstash Vector provides sparse embedding models."),
- Vector(id="id-1", data="You can upsert raw text data with these embedding models."),
+ Vector(id="id-1", data="You can upsert text data with these embedding models."),
]
)
```
@@ -219,7 +220,7 @@ curl $UPSTASH_VECTOR_REST_URL/upsert-data \
-H "Authorization: Bearer $UPSTASH_VECTOR_REST_TOKEN" \
-d '[
{"id": "id-0", "data": "Upstash Vector provides sparse embedding models."},
- {"id": "id-1", "data": "You can upsert raw text data with these embedding models."}
+ {"id": "id-1", "data": "You can upsert text data with these embedding models."}
]'
```
@@ -231,9 +232,9 @@ curl $UPSTASH_VECTOR_REST_URL/upsert-data \
Similar to upserts, you can query sparse vectors in two different ways.
-#### Querying with Raw Sparse Vectors
+#### Querying with Sparse Vectors
-You can query raw sparse vectors by representing the sparse query vector
+You can query sparse vectors by representing the sparse query vector
as two arrays of equal sizes. One signed 32-bit integer array for
non-zero dimension indices, and one 32-bit float array for the values.
@@ -279,10 +280,10 @@ Note that, the similarity scores are exact, not approximate. So, if there
are no vectors with one or more matching non-zero valued dimension indices with
the query vector, the result might be less than the provided top-K value.
-#### Querying with Text as Sparse Vectors
+#### Querying with Text Data
If you created the sparse index with an Upstash-hosted sparse embedding model,
-you can query with raw text data, and Upstash can embed it behind the scenes
+you can query with text data, and Upstash can embed it behind the scenes
before performing the actual query.