Skip to content

[RELEASE] cuvs v25.08#1205

Merged
AyodeAwe merged 115 commits into
mainfrom
branch-25.08
Aug 6, 2025
Merged

[RELEASE] cuvs v25.08#1205
AyodeAwe merged 115 commits into
mainfrom
branch-25.08

Conversation

@AyodeAwe

Copy link
Copy Markdown
Contributor

❄️ Code freeze for branch-25.08 and v25.08 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-25.08 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-25.08 into main for the release

raydouglass and others added 30 commits April 30, 2025 15:12
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Contributes to rapidsai/build-planning#181

* removes all uploads of conda packages and wheels to `downloads.rapids.ai`

## Notes for Reviewers

### How I identified changes

Looked for uses of the relevant `gha-tools` tools, as well as documentation about `downloads.rapids.ai`, being on the NVIDIA VPN, using S3, etc. like this:

```shell
git grep -i -E 's3|upload|downloads\.rapids|vpn'
```

### How I tested this

See "How I tested this" on rapidsai/shared-workflows#364

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)
  - Bradley Dice (https://github.com/bdice)

URL: #940
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
Forward-merge branch-25.06 into branch-25.08
This PR removes CUDA 11 devcontainers and updates CI scripts.

xref: rapidsai/build-planning#184

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #960
cjnolet and others added 8 commits July 25, 2025 19:18
In #902 and #1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. #1105 / #1102 or #1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from #1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: #1111
…dIndexParamsCreate (#1147)

Using `cuvsTieredIndexParamsCreate` and `cuvsTieredIndexParamsDestroy` now instead of allocating arena in Java.
  -Used CloseableHandle and CuvsParamsHelper as used in #1109 and #1110 for consistency
  
  This fixes #1138

Authors:
  - Puneet Ahuja (https://github.com/punAhuja)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: #1147
Add helper functions to construct a reasonable default CAGRA build parameters based on HNSW parameters.
The goal is to build a CAGRA graph that can be converted to an HNSW graph which has very similar search performance as the corresponding HNSW-built graph.

Additionally, this PR refactors CAGRA benchmark wrapper to store build parameters in a single place and flexibly set defaults based on the dataset shape.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #1125
The refine functions that work with GPU data use IVF-Flat under the hood to perform the refinement operation. This PR adds extern template declarations for `ivfflat_interleaved_scan` and uses these in the refine functions. This way we avoid recompiling the IVF-Flat search kernels, and save binary size.

Before this PR `ivfflat_interleaved_scan` was compiled through the `ivf_flat::search()` function instantiations. But the function symbols were not available due to inlining. This PR also add explicit instantiations for `ivfflat_interleaved_scan`, and now both `ivf_flat::search` and `refine` can use the same interleaved scan function.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Divye Gala (https://github.com/divyegala)

URL: #1095
This PR gives a proof-of-concept implementation of GPU-based index build for the ScaNN index. The ScaNN index defined here is similar to IVF-PQ index  in structure (a tree structure coming from kmeans, plus product quantization of vectors assigned to leaf nodes), together with “AVQ update” of the kmeans centroids and a spilled cluster assignment from the “SOAR” loss.

Other features, optimizations, and customizability options to appear in subsequent PRs.

* scann_build.cuh 

This file contains the implementation for build(..). The general pipeline looks like:
Train kmeans centers on sampled data
Assign all dataset vectors to kmeans clusters by minimizing L2 loss
Update kmeans centers with AVQ
Train PQ codebook on sampled residual vectors (here we use VPQ, slightly modified to perform product quantization on individual subspaces, e.g. each subspace has its own codebook)
Quantization loop (batched):
Compute spilled SOAR labels (performed here to minimize HtoD copies)
Compute and quantize residuals/soar residuals using trained pq codebook
If enabled, compute bf16 quantization of dataset vectors (performed here to minimize HtoD copies).

* scann_avq.cuh

This file contains apply_avq(..), which recomputes cluster centers using AVQ. The main technique is a single application of Theorem 4.2 in https://arxiv.org/pdf/1908.10396 to each cluster, using parameters:
h_i_parallel = eta * || x_i || ^ (eta - 1) 
h_i_orthogonal = ||x _i || ^ (eta -1)
The implementation of Theorem 4.2 is in compute_avq_centroid(..)

The overall pipeline for apply_avq(..) is:
Build clusters from kmeans cluster assignments 
For each cluster:
Gather cluster vectors into single matrix
Update kmeans center via compute_avq_centroid
Rescale updated centroids (I need to add more details about this step).

* scann_quantize.cuh

This file contains helpers for PQ. Codebooks are created from residual vectors using train_pq from vpq_dataset.cuh (using a single vq center which is set to zero). Unlike in VPQ, codebooks are generated separately for each subspace, rather than collapsing all subspaces into a single space and computing a global codebook. 



* scann_soar.cuh

The main function is compute_soar_labels(..), which computes a second, spilled cluster assignment by minimizing the SOAR loss function (Theorem 3.1 in https://arxiv.org/pdf/2404.00774) 

* scann_serialize.cuh

Contains the implementation of serialize(..). The goal is to serialize ScaNN artifacts in a way that is usable with open-source ScaNN search with minimal additional post-processing. The cluster assignments, quantized vectors (for both the primary and spilled SOAR assignments), and bf16 dataset are all stored in separate .npy files for direct consumption by open-source ScaNN. The codebook and cluster centers are also serialized separately, but require additional post-processing into the correct Protobuf structs (not included in this PR). 

Test Plan:
This code is mostly tested via CPU search with open-source ScaNN. Additional protobuf artifacts are created from the cuVS serialized index via an external tool. A pareto for OpenAI 5M is shown here:

Authors:
  - https://github.com/rmaschal

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Artem M. Chirkin (https://github.com/achirkin)

URL: #1120
@AyodeAwe AyodeAwe requested review from a team as code owners July 31, 2025 15:25
@AyodeAwe AyodeAwe requested review from msarahan and removed request for a team July 31, 2025 15:25
@copy-pr-bot

copy-pr-bot Bot commented Jul 31, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@AyodeAwe AyodeAwe merged commit 8af9b84 into main Aug 6, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.