-
Notifications
You must be signed in to change notification settings - Fork 201
[WIP] IP Vector Normalization to avoid all vectors clumped into single cluster in IVF-PQ #1892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 18 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
ee94971
Fixed vector normalization for Inner Product -- still failing 15 test…
HowardHuang1 78e47f1
Merge branch 'main' into HH-Vector-Normalization
aamijar e822460
add logging to compare recall and search speed for raw v.s. normalize…
HowardHuang1 da32073
revert change comparing normalize and raw in single run -- that was p…
HowardHuang1 5375bb5
previously normalization was applied to entire IVF-PQ pipeline --> ch…
HowardHuang1 c320640
revert to raw vectors. No need to normalize here because normalized v…
HowardHuang1 0486bf5
clean up code
HowardHuang1 107e2b3
clean up code
HowardHuang1 5442b89
upload code that resolves linker issue + live csv updates
HowardHuang1 dc9b6df
remove live_csv
HowardHuang1 cf3666c
Merge branch 'main' into HH-Vector-Normalization
HowardHuang1 bdda881
hardcode file path instead of searching multiple directories + fix in…
HowardHuang1 c237db3
clean up unnecessary checks in data_export.py
HowardHuang1 3c20377
bring back comma parsing instead of underscore parsing
HowardHuang1 728c964
bring back parts of plot/__main__.py for clarity
HowardHuang1 6c9bc36
get rid of incremental JSON->CSV write for clarity
HowardHuang1 eb6bb88
bring back original plot/__main__.py for clarity
HowardHuang1 4546a85
fix cuvs-bench generate groundtruth which was sorted incorrectly for …
HowardHuang1 ea8a6b4
revert normalization outside kernel that requires copy of dataset res…
HowardHuang1 5800e53
didn't modify kernels themselves but routed IP predict through existi…
HowardHuang1 945d3ee
add inner_product_cosine_assignment flag which is set when ivf_pq_bui…
HowardHuang1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1324,30 +1324,61 @@ auto build(raft::resources const& handle, | |
| auto cluster_centers = cluster_centers_buf.data(); | ||
|
|
||
| // Train balanced hierarchical kmeans clustering | ||
| auto trainset_const_view = raft::make_const_mdspan(trainset.view()); | ||
| auto centers_view = raft::make_device_matrix_view<float, internal_extents_t>( | ||
| auto centers_view = raft::make_device_matrix_view<float, internal_extents_t>( | ||
| cluster_centers, impl->n_lists(), impl->dim()); | ||
| cuvs::cluster::kmeans::balanced_params kmeans_params; | ||
| kmeans_params.n_iters = params.kmeans_n_iters; | ||
| kmeans_params.metric = static_cast<cuvs::distance::DistanceType>((int)impl->metric()); | ||
|
|
||
| if (impl->metric() == distance::DistanceType::CosineExpanded) { | ||
| raft::linalg::row_normalize<raft::linalg::L2Norm>( | ||
| handle, trainset_const_view, trainset.view()); | ||
| } | ||
| cuvs::cluster::kmeans::fit(handle, kmeans_params, trainset_const_view, centers_view); | ||
|
|
||
| // Trainset labels are needed for training PQ codebooks | ||
| rmm::device_uvector<uint32_t> labels(n_rows_train, stream, big_memory_resource); | ||
| auto labels_view = | ||
| raft::make_device_vector_view<uint32_t, internal_extents_t>(labels.data(), n_rows_train); | ||
| auto centers_const_view = raft::make_device_matrix_view<const float, internal_extents_t>( | ||
| cluster_centers, impl->n_lists(), impl->dim()); | ||
| if (impl->metric() == distance::DistanceType::CosineExpanded) { | ||
|
|
||
| if (impl->metric() == distance::DistanceType::InnerProduct) { | ||
| // Normalization only for k-means: use a copy so trainset stays in original space; metric | ||
| // remains inner product for the rest of the pipeline. | ||
| auto trainset_kmeans = raft::make_device_mdarray<float>( | ||
| handle, device_memory, raft::make_extents<int64_t>(n_rows_train, dim)); | ||
| raft::copy(handle, trainset_kmeans.view(), trainset.view()); | ||
| auto trainset_kmeans_view = raft::make_device_matrix_view<float, internal_extents_t>( | ||
| trainset_kmeans.data_handle(), n_rows_train, dim); | ||
| raft::linalg::row_normalize<raft::linalg::L2Norm>( | ||
| handle, raft::make_const_mdspan(trainset_kmeans_view), trainset_kmeans_view); | ||
| auto trainset_kmeans_const_view = raft::make_const_mdspan(trainset_kmeans.view()); | ||
| cuvs::cluster::kmeans::fit(handle, kmeans_params, trainset_kmeans_const_view, centers_view); | ||
| raft::linalg::row_normalize<raft::linalg::L2Norm>(handle, centers_const_view, centers_view); | ||
| cuvs::cluster::kmeans::predict( | ||
| handle, kmeans_params, trainset_kmeans_const_view, centers_const_view, labels_view); | ||
| // Recompute centers in original space (mean of unnormalized trainset per cluster), overwrites centers_view | ||
| rmm::device_uvector<uint32_t> cluster_sizes(impl->n_lists(), stream, device_memory); | ||
| cuvs::cluster::kmeans::detail::calc_centers_and_sizes<float, float, internal_extents_t, uint32_t, uint32_t>( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we use the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @jinsolp !
|
||
| handle, | ||
| cluster_centers, | ||
| cluster_sizes.data(), | ||
| static_cast<internal_extents_t>(impl->n_lists()), | ||
| static_cast<internal_extents_t>(dim), | ||
| trainset.data_handle(), | ||
| static_cast<internal_extents_t>(n_rows_train), | ||
| labels.data(), | ||
| true, | ||
| raft::identity_op{}, | ||
| device_memory); | ||
| } else if (impl->metric() == distance::DistanceType::CosineExpanded) { | ||
| auto trainset_const_view = raft::make_const_mdspan(trainset.view()); | ||
| raft::linalg::row_normalize<raft::linalg::L2Norm>( | ||
| handle, trainset_const_view, trainset.view()); | ||
| cuvs::cluster::kmeans::fit(handle, kmeans_params, trainset_const_view, centers_view); | ||
| raft::linalg::row_normalize<raft::linalg::L2Norm>(handle, centers_const_view, centers_view); | ||
| cuvs::cluster::kmeans::predict( | ||
| handle, kmeans_params, trainset_const_view, centers_const_view, labels_view); | ||
| } else { | ||
| auto trainset_const_view = raft::make_const_mdspan(trainset.view()); | ||
| cuvs::cluster::kmeans::fit(handle, kmeans_params, trainset_const_view, centers_view); | ||
| cuvs::cluster::kmeans::predict( | ||
| handle, kmeans_params, trainset_const_view, centers_const_view, labels_view); | ||
| } | ||
| auto labels_view = | ||
| raft::make_device_vector_view<uint32_t, internal_extents_t>(labels.data(), n_rows_train); | ||
| cuvs::cluster::kmeans::predict( | ||
| handle, kmeans_params, trainset_const_view, centers_const_view, labels_view); | ||
|
|
||
| // Make rotation matrix | ||
| helpers::make_rotation_matrix(handle, impl->rotation_matrix(), params.force_random_rotation); | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| #!/usr/bin/env python | ||
| # | ||
| # SPDX-FileCopyrightText: Copyright (c) 2024-2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Launcher to run cuvs_bench from THIS repository's Python code (ignoring | ||
| # any cuvs_bench installed in the environment). Use this when developing | ||
| # cuvs_bench in a fork or branch and you want to ensure the local code | ||
| # is used while still using conda-installed C++ binaries. | ||
| # | ||
| # Usage (from the repo root that contains python/cuvs_bench, e.g. cuvs_vector_norm): | ||
| # python python/cuvs_bench/run_benchmark_local.py --dataset ... --algorithms ... | ||
| # To confirm which package is used: CUVS_BENCH_DEBUG_LOAD=1 python python/cuvs_bench/run_benchmark_local.py ... | ||
| # To use local libcuvs.so: set CUVS_HOME to repo root (runner sets LD_PRELOAD and LD_LIBRARY_PATH for the benchmark subprocess). | ||
| # | ||
| # One-liner (must fix sys.argv so Click sees your flags; run from repo root): | ||
| # python -c " | ||
| # import sys, runpy, os | ||
| # sys.path.insert(0, os.path.join(os.getcwd(), 'python')) | ||
| # if '--' in sys.argv: | ||
| # sys.argv = ['cuvs_bench.run'] + sys.argv[sys.argv.index('--')+1:] | ||
| # runpy.run_module('cuvs_bench.run', run_name='__main__') | ||
| # " -- --dataset deep-image-96-inner -k 10 --batch-size 10 --algorithms cuvs_ivf_pq ... | ||
|
|
||
| from pathlib import Path | ||
| import os | ||
| import runpy | ||
| import sys | ||
|
|
||
| # Repo root: directory that contains python/cuvs_bench (one level up from this file's parent) | ||
| _SCRIPT_DIR = Path(__file__).resolve().parent | ||
| _REPO_PYTHON = _SCRIPT_DIR.parent # python/ inside the repo | ||
| _REPO_ROOT = _REPO_PYTHON.parent # repo root | ||
|
|
||
| # Prepend this repo's python directory so "import cuvs_bench" uses local code. | ||
| # Clear PYTHONPATH so the env cannot override (e.g. conda or shell set to cuvs). | ||
| if "PYTHONPATH" in os.environ: | ||
| os.environ.pop("PYTHONPATH") | ||
| _REPO_PYTHON_STR = str(_REPO_PYTHON) | ||
| if _REPO_PYTHON_STR not in sys.path: | ||
| sys.path.insert(0, _REPO_PYTHON_STR) | ||
| elif sys.path[0] != _REPO_PYTHON_STR: | ||
| sys.path.remove(_REPO_PYTHON_STR) | ||
| sys.path.insert(0, _REPO_PYTHON_STR) | ||
| if os.environ.get("CUVS_BENCH_DEBUG_LOAD"): | ||
| print(f"[cuvs_bench launcher] using python path: {_REPO_PYTHON_STR}", file=sys.stderr) | ||
|
|
||
| # Run the run module as __main__ | ||
| runpy.run_module("cuvs_bench.run", run_name="__main__") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| #!/usr/bin/env python | ||
| # | ||
| # SPDX-FileCopyrightText: Copyright (c) 2024-2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Launcher to run cuvs_bench.plot from THIS repository's Python code (ignoring | ||
| # any cuvs_bench installed in the environment). Use this when developing | ||
| # cuvs_bench in a fork or branch so the local plot code is used. | ||
| # | ||
| # Usage (from the repo root that contains python/cuvs_bench, e.g. cuvs_vector_norm): | ||
| # python python/cuvs_bench/run_plot_local.py --search --dataset deep-image-96-inner --dataset-path ./datasets -k 10 -bs 10000 --output-filepath . | ||
| # To confirm which package is used: CUVS_BENCH_DEBUG_LOAD=1 python python/cuvs_bench/run_plot_local.py ... | ||
|
|
||
| from pathlib import Path | ||
| import os | ||
| import runpy | ||
| import sys | ||
|
|
||
| # Repo root: directory that contains python/cuvs_bench (one level up from this file's parent) | ||
| _SCRIPT_DIR = Path(__file__).resolve().parent | ||
| _REPO_PYTHON = _SCRIPT_DIR.parent # python/ inside the repo | ||
| _REPO_ROOT = _REPO_PYTHON.parent # repo root | ||
|
|
||
| # Prepend this repo's python directory so "import cuvs_bench" uses local code. | ||
| # Clear PYTHONPATH so the env cannot override (e.g. conda or another repo). | ||
| if "PYTHONPATH" in os.environ: | ||
| os.environ.pop("PYTHONPATH") | ||
| _REPO_PYTHON_STR = str(_REPO_PYTHON) | ||
| if _REPO_PYTHON_STR not in sys.path: | ||
| sys.path.insert(0, _REPO_PYTHON_STR) | ||
| elif sys.path[0] != _REPO_PYTHON_STR: | ||
| sys.path.remove(_REPO_PYTHON_STR) | ||
| sys.path.insert(0, _REPO_PYTHON_STR) | ||
| if os.environ.get("CUVS_BENCH_DEBUG_LOAD"): | ||
| print(f"[cuvs_bench launcher] using python path: {_REPO_PYTHON_STR}", file=sys.stderr) | ||
|
|
||
| # Run the plot module as __main__ | ||
| runpy.run_module("cuvs_bench.plot", run_name="__main__") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,154 @@ | ||
| #!/usr/bin/env python3 | ||
| # | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Standalone check: does groundtruth.neighbors.ibin match exact inner-product | ||
| # top-k on the full base? (No dependency on installed cuvs_bench version.) | ||
| # | ||
| # Usage: | ||
| # python verify_ip_groundtruth.py base.fbin queries.fbin groundtruth.ibin [k] [query_row] | ||
| # Prints overlap, ids, and a side-by-side table of (idx, dot) for brute vs file (first 10 of k). | ||
| # | ||
| import struct | ||
| import sys | ||
|
|
||
| import numpy as np | ||
|
|
||
|
|
||
| def _read_shape(path): | ||
| with open(path, "rb") as f: | ||
| return struct.unpack("<II", f.read(8)) | ||
|
|
||
|
|
||
| def mmap_fbin(path): | ||
| nr, nc = _read_shape(path) | ||
| return np.memmap(path, dtype=np.float32, mode="r", offset=8, shape=(nr, nc)) | ||
|
|
||
|
|
||
| def mmap_ibin(path): | ||
| nr, nc = _read_shape(path) | ||
| return np.memmap(path, dtype=np.int32, mode="r", offset=8, shape=(nr, nc)) | ||
|
|
||
|
|
||
| def ip_scores_for_indices(dataset_mm, query_vec, indices): | ||
| """Inner product base[idx]·q for each index (for diagnostics).""" | ||
| q = np.asarray(query_vec, dtype=np.float32).ravel() | ||
| n = dataset_mm.shape[0] | ||
| out = np.empty(len(indices), dtype=np.float64) | ||
| for i, idx in enumerate(indices): | ||
| idx = int(idx) | ||
| if idx < 0 or idx >= n: | ||
| out[i] = np.nan | ||
| else: | ||
| row = np.asarray(dataset_mm[idx], dtype=np.float32) | ||
| out[i] = float(np.dot(row, q)) | ||
| return out | ||
|
|
||
|
|
||
| def brute_ip_topk_chunked(query_vec, dataset_mm, k, chunk_rows=65536): | ||
| q = np.asarray(query_vec, dtype=np.float32).ravel() | ||
| n = dataset_mm.shape[0] | ||
| top_sim = np.full(k, -np.inf, dtype=np.float64) | ||
| top_idx = np.zeros(k, dtype=np.int64) | ||
| for start in range(0, n, chunk_rows): | ||
| end = min(start + chunk_rows, n) | ||
| block = np.asarray(dataset_mm[start:end], dtype=np.float32) | ||
| sim = block @ q | ||
| merged_sim = np.concatenate([top_sim, sim.astype(np.float64)]) | ||
| merged_idx = np.concatenate( | ||
| [top_idx, np.arange(start, end, dtype=np.int64)] | ||
| ) | ||
| pick = np.argsort(-merged_sim)[:k] | ||
| top_sim = merged_sim[pick] | ||
| top_idx = merged_idx[pick] | ||
| return top_idx | ||
|
|
||
|
|
||
| def main(): | ||
| if len(sys.argv) < 4: | ||
| print( | ||
| "usage: python verify_ip_groundtruth.py " | ||
| "base.fbin queries.fbin groundtruth.neighbors.ibin [k] [query_row]", | ||
| file=sys.stderr, | ||
| ) | ||
| sys.exit(2) | ||
| base_p, q_p, gt_p = sys.argv[1:4] | ||
| k = int(sys.argv[4]) if len(sys.argv) > 4 else 10 | ||
| qi = int(sys.argv[5]) if len(sys.argv) > 5 else 0 | ||
|
|
||
| base = mmap_fbin(base_p) | ||
| queries = mmap_fbin(q_p) | ||
| gt = mmap_ibin(gt_p) | ||
| if base.shape[1] != queries.shape[1]: | ||
| print( | ||
| f"dim mismatch base {base.shape[1]} vs queries {queries.shape[1]}", | ||
| file=sys.stderr, | ||
| ) | ||
| sys.exit(1) | ||
| kk = min(k, gt.shape[1]) | ||
| if qi >= queries.shape[0] or qi >= gt.shape[0]: | ||
| print("query_row out of range", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
| print( | ||
| f"shapes: base={base.shape} queries={queries.shape} gt={gt.shape} " | ||
| f"(gt rows should match queries rows; gt cols >= k)" | ||
| ) | ||
|
|
||
| truth = brute_ip_topk_chunked(queries[qi], base, kk) | ||
| got = np.asarray(gt[qi, :kk], dtype=np.int64) | ||
| n_base = base.shape[0] | ||
| bad_got = np.logical_or(got < 0, got >= n_base) | ||
| if bad_got.any(): | ||
| print( | ||
| f"warning: {bad_got.sum()} neighbor id(s) out of range [0, {n_base}) " | ||
| f"in file row {qi} — possible wrong dtype/endian or corrupt header" | ||
| ) | ||
|
|
||
| inter = len(set(truth.tolist()) & set(got.tolist())) | ||
| print(f"query_row={qi} k={kk} overlap true∩file = {inter}/{kk}") | ||
| print(f" brute IP top-{kk} ids: {truth.tolist()}") | ||
| print(f" file row ids: {got.tolist()}") | ||
|
|
||
| qv = np.asarray(queries[qi], dtype=np.float32).ravel() | ||
| truth_dots = ip_scores_for_indices(base, qv, truth) | ||
| got_dots = ip_scores_for_indices(base, qv, got) | ||
| # Sort by dot descending so you see "best first" (same order as true IP ranking) | ||
| t_order = np.argsort(-truth_dots) | ||
| g_order = np.argsort(-got_dots) | ||
| show = min(10, kk) | ||
| print() | ||
| print( | ||
| f"Inner product (dot) scores for query_row={qi} " | ||
| f"(showing first {show} of {kk}; sorted by dot desc within each list):" | ||
| ) | ||
| print(f" {'rank':>4} {'brute idx':>12} {'brute dot':>14} | {'file idx':>12} {'file dot':>14}") | ||
| for r in range(show): | ||
| ti = t_order[r] | ||
| gi = g_order[r] | ||
| print( | ||
| f" {r + 1:4d} {int(truth[ti]):12d} {truth_dots[ti]:14.6g} | " | ||
| f"{int(got[gi]):12d} {got_dots[gi]:14.6g}" | ||
| ) | ||
| if kk > show: | ||
| print(f" ... ({kk - show} more rows per column not shown)") | ||
| print() | ||
| print( | ||
| f" brute: min_dot={np.nanmin(truth_dots):.6g} max_dot={np.nanmax(truth_dots):.6g} " | ||
| f"(true IP top-{kk} should have the k largest dots in the dataset)" | ||
| ) | ||
| print( | ||
| f" file: min_dot={np.nanmin(got_dots):.6g} max_dot={np.nanmax(got_dots):.6g} " | ||
| f"(if file is IP GT, these should match brute up to ties)" | ||
| ) | ||
|
|
||
| if inter < kk: | ||
| print( | ||
| "If not k/k, GT file is not raw IP top-k for this base/queries " | ||
| "(or rows misaligned)." | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait- we seem to be doing a copy of the trainset? Why? Just do the normalization in the distance kernels.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the way we do things- we don't do unecessary copies just to avoid changing kernels. This becomes frustrating for users, because they are almost always memory limited and every GB counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.