Commit dbd29a6
authored
[REVIEW] cuVS bench: Fix cudaFuncSetAttribute not being called when CAGRA search switches kernel variants (#1851)
Fix a bug in `safely_launch_kernel_with_smem_size` where `cudaFuncSetAttribute` was skipped for kernels that needed it. The function tracked the max shared memory in a single static variable per KernelT type, but `cudaFuncSetAttribute` applies per function pointer value — and the single-CTA CAGRA [search](https://github.com/rapidsai/cuvs/blob/d7a28aa1cb7648fa61037ed0459df0ec0e9db841/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh#L1373C4-L1375C78) dispatches multiple kernel instantiations that share the same pointer type. When one kernel bumped the tracked max, a different kernel whose smem fell between its own previous max and the global max would skip `cudaFuncSetAttribute`, causing `cudaErrorInvalidValue`. The fix tracks the kernel pointer identity alongside a monotonically growing smem high-water mark: when the pointer changes, the new kernel is brought up to the high-water mark; when smem exceeds it, the mark is grown.
## Error in question
```c++
$ CUVS_CAGRA_ANN_BENCH --search --data_prefix='<DATA_DIR>/' --benchmark_out_format=csv --benchmark_out=res_search_iter_cagra.csv --benchmark_counters_tabular=true --override_kv=dataset_memory_type:\"device\" <CONFIG_DIR>/laion_1M_cagra_iterative.json
[I] [12:28:52.095261] Using the query file '<DATA_DIR>/laion_1M/queries.fbin'
[I] [12:28:52.096141] Using the ground truth file '<DATA_DIR>/laion_1M/groundtruth.1M.neighbors.ibin'
2026-02-25T12:28:52+00:00
Running CUVS_CAGRA_ANN_BENCH
Run on (224 X 800 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x112)
L1 Instruction 32 KiB (x112)
L2 Unified 2048 KiB (x112)
L3 Unified 307200 KiB (x2)
Load Average: 0.70, 0.44, 0.28
dataset: laion_1M
dim: 768
distance: euclidean
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k max_iterations n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/0/process_time/real_time 5.70 ms 5.70 ms 121 5.68808m 5.69994m 0.96424 0.689692 1.75441M/s 64 10 8 10k 1 2 1.21M dataset_memory_type="device"
cuvs_cagra_iterative/1/process_time/real_time 5.70 ms 5.70 ms 121 5.6863m 5.69879m 0.96424 0.689553 1.75477M/s 64 10 8 10k 1 2 1.21M dataset_memory_type="device"
cuvs_cagra_iterative/2/process_time/real_time 4.92 ms 4.92 ms 140 4.90351m 4.91567m 0.96046 0.688193 2.03432M/s 128 10 12 10k 1 1 1.4M dataset_memory_type="device"
cuvs_cagra_iterative/3/process_time/real_time 5.99 ms 5.99 ms 115 5.97476m 5.98617m 0.97519 0.688409 1.67052M/s 128 10 16 10k 1 1 1.15M dataset_memory_type="device"
cuvs_cagra_iterative/4/process_time/real_time 6.97 ms 6.97 ms 99 6.95873m 6.9703m 0.98129 0.690059 1.43466M/s 256 10 16 10k 1 1 990k dataset_memory_type="device"
cuvs_cagra_iterative/5/process_time/real_time 10.5 ms 10.5 ms 66 0.010479 0.0104908 0.98548 0.692391 953.222k/s 512 10 10 10k 1 2 660k dataset_memory_type="device"
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------
cuvs_cagra_iterative/6/process_time/real_time ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
Obtained 19 stack frames
#1 in CUVS_CAGRA_ANN_BENCH: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
#2 in libcuvs.so: void cuvs::neighbors::cagra::detail::single_cta_search::select_and_run<float, unsigned int, float, unsigned int, cuvs::neighbors::filtering::none_sample_filter>(...)
#3 in libcuvs.so: cuvs::neighbors::cagra::detail::single_cta_search::search<float, unsigned int, float, cuvs::neighbors::filtering::none_sample_filter, unsigned int, long>::operator()(...)
#4 in libcuvs.so(+0x18fd0f1)
#5 in libcuvs.so: void cuvs::neighbors::cagra::search<float, unsigned int, long>(...)
#6-#19 in CUVS_CAGRA_ANN_BENCH / libc.so.6
'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k max_iterations n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/7/process_time/real_time 10.5 ms 10.5 ms 66 0.0105088 0.0105202 0.98663 0.694332 950.555k/s 32 10 32 10k 1 1 660k dataset_memory_type="device"
cuvs_cagra_iterative/8/process_time/real_time 12.8 ms 12.8 ms 54 0.012796 0.0128079 0.98807 0.691628 780.768k/s 32 10 64 10k 1 1 540k dataset_memory_type="device"
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------
cuvs_cagra_iterative/9/process_time/real_time ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
[same stack trace as above]
'
cuvs_cagra_iterative/10/process_time/real_time ERROR OCCURRED: 'Benchmark loop: CUDA error encountered at: file=cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh line=2348: call='cudaPeekAtLastError()', Reason=cudaErrorInvalidValue:invalid argument
[same stack trace as above]
'
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations GPU Latency Recall end_to_end items_per_second itopk k max_iterations n_queries refine_ratio search_width total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cuvs_cagra_iterative/11/process_time/real_time 46.1 ms 46.2 ms 15 0.0461323 0.0461439 0.99131 0.692158 216.714k/s 256 10 10 10k 1 16 150k dataset_memory_type="device"
cuvs_cagra_iterative/12/process_time/real_time 142 ms 142 ms 5 0.141713 0.141725 0.99198 0.708627 70.5591k/s 512 10 32 10k 1 16 50k dataset_memory_type="device"
```
## Config
```
{
"dataset": {
"name": "laion_1M",
"base_file": "laion_1M/base.1M.fbin",
"subset_size": 1000000,
"query_file": "laion_1M/queries.fbin",
"groundtruth_neighbors_file": "laion_1M/groundtruth.1M.neighbors.ibin",
"distance": "euclidean"
},
"search_basic_param": {
"batch_size": 10000,
"k": 10
},
"index": [
{
"name": "cuvs_cagra_iterative",
"algo": "cuvs_cagra",
"build_param": {
"graph_degree": 64,
"intermediate_graph_degree": 128,
"search_width": 1
},
"file": "laion_1M/cagra/q_coarse_iterative.ibin",
"search_params": [
{"itopk": 64, "search_width": 2, "max_iterations": 8, "refine_ratio": 1},
{"itopk": 64, "search_width": 2, "max_iterations": 8, "refine_ratio": 1},
{"itopk": 128, "search_width": 1, "max_iterations": 12, "refine_ratio": 1},
{"itopk": 128, "search_width": 1, "max_iterations": 16, "refine_ratio": 1},
{"itopk": 256, "search_width": 1, "max_iterations": 16, "refine_ratio": 1},
{"itopk": 512, "search_width": 2, "max_iterations": 10, "refine_ratio": 1},
{"itopk": 256, "search_width": 2, "max_iterations": 12, "refine_ratio": 1},
{"itopk": 32, "search_width": 1, "max_iterations": 32, "refine_ratio": 1},
{"itopk": 32, "search_width": 1, "max_iterations": 64, "refine_ratio": 1},
{"itopk": 192, "search_width": 4, "max_iterations": 12, "refine_ratio": 1},
{"itopk": 256, "search_width": 4, "max_iterations": 12, "refine_ratio": 1},
{"itopk": 256, "search_width": 16, "max_iterations": 10, "refine_ratio": 1},
{"itopk": 512, "search_width": 16, "max_iterations": 32, "refine_ratio": 1}
]
}
]
}
```
Authors:
- https://github.com/irina-resh-nvda
Approvers:
- Artem M. Chirkin (https://github.com/achirkin)
URL: #18511 parent aad241e commit dbd29a6
2 files changed
Lines changed: 39 additions & 28 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2080 | 2080 | | |
2081 | 2081 | | |
2082 | 2082 | | |
2083 | | - | |
| 2083 | + | |
2084 | 2084 | | |
2085 | 2085 | | |
2086 | 2086 | | |
| |||
2146 | 2146 | | |
2147 | 2147 | | |
2148 | 2148 | | |
2149 | | - | |
| 2149 | + | |
2150 | 2150 | | |
2151 | 2151 | | |
2152 | 2152 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | | - | |
8 | | - | |
9 | 7 | | |
10 | 8 | | |
11 | 9 | | |
| 10 | + | |
12 | 11 | | |
13 | 12 | | |
14 | 13 | | |
15 | 14 | | |
16 | 15 | | |
17 | | - | |
18 | | - | |
19 | 16 | | |
20 | | - | |
21 | | - | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
22 | 26 | | |
23 | 27 | | |
24 | | - | |
| 28 | + | |
25 | 29 | | |
26 | 30 | | |
27 | 31 | | |
| |||
31 | 35 | | |
32 | 36 | | |
33 | 37 | | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
48 | 57 | | |
49 | | - | |
50 | | - | |
| 58 | + | |
51 | 59 | | |
52 | | - | |
| 60 | + | |
53 | 61 | | |
54 | 62 | | |
55 | | - | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
56 | 68 | | |
57 | 69 | | |
58 | | - | |
59 | 70 | | |
60 | 71 | | |
61 | 72 | | |
| |||
0 commit comments