Add native find_in_set utility#4636
Conversation
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
|
pre-commit.ci autofix |
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Greptile SummaryThis PR adds a native GPU
Confidence Score: 3/5The new CUDA kernel is logically correct for all documented cases, but one assumption about stream-ordered memory deallocation is load-bearing and undocumented. The scalar ( src/main/cpp/src/find_in_set.cu — specifically the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Java: findInSet / findInSetRepeated] --> B[JNI: StringUtilsJni.cpp]
B --> C{word contains comma?}
C -- Yes --> D[fill all rows with 0, preserve null mask]
C -- No --> E[find_in_set path]
E --> F[Create word_scalar on device]
F --> G[thrust::transform per row]
G --> H{row is null?}
H -- Yes --> I[write 0 to data buffer, null mask marks as null]
H -- No --> J[find_token_position byte-by-byte scan]
J --> K[return 1-based position or 0]
B2[findInSetRepeated] --> C2{word contains comma?}
C2 -- Yes --> D2[make_zero_or_null_result]
C2 -- No --> E2[dictionary::encode sets column]
E2 --> F2{keys_size > maxDistinctSets?}
F2 -- Yes --> G2[return nullptr → Java null]
F2 -- No --> H2{keys_size == 0?}
H2 -- Yes --> D2
H2 -- No --> I2[find_in_set on dictionary keys]
I2 --> J2[build gather_map: null→0, non-null→dict index]
J2 --> K2[gather key_positions via gather_map]
K2 --> L2[overwrite null mask from original sets]
L2 --> M2[return result column]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[Java: findInSet / findInSetRepeated] --> B[JNI: StringUtilsJni.cpp]
B --> C{word contains comma?}
C -- Yes --> D[fill all rows with 0, preserve null mask]
C -- No --> E[find_in_set path]
E --> F[Create word_scalar on device]
F --> G[thrust::transform per row]
G --> H{row is null?}
H -- Yes --> I[write 0 to data buffer, null mask marks as null]
H -- No --> J[find_token_position byte-by-byte scan]
J --> K[return 1-based position or 0]
B2[findInSetRepeated] --> C2{word contains comma?}
C2 -- Yes --> D2[make_zero_or_null_result]
C2 -- No --> E2[dictionary::encode sets column]
E2 --> F2{keys_size > maxDistinctSets?}
F2 -- Yes --> G2[return nullptr → Java null]
F2 -- No --> H2{keys_size == 0?}
H2 -- Yes --> D2
H2 -- No --> I2[find_in_set on dictionary keys]
I2 --> J2[build gather_map: null→0, non-null→dict index]
J2 --> K2[gather key_positions via gather_map]
K2 --> L2[overwrite null mask from original sets]
L2 --> M2[return result column]
Reviews (1): Last reviewed commit: "Add repeated find_in_set native path" | Re-trigger Greptile |
| auto word_scalar = cudf::make_string_scalar(word, stream); | ||
| auto const& word_string_scalar = static_cast<cudf::string_scalar const&>(*word_scalar); | ||
| auto const d_word = cudf::string_view(word_string_scalar.data(), word_string_scalar.size()); | ||
|
|
||
| auto const sets_column = cudf::column_device_view::create(sets.parent(), stream); | ||
| auto const d_sets = *sets_column; | ||
|
|
||
| thrust::transform(rmm::exec_policy(stream), | ||
| thrust::make_counting_iterator<cudf::size_type>(0), | ||
| thrust::make_counting_iterator<cudf::size_type>(row_count), | ||
| d_results, | ||
| [d_sets, d_word] __device__(cudf::size_type idx) { | ||
| if (d_sets.is_null(idx)) { return cudf::size_type{0}; } | ||
| return find_token_position(d_sets.element<cudf::string_view>(idx), d_word); | ||
| }); |
There was a problem hiding this comment.
d_word device pointer outlives its owning scalar in async execution
word_scalar is a unique_ptr destroyed when find_in_set returns. The lambda captures d_word — a cudf::string_view that holds a raw pointer into the scalar's device buffer — and that buffer will be freed when word_scalar is destroyed. Since thrust::transform is asynchronous on the stream, the kernel may still be accessing d_word.data() after word_scalar's destructor runs.
This is safe only if the device memory resource uses stream-ordered deallocation (the cuDF convention with cuda_async_memory_resource). If a caller supplies a synchronous mr (or the RMM pool is configured otherwise), this is a use-after-free. Consider keeping word_scalar alive by either extending its scope past the function or storing its device buffer in the results and syncing explicitly before the scalar is freed.
| thrust::transform(rmm::exec_policy(stream), | ||
| thrust::make_counting_iterator<cudf::size_type>(0), | ||
| thrust::make_counting_iterator<cudf::size_type>(row_count), | ||
| d_gather_map, | ||
| [d_dictionary = *d_dictionary, d_indices = *d_indices] __device__( | ||
| cudf::size_type idx) { | ||
| return d_dictionary.is_null(idx) ? cudf::size_type{0} | ||
| : d_indices.element<cudf::size_type>(idx); | ||
| }); |
There was a problem hiding this comment.
Null-row gather index silently relies on
keys_size >= 1
For null rows, the lambda returns cudf::size_type{0}, which becomes the gather index into key_positions. This is safe because if (keys_size == 0) { return make_zero_or_null_result(); } on line 151 guarantees at least one key when we reach the gather. However, the safety of DONT_CHECK at the gather call depends entirely on this ordering — there is no bounds assertion or comment to document the invariant. If the early-return guard is ever moved or removed, null rows will silently gather out of bounds with DONT_CHECK. A brief comment coupling the two guards would make the invariant explicit.
| if (word.find(',') != std::string::npos) { | ||
| thrust::fill_n(rmm::exec_policy(stream), d_results, row_count, cudf::size_type{0}); | ||
| results->set_null_count(sets.null_count()); | ||
| return results; | ||
| } |
There was a problem hiding this comment.
Redundant
set_null_count after make_numeric_column
cudf::make_numeric_column already accepts and stores the null count from its constructor argument (sets.null_count()). The subsequent results->set_null_count(sets.null_count()) call (repeated in both the comma fast path here and inside make_zero_or_null_result in find_in_set_repeated) is a no-op. The same redundancy appears on line 114 in the normal kernel path. These extra calls are harmless but create noise; removing them improves clarity.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| @Test | ||
| void testFindInSetRepeated() { | ||
| try ( | ||
| ColumnVector sets = ColumnVector.fromStrings( | ||
| "a,b,c", | ||
| "x,b", | ||
| "a,b,c", | ||
| null, | ||
| "b", | ||
| "", | ||
| ","); | ||
| ColumnVector expectedB = ColumnVector.fromBoxedInts( | ||
| 2, 2, 2, null, 1, 0, 0); | ||
| ColumnVector expectedEmpty = ColumnVector.fromBoxedInts( | ||
| 0, 0, 0, null, 0, 1, 1); | ||
| ColumnVector expectedComma = ColumnVector.fromBoxedInts( | ||
| 0, 0, 0, null, 0, 0, 0); | ||
| ColumnVector actualB = StringUtils.findInSetRepeated(sets, "b", 5); | ||
| ColumnVector actualEmpty = StringUtils.findInSetRepeated(sets, "", 5); | ||
| ColumnVector actualComma = StringUtils.findInSetRepeated(sets, "a,b", 5)) { | ||
| assertColumnsAreEqual(expectedB, actualB); | ||
| assertColumnsAreEqual(expectedEmpty, actualEmpty); | ||
| assertColumnsAreEqual(expectedComma, actualComma); | ||
| try (ColumnVector tooManyDistinct = StringUtils.findInSetRepeated(sets, "b", 4)) { | ||
| assertNull(tooManyDistinct); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Missing edge-case coverage for
findInSetRepeated
Two code paths exercised by find_in_set_repeated have no test coverage:
- Empty column (
row_count == 0) — bothfind_in_setandfind_in_set_repeatedtake an early return to produce an empty INT32 column; there is no test asserting the output type or size. maxDistinctSets = 0— the Java-side validation allows 0 (only negative values are rejected), but there is no test confirming that a non-trivial column returnsnulland a fully-null column returns the expected all-null INT32 result.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Adds a native Spark-compatible
find_in_setutility for the scalar-word plus string-column set case.The new implementation scans each comma-delimited RHS row once on the GPU and returns the 1-based token position, 0 for missing values, and null for null RHS rows. This avoids materializing split list columns and the follow-up list search for the dynamic RHS path in Spark RAPIDS. It also includes a fast path for literal words containing commas, which Spark treats as no match.
Tests:
git diff --cached --checkthirdparty/cudfinitialized, so the lifecycle stops at the cuDF submodule check and direct Java test compilation cannot resolveai.rapids.cudfclasses. This PR is opened as draft for CI/native validation.