Skip to content

Add native find_in_set utility#4636

Draft
viadea wants to merge 4 commits into
NVIDIA:mainfrom
viadea:codex/native-find-in-set
Draft

Add native find_in_set utility#4636
viadea wants to merge 4 commits into
NVIDIA:mainfrom
viadea:codex/native-find-in-set

Conversation

@viadea

@viadea viadea commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Adds a native Spark-compatible find_in_set utility for the scalar-word plus string-column set case.

The new implementation scans each comma-delimited RHS row once on the GPU and returns the 1-based token position, 0 for missing values, and null for null RHS rows. This avoids materializing split list columns and the follow-up list search for the dynamic RHS path in Spark RAPIDS. It also includes a fast path for literal words containing commas, which Spark treats as no match.

Tests:

  • Added Java coverage for first-match semantics, missing words, empty tokens, null RHS rows, UTF-8 tokens, and words containing commas.
  • git diff --cached --check
  • Local Maven compile was attempted, but this checkout does not have thirdparty/cudf initialized, so the lifecycle stops at the cuDF submodule check and direct Java test compilation cannot resolve ai.rapids.cudf classes. This PR is opened as draft for CI/native validation.

Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@viadea

viadea commented May 27, 2026

Copy link
Copy Markdown
Collaborator Author

pre-commit.ci autofix

viadea added 2 commits May 26, 2026 23:01
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a native GPU find_in_set utility with two entry points: find_in_set (simple per-row scan) and find_in_set_repeated (dictionary-encoded path for repeated set strings), both returning a 1-based INT32 position column with Spark-compatible null semantics.

  • find_in_set scans each row byte-by-byte on the GPU, handles UTF-8 correctly via size_bytes(), and short-circuits for words containing commas. The implementation is correct for all tested cases.
  • find_in_set_repeated dictionary-encodes the column, scans only unique keys, then gathers results — a sound approach; however, the word_scalar device buffer used inside the kernel lambda may be freed before the async CUDA kernel finishes if the memory resource is not stream-ordered, and the DONT_CHECK gather for null rows silently depends on an undocumented invariant that keys_size >= 1 whenever the gather is reached.
  • Test coverage is solid for first-match, UTF-8, null rows, and comma-word fast paths, but edge cases for empty columns and maxDistinctSets=0 are not exercised.

Confidence Score: 3/5

The new CUDA kernel is logically correct for all documented cases, but one assumption about stream-ordered memory deallocation is load-bearing and undocumented.

The scalar (word_scalar) that backs the device pointer captured by the thrust lambda is destroyed when find_in_set returns, while the kernel runs asynchronously. This is safe with cuDF's default stream-ordered memory resource but silently breaks with a synchronous allocator. Additionally, the DONT_CHECK gather in find_in_set_repeated relies on an implicit coupling between the keys_size == 0 early-return guard and the sentinel gather index of 0 used for null rows — removing or reordering those guards would produce silent out-of-bounds reads. Neither issue manifests in typical cuDF usage, but both should be hardened or documented before the draft is promoted.

src/main/cpp/src/find_in_set.cu — specifically the word_scalar lifetime around line 99–113 and the gather sentinel logic around line 167–183.

Important Files Changed

Filename Overview
src/main/cpp/src/find_in_set.cu Core CUDA implementation; the word_scalar device memory lifetime depends on stream-ordered deallocation, and the gather sentinel relies on an undocumented coupling with the keys_size == 0 early-return guard.
src/main/cpp/src/find_in_set.hpp Header declares find_in_set and find_in_set_repeated with correct default stream/mr parameters; no issues.
src/main/cpp/src/StringUtilsJni.cpp JNI glue correctly handles null checks, auto-set-device, and uses the nullptr→0 sentinel convention for findInSetRepeated; no issues.
src/main/java/com/nvidia/spark/rapids/jni/StringUtils.java Java API validates arguments correctly and maps the 0-sentinel to null; adds NativeDepsLoader.loadNativeDeps() static block that was previously missing.
src/test/java/com/nvidia/spark/rapids/jni/StringUtilsTest.java Good coverage of first-match, missing words, empty tokens, null rows, UTF-8, and comma-word fast path; missing tests for empty column and maxDistinctSets=0.
src/main/cpp/CMakeLists.txt Adds find_in_set.cu to the library source list in alphabetical order; no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Java: findInSet / findInSetRepeated] --> B[JNI: StringUtilsJni.cpp]
    B --> C{word contains comma?}
    C -- Yes --> D[fill all rows with 0, preserve null mask]
    C -- No --> E[find_in_set path]
    E --> F[Create word_scalar on device]
    F --> G[thrust::transform per row]
    G --> H{row is null?}
    H -- Yes --> I[write 0 to data buffer, null mask marks as null]
    H -- No --> J[find_token_position byte-by-byte scan]
    J --> K[return 1-based position or 0]
    B2[findInSetRepeated] --> C2{word contains comma?}
    C2 -- Yes --> D2[make_zero_or_null_result]
    C2 -- No --> E2[dictionary::encode sets column]
    E2 --> F2{keys_size > maxDistinctSets?}
    F2 -- Yes --> G2[return nullptr → Java null]
    F2 -- No --> H2{keys_size == 0?}
    H2 -- Yes --> D2
    H2 -- No --> I2[find_in_set on dictionary keys]
    I2 --> J2[build gather_map: null→0, non-null→dict index]
    J2 --> K2[gather key_positions via gather_map]
    K2 --> L2[overwrite null mask from original sets]
    L2 --> M2[return result column]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Java: findInSet / findInSetRepeated] --> B[JNI: StringUtilsJni.cpp]
    B --> C{word contains comma?}
    C -- Yes --> D[fill all rows with 0, preserve null mask]
    C -- No --> E[find_in_set path]
    E --> F[Create word_scalar on device]
    F --> G[thrust::transform per row]
    G --> H{row is null?}
    H -- Yes --> I[write 0 to data buffer, null mask marks as null]
    H -- No --> J[find_token_position byte-by-byte scan]
    J --> K[return 1-based position or 0]
    B2[findInSetRepeated] --> C2{word contains comma?}
    C2 -- Yes --> D2[make_zero_or_null_result]
    C2 -- No --> E2[dictionary::encode sets column]
    E2 --> F2{keys_size > maxDistinctSets?}
    F2 -- Yes --> G2[return nullptr → Java null]
    F2 -- No --> H2{keys_size == 0?}
    H2 -- Yes --> D2
    H2 -- No --> I2[find_in_set on dictionary keys]
    I2 --> J2[build gather_map: null→0, non-null→dict index]
    J2 --> K2[gather key_positions via gather_map]
    K2 --> L2[overwrite null mask from original sets]
    L2 --> M2[return result column]
Loading

Reviews (1): Last reviewed commit: "Add repeated find_in_set native path" | Re-trigger Greptile

Comment on lines +99 to +113
auto word_scalar = cudf::make_string_scalar(word, stream);
auto const& word_string_scalar = static_cast<cudf::string_scalar const&>(*word_scalar);
auto const d_word = cudf::string_view(word_string_scalar.data(), word_string_scalar.size());

auto const sets_column = cudf::column_device_view::create(sets.parent(), stream);
auto const d_sets = *sets_column;

thrust::transform(rmm::exec_policy(stream),
thrust::make_counting_iterator<cudf::size_type>(0),
thrust::make_counting_iterator<cudf::size_type>(row_count),
d_results,
[d_sets, d_word] __device__(cudf::size_type idx) {
if (d_sets.is_null(idx)) { return cudf::size_type{0}; }
return find_token_position(d_sets.element<cudf::string_view>(idx), d_word);
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 d_word device pointer outlives its owning scalar in async execution

word_scalar is a unique_ptr destroyed when find_in_set returns. The lambda captures d_word — a cudf::string_view that holds a raw pointer into the scalar's device buffer — and that buffer will be freed when word_scalar is destroyed. Since thrust::transform is asynchronous on the stream, the kernel may still be accessing d_word.data() after word_scalar's destructor runs.

This is safe only if the device memory resource uses stream-ordered deallocation (the cuDF convention with cuda_async_memory_resource). If a caller supplies a synchronous mr (or the RMM pool is configured otherwise), this is a use-after-free. Consider keeping word_scalar alive by either extending its scope past the function or storing its device buffer in the results and syncing explicitly before the scalar is freed.

Comment on lines +167 to +175
thrust::transform(rmm::exec_policy(stream),
thrust::make_counting_iterator<cudf::size_type>(0),
thrust::make_counting_iterator<cudf::size_type>(row_count),
d_gather_map,
[d_dictionary = *d_dictionary, d_indices = *d_indices] __device__(
cudf::size_type idx) {
return d_dictionary.is_null(idx) ? cudf::size_type{0}
: d_indices.element<cudf::size_type>(idx);
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Null-row gather index silently relies on keys_size >= 1

For null rows, the lambda returns cudf::size_type{0}, which becomes the gather index into key_positions. This is safe because if (keys_size == 0) { return make_zero_or_null_result(); } on line 151 guarantees at least one key when we reach the gather. However, the safety of DONT_CHECK at the gather call depends entirely on this ordering — there is no bounds assertion or comment to document the invariant. If the early-return guard is ever moved or removed, null rows will silently gather out of bounds with DONT_CHECK. A brief comment coupling the two guards would make the invariant explicit.

Comment on lines +93 to +97
if (word.find(',') != std::string::npos) {
thrust::fill_n(rmm::exec_policy(stream), d_results, row_count, cudf::size_type{0});
results->set_null_count(sets.null_count());
return results;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant set_null_count after make_numeric_column

cudf::make_numeric_column already accepts and stores the null count from its constructor argument (sets.null_count()). The subsequent results->set_null_count(sets.null_count()) call (repeated in both the comma fast path here and inside make_zero_or_null_result in find_in_set_repeated) is a no-op. The same redundancy appears on line 114 in the normal kernel path. These extra calls are harmless but create noise; removing them improves clarity.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +131 to +158
@Test
void testFindInSetRepeated() {
try (
ColumnVector sets = ColumnVector.fromStrings(
"a,b,c",
"x,b",
"a,b,c",
null,
"b",
"",
",");
ColumnVector expectedB = ColumnVector.fromBoxedInts(
2, 2, 2, null, 1, 0, 0);
ColumnVector expectedEmpty = ColumnVector.fromBoxedInts(
0, 0, 0, null, 0, 1, 1);
ColumnVector expectedComma = ColumnVector.fromBoxedInts(
0, 0, 0, null, 0, 0, 0);
ColumnVector actualB = StringUtils.findInSetRepeated(sets, "b", 5);
ColumnVector actualEmpty = StringUtils.findInSetRepeated(sets, "", 5);
ColumnVector actualComma = StringUtils.findInSetRepeated(sets, "a,b", 5)) {
assertColumnsAreEqual(expectedB, actualB);
assertColumnsAreEqual(expectedEmpty, actualEmpty);
assertColumnsAreEqual(expectedComma, actualComma);
try (ColumnVector tooManyDistinct = StringUtils.findInSetRepeated(sets, "b", 4)) {
assertNull(tooManyDistinct);
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing edge-case coverage for findInSetRepeated

Two code paths exercised by find_in_set_repeated have no test coverage:

  1. Empty column (row_count == 0) — both find_in_set and find_in_set_repeated take an early return to produce an empty INT32 column; there is no test asserting the output type or size.
  2. maxDistinctSets = 0 — the Java-side validation allows 0 (only negative values are rejected), but there is no test confirming that a non-trivial column returns null and a fully-null column returns the expected all-null INT32 result.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants