Add native find_in_set utility by viadea · Pull Request #4636 · NVIDIA/cudf-spark-jni

viadea · 2026-05-27T03:11:08Z

Adds a native Spark-compatible find_in_set utility for the scalar-word plus string-column set case.

The new implementation scans each comma-delimited RHS row once on the GPU and returns the 1-based token position, 0 for missing values, and null for null RHS rows. This avoids materializing split list columns and the follow-up list search for the dynamic RHS path in Spark RAPIDS. It also includes a fast path for literal words containing commas, which Spark treats as no match.

Tests:

Added Java coverage for first-match semantics, missing words, empty tokens, null RHS rows, UTF-8 tokens, and words containing commas.
git diff --cached --check
Local Maven compile was attempted, but this checkout does not have thirdparty/cudf initialized, so the lifecycle stops at the cuDF submodule check and direct Java test compilation cannot resolve ai.rapids.cudf classes. This PR is opened as draft for CI/native validation.

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

viadea · 2026-05-27T03:12:47Z

pre-commit.ci autofix

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

greptile-apps · 2026-06-24T02:01:25Z

Greptile Summary

This PR adds a native GPU find_in_set utility with two entry points: find_in_set (simple per-row scan) and find_in_set_repeated (dictionary-encoded path for repeated set strings), both returning a 1-based INT32 position column with Spark-compatible null semantics.

find_in_set scans each row byte-by-byte on the GPU, handles UTF-8 correctly via size_bytes(), and short-circuits for words containing commas. The implementation is correct for all tested cases.
find_in_set_repeated dictionary-encodes the column, scans only unique keys, then gathers results — a sound approach; however, the word_scalar device buffer used inside the kernel lambda may be freed before the async CUDA kernel finishes if the memory resource is not stream-ordered, and the DONT_CHECK gather for null rows silently depends on an undocumented invariant that keys_size >= 1 whenever the gather is reached.
Test coverage is solid for first-match, UTF-8, null rows, and comma-word fast paths, but edge cases for empty columns and maxDistinctSets=0 are not exercised.

Confidence Score: 3/5

The new CUDA kernel is logically correct for all documented cases, but one assumption about stream-ordered memory deallocation is load-bearing and undocumented.

The scalar (word_scalar) that backs the device pointer captured by the thrust lambda is destroyed when find_in_set returns, while the kernel runs asynchronously. This is safe with cuDF's default stream-ordered memory resource but silently breaks with a synchronous allocator. Additionally, the DONT_CHECK gather in find_in_set_repeated relies on an implicit coupling between the keys_size == 0 early-return guard and the sentinel gather index of 0 used for null rows — removing or reordering those guards would produce silent out-of-bounds reads. Neither issue manifests in typical cuDF usage, but both should be hardened or documented before the draft is promoted.

src/main/cpp/src/find_in_set.cu — specifically the word_scalar lifetime around line 99–113 and the gather sentinel logic around line 167–183.

Important Files Changed

Filename	Overview
src/main/cpp/src/find_in_set.cu	Core CUDA implementation; the `word_scalar` device memory lifetime depends on stream-ordered deallocation, and the gather sentinel relies on an undocumented coupling with the `keys_size == 0` early-return guard.
src/main/cpp/src/find_in_set.hpp	Header declares `find_in_set` and `find_in_set_repeated` with correct default stream/mr parameters; no issues.
src/main/cpp/src/StringUtilsJni.cpp	JNI glue correctly handles null checks, auto-set-device, and uses the nullptr→0 sentinel convention for `findInSetRepeated`; no issues.
src/main/java/com/nvidia/spark/rapids/jni/StringUtils.java	Java API validates arguments correctly and maps the 0-sentinel to null; adds `NativeDepsLoader.loadNativeDeps()` static block that was previously missing.
src/test/java/com/nvidia/spark/rapids/jni/StringUtilsTest.java	Good coverage of first-match, missing words, empty tokens, null rows, UTF-8, and comma-word fast path; missing tests for empty column and maxDistinctSets=0.
src/main/cpp/CMakeLists.txt	Adds `find_in_set.cu` to the library source list in alphabetical order; no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Java: findInSet / findInSetRepeated] --> B[JNI: StringUtilsJni.cpp]
    B --> C{word contains comma?}
    C -- Yes --> D[fill all rows with 0, preserve null mask]
    C -- No --> E[find_in_set path]
    E --> F[Create word_scalar on device]
    F --> G[thrust::transform per row]
    G --> H{row is null?}
    H -- Yes --> I[write 0 to data buffer, null mask marks as null]
    H -- No --> J[find_token_position byte-by-byte scan]
    J --> K[return 1-based position or 0]
    B2[findInSetRepeated] --> C2{word contains comma?}
    C2 -- Yes --> D2[make_zero_or_null_result]
    C2 -- No --> E2[dictionary::encode sets column]
    E2 --> F2{keys_size > maxDistinctSets?}
    F2 -- Yes --> G2[return nullptr → Java null]
    F2 -- No --> H2{keys_size == 0?}
    H2 -- Yes --> D2
    H2 -- No --> I2[find_in_set on dictionary keys]
    I2 --> J2[build gather_map: null→0, non-null→dict index]
    J2 --> K2[gather key_positions via gather_map]
    K2 --> L2[overwrite null mask from original sets]
    L2 --> M2[return result column]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[Java: findInSet / findInSetRepeated] --> B[JNI: StringUtilsJni.cpp]
    B --> C{word contains comma?}
    C -- Yes --> D[fill all rows with 0, preserve null mask]
    C -- No --> E[find_in_set path]
    E --> F[Create word_scalar on device]
    F --> G[thrust::transform per row]
    G --> H{row is null?}
    H -- Yes --> I[write 0 to data buffer, null mask marks as null]
    H -- No --> J[find_token_position byte-by-byte scan]
    J --> K[return 1-based position or 0]
    B2[findInSetRepeated] --> C2{word contains comma?}
    C2 -- Yes --> D2[make_zero_or_null_result]
    C2 -- No --> E2[dictionary::encode sets column]
    E2 --> F2{keys_size > maxDistinctSets?}
    F2 -- Yes --> G2[return nullptr → Java null]
    F2 -- No --> H2{keys_size == 0?}
    H2 -- Yes --> D2
    H2 -- No --> I2[find_in_set on dictionary keys]
    I2 --> J2[build gather_map: null→0, non-null→dict index]
    J2 --> K2[gather key_positions via gather_map]
    K2 --> L2[overwrite null mask from original sets]
    L2 --> M2[return result column]

_{Reviews (1): Last reviewed commit: "Add repeated find_in_set native path" | Re-trigger Greptile}

greptile-apps · 2026-06-24T02:01:28Z

+  auto word_scalar               = cudf::make_string_scalar(word, stream);
+  auto const& word_string_scalar = static_cast<cudf::string_scalar const&>(*word_scalar);
+  auto const d_word = cudf::string_view(word_string_scalar.data(), word_string_scalar.size());
+
+  auto const sets_column = cudf::column_device_view::create(sets.parent(), stream);
+  auto const d_sets      = *sets_column;
+
+  thrust::transform(rmm::exec_policy(stream),
+                    thrust::make_counting_iterator<cudf::size_type>(0),
+                    thrust::make_counting_iterator<cudf::size_type>(row_count),
+                    d_results,
+                    [d_sets, d_word] __device__(cudf::size_type idx) {
+                      if (d_sets.is_null(idx)) { return cudf::size_type{0}; }
+                      return find_token_position(d_sets.element<cudf::string_view>(idx), d_word);
+                    });


d_word device pointer outlives its owning scalar in async execution

word_scalar is a unique_ptr destroyed when find_in_set returns. The lambda captures d_word — a cudf::string_view that holds a raw pointer into the scalar's device buffer — and that buffer will be freed when word_scalar is destroyed. Since thrust::transform is asynchronous on the stream, the kernel may still be accessing d_word.data() after word_scalar's destructor runs.

This is safe only if the device memory resource uses stream-ordered deallocation (the cuDF convention with cuda_async_memory_resource). If a caller supplies a synchronous mr (or the RMM pool is configured otherwise), this is a use-after-free. Consider keeping word_scalar alive by either extending its scope past the function or storing its device buffer in the results and syncing explicitly before the scalar is freed.

greptile-apps · 2026-06-24T02:01:29Z

+  thrust::transform(rmm::exec_policy(stream),
+                    thrust::make_counting_iterator<cudf::size_type>(0),
+                    thrust::make_counting_iterator<cudf::size_type>(row_count),
+                    d_gather_map,
+                    [d_dictionary = *d_dictionary, d_indices = *d_indices] __device__(
+                      cudf::size_type idx) {
+                      return d_dictionary.is_null(idx) ? cudf::size_type{0}
+                                                       : d_indices.element<cudf::size_type>(idx);
+                    });


Null-row gather index silently relies on keys_size >= 1

For null rows, the lambda returns cudf::size_type{0}, which becomes the gather index into key_positions. This is safe because if (keys_size == 0) { return make_zero_or_null_result(); } on line 151 guarantees at least one key when we reach the gather. However, the safety of DONT_CHECK at the gather call depends entirely on this ordering — there is no bounds assertion or comment to document the invariant. If the early-return guard is ever moved or removed, null rows will silently gather out of bounds with DONT_CHECK. A brief comment coupling the two guards would make the invariant explicit.

greptile-apps · 2026-06-24T02:01:30Z

+  if (word.find(',') != std::string::npos) {
+    thrust::fill_n(rmm::exec_policy(stream), d_results, row_count, cudf::size_type{0});
+    results->set_null_count(sets.null_count());
+    return results;
+  }


Redundant set_null_count after make_numeric_column

cudf::make_numeric_column already accepts and stores the null count from its constructor argument (sets.null_count()). The subsequent results->set_null_count(sets.null_count()) call (repeated in both the comma fast path here and inside make_zero_or_null_result in find_in_set_repeated) is a no-op. The same redundancy appears on line 114 in the normal kernel path. These extra calls are harmless but create noise; removing them improves clarity.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-24T02:01:31Z

+  @Test
+  void testFindInSetRepeated() {
+    try (
+        ColumnVector sets = ColumnVector.fromStrings(
+            "a,b,c",
+            "x,b",
+            "a,b,c",
+            null,
+            "b",
+            "",
+            ",");
+        ColumnVector expectedB = ColumnVector.fromBoxedInts(
+            2, 2, 2, null, 1, 0, 0);
+        ColumnVector expectedEmpty = ColumnVector.fromBoxedInts(
+            0, 0, 0, null, 0, 1, 1);
+        ColumnVector expectedComma = ColumnVector.fromBoxedInts(
+            0, 0, 0, null, 0, 0, 0);
+        ColumnVector actualB = StringUtils.findInSetRepeated(sets, "b", 5);
+        ColumnVector actualEmpty = StringUtils.findInSetRepeated(sets, "", 5);
+        ColumnVector actualComma = StringUtils.findInSetRepeated(sets, "a,b", 5)) {
+      assertColumnsAreEqual(expectedB, actualB);
+      assertColumnsAreEqual(expectedEmpty, actualEmpty);
+      assertColumnsAreEqual(expectedComma, actualComma);
+      try (ColumnVector tooManyDistinct = StringUtils.findInSetRepeated(sets, "b", 4)) {
+        assertNull(tooManyDistinct);
+      }
+    }
+  }


Missing edge-case coverage for findInSetRepeated

Two code paths exercised by find_in_set_repeated have no test coverage:

Empty column (row_count == 0) — both find_in_set and find_in_set_repeated take an early return to produce an empty INT32 column; there is no test asserting the output type or size.

maxDistinctSets = 0 — the Java-side validation allows 0 (only negative values are rejected), but there is no test confirming that a non-trivial column returns null and a fully-null column returns the expected all-null INT32 result.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Add native find_in_set utility

adfb72c

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

[pre-commit.ci] auto code formatting

6f6a8cf

viadea mentioned this pull request May 27, 2026

Add GPU support for find_in_set NVIDIA/cudf-spark#14889

Draft

8 tasks

viadea added 2 commits May 26, 2026 23:01

Import native dependency loader for string utils

8e82ca0

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

Add repeated find_in_set native path

42847fd

Signed-off-by: Hao Zhu <hazhu@nvidia.com>

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add native find_in_set utility#4636

Add native find_in_set utility#4636
viadea wants to merge 4 commits into
NVIDIA:mainfrom
viadea:codex/native-find-in-set

viadea commented May 27, 2026

Uh oh!

viadea commented May 27, 2026

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

viadea commented May 27, 2026

Uh oh!

viadea commented May 27, 2026

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants