from_json: null only schema-mismatched rows by wjxiz1992 · Pull Request #4728 · NVIDIA/cudf-spark-jni

wjxiz1992 · 2026-06-17T04:14:45Z

Summary

replace the reverted whole-column schema-mismatch nulling path with row-level nulling for only the affected depth-1 parent rows
use the new cuDF row-level JSON schema mismatch diagnostics from Add row-level JSON schema mismatch diagnostics rapidsai/cudf#22915
preserve sibling top-level fields for the same input row, matching Spark from_json behavior
add a JNI regression test and a focused from_json_to_structs nvbench target

Contributes to #4645. This is the follow-up to the reverted #4536 / #4706 path.

Dependency

This PR is intentionally draft until rapidsai/cudf#22915 is merged. The current submodule pointer references c9cb6c288faff96684439d540e0e0e64b841b0ea, which is available on my fork branch but not yet on rapidsai/cudf main; CI should only be treated as authoritative after the cuDF PR lands and this submodule pointer is refreshed to the upstream SHA.

Validation

PARALLEL_LEVEL=12 ./build/run-in-docker cmake --build target/jni/cmake-build --target spark_rapids_jni -j12
LOCAL_MAVEN_REPO=$PWD/.mvn-repo DOCKER_RUN_EXTRA_ARGS="-e URM_URL -e ART_URL -e URM_CREDS_USR -e URM_CREDS_PSW -e ART_CREDS_USR -e ART_CREDS_PSW" PARALLEL_LEVEL=12 ./build/run-in-docker mvn test -s ci/settings.xml -Dmaven.repo.local=./.mvn-repo -Dtest=FromJsonToStructsTest -Dsubmodule.patch.skip=true -Dlibcudf.clean.skip=true -Dsurefire.useFile=false
- FromJsonToStructsTest: Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
- ColumnViewNonEmptyNullsTest: Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
- CudaFatalTest: Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
cuDF: ./build/run-in-docker cmake --build target/libcudf/cmake-build --target JSON_TEST -j12
cuDF: ./build/run-in-docker env HOME=/tmp target/libcudf/cmake-build/gtests/JSON_TEST --gtest_filter=JsonReaderTest.SchemaMismatchDiag*
- Passed: 6 tests

Performance

Built FROM_JSON_TO_STRUCTS_BENCH: PARALLEL_LEVEL=12 ./build/run-in-docker cmake --build target/jni/cmake-build --target FROM_JSON_TO_STRUCTS_BENCH -j12
Ran 1M-row sanity benchmark on local RTX 5880 Ada:
- num_rows=1000000, mismatch_percent=0: GPU 197.252 ms, noise 2.17%
- num_rows=1000000, mismatch_percent=1: GPU 182.406 ms, noise 1.43%

The benchmark is a local sanity check that the path remains GPU-side and same-order; it is not claiming a strict mismatch-overhead comparison because the mismatch payload shape differs from the all-valid payload.

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-06-23T09:10:40Z

Requesting early JNI-side review while this remains draft.

Context: depends on rapidsai/cudf#22915.

Could you please review the JNI integration and row-level nulling approach before the cuDF dependency lands? The main things I would like feedback on are:

consuming read_json_with_row_diagnostics from the updated cuDF submodule
nullifying only schema-mismatched parent rows while preserving sibling top-level fields
whether the list-child sanitization path is the right place to handle non-empty null children after row-level nulling

Thanks @ttnghia @jihoonson @thirtiseven.

greptile-apps · 2026-06-24T02:01:35Z

Greptile Summary

This PR replaces the reverted whole-column schema-mismatch nulling with a row-level approach: it calls the new read_json_with_row_diagnostics cuDF API, then applies nullify_rows only to the specific top-level columns and row indices that had type mismatches, preserving sibling fields (e.g., id) for the same input row — matching Spark from_json semantics.

nullify_rows (new helper) asynchronously clears specific bitmask bits on the GPU, recomputes the null count, and updates the column's null mask in-place before convert_data_type runs.
make_structs_column_with_null_consistency (new helper) now calls cudf::make_structs_column (which superimposes parent nulls onto children) when null_count > 0, replacing the previous path that deliberately skipped superimpose_nulls; this is required so that mismatch-nulled STRUCT rows properly propagate to nested children including LIST columns.
make_lists_column_with_null_sanitization (new helper) calls purge_nonempty_nulls for LIST columns when the null count is non-zero, guarding against non-empty null rows left by schema-mismatch recovery.

Confidence Score: 3/5

The row-level nulling logic is conceptually correct but rests on an unvalidated invariant: that cudf::make_structs_column internally calls purge_nonempty_nulls on LIST children after propagating parent nulls. If it does not, the output column has non-empty nulls in nested LIST children — invalid cuDF state that downstream operations could mishandle.

The core logic (nullify_rows + convert_data_type + helper column builders) is well-reasoned and the test validates the user-visible behavior. However, the test does not check whether nested LIST column offsets are properly compacted for mismatch-nulled rows, so a non-empty null bug in the nested LIST path would be invisible to the test suite. Additionally, the submodule points to an unreleased fork commit, meaning CI is not authoritative until rapidsai/cudf#22915 lands.

src/main/cpp/src/from_json_to_structs.cu — specifically the interaction between make_lists_column_with_null_sanitization and make_structs_column_with_null_consistency for nested LIST children of mismatch-nulled STRUCT columns.

Important Files Changed

Filename	Overview
src/main/cpp/src/from_json_to_structs.cu	Core change: replaces whole-column schema-mismatch nulling with row-level nulling via new cuDF diagnostics API. Introduces nullify_rows (GPU bit-clearing) and two new helper functions for LIST and STRUCT column construction. The LIST null-sanitization interaction with parent STRUCT null propagation via cudf::make_structs_column warrants verification. Also contains an O(N×M) mismatch lookup that should be a map lookup.
src/test/java/com/nvidia/spark/rapids/jni/FromJsonToStructsTest.java	New JNI regression test covering the core mismatch scenario (nested LIST element type mismatch). Validates that only the mismatched depth-1 column is nulled while siblings are preserved. Does not validate internal LIST offset state for null rows or test top-level LIST mismatches.
src/main/cpp/benchmarks/from_json_to_structs.cu	New nvbench target for from_json_to_structs with 0% and 1% mismatch rates. Straightforward setup; add_buffer_size passes row count rather than byte count, making the auto-computed GB/s throughput metric cosmetically inaccurate.
thirdparty/cudf	Submodule pointer advanced to c9cb6c2 (author's fork branch, not yet on rapidsai/cudf main). PR is intentionally draft until the upstream cuDF PR #22915 lands; CI results should not be treated as authoritative yet.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant from_json_to_structs
    participant cuDF_read_json as cuDF read_json_with_row_diagnostics
    participant nullify_rows
    participant convert_data_type
    participant make_structs_col as make_structs_column_with_null_consistency
    participant make_lists_col as make_lists_column_with_null_sanitization

    Caller->>from_json_to_structs: input strings + schema
    from_json_to_structs->>cuDF_read_json: concat_json input + opts
    cuDF_read_json-->>from_json_to_structs: parsed_table + diagnostics

    loop for each top-level schema column i
        from_json_to_structs->>from_json_to_structs: find col_name in diagnostics O(M)
        alt column has mismatch rows
            from_json_to_structs->>nullify_rows: parsed_columns[i] + row_indices
            nullify_rows->>nullify_rows: copy_bitmask / create_null_mask
            nullify_rows->>nullify_rows: thrust::for_each clear_bit async
            nullify_rows->>nullify_rows: cudf::null_count stream sync
            nullify_rows->>nullify_rows: set_null_mask on column
        end
        from_json_to_structs->>convert_data_type: parsed_columns[i] after nullify
        convert_data_type->>make_lists_col: LIST child + null_count
        make_lists_col->>make_lists_col: "purge_nonempty_nulls if null_count > 0"
        convert_data_type->>make_structs_col: STRUCT children + null_count
        make_structs_col->>make_structs_col: "cudf::make_structs_column if null_count > 0"
    end

    from_json_to_structs->>make_structs_col: top-level STRUCT
    make_structs_col-->>Caller: output STRUCT column

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant from_json_to_structs
    participant cuDF_read_json as cuDF read_json_with_row_diagnostics
    participant nullify_rows
    participant convert_data_type
    participant make_structs_col as make_structs_column_with_null_consistency
    participant make_lists_col as make_lists_column_with_null_sanitization

    Caller->>from_json_to_structs: input strings + schema
    from_json_to_structs->>cuDF_read_json: concat_json input + opts
    cuDF_read_json-->>from_json_to_structs: parsed_table + diagnostics

    loop for each top-level schema column i
        from_json_to_structs->>from_json_to_structs: find col_name in diagnostics O(M)
        alt column has mismatch rows
            from_json_to_structs->>nullify_rows: parsed_columns[i] + row_indices
            nullify_rows->>nullify_rows: copy_bitmask / create_null_mask
            nullify_rows->>nullify_rows: thrust::for_each clear_bit async
            nullify_rows->>nullify_rows: cudf::null_count stream sync
            nullify_rows->>nullify_rows: set_null_mask on column
        end
        from_json_to_structs->>convert_data_type: parsed_columns[i] after nullify
        convert_data_type->>make_lists_col: LIST child + null_count
        make_lists_col->>make_lists_col: "purge_nonempty_nulls if null_count > 0"
        convert_data_type->>make_structs_col: STRUCT children + null_count
        make_structs_col->>make_structs_col: "cudf::make_structs_column if null_count > 0"
    end

    from_json_to_structs->>make_structs_col: top-level STRUCT
    make_structs_col-->>Caller: output STRUCT column

_{Reviews (1): Last reviewed commit: "Fix from_json schema mismatch row nullin..." | Re-trigger Greptile}

greptile-apps · 2026-06-24T02:01:39Z

+                                               num_rows,
+                                               rmm::device_buffer{},
+                                               std::move(null_mask),
+                                               null_count,
+                                               std::move(children));
+  // Row-level schema mismatch nulls can leave child data under null parents; sanitize it here.
+  if (null_count > 0) { output = cudf::purge_nonempty_nulls(output->view(), stream, mr); }
+  return output;


Non-empty null risk in nested LIST children after parent STRUCT propagation

make_lists_column_with_null_sanitization conditionally calls purge_nonempty_nulls based on the LIST column's null count at construction time (i.e., the JSON reader's null count for that child). When nullify_rows marks a parent STRUCT's rows as null and make_structs_column_with_null_consistency later propagates those parent nulls into LIST children via cudf::make_structs_column, those LIST children acquire new null rows that were NOT present when purge_nonempty_nulls was (or was not) called.

If cudf::make_structs_column → superimpose_and_sanitize_nulls internally calls purge_nonempty_nulls on LIST children after ORing in the parent mask, the column is valid. If it only ORs the null bits without compacting child offsets, the LIST child ends up with non-empty null rows (null bit set, non-zero offset span), which is invalid cuDF state for columns that may be accessed independently downstream.

The existing test only checks null-mask equality at the top level — it does not verify that data.c2 has zero-size offsets for the mismatched row, so a non-empty null bug here would not be caught by assertColumnsAreEqual. Please add a validator (or check offset equality on host) to confirm that the nested LIST's offsets are compacted for null rows.

greptile-apps · 2026-06-24T02:01:40Z

+    auto const mismatch_rows =
+      std::find_if(parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.begin(),
+                   parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end(),
+                   [&col_name](auto const& row_info) { return row_info.column_name == col_name; });
+    if (mismatch_rows !=
+        parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end()) {
+      nullify_rows(*parsed_columns[i], mismatch_rows->row_indices, stream, mr);
+    }


O(N×M) mismatch lookup inside the column loop

std::find_if over top_level_columns_with_schema_mismatch_rows is called once per schema column, making the overall complexity O(schema_columns × mismatched_columns). In the worst case — where every column reports mismatches — this is O(N²). A single pre-pass that builds a std::unordered_map<std::string, std::vector<cudf::size_type> const*> from parsed_result.diagnostics before the loop would make each lookup O(1), which is important for schemas with many top-level fields.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-24T02:01:41Z

+  auto const input_view = input.view();
+  for (auto const row : row_indices) {
+    CUDF_EXPECTS(row >= 0 && row < input_view.size(), "Schema mismatch row index out of bounds.");
+  }


Sequential CPU bounds check before GPU operations

This loop iterates every mismatch row index on the CPU before any GPU work starts. For pathological inputs (e.g., a table where 50% of rows mismatch), this serializes a potentially large host-side validation pass. The GPU kernel below (thrust::for_each) is inherently bounded by the device memory it touches, so out-of-range indices would produce undefined GPU behavior regardless — this check should at minimum only run in debug builds, or be replaced with an assertion on the max element rather than a per-element loop.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Commit 267ce8b accidentally swept a thirdparty/cudf gitlink bump (18e8ccd8d7 -> c9cb6c288f) into the from_json row-mask fix. The row-mask approach is JNI-side and needs no cudf change, so reset the gitlink to the merge-base. This drops thirdparty/cudf from the PR diff; since only main's side then differs from the merge-base, the submodule no longer conflicts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>

ttnghia · 2026-06-27T16:40:00Z

+                                               null_count,
+                                               std::move(children));
+  // Row-level schema mismatch nulls can leave child data under null parents; sanitize it here.
+  if (null_count > 0) { output = cudf::purge_nonempty_nulls(output->view(), stream, mr); }


Use cudf::has_nonempty_nulls() to reduce overhead.

ttnghia · 2026-06-27T16:42:06Z

+    auto const mismatch_rows =
+      std::find_if(parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.begin(),
+                   parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end(),
+                   [&col_name](auto const& row_info) { return row_info.column_name == col_name; });
+    if (mismatch_rows !=
+        parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end()) {
+      nullify_rows(*parsed_columns[i], mismatch_rows->row_indices, stream, mr);


Why is this executed on host?

wjxiz1992 mentioned this pull request Jun 17, 2026

[FEA] Per-row schema-mismatch diagnostic for read_json_with_diagnostics #4645

Open

wjxiz1992 force-pushed the fix/4645-from-json-row-mask branch 3 times, most recently from 5b64481 to 1f9b3c9 Compare June 23, 2026 08:01

Fix from_json schema mismatch row nulling

267ce8b

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 force-pushed the fix/4645-from-json-row-mask branch from 1f9b3c9 to 267ce8b Compare June 23, 2026 08:07

wjxiz1992 requested review from jihoonson, thirtiseven and ttnghia June 23, 2026 09:10

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

ttnghia reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

from_json: null only schema-mismatched rows#4728

from_json: null only schema-mismatched rows#4728
wjxiz1992 wants to merge 2 commits into
NVIDIA:mainfrom
wjxiz1992:fix/4645-from-json-row-mask

wjxiz1992 commented Jun 17, 2026 •

edited

Loading

Uh oh!

wjxiz1992 commented Jun 23, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

greptile-apps Bot Jun 24, 2026

Uh oh!

ttnghia Jun 27, 2026

Uh oh!

ttnghia Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

wjxiz1992 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependency

Validation

Performance

Uh oh!

wjxiz1992 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jun 24, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

ttnghia Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

ttnghia Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wjxiz1992 commented Jun 17, 2026 •

edited

Loading

wjxiz1992 commented Jun 23, 2026 •

edited

Loading