Skip to content

from_json: null only schema-mismatched rows#4728

Draft
wjxiz1992 wants to merge 2 commits into
NVIDIA:mainfrom
wjxiz1992:fix/4645-from-json-row-mask
Draft

from_json: null only schema-mismatched rows#4728
wjxiz1992 wants to merge 2 commits into
NVIDIA:mainfrom
wjxiz1992:fix/4645-from-json-row-mask

Conversation

@wjxiz1992

@wjxiz1992 wjxiz1992 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • replace the reverted whole-column schema-mismatch nulling path with row-level nulling for only the affected depth-1 parent rows
  • use the new cuDF row-level JSON schema mismatch diagnostics from Add row-level JSON schema mismatch diagnostics rapidsai/cudf#22915
  • preserve sibling top-level fields for the same input row, matching Spark from_json behavior
  • add a JNI regression test and a focused from_json_to_structs nvbench target

Contributes to #4645. This is the follow-up to the reverted #4536 / #4706 path.

Dependency

This PR is intentionally draft until rapidsai/cudf#22915 is merged. The current submodule pointer references c9cb6c288faff96684439d540e0e0e64b841b0ea, which is available on my fork branch but not yet on rapidsai/cudf main; CI should only be treated as authoritative after the cuDF PR lands and this submodule pointer is refreshed to the upstream SHA.

Validation

  • PARALLEL_LEVEL=12 ./build/run-in-docker cmake --build target/jni/cmake-build --target spark_rapids_jni -j12
  • LOCAL_MAVEN_REPO=$PWD/.mvn-repo DOCKER_RUN_EXTRA_ARGS="-e URM_URL -e ART_URL -e URM_CREDS_USR -e URM_CREDS_PSW -e ART_CREDS_USR -e ART_CREDS_PSW" PARALLEL_LEVEL=12 ./build/run-in-docker mvn test -s ci/settings.xml -Dmaven.repo.local=./.mvn-repo -Dtest=FromJsonToStructsTest -Dsubmodule.patch.skip=true -Dlibcudf.clean.skip=true -Dsurefire.useFile=false
    • FromJsonToStructsTest: Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
    • ColumnViewNonEmptyNullsTest: Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
    • CudaFatalTest: Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
  • cuDF: ./build/run-in-docker cmake --build target/libcudf/cmake-build --target JSON_TEST -j12
  • cuDF: ./build/run-in-docker env HOME=/tmp target/libcudf/cmake-build/gtests/JSON_TEST --gtest_filter=JsonReaderTest.SchemaMismatchDiag*
    • Passed: 6 tests

Performance

  • Built FROM_JSON_TO_STRUCTS_BENCH: PARALLEL_LEVEL=12 ./build/run-in-docker cmake --build target/jni/cmake-build --target FROM_JSON_TO_STRUCTS_BENCH -j12
  • Ran 1M-row sanity benchmark on local RTX 5880 Ada:
    • num_rows=1000000, mismatch_percent=0: GPU 197.252 ms, noise 2.17%
    • num_rows=1000000, mismatch_percent=1: GPU 182.406 ms, noise 1.43%

The benchmark is a local sanity check that the path remains GPU-side and same-order; it is not claiming a strict mismatch-overhead comparison because the mismatch payload shape differs from the all-valid payload.

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 force-pushed the fix/4645-from-json-row-mask branch from 1f9b3c9 to 267ce8b Compare June 23, 2026 08:07
@wjxiz1992

wjxiz1992 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Requesting early JNI-side review while this remains draft.

Context: depends on rapidsai/cudf#22915.

Could you please review the JNI integration and row-level nulling approach before the cuDF dependency lands? The main things I would like feedback on are:

  • consuming read_json_with_row_diagnostics from the updated cuDF submodule
  • nullifying only schema-mismatched parent rows while preserving sibling top-level fields
  • whether the list-child sanitization path is the right place to handle non-empty null children after row-level nulling

Thanks @ttnghia @jihoonson @thirtiseven.

@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces the reverted whole-column schema-mismatch nulling with a row-level approach: it calls the new read_json_with_row_diagnostics cuDF API, then applies nullify_rows only to the specific top-level columns and row indices that had type mismatches, preserving sibling fields (e.g., id) for the same input row — matching Spark from_json semantics.

  • nullify_rows (new helper) asynchronously clears specific bitmask bits on the GPU, recomputes the null count, and updates the column's null mask in-place before convert_data_type runs.
  • make_structs_column_with_null_consistency (new helper) now calls cudf::make_structs_column (which superimposes parent nulls onto children) when null_count > 0, replacing the previous path that deliberately skipped superimpose_nulls; this is required so that mismatch-nulled STRUCT rows properly propagate to nested children including LIST columns.
  • make_lists_column_with_null_sanitization (new helper) calls purge_nonempty_nulls for LIST columns when the null count is non-zero, guarding against non-empty null rows left by schema-mismatch recovery.

Confidence Score: 3/5

The row-level nulling logic is conceptually correct but rests on an unvalidated invariant: that cudf::make_structs_column internally calls purge_nonempty_nulls on LIST children after propagating parent nulls. If it does not, the output column has non-empty nulls in nested LIST children — invalid cuDF state that downstream operations could mishandle.

The core logic (nullify_rows + convert_data_type + helper column builders) is well-reasoned and the test validates the user-visible behavior. However, the test does not check whether nested LIST column offsets are properly compacted for mismatch-nulled rows, so a non-empty null bug in the nested LIST path would be invisible to the test suite. Additionally, the submodule points to an unreleased fork commit, meaning CI is not authoritative until rapidsai/cudf#22915 lands.

src/main/cpp/src/from_json_to_structs.cu — specifically the interaction between make_lists_column_with_null_sanitization and make_structs_column_with_null_consistency for nested LIST children of mismatch-nulled STRUCT columns.

Important Files Changed

Filename Overview
src/main/cpp/src/from_json_to_structs.cu Core change: replaces whole-column schema-mismatch nulling with row-level nulling via new cuDF diagnostics API. Introduces nullify_rows (GPU bit-clearing) and two new helper functions for LIST and STRUCT column construction. The LIST null-sanitization interaction with parent STRUCT null propagation via cudf::make_structs_column warrants verification. Also contains an O(N×M) mismatch lookup that should be a map lookup.
src/test/java/com/nvidia/spark/rapids/jni/FromJsonToStructsTest.java New JNI regression test covering the core mismatch scenario (nested LIST element type mismatch). Validates that only the mismatched depth-1 column is nulled while siblings are preserved. Does not validate internal LIST offset state for null rows or test top-level LIST mismatches.
src/main/cpp/benchmarks/from_json_to_structs.cu New nvbench target for from_json_to_structs with 0% and 1% mismatch rates. Straightforward setup; add_buffer_size passes row count rather than byte count, making the auto-computed GB/s throughput metric cosmetically inaccurate.
thirdparty/cudf Submodule pointer advanced to c9cb6c2 (author's fork branch, not yet on rapidsai/cudf main). PR is intentionally draft until the upstream cuDF PR #22915 lands; CI results should not be treated as authoritative yet.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant from_json_to_structs
    participant cuDF_read_json as cuDF read_json_with_row_diagnostics
    participant nullify_rows
    participant convert_data_type
    participant make_structs_col as make_structs_column_with_null_consistency
    participant make_lists_col as make_lists_column_with_null_sanitization

    Caller->>from_json_to_structs: input strings + schema
    from_json_to_structs->>cuDF_read_json: concat_json input + opts
    cuDF_read_json-->>from_json_to_structs: parsed_table + diagnostics

    loop for each top-level schema column i
        from_json_to_structs->>from_json_to_structs: find col_name in diagnostics O(M)
        alt column has mismatch rows
            from_json_to_structs->>nullify_rows: parsed_columns[i] + row_indices
            nullify_rows->>nullify_rows: copy_bitmask / create_null_mask
            nullify_rows->>nullify_rows: thrust::for_each clear_bit async
            nullify_rows->>nullify_rows: cudf::null_count stream sync
            nullify_rows->>nullify_rows: set_null_mask on column
        end
        from_json_to_structs->>convert_data_type: parsed_columns[i] after nullify
        convert_data_type->>make_lists_col: LIST child + null_count
        make_lists_col->>make_lists_col: "purge_nonempty_nulls if null_count > 0"
        convert_data_type->>make_structs_col: STRUCT children + null_count
        make_structs_col->>make_structs_col: "cudf::make_structs_column if null_count > 0"
    end

    from_json_to_structs->>make_structs_col: top-level STRUCT
    make_structs_col-->>Caller: output STRUCT column
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant from_json_to_structs
    participant cuDF_read_json as cuDF read_json_with_row_diagnostics
    participant nullify_rows
    participant convert_data_type
    participant make_structs_col as make_structs_column_with_null_consistency
    participant make_lists_col as make_lists_column_with_null_sanitization

    Caller->>from_json_to_structs: input strings + schema
    from_json_to_structs->>cuDF_read_json: concat_json input + opts
    cuDF_read_json-->>from_json_to_structs: parsed_table + diagnostics

    loop for each top-level schema column i
        from_json_to_structs->>from_json_to_structs: find col_name in diagnostics O(M)
        alt column has mismatch rows
            from_json_to_structs->>nullify_rows: parsed_columns[i] + row_indices
            nullify_rows->>nullify_rows: copy_bitmask / create_null_mask
            nullify_rows->>nullify_rows: thrust::for_each clear_bit async
            nullify_rows->>nullify_rows: cudf::null_count stream sync
            nullify_rows->>nullify_rows: set_null_mask on column
        end
        from_json_to_structs->>convert_data_type: parsed_columns[i] after nullify
        convert_data_type->>make_lists_col: LIST child + null_count
        make_lists_col->>make_lists_col: "purge_nonempty_nulls if null_count > 0"
        convert_data_type->>make_structs_col: STRUCT children + null_count
        make_structs_col->>make_structs_col: "cudf::make_structs_column if null_count > 0"
    end

    from_json_to_structs->>make_structs_col: top-level STRUCT
    make_structs_col-->>Caller: output STRUCT column
Loading

Reviews (1): Last reviewed commit: "Fix from_json schema mismatch row nullin..." | Re-trigger Greptile

Comment on lines +193 to +200
num_rows,
rmm::device_buffer{},
std::move(null_mask),
null_count,
std::move(children));
// Row-level schema mismatch nulls can leave child data under null parents; sanitize it here.
if (null_count > 0) { output = cudf::purge_nonempty_nulls(output->view(), stream, mr); }
return output;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-empty null risk in nested LIST children after parent STRUCT propagation

make_lists_column_with_null_sanitization conditionally calls purge_nonempty_nulls based on the LIST column's null count at construction time (i.e., the JSON reader's null count for that child). When nullify_rows marks a parent STRUCT's rows as null and make_structs_column_with_null_consistency later propagates those parent nulls into LIST children via cudf::make_structs_column, those LIST children acquire new null rows that were NOT present when purge_nonempty_nulls was (or was not) called.

If cudf::make_structs_columnsuperimpose_and_sanitize_nulls internally calls purge_nonempty_nulls on LIST children after ORing in the parent mask, the column is valid. If it only ORs the null bits without compacting child offsets, the LIST child ends up with non-empty null rows (null bit set, non-zero offset span), which is invalid cuDF state for columns that may be accessed independently downstream.

The existing test only checks null-mask equality at the top level — it does not verify that data.c2 has zero-size offsets for the mismatched row, so a non-empty null bug here would not be caught by assertColumnsAreEqual. Please add a validator (or check offset equality on host) to confirm that the nested LIST's offsets are compacted for null rows.

Comment on lines +934 to +941
auto const mismatch_rows =
std::find_if(parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.begin(),
parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end(),
[&col_name](auto const& row_info) { return row_info.column_name == col_name; });
if (mismatch_rows !=
parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end()) {
nullify_rows(*parsed_columns[i], mismatch_rows->row_indices, stream, mr);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 O(N×M) mismatch lookup inside the column loop

std::find_if over top_level_columns_with_schema_mismatch_rows is called once per schema column, making the overall complexity O(schema_columns × mismatched_columns). In the worst case — where every column reports mismatches — this is O(N²). A single pre-pass that builds a std::unordered_map<std::string, std::vector<cudf::size_type> const*> from parsed_result.diagnostics before the loop would make each lookup O(1), which is important for schemas with many top-level fields.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +158 to +161
auto const input_view = input.view();
for (auto const row : row_indices) {
CUDF_EXPECTS(row >= 0 && row < input_view.size(), "Schema mismatch row index out of bounds.");
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Sequential CPU bounds check before GPU operations

This loop iterates every mismatch row index on the CPU before any GPU work starts. For pathological inputs (e.g., a table where 50% of rows mismatch), this serializes a potentially large host-side validation pass. The GPU kernel below (thrust::for_each) is inherently bounded by the device memory it touches, so out-of-range indices would produce undefined GPU behavior regardless — this check should at minimum only run in debug builds, or be replaced with an assertion on the max element rather than a per-element loop.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Commit 267ce8b accidentally swept a thirdparty/cudf gitlink bump
(18e8ccd8d7 -> c9cb6c288f) into the from_json row-mask fix. The row-mask
approach is JNI-side and needs no cudf change, so reset the gitlink to the
merge-base. This drops thirdparty/cudf from the PR diff; since only main's
side then differs from the merge-base, the submodule no longer conflicts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
null_count,
std::move(children));
// Row-level schema mismatch nulls can leave child data under null parents; sanitize it here.
if (null_count > 0) { output = cudf::purge_nonempty_nulls(output->view(), stream, mr); }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use cudf::has_nonempty_nulls() to reduce overhead.

Comment on lines +934 to +940
auto const mismatch_rows =
std::find_if(parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.begin(),
parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end(),
[&col_name](auto const& row_info) { return row_info.column_name == col_name; });
if (mismatch_rows !=
parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end()) {
nullify_rows(*parsed_columns[i], mismatch_rows->row_indices, stream, mr);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this executed on host?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants