from_json: null only schema-mismatched rows#4728
Conversation
5b64481 to
1f9b3c9
Compare
Signed-off-by: Allen Xu <allxu@nvidia.com>
1f9b3c9 to
267ce8b
Compare
|
Requesting early JNI-side review while this remains draft. Context: depends on rapidsai/cudf#22915. Could you please review the JNI integration and row-level nulling approach before the cuDF dependency lands? The main things I would like feedback on are:
Thanks @ttnghia @jihoonson @thirtiseven. |
| num_rows, | ||
| rmm::device_buffer{}, | ||
| std::move(null_mask), | ||
| null_count, | ||
| std::move(children)); | ||
| // Row-level schema mismatch nulls can leave child data under null parents; sanitize it here. | ||
| if (null_count > 0) { output = cudf::purge_nonempty_nulls(output->view(), stream, mr); } | ||
| return output; |
There was a problem hiding this comment.
Non-empty null risk in nested LIST children after parent STRUCT propagation
make_lists_column_with_null_sanitization conditionally calls purge_nonempty_nulls based on the LIST column's null count at construction time (i.e., the JSON reader's null count for that child). When nullify_rows marks a parent STRUCT's rows as null and make_structs_column_with_null_consistency later propagates those parent nulls into LIST children via cudf::make_structs_column, those LIST children acquire new null rows that were NOT present when purge_nonempty_nulls was (or was not) called.
If cudf::make_structs_column → superimpose_and_sanitize_nulls internally calls purge_nonempty_nulls on LIST children after ORing in the parent mask, the column is valid. If it only ORs the null bits without compacting child offsets, the LIST child ends up with non-empty null rows (null bit set, non-zero offset span), which is invalid cuDF state for columns that may be accessed independently downstream.
The existing test only checks null-mask equality at the top level — it does not verify that data.c2 has zero-size offsets for the mismatched row, so a non-empty null bug here would not be caught by assertColumnsAreEqual. Please add a validator (or check offset equality on host) to confirm that the nested LIST's offsets are compacted for null rows.
| auto const mismatch_rows = | ||
| std::find_if(parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.begin(), | ||
| parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end(), | ||
| [&col_name](auto const& row_info) { return row_info.column_name == col_name; }); | ||
| if (mismatch_rows != | ||
| parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end()) { | ||
| nullify_rows(*parsed_columns[i], mismatch_rows->row_indices, stream, mr); | ||
| } |
There was a problem hiding this comment.
O(N×M) mismatch lookup inside the column loop
std::find_if over top_level_columns_with_schema_mismatch_rows is called once per schema column, making the overall complexity O(schema_columns × mismatched_columns). In the worst case — where every column reports mismatches — this is O(N²). A single pre-pass that builds a std::unordered_map<std::string, std::vector<cudf::size_type> const*> from parsed_result.diagnostics before the loop would make each lookup O(1), which is important for schemas with many top-level fields.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| auto const input_view = input.view(); | ||
| for (auto const row : row_indices) { | ||
| CUDF_EXPECTS(row >= 0 && row < input_view.size(), "Schema mismatch row index out of bounds."); | ||
| } |
There was a problem hiding this comment.
Sequential CPU bounds check before GPU operations
This loop iterates every mismatch row index on the CPU before any GPU work starts. For pathological inputs (e.g., a table where 50% of rows mismatch), this serializes a potentially large host-side validation pass. The GPU kernel below (thrust::for_each) is inherently bounded by the device memory it touches, so out-of-range indices would produce undefined GPU behavior regardless — this check should at minimum only run in debug builds, or be replaced with an assertion on the max element rather than a per-element loop.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Commit 267ce8b accidentally swept a thirdparty/cudf gitlink bump (18e8ccd8d7 -> c9cb6c288f) into the from_json row-mask fix. The row-mask approach is JNI-side and needs no cudf change, so reset the gitlink to the merge-base. This drops thirdparty/cudf from the PR diff; since only main's side then differs from the merge-base, the submodule no longer conflicts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Allen Xu <allxu@nvidia.com>
| null_count, | ||
| std::move(children)); | ||
| // Row-level schema mismatch nulls can leave child data under null parents; sanitize it here. | ||
| if (null_count > 0) { output = cudf::purge_nonempty_nulls(output->view(), stream, mr); } |
There was a problem hiding this comment.
Use cudf::has_nonempty_nulls() to reduce overhead.
| auto const mismatch_rows = | ||
| std::find_if(parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.begin(), | ||
| parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end(), | ||
| [&col_name](auto const& row_info) { return row_info.column_name == col_name; }); | ||
| if (mismatch_rows != | ||
| parsed_result.diagnostics.top_level_columns_with_schema_mismatch_rows.end()) { | ||
| nullify_rows(*parsed_columns[i], mismatch_rows->row_indices, stream, mr); |
There was a problem hiding this comment.
Why is this executed on host?
Summary
from_jsonbehaviorfrom_json_to_structsnvbench targetContributes to #4645. This is the follow-up to the reverted #4536 / #4706 path.
Dependency
This PR is intentionally draft until rapidsai/cudf#22915 is merged. The current submodule pointer references
c9cb6c288faff96684439d540e0e0e64b841b0ea, which is available on my fork branch but not yet onrapidsai/cudf main; CI should only be treated as authoritative after the cuDF PR lands and this submodule pointer is refreshed to the upstream SHA.Validation
PARALLEL_LEVEL=12 ./build/run-in-docker cmake --build target/jni/cmake-build --target spark_rapids_jni -j12LOCAL_MAVEN_REPO=$PWD/.mvn-repo DOCKER_RUN_EXTRA_ARGS="-e URM_URL -e ART_URL -e URM_CREDS_USR -e URM_CREDS_PSW -e ART_CREDS_USR -e ART_CREDS_PSW" PARALLEL_LEVEL=12 ./build/run-in-docker mvn test -s ci/settings.xml -Dmaven.repo.local=./.mvn-repo -Dtest=FromJsonToStructsTest -Dsubmodule.patch.skip=true -Dlibcudf.clean.skip=true -Dsurefire.useFile=falseFromJsonToStructsTest: Tests run: 1, Failures: 0, Errors: 0, Skipped: 0ColumnViewNonEmptyNullsTest: Tests run: 4, Failures: 0, Errors: 0, Skipped: 0CudaFatalTest: Tests run: 1, Failures: 0, Errors: 0, Skipped: 0./build/run-in-docker cmake --build target/libcudf/cmake-build --target JSON_TEST -j12./build/run-in-docker env HOME=/tmp target/libcudf/cmake-build/gtests/JSON_TEST --gtest_filter=JsonReaderTest.SchemaMismatchDiag*Performance
FROM_JSON_TO_STRUCTS_BENCH:PARALLEL_LEVEL=12 ./build/run-in-docker cmake --build target/jni/cmake-build --target FROM_JSON_TO_STRUCTS_BENCH -j12num_rows=1000000, mismatch_percent=0: GPU 197.252 ms, noise 2.17%num_rows=1000000, mismatch_percent=1: GPU 182.406 ms, noise 1.43%The benchmark is a local sanity check that the path remains GPU-side and same-order; it is not claiming a strict mismatch-overhead comparison because the mismatch payload shape differs from the all-valid payload.