-
Notifications
You must be signed in to change notification settings - Fork 71
Feat add json validation checks #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat add json validation checks #616
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds row-level JSON validation checks and integrates them into examples and tests.
- Introduces is_valid_json and has_json_keys row checks.
- Updates YAML examples, reference docs, and integration/unit tests to cover the new checks.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/databricks/labs/dqx/check_funcs.py | Adds JSON validation/check functions; core logic for new checks. |
| tests/unit/test_build_rules.py | Extends metadata conversion tests to include new JSON checks. |
| tests/integration/test_apply_checks.py | Adds col_json_str to test schemas and values; exercises new checks in streaming and class-based tests. |
| tests/resources/all_row_checks.yaml | Includes is_valid_json check in the “all row checks” YAML. |
| src/databricks/labs/dqx/llm/resources/yaml_checks_examples.yml | Adds examples for is_valid_json and has_json_keys. |
| docs/dqx/docs/reference/quality_checks.mdx | Documents new checks and shows usage examples. |
Comments suppressed due to low confidence (1)
docs/dqx/docs/reference/quality_checks.mdx:1
- Both examples use the same name 'col_json_str_has_json_keys', which is confusing and may collide in practice. Use distinct, descriptive names (e.g., 'col_json_str_has_no_json_key1' and 'col_json_str_has_no_json_key1_key2').
---
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column) | ||
| return make_condition( | ||
| ~F.when(F.col(col_expr_str).isNotNull(), F.try_parse_json(col_expr_str).isNotNull()), | ||
| F.concat_ws( | ||
| "", F.lit("Value '"), col_expr.cast("string"), F.lit(f"' in Column '{col_expr_str}' is not a valid JSON") | ||
| ), | ||
| f"{col_str_norm}_is_not_valid_json", | ||
| ) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_valid_json dereferences the column by name (F.col(col_expr_str)) rather than using the resolved expression (col_expr). This breaks when the caller supplies a column expression (e.g., F.trim('c')), which may not exist as a named column. Use col_expr consistently in both the null and try_parse_json checks.
|
|
||
| unique_keys_lit = F.lit(list(set(keys))) | ||
| col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column) | ||
| json_keys_array = F.json_object_keys(col_expr) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyspark.sql.functions does not expose json_object_keys; this will raise AttributeError at runtime. Use F.expr to call the SQL function or derive keys via from_json + map_keys. For example: json_keys_array = F.expr(f"json_object_keys({col_expr_str})").
| json_keys_array = F.json_object_keys(col_expr) | |
| json_keys_array = F.expr(f"json_object_keys({col_expr_str})") |
|
|
||
| if require_all: | ||
| condition = F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0 | ||
| else: | ||
| condition = F.when(is_valid_json(col_str_norm).isNull(), F.arrays_overlap(json_keys_array, unique_keys_lit)) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_json_keys returns a null condition (and thus passes) for invalid JSON or null values, which means it won't flag missing keys unless a separate is_valid_json rule is also configured. To make has_json_keys self-contained, include a JSON-validity guard in the condition (so invalid JSON fails this check as well). For example: compute json_valid = F.try_parse_json(col_expr).isNotNull() and combine it with the key presence logic.
| if require_all: | |
| condition = F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0 | |
| else: | |
| condition = F.when(is_valid_json(col_str_norm).isNull(), F.arrays_overlap(json_keys_array, unique_keys_lit)) | |
| json_valid = F.try_parse_json(col_expr).isNotNull() | |
| if require_all: | |
| condition = json_valid & (F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0) | |
| else: | |
| condition = json_valid & F.arrays_overlap(json_keys_array, unique_keys_lit) |
| if not isinstance(key, (str)): | ||
| raise InvalidParameterError("All keys must be of type string.") | ||
|
|
||
| unique_keys_lit = F.lit(list(set(keys))) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using set(keys) loses ordering and can lead to non-deterministic error messages (and test flakiness). Prefer a stable order: unique_keys_lit = F.lit(sorted(set(keys))).
| unique_keys_lit = F.lit(list(set(keys))) | |
| unique_keys_lit = F.lit(sorted(set(keys))) |
| F.concat_ws(", ", F.lit(keys)), | ||
| F.lit("]"), | ||
| ), | ||
| f"{col_str_norm}_has_no_json_keys", |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alias for has_json_keys does not include the target keys. If this rule is applied multiple times to the same column with different key sets, the resulting alias collides (e.g., duplicate 'col_json_str_has_no_json_keys'). Include the keys (and optionally require_all) in the alias to ensure uniqueness, e.g., alias_name = f"{col_str_norm}has_no_json{'_'.join(sorted(set(keys)))}".
| f"{col_str_norm}_has_no_json_keys", | |
| f"{col_str_norm}_has_no_json_{'_'.join(sorted(set(keys)))}{'_all' if require_all else '_any'}", |
| | `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for | |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation gaps: (1) is_valid_json validates any JSON (object/array/etc.), not only 'JSON objects'—please reword. (2) has_json_keys supports a require_all flag (default True) but it's not documented—add it to the parameter list.
| | `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) | | |
| | `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for | | |
| | `is_valid_json` | Checks whether the values in the input column are valid JSON (objects, arrays, strings, numbers, etc.), not just JSON objects. | `column`: column to check (can be a string column name or a column expression) | | |
| | `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for; `require_all`: whether all keys must be present (default: True) | |
| keys: | ||
| - key1 | ||
| - criticality: error | ||
| name: col_json_str_has_no_json_keys2 |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example name 'col_json_str_has_no_json_keys2' is ambiguous. Consider aligning with the keys being checked (e.g., 'col_json_str_has_no_json_key1_key2') for clarity and consistency with other examples.
| name: col_json_str_has_no_json_keys2 | |
| name: col_json_str_has_no_json_key1_key2 |
…ack/dqx into feat-add-json-validation-checks
…validation-checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .withColumnSpec("col10") | ||
| .withColumnSpec("col_ipv4", template=r"\n.\n.\n.\n") | ||
| .withColumnSpec("col_ipv6", template="XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX") | ||
| .withColumnSpec("col_json_str", template=r"{'\key1': '\w', '\key2': 'd\w'}") |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escape sequences \k, \w, and \w in the JSON template string are invalid. The backslashes before single quotes should be removed since they're inside a raw string, and the escapes before 'k' and 'd' are unnecessary. The template should be r"{'key1': '\w', 'key2': 'd\w'}" to generate valid JSON-like strings.
| .withColumnSpec("col_json_str", template=r"{'\key1': '\w', '\key2': 'd\w'}") | |
| .withColumnSpec("col_json_str", template=r"{'key1': '\w', 'key2': 'd\w'}") |
| "", | ||
| F.lit("Value '"), | ||
| F.when(col_expr.isNull(), F.lit("null")).otherwise(col_expr.cast("string")), | ||
| F.lit(f"' in column '{col_expr_str}' does not conform to expected JSON schema: "), |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected capitalization of 'column' to 'Column' for consistency with other error messages in the codebase.
| F.lit(f"' in column '{col_expr_str}' does not conform to expected JSON schema: "), | |
| F.lit(f"' in Column '{col_expr_str}' does not conform to expected JSON schema: "), |
| "col1: string, col2: int, col3: int, col4 array<int>, col5: date, col6: timestamp, " | ||
| "col7: map<string, int>, col8: struct<field1: int>, col10: int, col11: string, " | ||
| "col_ipv4: string, col_ipv6: string" | ||
| "col_ipv4: string, col_ipv6: string, col_json_str: string, col_json_str2" |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing type declaration for col_json_str2 in schema string. Should be col_json_str2: string to match the pattern used for other columns.
| "col_ipv4: string, col_ipv6: string, col_json_str: string, col_json_str2" | |
| "col_ipv4: string, col_ipv6: string, col_json_str: string, col_json_str2: string" |
…validation-checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .withColumnSpec("col10") | ||
| .withColumnSpec("col_ipv4", template=r"\n.\n.\n.\n") | ||
| .withColumnSpec("col_ipv6", template="XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX") | ||
| .withColumnSpec("col_json_str", template=r"{'\key1': '\w', '\key2': 'd\w'}") |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON template string uses single quotes instead of double quotes and has unnecessary backslashes before the keys, which will not produce valid JSON. The template should be: r'{"key1": "\w", "key2": "d\w"}'
| .withColumnSpec("col_json_str", template=r"{'\key1': '\w', '\key2': 'd\w'}") | |
| .withColumnSpec("col_json_str", template=r'{"key1": "\w", "key2": "d\w"}') |
| "", | ||
| F.lit("Value '"), | ||
| F.when(col_expr.isNull(), F.lit("null")).otherwise(col_expr.cast("string")), | ||
| F.lit(f"' in column '{col_expr_str}' does not conform to expected JSON schema: "), |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected capitalization of 'column' to 'Column' for consistency with other error messages in the file.
| F.lit(f"' in column '{col_expr_str}' does not conform to expected JSON schema: "), | |
| F.lit(f"' in Column '{col_expr_str}' does not conform to expected JSON schema: "), |

Changes
Linked issues
Resolves #595
Tests