Skip to content

Conversation

@cornzyblack
Copy link
Contributor

@cornzyblack cornzyblack commented Oct 15, 2025

Changes

Linked issues

Resolves #595

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@mwojtyczka mwojtyczka requested a review from Copilot October 16, 2025 09:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds row-level JSON validation checks and integrates them into examples and tests.

  • Introduces is_valid_json and has_json_keys row checks.
  • Updates YAML examples, reference docs, and integration/unit tests to cover the new checks.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/check_funcs.py Adds JSON validation/check functions; core logic for new checks.
tests/unit/test_build_rules.py Extends metadata conversion tests to include new JSON checks.
tests/integration/test_apply_checks.py Adds col_json_str to test schemas and values; exercises new checks in streaming and class-based tests.
tests/resources/all_row_checks.yaml Includes is_valid_json check in the “all row checks” YAML.
src/databricks/labs/dqx/llm/resources/yaml_checks_examples.yml Adds examples for is_valid_json and has_json_keys.
docs/dqx/docs/reference/quality_checks.mdx Documents new checks and shows usage examples.
Comments suppressed due to low confidence (1)

docs/dqx/docs/reference/quality_checks.mdx:1

  • Both examples use the same name 'col_json_str_has_json_keys', which is confusing and may collide in practice. Use distinct, descriptive names (e.g., 'col_json_str_has_no_json_key1' and 'col_json_str_has_no_json_key1_key2').
---

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 1761 to 1768
col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
return make_condition(
~F.when(F.col(col_expr_str).isNotNull(), F.try_parse_json(col_expr_str).isNotNull()),
F.concat_ws(
"", F.lit("Value '"), col_expr.cast("string"), F.lit(f"' in Column '{col_expr_str}' is not a valid JSON")
),
f"{col_str_norm}_is_not_valid_json",
)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_valid_json dereferences the column by name (F.col(col_expr_str)) rather than using the resolved expression (col_expr). This breaks when the caller supplies a column expression (e.g., F.trim('c')), which may not exist as a named column. Use col_expr consistently in both the null and try_parse_json checks.

Copilot uses AI. Check for mistakes.

unique_keys_lit = F.lit(list(set(keys)))
col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
json_keys_array = F.json_object_keys(col_expr)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyspark.sql.functions does not expose json_object_keys; this will raise AttributeError at runtime. Use F.expr to call the SQL function or derive keys via from_json + map_keys. For example: json_keys_array = F.expr(f"json_object_keys({col_expr_str})").

Suggested change
json_keys_array = F.json_object_keys(col_expr)
json_keys_array = F.expr(f"json_object_keys({col_expr_str})")

Copilot uses AI. Check for mistakes.
Comment on lines 1793 to 1797

if require_all:
condition = F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0
else:
condition = F.when(is_valid_json(col_str_norm).isNull(), F.arrays_overlap(json_keys_array, unique_keys_lit))
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_json_keys returns a null condition (and thus passes) for invalid JSON or null values, which means it won't flag missing keys unless a separate is_valid_json rule is also configured. To make has_json_keys self-contained, include a JSON-validity guard in the condition (so invalid JSON fails this check as well). For example: compute json_valid = F.try_parse_json(col_expr).isNotNull() and combine it with the key presence logic.

Suggested change
if require_all:
condition = F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0
else:
condition = F.when(is_valid_json(col_str_norm).isNull(), F.arrays_overlap(json_keys_array, unique_keys_lit))
json_valid = F.try_parse_json(col_expr).isNotNull()
if require_all:
condition = json_valid & (F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0)
else:
condition = json_valid & F.arrays_overlap(json_keys_array, unique_keys_lit)

Copilot uses AI. Check for mistakes.
if not isinstance(key, (str)):
raise InvalidParameterError("All keys must be of type string.")

unique_keys_lit = F.lit(list(set(keys)))
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using set(keys) loses ordering and can lead to non-deterministic error messages (and test flakiness). Prefer a stable order: unique_keys_lit = F.lit(sorted(set(keys))).

Suggested change
unique_keys_lit = F.lit(list(set(keys)))
unique_keys_lit = F.lit(sorted(set(keys)))

Copilot uses AI. Check for mistakes.
F.concat_ws(", ", F.lit(keys)),
F.lit("]"),
),
f"{col_str_norm}_has_no_json_keys",
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alias for has_json_keys does not include the target keys. If this rule is applied multiple times to the same column with different key sets, the resulting alias collides (e.g., duplicate 'col_json_str_has_no_json_keys'). Include the keys (and optionally require_all) in the alias to ensure uniqueness, e.g., alias_name = f"{col_str_norm}has_no_json{'_'.join(sorted(set(keys)))}".

Suggested change
f"{col_str_norm}_has_no_json_keys",
f"{col_str_norm}_has_no_json_{'_'.join(sorted(set(keys)))}{'_all' if require_all else '_any'}",

Copilot uses AI. Check for mistakes.
Comment on lines 41 to 42
| `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) |
| `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for |
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation gaps: (1) is_valid_json validates any JSON (object/array/etc.), not only 'JSON objects'—please reword. (2) has_json_keys supports a require_all flag (default True) but it's not documented—add it to the parameter list.

Suggested change
| `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) |
| `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for |
| `is_valid_json` | Checks whether the values in the input column are valid JSON (objects, arrays, strings, numbers, etc.), not just JSON objects. | `column`: column to check (can be a string column name or a column expression) |
| `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for; `require_all`: whether all keys must be present (default: True) |

Copilot uses AI. Check for mistakes.
keys:
- key1
- criticality: error
name: col_json_str_has_no_json_keys2
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example name 'col_json_str_has_no_json_keys2' is ambiguous. Consider aligning with the keys being checked (e.g., 'col_json_str_has_no_json_key1_key2') for clarity and consistency with other examples.

Suggested change
name: col_json_str_has_no_json_keys2
name: col_json_str_has_no_json_key1_key2

Copilot uses AI. Check for mistakes.
@mwojtyczka mwojtyczka requested a review from Copilot November 11, 2025 22:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.withColumnSpec("col10")
.withColumnSpec("col_ipv4", template=r"\n.\n.\n.\n")
.withColumnSpec("col_ipv6", template="XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX")
.withColumnSpec("col_json_str", template=r"{'\key1': '\w', '\key2': 'd\w'}")
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escape sequences \k, \w, and \w in the JSON template string are invalid. The backslashes before single quotes should be removed since they're inside a raw string, and the escapes before 'k' and 'd' are unnecessary. The template should be r"{'key1': '\w', 'key2': 'd\w'}" to generate valid JSON-like strings.

Suggested change
.withColumnSpec("col_json_str", template=r"{'\key1': '\w', '\key2': 'd\w'}")
.withColumnSpec("col_json_str", template=r"{'key1': '\w', 'key2': 'd\w'}")

Copilot uses AI. Check for mistakes.
"",
F.lit("Value '"),
F.when(col_expr.isNull(), F.lit("null")).otherwise(col_expr.cast("string")),
F.lit(f"' in column '{col_expr_str}' does not conform to expected JSON schema: "),
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected capitalization of 'column' to 'Column' for consistency with other error messages in the codebase.

Suggested change
F.lit(f"' in column '{col_expr_str}' does not conform to expected JSON schema: "),
F.lit(f"' in Column '{col_expr_str}' does not conform to expected JSON schema: "),

Copilot uses AI. Check for mistakes.
"col1: string, col2: int, col3: int, col4 array<int>, col5: date, col6: timestamp, "
"col7: map<string, int>, col8: struct<field1: int>, col10: int, col11: string, "
"col_ipv4: string, col_ipv6: string"
"col_ipv4: string, col_ipv6: string, col_json_str: string, col_json_str2"
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type declaration for col_json_str2 in schema string. Should be col_json_str2: string to match the pattern used for other columns.

Suggested change
"col_ipv4: string, col_ipv6: string, col_json_str: string, col_json_str2"
"col_ipv4: string, col_ipv6: string, col_json_str: string, col_json_str2: string"

Copilot uses AI. Check for mistakes.
@cornzyblack
Copy link
Contributor Author

Updates
The current implementation of has_valid_json_schema is not strict but permissive. Take, for example, row 4 has an additional field c but still passes.

Making it strict will require UDFs for especially nested JSON

import pyspark.sql.functions as F

from databricks.labs.dqx.rule import DQRowRule
from pyspark.sql import types


from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.engine import DQEngine
from pyspark.sql import functions as F
from databricks.labs.dqx import check_funcs

ws = WorkspaceClient()
dq_engine = DQEngine(ws)

json_schema = "STRUCT<a: BIGINT, b: BIGINT>"

checks = [

    DQRowRule(
        criticality="error",
        check_func=check_funcs.has_valid_json_schema,
        column="json",
        check_func_kwargs={"schema": json_schema},
    ),
    DQRowRule(
        criticality="error",
        check_func=check_funcs.has_valid_json_schema,
        column="json",
        check_func_kwargs={"schema": json_schema,},
    ),
]


df = spark.createDataFrame(
    [
        {"json": """{ "a" : 1 }"""},
        {"json": """{ "a" : 1, "b": 2}"""},
        {"json": """{ "a" : 1, "b": null}"""},
        {"json": """{ "a" : 1, "b": 1023455,  "c": null}}"""},
    ]
)

checked_df = dq_engine.apply_checks(df, checks)
display(checked_df)

will result in

image

@cornzyblack cornzyblack marked this pull request as ready for review November 13, 2025 12:03
@cornzyblack cornzyblack requested a review from a team as a code owner November 13, 2025 12:03
@cornzyblack cornzyblack requested review from pratikk-databricks and removed request for a team November 13, 2025 12:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.withColumnSpec("col10")
.withColumnSpec("col_ipv4", template=r"\n.\n.\n.\n")
.withColumnSpec("col_ipv6", template="XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX")
.withColumnSpec("col_json_str", template=r"{'\key1': '\w', '\key2': 'd\w'}")
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON template string uses single quotes instead of double quotes and has unnecessary backslashes before the keys, which will not produce valid JSON. The template should be: r'{"key1": "\w", "key2": "d\w"}'

Suggested change
.withColumnSpec("col_json_str", template=r"{'\key1': '\w', '\key2': 'd\w'}")
.withColumnSpec("col_json_str", template=r'{"key1": "\w", "key2": "d\w"}')

Copilot uses AI. Check for mistakes.
"",
F.lit("Value '"),
F.when(col_expr.isNull(), F.lit("null")).otherwise(col_expr.cast("string")),
F.lit(f"' in column '{col_expr_str}' does not conform to expected JSON schema: "),
Copy link

Copilot AI Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected capitalization of 'column' to 'Column' for consistency with other error messages in the file.

Suggested change
F.lit(f"' in column '{col_expr_str}' does not conform to expected JSON schema: "),
F.lit(f"' in Column '{col_expr_str}' does not conform to expected JSON schema: "),

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: JSON Validation Checks

3 participants