-
Notifications
You must be signed in to change notification settings - Fork 72
Feat add json validation checks #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cornzyblack
wants to merge
61
commits into
databrickslabs:main
Choose a base branch
from
cornzyblack:feat-add-json-validation-checks
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
61 commits
Select commit
Hold shift + click to select a range
6cd9f02
move criticiality of rule inro _validate_attributes
cornzyblack 98371b7
since criticality is validated after creation, filter by criticality …
cornzyblack 1a58e16
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack 98803bc
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack acf3767
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack 0766f2b
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack 3d0fd34
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack e5712fc
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack 9bf6d98
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack 1e4d783
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack fcdb1ce
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack 2393404
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack eddc874
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack c378b6d
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack cb6f9ef
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack 82c7a22
feat: add check for valid json
cornzyblack f1ec4af
feat: add checks for is_valid_json
cornzyblack dfa9649
feat: add is_valid_json
cornzyblack 89f2811
feat: add has_json_keys
cornzyblack 02466c1
refactor: change logic
cornzyblack ccb6e05
refactor: invert
cornzyblack 156a9c2
refactor: negate
cornzyblack 8d30ff6
refactor: update
cornzyblack 1873d72
refactor: update
cornzyblack 5109c27
refactor: update
cornzyblack ceecf7d
refactor: updates
cornzyblack 0c94089
refactor: change and update
cornzyblack 246833b
Merge branch 'main' into feat-add-json-validation-checks
mwojtyczka 05365e0
refactor: fix docs
cornzyblack 7be64e6
refactor: updates
cornzyblack c7d8406
refactor: update logic
cornzyblack 66cbb13
refactor: explcit True
cornzyblack 70e19bd
refactor: remove repetition
cornzyblack a168d64
refactor: remove as it depends on spark
cornzyblack c3c23e7
feat: add perf test for 2 tests (remaining 1)
cornzyblack 984bbb8
Merge branch 'main' into feat-add-json-validation-checks
mwojtyczka 3e63312
refactor: switch back to has_json_schema
cornzyblack 9ed893a
Merge branch 'feat-add-json-validation-checks' of github.com:cornzybl…
cornzyblack a72bdb1
docs: document properly that function only checks outside keys
cornzyblack b8505e4
refactor: comment out to test
cornzyblack a177c01
refactor: try using transform for strict comparison
cornzyblack e0c3438
feat: implement changes
cornzyblack 853c8c0
Merge branch 'main' into feat-add-json-validation-checks
cornzyblack 3b0fd52
format and add tests
cornzyblack 44881fe
Merge branch 'feat-add-json-validation-checks' of github.com:cornzybl…
cornzyblack 7b19d00
refactor: add to markdown
cornzyblack 96cbc8e
updates
cornzyblack 0ff6ccb
Merge branch 'main' into feat-add-json-validation-checks
mwojtyczka ebc9527
Merge branch 'main' of github.com:cornzyblack/dqx into feat-add-json-…
cornzyblack 0be72ad
Merge branch 'main' into feat-add-json-validation-checks
mwojtyczka fc7bd7a
refactor: add missing type in schema
cornzyblack f2dbbcb
Merge branch 'main' of github.com:cornzyblack/dqx into feat-add-json-…
cornzyblack a10ab22
refactor: update tests
cornzyblack b753071
feat: add has_valid_json_schema to perf
cornzyblack 53a8f51
refactor: modify schema and dataframe
cornzyblack 560ffd0
refactor: add note that this is not strict validation
cornzyblack ca62317
docs: update docs
cornzyblack 37db749
Merge branch 'main' into feat-add-json-validation-checks
mwojtyczka 37d7803
chore: change to uppercase
cornzyblack 465b7c1
Merge branch 'main' into feat-add-json-validation-checks
ghanse 7ebbe2f
Merge branch 'main' into feat-add-json-validation-checks
ghanse File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -41,6 +41,9 @@ You can also define your own custom checks in Python (see [Creating custom check | |
| | `is_not_greater_than` | Checks whether the values in the input column are not greater than the provided limit. | `column`: column to check (can be a string column name or a column expression); `limit`: limit as number, date, timestamp, column name or sql expression | | ||
| | `is_valid_date` | Checks whether the values in the input column have valid date formats. | `column`: column to check (can be a string column name or a column expression); `date_format`: optional date format (e.g. 'yyyy-mm-dd') | | ||
| | `is_valid_timestamp` | Checks whether the values in the input column have valid timestamp formats. | `column`: column to check (can be a string column name or a column expression); `timestamp_format`: optional timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') | | ||
| | `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `has_json_keys` | Checks whether the values in the input column contain specific keys in the outermost JSON object. | `column`: column to check (can be a string column name or a column expression); `keys`: A list of JSON keys to verify within the outermost JSON object; `require_all`: optional boolean flag to require all keys to be present | | ||
| | `has_valid_json_schema` | Checks whether the values in the specified column, which contain JSON strings, conform to the expected schema. This check is **not strict**. Extra fields in the JSON that are not defined in the schema are ignored. | `column`: column to check (can be a string column name or a column expression); `schema`: the schema as a DDL string (e.g., "id INT, name STRING") or StructType object; | | ||
| | `is_not_in_future` | Checks whether the values in the input column contain a timestamp that is not in the future, where 'future' is defined as current_timestamp + offset (in seconds). | `column`: column to check (can be a string column name or a column expression); `offset`: offset to use; `curr_timestamp`: current timestamp, if not provided current_timestamp() function is used | | ||
| | `is_not_in_near_future` | Checks whether the values in the input column contain a timestamp that is not in the near future, where 'near future' is defined as greater than the current timestamp but less than the current_timestamp + offset (in seconds). | `column`: column to check (can be a string column name or a column expression); `offset`: offset to use; `curr_timestamp`: current timestamp, if not provided current_timestamp() function is used | | ||
| | `is_older_than_n_days` | Checks whether the values in one input column are at least N days older than the values in another column. | `column`: column to check (can be a string column name or a column expression); `days`: number of days; `curr_date`: current date, if not provided current_date() function is used; `negate`: if the condition should be negated | | ||
|
|
@@ -325,6 +328,41 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen | |
| column: col5 | ||
| date_format: yyyy-MM-dd | ||
|
|
||
| # is_valid_json check | ||
| - criticality: error | ||
| check: | ||
| function: is_valid_json | ||
| arguments: | ||
| column: col_json_str | ||
|
|
||
| # has_json_keys check | ||
| - criticality: error | ||
| check: | ||
| function: has_json_keys | ||
| arguments: | ||
| column: col_json_str | ||
| keys: | ||
| - key1 | ||
|
|
||
| - criticality: error | ||
| name: col_json_str_does_not_have_json_keys2 | ||
| check: | ||
| function: has_json_keys | ||
| arguments: | ||
| column: col_json_str | ||
| keys: | ||
| - key1 | ||
| - key2 | ||
| require_all: False | ||
|
|
||
| - criticality: error | ||
| name: col_json_str2_has_invalid_json_schema | ||
| check: | ||
| function: has_valid_json_schema | ||
| arguments: | ||
| column: col_json_str2 | ||
| schema: "STRUCT<a: BIGINT, b: BIGINT>" | ||
|
|
||
| # is_valid_timestamp check | ||
| - criticality: error | ||
| check: | ||
|
|
@@ -534,42 +572,42 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen | |
| function: is_linestring | ||
| arguments: | ||
| column: linestring_geom | ||
|
|
||
| # is_polygon check | ||
| - criticality: error | ||
| check: | ||
| function: is_polygon | ||
| arguments: | ||
| column: polygon_geom | ||
|
|
||
| # is_multipoint check | ||
| - criticality: error | ||
| check: | ||
| function: is_multipoint | ||
| arguments: | ||
| column: multipoint_geom | ||
|
|
||
| # is_multilinestring check | ||
| - criticality: error | ||
| check: | ||
| function: is_multilinestring | ||
| arguments: | ||
| column: multilinestring_geom | ||
|
|
||
| # is_multipolygon check | ||
| - criticality: error | ||
| check: | ||
| function: is_multipolygon | ||
| arguments: | ||
| column: multipolygon_geom | ||
|
|
||
| # is_geometrycollection check | ||
| - criticality: error | ||
| check: | ||
| function: is_geometrycollection | ||
| arguments: | ||
| column: geometrycollection_geom | ||
|
|
||
| # is_ogc_valid check | ||
| - criticality: error | ||
| check: | ||
|
|
@@ -583,15 +621,15 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen | |
| function: is_non_empty_geometry | ||
| arguments: | ||
| column: point_geom | ||
|
|
||
| # has_dimension check | ||
| - criticality: error | ||
| check: | ||
| function: has_dimension | ||
| arguments: | ||
| column: polygon_geom | ||
| dimension: 2 | ||
|
|
||
| # has_x_coordinate_between check | ||
| - criticality: error | ||
| check: | ||
|
|
@@ -600,7 +638,7 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen | |
| column: polygon_geom | ||
| min_value: 0.0 | ||
| max_value: 10.0 | ||
|
|
||
| # has_y_coordinate_between check | ||
| - criticality: error | ||
| check: | ||
|
|
@@ -609,6 +647,7 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen | |
| column: polygon_geom | ||
| min_value: 0.0 | ||
| max_value: 10.0 | ||
|
|
||
| ``` | ||
| </details> | ||
|
|
||
|
|
@@ -881,6 +920,38 @@ checks = [ | |
| name="col6_is_not_valid_timestamp2" | ||
| ), | ||
|
|
||
| # is_valid_json check | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls group examples for |
||
| DQRowRule( | ||
| criticality="error", | ||
| check_func=check_funcs.is_valid_json, | ||
| column="col_json_str" | ||
| ), | ||
|
|
||
| # has_json_keys check | ||
| DQRowRule( | ||
| criticality="error", | ||
| check_func=check_funcs.has_json_keys, | ||
| column="col_json_str", # or as expr: F.col("col_json_str") | ||
| check_func_kwargs={"keys": ["key1"]}, | ||
| name="col_json_str_has_json_keys" | ||
| ), | ||
|
|
||
| DQRowRule( | ||
| criticality="error", | ||
| check_func=check_funcs.has_json_keys, | ||
| column="col_json_str", # or as expr: F.col("col_json_str") | ||
| check_func_kwargs={"keys": ["key1", "key2"], "require_all": False}, | ||
| name="col_json_str_has_json_keys" | ||
| ), | ||
|
|
||
| DQRowRule( | ||
| criticality="error", | ||
| check_func=check_funcs.has_valid_json_schema, | ||
| column="col_json_str2", # or as expr: F.col("col_json_str") | ||
| check_func_kwargs={"schema": "STRUCT<a: BIGINT, b: BIGINT>"}, | ||
| name="col_json_str2_has_valid_json_schema" | ||
| ), | ||
|
|
||
| # is_not_in_future check | ||
| DQRowRule( | ||
| criticality="error", | ||
|
|
@@ -1018,7 +1089,7 @@ checks = [ | |
| check_func=geo_check_funcs.is_multilinestring, | ||
| column="multilinestring_geom" | ||
| ), | ||
|
|
||
| # is_multipolygon check | ||
| DQRowRule( | ||
| criticality="error", | ||
|
|
@@ -3024,7 +3095,7 @@ The PII detection extras include a built-in `does_not_contain_pii` check that us | |
| function: does_not_contain_pii | ||
| arguments: | ||
| column: description | ||
|
|
||
| # PII detection check with custom threshold and named entities | ||
| - criticality: error | ||
| check: | ||
|
|
@@ -3041,7 +3112,7 @@ The PII detection extras include a built-in `does_not_contain_pii` check that us | |
| ```python | ||
| from databricks.labs.dqx.rule import DQRowRule | ||
| from databricks.labs.dqx.pii.pii_detection_funcs import does_not_contain_pii | ||
|
|
||
| checks = [ | ||
| # Basic PII detection check | ||
| DQRowRule( | ||
|
|
@@ -3059,7 +3130,7 @@ The PII detection extras include a built-in `does_not_contain_pii` check that us | |
| check_func_kwargs={"threshold": 0.8, "entities": ["PERSON", "EMAIL_ADDRESS"]} | ||
| ), | ||
| ] | ||
| ``` | ||
| ``` | ||
| </TabItem> | ||
| </Tabs> | ||
|
|
||
|
|
@@ -3096,7 +3167,7 @@ These can be loaded using `NLPEngineConfig`: | |
| from databricks.labs.dqx.rule import DQRowRule | ||
| from databricks.labs.dqx.pii.pii_detection_funcs import does_not_contain_pii | ||
| from databricks.labs.dqx.pii.nlp_engine_config import NLPEngineConfig | ||
|
|
||
| checks = [ | ||
| # PII detection check using spacy as a named entity recognizer | ||
| DQRowRule( | ||
|
|
@@ -3105,7 +3176,7 @@ These can be loaded using `NLPEngineConfig`: | |
| column="description", | ||
| check_func=does_not_contain_pii, | ||
| check_func_kwargs={"nlp_engine_config": NLPEngineConfig.SPACY_MEDIUM} | ||
| ), | ||
| ), | ||
| ] | ||
| ``` | ||
| </TabItem> | ||
|
|
@@ -3125,7 +3196,7 @@ Using custom models for named-entity recognition may require you to install thes | |
| from databricks.labs.dqx.rule import DQRowRule | ||
| from databricks.labs.dqx.engine import DQEngine | ||
| from databricks.sdk import WorkspaceClient | ||
|
|
||
| nlp_engine_config = { | ||
| 'nlp_engine_name': 'transformers_stanford_deidentifier_base', | ||
| 'models': [ | ||
|
|
@@ -3168,9 +3239,9 @@ Using custom models for named-entity recognition may require you to install thes | |
| column="description", | ||
| check_func=does_not_contain_pii, | ||
| check_func_kwargs={"nlp_engine_config": nlp_engine_config}, | ||
| ), | ||
| ), | ||
| ] | ||
|
|
||
| dq_engine = DQEngine(WorkspaceClient()) | ||
| df = spark.read.table("main.default.table") | ||
| valid_df, quarantine_df = dq_engine.apply_checks_and_split(df, checks) | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description states 'valid JSON objects' but the function accepts any valid JSON value (objects, arrays, primitives like numbers, strings, booleans, null). The description should be 'Checks whether the values in the input column are valid JSON strings.' to accurately reflect the implementation.