-
Notifications
You must be signed in to change notification settings - Fork 224
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* added contract roadmap ideas * Added column level checks and switched to standard yaml checks * Added contract json schema and started on docs * Contract wip * Contract wip * Contract wip * Contract wip * Contract wip * Contract wip * Contract wip * Contract wip * Contract wip * Contract wip * Contract wip * Contract wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * Update soda/contracts/docs/README.md Co-authored-by: Milan Lukac <[email protected]> * Update soda/contracts/docs/README.md Co-authored-by: Milan Lukac <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * wip * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * wip * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix import path in tests * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * wip * wip * Fix attribute handler timezone test * Fix fixtures imports * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix fixtures imports --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Milan Lukac <[email protected]> Co-authored-by: Milan Lukac <[email protected]>
- Loading branch information
1 parent
c25a872
commit 98c52ce
Showing
64 changed files
with
5,804 additions
and
800 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,126 +1 @@ | ||
# Goal and purpose of data contracts in Soda | ||
|
||
Soda's goal is to be the best enforcement engine for data contract implementations. For | ||
this purpose we have created a new YAML format that aligns with data contracts from a producer | ||
perspective. This new format enables a very easy translation from any data contract | ||
YAML language to Soda's enforcable contract language. | ||
|
||
# Example | ||
|
||
```yaml | ||
dataset: DIM_CUSTOMER | ||
|
||
columns: | ||
- name: id | ||
data_type: character varying | ||
unique: true | ||
- name: cst_size | ||
data_type: decimal | ||
- name: cst_size_txt | ||
valid_values: [1, 2, 3] | ||
- name: distance | ||
data_type: integer | ||
- name: country | ||
data_type: varchar | ||
not_null: true | ||
reference: | ||
dataset: COUNTRIES | ||
column: id | ||
- name: ts | ||
|
||
checks: | ||
- avg(distance) between 400 and 500 | ||
``` | ||
# Top level keys | ||
| Key | Description | Required | YAML data type | | ||
| --- | ----------- | -------- | -------------- | | ||
| `dataset` | Name of the dataset as in the SQL engine. | Required | string | | ||
| `columns` | Schema specified as a list of columns. See below for the column format | Required| list of objects | | ||
| `checks` | SodaCL checks. See [SodaCL docs](https://docs.soda.io/soda-cl/metrics-and-checks.html) | Optional | list of SodaCL checks | | ||
|
||
# Column keys | ||
|
||
| Key | Description | Required | YAML data type | | ||
| --- | ----------- | -------- | -------------- | | ||
| `name` | Name of the column as in the SQL engine. | Required | string | | ||
| `data_type` | Ensures verification of this physical data type as part of the schema check. Must be the name of the data type as in the SQL engine | Optional | string | | ||
| `not_null` | `not_null: true` Ensures a missing values check | Optional | boolean | | ||
| `missing_*` | Ensures a missing values check and uses all the `missing_*` configurations | Optional | list of strings or numbers | | ||
| `valid_*` * `invalid_*` | Ensures a validity check and uses all the `valid_*` and `invalid_*` configurations | Optional | list of strings or numbers | | ||
| `unique` | `unique: true` ensures a uniqueness check | Optional | boolean | | ||
| `reference` | Ensures a reference check that ensures values in the column exist in the referenced column.See section reference keys below. | Optional | object | | ||
|
||
# Reference keys | ||
|
||
| Key | Description | Required | YAML data type | | ||
| --- | ----------- | -------- | -------------- | | ||
| `dataset` | Name of the reference dataset as in the SQL engine. | Required | string | | ||
| `column` | Name of the column in the reference dataset. | Required | string | | ||
| `samples_limit` | Limit of the failed rows samples taken. | Optional | number | | ||
|
||
# Contract to SodaCL translation | ||
|
||
## The schema check | ||
|
||
A contract verification will always check the schema. The contract schema check will verify that the list of columns | ||
in the database dataset matches with the columns in the contract file. All columns listed in the contract | ||
are required and no other columns are allowed. | ||
|
||
Optionally, if the `data_type` property is specified in the column, the data type will be checked as well as part of | ||
the schema check. | ||
|
||
The ordering and index of columns is ignored. | ||
|
||
## Other column checks | ||
|
||
`not_null`, `missing_values`, `missing_format` or `missing_regex` will ensure a single check | ||
`missing_count({COLUMN_NAME}) = 0` on that column with all the `missing_*` keys as configuration. | ||
|
||
Presence of `valid_*`, `invalid_*` column keys will ensure a single validity check `invalid_count({COLUMN_NAME}) = 0` | ||
on that column with all `valid_*`, `invalid_*` keys as configuration. | ||
|
||
`unique: true` will ensure a SodaCL check `duplicate_count({COLUMN_NAME}) = 0` | ||
|
||
## SodaCL checks section | ||
|
||
The YAML list structure under `checks:` will just be copied to the SodaCL checks file as-is. | ||
|
||
# Contract enforcement | ||
|
||
"Enforcement" of a contract comes down to verifying that a certain dataset (like eg a table) complies with the specification in | ||
the contract file. When the contract does not comply, the data owner and potentially the consumers should be notified. | ||
|
||
> Known limitation: At the moment there possibility to verify contracts using the CLI. Only a | ||
> Python programmatic API is available. | ||
|
||
In your python (virtual) environment, ensure that the libraries `soda-core` and `soda-core-contracts` are available | ||
as well as the `soda-core-xxxx` library for the SQL engine of your choice. | ||
|
||
To verify if a dataset complies with the contract, here's the code snippet. | ||
|
||
```python | ||
from soda.contracts.data_contract_translator import DataContractTranslator | ||
from soda.scan import Scan | ||
import logging | ||
# Read your data contract file as a Python str | ||
with open("dim_customer_data_contract.yml") as f: | ||
data_contract_yaml_str: str = f.read() | ||
# Translate the data contract into SodaCL | ||
data_contract_parser = DataContractTranslator() | ||
sodacl_yaml_str = data_contract_parser.translate_data_contract_yaml_str(data_contract_yaml_str) | ||
# Logging or saving the SodaCL YAMl file will help with debugging potential scan execution issues | ||
logging.debug(sodacl_yaml_str) | ||
# Execute the contract SodaCL in a scan | ||
scan = Scan() | ||
scan.set_data_source_name("SALESDB") | ||
scan.add_configuration_yaml_file(file_path="~/.soda/my_local_soda_environment.yml") | ||
scan.add_sodacl_yaml_str(sodacl_yaml_str) | ||
scan.execute() | ||
scan.assert_all_checks_pass() | ||
``` | ||
See [Soda contract docs](docs/README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# YAML string to YAML string conversion | ||
|
||
We translate Soda data contract YAML format to SodaCL YAML string first and then feed the SodaCL YAML | ||
string into a Soda scan. This way we can quickly build a relative complete coverage of checks | ||
in a contract with a tested implementation. | ||
|
||
Pros: | ||
* Easier & faster to build. | ||
* More coverage and less chance of bugs | ||
* Users can review the intermediate SodaCL and debug that based on the SodaCL docs. | ||
|
||
Cons: | ||
* No native error messages on the contract YAML lines. | ||
* Extra 'compilation' step | ||
|
||
Later we may consider to build native implementations for contracts to enable further improvements. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Connection and contract API | ||
|
||
The contract API was designed to accommodate the provide a way to execute the verification of a | ||
contract in a minimal way so that it can be used and combined in as many scenarios and use cases | ||
as possible. | ||
|
||
Guiding principles for this API are: | ||
* Easy to read and understand | ||
* Simple way to stop the pipeline in case of problems (problems are both contract verification | ||
execution exceptions as well as check failures) | ||
* Simple way to introspect the contract verification results | ||
* Problems with SodaCloud or database connections should fail fast as these are not recoverable | ||
* For contract verification, as many problems as possible should be collected and reported in one go. | ||
* Simple way to leverage the existing Soda Core engine and optionally provide new implementations for | ||
contract verification later on. | ||
|
||
From a concepts point of view, we switch from using the notion of a data source to using a connection. | ||
If the schema has to be used, it has to be referenced separately: either in the contract file, as a | ||
contract verification parameter or some other way. | ||
|
||
A wrapper around the DBAPI connection is needed to handle the SQL differences. | ||
It's anticipated that initially the implementation will be based on the existing Soda Core | ||
DataSource and Scan. But that later there will be direct connection implementations | ||
for each database. | ||
|
||
The returned connection is immediately open. | ||
|
||
```python | ||
import logging | ||
from soda.contracts.connection import Connection, SodaException | ||
from soda.contracts.contract import Contract, ContractResult | ||
from soda.contracts.soda_cloud import SodaCloud | ||
|
||
connection_file_path = 'postgres_localhost.scn.yml' | ||
contract_file_path = 'customers.sdc.yml' | ||
try: | ||
soda_cloud: SodaCloud = SodaCloud.from_environment_variables() | ||
with Connection.from_yaml_file(file_path=connection_file_path) as connection: | ||
contract: Contract = Contract.from_yaml_file(file_path=contract_file_path) | ||
contract_result: ContractResult = contract.verify(connection=connection, soda_cloud=soda_cloud) | ||
except SodaException as e: | ||
logging.exception("Problems verifying contract") | ||
# TODO ensure you stop your ochestration job or pipeline & the right people are notified | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Exceptions vs error logs | ||
|
||
In general the principle is that contract verification aims to be resilient, | ||
record any logs and continue to report as many problems in a single execution. | ||
|
||
This is realized by suppressing exceptions and collecting all the logs until the | ||
end of the `contract.verify` method. There any error logs or check failures will | ||
cause an exception to be raised. The SodaException raised at the end of the | ||
`contract.verify` method will list all the errors and check failures in a | ||
single SodaException. | ||
|
||
So for any of the following problems, you will get an exception being | ||
raised at the end of the contract.verfiy method: | ||
* Connection | ||
* Connection YAML or configuration issues (includes variable resolving problems) | ||
* Connection usage issues (can't reach db or no proper permissions) | ||
* SodaCloud issues (only if used) | ||
* SodaCloud YAML or configuration issues (includes variable resolving problems) | ||
* SodaCloud usage issues (can't reach Soda online or no proper credentials) | ||
* Contract | ||
* Contract YAML or configuration issues (includes variable resolving problems) | ||
* Contract verification issues | ||
* Check failures | ||
|
||
In the next recommended API usage, please note that exceptions suppressed in | ||
Connection, SodaCloud and contract parsing are passed as logs (Connection.logs, | ||
SodaCloud.logs, Contract.logs) in to the `contract.verify` method. | ||
|
||
```python | ||
connection_file_path = 'postgres_localhost.scn.yml' | ||
contract_file_path = 'customers.sdc.yml' | ||
try: | ||
soda_cloud: SodaCloud = SodaCloud.from_environment_variables() | ||
with Connection.from_yaml_file(file_path=connection_file_path) as connection: | ||
contract: Contract = Contract.from_yaml_file(file_path=contract_file_path) | ||
contract_result: ContractResult = contract.verify(connection=connection, soda_cloud=soda_cloud) | ||
# contract verification passed | ||
except SodaException as e: | ||
# contract verification failed | ||
logging.exception(f"Contract verification failed: {e}", exc_info=e) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
# Link between contract and schema | ||
|
||
With the new contracts API, we will revisit the concept of a data source. Instead of | ||
combining the connection together with the schema in a data source, contracts will just | ||
work on a connection. This will bring the abstractions more in line with what users | ||
know. | ||
|
||
Contract verification operates on a connection. This implies a selection of a database. | ||
Usually one connection can provide access to multiple schemas. | ||
|
||
In the simplest case, a schema is not needed. Contract verification can run on just the | ||
table name. As long as the connection is able to identify the table by its name without | ||
referring to the schema. | ||
|
||
```yaml | ||
dataset: CUSTOMERS | ||
columns: | ||
- name: id | ||
... | ||
``` | ||
|
||
The connection may not have the target schema in the search path and referring to the table | ||
name may not be sufficient on the connection. In that case, we should consider to let users | ||
specify the schema in several ways: | ||
|
||
a) In the contract itself: | ||
```yaml | ||
dataset: CUSTOMERS | ||
schema: CSTMR_DATA_PROD | ||
columns: | ||
- name: id | ||
... | ||
``` | ||
|
||
b) In the API | ||
```python | ||
contract: Contract = Contract.from_yaml_file(file_path=contract_file_path, schema="CSTMR_DATA_PROD") | ||
contract_result: ContractResult = contract.verify(connection=connection, soda_cloud=soda_cloud) | ||
``` | ||
|
||
c) (still in idea-stage) We can expand this basic API with a file naming convention that uses relative references to | ||
the schema and connection files.`../schema.yml` | ||
and `../../../connection.yml` leading to for example: | ||
|
||
``` | ||
+ postgres_localhost_db/ | ||
+ connection.sdn.yml | ||
+ soda_cloud.scl.yml | ||
+ schemas/ | ||
| + CSTMR_DATA_PROD/ | ||
| | + schema.yml | ||
| | + datasets/ | ||
| | | + CUSTOMERS.sdc.yml | ||
| | | + SUPPLIERS.sdc.yml | ||
``` | ||
then we can add a simpler API like | ||
|
||
```python | ||
import logging | ||
from soda.contracts.contract import Contracts | ||
from soda.contracts.connection import SodaException | ||
|
||
try: | ||
Contracts.verify(["postgres_localhost_db/schemas/CSTMR_DATA_PROD/datasets/*.sdc.yml"]) | ||
except SodaException as e: | ||
logging.exception("Problems verifying contract") | ||
# TODO ensure you stop your ochestration job or pipeline & the right people are notified | ||
``` | ||
|
||
This would also fit the CLI tooling. Using this file name convention, it also makes the connection between the contract | ||
and the database much clearer: The contract is the place where you can extend the databases metadata. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
### Keys without spaces nor variables | ||
|
||
No spaces in keys. No parsing of keys. No variable parts in keys except for the column names. | ||
|
||
* Pro: | ||
* More JSON compliant | ||
* More validation from JSON schema | ||
* More expected and in line with people's expectations | ||
* Con: | ||
* Not similar to SodaCL |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# New YAML framework | ||
|
||
See `../soda/contracts/impl/yaml.py` | ||
|
||
The new YAML abstract allows for: | ||
* Capturing all errors into a central logs object instead of raising an exception on the first problem | ||
* Convenience read_* methods on the YamlObject for writing parsing code | ||
* A more convenient way to access the line and column information (location) | ||
|
||
It's intended for reading, not writing. Should we add ability to write on this same framework? | ||
For now we write using plain dicts/lists. There is also the unpack() method. | ||
But full mutable data structures would require overloading the muting operators like eg __setitem__ etc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
In order to make it easier for contract authors to know when they are putting in literal SQL vs Soda Contract interpreted values, | ||
all the keys that are used literally in SQL queries should have `sql` in them. | ||
|
||
For example `sql_expression`, `invalid_sql_regex`, `valid_sql_regex` etc | ||
```yaml | ||
dataset: {table_name} | ||
checks: | ||
- type: metric_sql_expression | ||
metric: us_count | ||
sql_expression: COUNT(CASE WHEN country = 'US' THEN 1 END) | ||
must_be: 0 | ||
``` | ||
Potentially you could consider the column name and data type exceptions to this rule. | ||
Adding `sql` to the keys `name` and `data_type` would be overkill. | ||
```yaml | ||
dataset: {table_name} | ||
columns: | ||
- name: id | ||
data_type: VARCHAR | ||
``` |
Oops, something went wrong.