Skip to content

Commit

Permalink
Contracts 2nd iteration (#2006)
Browse files Browse the repository at this point in the history
* added contract roadmap ideas

* Added column level checks and switched to standard yaml checks

* Added contract json schema and started on docs

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* Contract wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* Update soda/contracts/docs/README.md

Co-authored-by: Milan Lukac <[email protected]>

* Update soda/contracts/docs/README.md

Co-authored-by: Milan Lukac <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* wip

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* wip

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix import path in tests

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

* wip

* Fix attribute handler timezone test

* Fix fixtures imports

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix fixtures imports

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Milan Lukac <[email protected]>
Co-authored-by: Milan Lukac <[email protected]>
  • Loading branch information
4 people authored Mar 16, 2024
1 parent c25a872 commit 98c52ce
Show file tree
Hide file tree
Showing 64 changed files with 5,804 additions and 800 deletions.
10 changes: 8 additions & 2 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with Python 3.10
# This file is autogenerated by pip-compile with Python 3.9
# by the following command:
#
# pip-compile dev-requirements.in
Expand Down Expand Up @@ -52,6 +52,8 @@ filelock==3.13.1
# virtualenv
idna==3.6
# via requests
importlib-metadata==7.0.2
# via build
iniconfig==2.0.0
# via pytest
mypy-extensions==1.0.0
Expand Down Expand Up @@ -136,7 +138,9 @@ tox==4.13.0
tox-docker==4.1.0
# via -r dev-requirements.in
typing-extensions==4.10.0
# via -r dev-requirements.in
# via
# -r dev-requirements.in
# black
unidecode==1.3.8
# via cli-ui
urllib3==1.26.18
Expand All @@ -154,6 +158,8 @@ wheel==0.42.0
# via
# -r dev-requirements.in
# pip-tools
zipp==3.17.0
# via importlib-metadata

# The following packages are considered to be unsafe in a requirements file:
# pip
Expand Down
2 changes: 1 addition & 1 deletion scripts/recreate_venv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ set -e
rm -rf .venv
rm -rf soda_sql.egg-info

python3 -m venv .venv
virtualenv .venv
# shellcheck disable=SC1091
source .venv/bin/activate
pip install --upgrade pip
Expand Down
127 changes: 1 addition & 126 deletions soda/contracts/README.md
Original file line number Diff line number Diff line change
@@ -1,126 +1 @@
# Goal and purpose of data contracts in Soda

Soda's goal is to be the best enforcement engine for data contract implementations. For
this purpose we have created a new YAML format that aligns with data contracts from a producer
perspective. This new format enables a very easy translation from any data contract
YAML language to Soda's enforcable contract language.

# Example

```yaml
dataset: DIM_CUSTOMER

columns:
- name: id
data_type: character varying
unique: true
- name: cst_size
data_type: decimal
- name: cst_size_txt
valid_values: [1, 2, 3]
- name: distance
data_type: integer
- name: country
data_type: varchar
not_null: true
reference:
dataset: COUNTRIES
column: id
- name: ts

checks:
- avg(distance) between 400 and 500
```
# Top level keys
| Key | Description | Required | YAML data type |
| --- | ----------- | -------- | -------------- |
| `dataset` | Name of the dataset as in the SQL engine. | Required | string |
| `columns` | Schema specified as a list of columns. See below for the column format | Required| list of objects |
| `checks` | SodaCL checks. See [SodaCL docs](https://docs.soda.io/soda-cl/metrics-and-checks.html) | Optional | list of SodaCL checks |

# Column keys

| Key | Description | Required | YAML data type |
| --- | ----------- | -------- | -------------- |
| `name` | Name of the column as in the SQL engine. | Required | string |
| `data_type` | Ensures verification of this physical data type as part of the schema check. Must be the name of the data type as in the SQL engine | Optional | string |
| `not_null` | `not_null: true` Ensures a missing values check | Optional | boolean |
| `missing_*` | Ensures a missing values check and uses all the `missing_*` configurations | Optional | list of strings or numbers |
| `valid_*` * `invalid_*` | Ensures a validity check and uses all the `valid_*` and `invalid_*` configurations | Optional | list of strings or numbers |
| `unique` | `unique: true` ensures a uniqueness check | Optional | boolean |
| `reference` | Ensures a reference check that ensures values in the column exist in the referenced column.See section reference keys below. | Optional | object |

# Reference keys

| Key | Description | Required | YAML data type |
| --- | ----------- | -------- | -------------- |
| `dataset` | Name of the reference dataset as in the SQL engine. | Required | string |
| `column` | Name of the column in the reference dataset. | Required | string |
| `samples_limit` | Limit of the failed rows samples taken. | Optional | number |

# Contract to SodaCL translation

## The schema check

A contract verification will always check the schema. The contract schema check will verify that the list of columns
in the database dataset matches with the columns in the contract file. All columns listed in the contract
are required and no other columns are allowed.

Optionally, if the `data_type` property is specified in the column, the data type will be checked as well as part of
the schema check.

The ordering and index of columns is ignored.

## Other column checks

`not_null`, `missing_values`, `missing_format` or `missing_regex` will ensure a single check
`missing_count({COLUMN_NAME}) = 0` on that column with all the `missing_*` keys as configuration.

Presence of `valid_*`, `invalid_*` column keys will ensure a single validity check `invalid_count({COLUMN_NAME}) = 0`
on that column with all `valid_*`, `invalid_*` keys as configuration.

`unique: true` will ensure a SodaCL check `duplicate_count({COLUMN_NAME}) = 0`

## SodaCL checks section

The YAML list structure under `checks:` will just be copied to the SodaCL checks file as-is.

# Contract enforcement

"Enforcement" of a contract comes down to verifying that a certain dataset (like eg a table) complies with the specification in
the contract file. When the contract does not comply, the data owner and potentially the consumers should be notified.

> Known limitation: At the moment there possibility to verify contracts using the CLI. Only a
> Python programmatic API is available.

In your python (virtual) environment, ensure that the libraries `soda-core` and `soda-core-contracts` are available
as well as the `soda-core-xxxx` library for the SQL engine of your choice.

To verify if a dataset complies with the contract, here's the code snippet.

```python
from soda.contracts.data_contract_translator import DataContractTranslator
from soda.scan import Scan
import logging
# Read your data contract file as a Python str
with open("dim_customer_data_contract.yml") as f:
data_contract_yaml_str: str = f.read()
# Translate the data contract into SodaCL
data_contract_parser = DataContractTranslator()
sodacl_yaml_str = data_contract_parser.translate_data_contract_yaml_str(data_contract_yaml_str)
# Logging or saving the SodaCL YAMl file will help with debugging potential scan execution issues
logging.debug(sodacl_yaml_str)
# Execute the contract SodaCL in a scan
scan = Scan()
scan.set_data_source_name("SALESDB")
scan.add_configuration_yaml_file(file_path="~/.soda/my_local_soda_environment.yml")
scan.add_sodacl_yaml_str(sodacl_yaml_str)
scan.execute()
scan.assert_all_checks_pass()
```
See [Soda contract docs](docs/README.md)
16 changes: 16 additions & 0 deletions soda/contracts/adr/01_yaml_to_yaml_conversion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# YAML string to YAML string conversion

We translate Soda data contract YAML format to SodaCL YAML string first and then feed the SodaCL YAML
string into a Soda scan. This way we can quickly build a relative complete coverage of checks
in a contract with a tested implementation.

Pros:
* Easier & faster to build.
* More coverage and less chance of bugs
* Users can review the intermediate SodaCL and debug that based on the SodaCL docs.

Cons:
* No native error messages on the contract YAML lines.
* Extra 'compilation' step

Later we may consider to build native implementations for contracts to enable further improvements.
44 changes: 44 additions & 0 deletions soda/contracts/adr/02_contract_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Connection and contract API

The contract API was designed to accommodate the provide a way to execute the verification of a
contract in a minimal way so that it can be used and combined in as many scenarios and use cases
as possible.

Guiding principles for this API are:
* Easy to read and understand
* Simple way to stop the pipeline in case of problems (problems are both contract verification
execution exceptions as well as check failures)
* Simple way to introspect the contract verification results
* Problems with SodaCloud or database connections should fail fast as these are not recoverable
* For contract verification, as many problems as possible should be collected and reported in one go.
* Simple way to leverage the existing Soda Core engine and optionally provide new implementations for
contract verification later on.

From a concepts point of view, we switch from using the notion of a data source to using a connection.
If the schema has to be used, it has to be referenced separately: either in the contract file, as a
contract verification parameter or some other way.

A wrapper around the DBAPI connection is needed to handle the SQL differences.
It's anticipated that initially the implementation will be based on the existing Soda Core
DataSource and Scan. But that later there will be direct connection implementations
for each database.

The returned connection is immediately open.

```python
import logging
from soda.contracts.connection import Connection, SodaException
from soda.contracts.contract import Contract, ContractResult
from soda.contracts.soda_cloud import SodaCloud

connection_file_path = 'postgres_localhost.scn.yml'
contract_file_path = 'customers.sdc.yml'
try:
soda_cloud: SodaCloud = SodaCloud.from_environment_variables()
with Connection.from_yaml_file(file_path=connection_file_path) as connection:
contract: Contract = Contract.from_yaml_file(file_path=contract_file_path)
contract_result: ContractResult = contract.verify(connection=connection, soda_cloud=soda_cloud)
except SodaException as e:
logging.exception("Problems verifying contract")
# TODO ensure you stop your ochestration job or pipeline & the right people are notified
```
41 changes: 41 additions & 0 deletions soda/contracts/adr/03_exceptions_vs_error_logs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Exceptions vs error logs

In general the principle is that contract verification aims to be resilient,
record any logs and continue to report as many problems in a single execution.

This is realized by suppressing exceptions and collecting all the logs until the
end of the `contract.verify` method. There any error logs or check failures will
cause an exception to be raised. The SodaException raised at the end of the
`contract.verify` method will list all the errors and check failures in a
single SodaException.

So for any of the following problems, you will get an exception being
raised at the end of the contract.verfiy method:
* Connection
* Connection YAML or configuration issues (includes variable resolving problems)
* Connection usage issues (can't reach db or no proper permissions)
* SodaCloud issues (only if used)
* SodaCloud YAML or configuration issues (includes variable resolving problems)
* SodaCloud usage issues (can't reach Soda online or no proper credentials)
* Contract
* Contract YAML or configuration issues (includes variable resolving problems)
* Contract verification issues
* Check failures

In the next recommended API usage, please note that exceptions suppressed in
Connection, SodaCloud and contract parsing are passed as logs (Connection.logs,
SodaCloud.logs, Contract.logs) in to the `contract.verify` method.

```python
connection_file_path = 'postgres_localhost.scn.yml'
contract_file_path = 'customers.sdc.yml'
try:
soda_cloud: SodaCloud = SodaCloud.from_environment_variables()
with Connection.from_yaml_file(file_path=connection_file_path) as connection:
contract: Contract = Contract.from_yaml_file(file_path=contract_file_path)
contract_result: ContractResult = contract.verify(connection=connection, soda_cloud=soda_cloud)
# contract verification passed
except SodaException as e:
# contract verification failed
logging.exception(f"Contract verification failed: {e}", exc_info=e)
```
71 changes: 71 additions & 0 deletions soda/contracts/adr/04_link_contract_schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Link between contract and schema

With the new contracts API, we will revisit the concept of a data source. Instead of
combining the connection together with the schema in a data source, contracts will just
work on a connection. This will bring the abstractions more in line with what users
know.

Contract verification operates on a connection. This implies a selection of a database.
Usually one connection can provide access to multiple schemas.

In the simplest case, a schema is not needed. Contract verification can run on just the
table name. As long as the connection is able to identify the table by its name without
referring to the schema.

```yaml
dataset: CUSTOMERS
columns:
- name: id
...
```

The connection may not have the target schema in the search path and referring to the table
name may not be sufficient on the connection. In that case, we should consider to let users
specify the schema in several ways:

a) In the contract itself:
```yaml
dataset: CUSTOMERS
schema: CSTMR_DATA_PROD
columns:
- name: id
...
```

b) In the API
```python
contract: Contract = Contract.from_yaml_file(file_path=contract_file_path, schema="CSTMR_DATA_PROD")
contract_result: ContractResult = contract.verify(connection=connection, soda_cloud=soda_cloud)
```

c) (still in idea-stage) We can expand this basic API with a file naming convention that uses relative references to
the schema and connection files.`../schema.yml`
and `../../../connection.yml` leading to for example:

```
+ postgres_localhost_db/
+ connection.sdn.yml
+ soda_cloud.scl.yml
+ schemas/
| + CSTMR_DATA_PROD/
| | + schema.yml
| | + datasets/
| | | + CUSTOMERS.sdc.yml
| | | + SUPPLIERS.sdc.yml
```
then we can add a simpler API like

```python
import logging
from soda.contracts.contract import Contracts
from soda.contracts.connection import SodaException

try:
Contracts.verify(["postgres_localhost_db/schemas/CSTMR_DATA_PROD/datasets/*.sdc.yml"])
except SodaException as e:
logging.exception("Problems verifying contract")
# TODO ensure you stop your ochestration job or pipeline & the right people are notified
```

This would also fit the CLI tooling. Using this file name convention, it also makes the connection between the contract
and the database much clearer: The contract is the place where you can extend the databases metadata.
10 changes: 10 additions & 0 deletions soda/contracts/adr/05_data_contract_yaml_format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
### Keys without spaces nor variables

No spaces in keys. No parsing of keys. No variable parts in keys except for the column names.

* Pro:
* More JSON compliant
* More validation from JSON schema
* More expected and in line with people's expectations
* Con:
* Not similar to SodaCL
12 changes: 12 additions & 0 deletions soda/contracts/adr/06_new_yaml_framework.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# New YAML framework

See `../soda/contracts/impl/yaml.py`

The new YAML abstract allows for:
* Capturing all errors into a central logs object instead of raising an exception on the first problem
* Convenience read_* methods on the YamlObject for writing parsing code
* A more convenient way to access the line and column information (location)

It's intended for reading, not writing. Should we add ability to write on this same framework?
For now we write using plain dicts/lists. There is also the unpack() method.
But full mutable data structures would require overloading the muting operators like eg __setitem__ etc
21 changes: 21 additions & 0 deletions soda/contracts/adr/07_sql_yaml_keys.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
In order to make it easier for contract authors to know when they are putting in literal SQL vs Soda Contract interpreted values,
all the keys that are used literally in SQL queries should have `sql` in them.

For example `sql_expression`, `invalid_sql_regex`, `valid_sql_regex` etc
```yaml
dataset: {table_name}
checks:
- type: metric_sql_expression
metric: us_count
sql_expression: COUNT(CASE WHEN country = 'US' THEN 1 END)
must_be: 0
```
Potentially you could consider the column name and data type exceptions to this rule.
Adding `sql` to the keys `name` and `data_type` would be overkill.
```yaml
dataset: {table_name}
columns:
- name: id
data_type: VARCHAR
```
Loading

0 comments on commit 98c52ce

Please sign in to comment.