Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync main from soda-core #9

Open
wants to merge 112 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
5164d34
Catch exceptions while building results file (#1936)
m1n0 Sep 13, 2023
71dfe19
[pre-commit.ci] pre-commit autoupdate (#1935)
pre-commit-ci[bot] Sep 14, 2023
8fa452e
Reference check: support must NOT exist (#1937)
m1n0 Sep 18, 2023
995b4ac
Bump to 3.0.49
m1n0 Sep 19, 2023
67597f2
Add thresholds and diagnostics to scan result (#1939)
m1n0 Sep 21, 2023
8e74d93
Fix databricks numeric types profiling (#1941)
m1n0 Sep 27, 2023
67111aa
Bump to 3.0.50
m1n0 Sep 27, 2023
f743fc7
Allow to specify virtual file name for add sodacl string (#1943)
m1n0 Oct 2, 2023
3fdac3c
Feature/add more file formats for duckdb (#1942)
PaoloLeonard Oct 6, 2023
b34e271
added BigQuery Job Labels (#1947)
m1n0 Oct 10, 2023
d25316f
Bump to 3.0.51
m1n0 Oct 11, 2023
2f67adb
Distribution: compute value counts in DB rather than in python
baturayo Oct 13, 2023
fe27fc3
Fix 3.8 compatibility
m1n0 Oct 17, 2023
431a0ee
feat: Add Dask/Pandas configurable data source naming support (#1951)
dirkgroenen Oct 25, 2023
5312c43
Bump to 3.0.52
dirkgroenen Oct 25, 2023
f6505f0
Freshness: support mixed thresholds (#1957)
m1n0 Oct 31, 2023
7affe19
Add License to every package (#1958)
m1n0 Nov 1, 2023
b3c112e
Bump to 3.0.53
m1n0 Nov 1, 2023
2c9cde9
Failed rows check: support thresholds (#1960)
m1n0 Nov 3, 2023
59191bf
Updated install doc to include MotherDuck support via DuckDB (#1963)
janet-can Nov 7, 2023
c7182b1
remove % from pattern (#1956)
chuwangBA Nov 9, 2023
7505aa3
Sqlserver: support quoting tables with brackets, "quote_tables" mode …
m1n0 Nov 14, 2023
644546d
Bump to 3.0.54
m1n0 Nov 14, 2023
5f268b8
Contracts
tombaeyens Nov 15, 2023
6ffddd9
Fix check source payload (#1966)
m1n0 Nov 15, 2023
2a142e7
Bump to 3.1.0
m1n0 Nov 16, 2023
3f8fcc7
Update python api docs (#1967)
m1n0 Nov 16, 2023
88640a9
Make custom identity fixed as v4 (#1968)
m1n0 Nov 20, 2023
09c00a2
Freshness: support in-check filters (#1970)
m1n0 Dec 1, 2023
ae8d325
Bump to 3.1.1
m1n0 Dec 2, 2023
8249949
Adding support for authentication via a chained list of delegate acco…
nathadfield Dec 15, 2023
17c67cf
fix anomaly detection frequency aggregation bug (#1975)
baturayo Dec 15, 2023
46206eb
upgrade pydantic from v1 to v2 (#1974)
baturayo Dec 15, 2023
cb950c9
[pre-commit.ci] pre-commit autoupdate (#1938)
pre-commit-ci[bot] Dec 15, 2023
b7103e1
Bump to 3.1.2
m1n0 Dec 15, 2023
e80f118
feat: implement warn_only for anomaly score (#156) (#1980)
baturayo Dec 27, 2023
3c05346
Bump to 3.1.3
m1n0 Jan 3, 2024
1a44ce0
Dbt: improve parsing logs (#1981)
m1n0 Jan 4, 2024
2bde90c
Sampler: fix link href (#1983)
m1n0 Jan 5, 2024
c3c9521
Document group by example for Soda Core with failed rows check (#1984)
janet-can Jan 5, 2024
45a5a74
Schema check: support custom identity (#1988)
m1n0 Jan 16, 2024
34d65af
Add semver release with major, minor, latest (#1993)
dirkgroenen Jan 23, 2024
036204b
bug: handle null values for continuous dist (#165) (#1994)
baturayo Jan 23, 2024
55b85f5
[pre-commit.ci] pre-commit autoupdate (#1977)
pre-commit-ci[bot] Jan 23, 2024
ceab226
feat: implement new anomaly detection in soda core (#1995)
baturayo Jan 24, 2024
9445d1e
feat: support built-in prophet public holidays (#1997)
baturayo Jan 24, 2024
64bc338
Bump to 3.1.4
m1n0 Jan 24, 2024
b6f4329
Hive data source improvements (#1982)
robertomorandeira Jan 24, 2024
79b513a
feat: implement migrate from anomaly score check config (#168) (#1998)
baturayo Jan 25, 2024
311f1f2
Bump Prophet (#2000)
m1n0 Jan 25, 2024
89da879
Tests: use approx comparison for floats (#1999)
m1n0 Jan 25, 2024
8e0ae62
hive: add configuration parameters (#36)
vijaykiran Jul 3, 2023
2d00558
Bump to 3.1.5
m1n0 Jan 26, 2024
594d026
feat: implement severity level paramaters (#2001)
baturayo Jan 29, 2024
339309f
Always use datasource specifis COUNT expression (#2003)
m1n0 Jan 29, 2024
51a30fb
fix: anomaly detection feedbacks (#2005)
baturayo Jan 31, 2024
70b8753
[pre-commit.ci] pre-commit autoupdate (#2002)
pre-commit-ci[bot] Feb 2, 2024
1d2e8ac
feat: anomaly detection simulator (#163) (#2010)
baturayo Feb 6, 2024
e172b7d
feat: added dremio token support (#2009)
JorisTruong Feb 7, 2024
fc8e191
Bump to 3.2.0
m1n0 Feb 8, 2024
68d44b3
feat: correctly identified anomalies are excluded from training data …
baturayo Feb 9, 2024
1a211f5
fix: show more clearly the detected frequency using warning message f…
baturayo Feb 9, 2024
16ea0b9
Fix simulator import and streamlit path (#2017)
m1n0 Feb 12, 2024
a02f463
[pre-commit.ci] pre-commit autoupdate (#2016)
pre-commit-ci[bot] Feb 13, 2024
2c3ce9d
Update oracle_data_source.py (#2012)
vinod901 Feb 13, 2024
eb2abf9
Oracle: cast config to str/int to prevent oracledb errors (#2018)
m1n0 Feb 13, 2024
dd63d9e
Bump to 3.2.1
m1n0 Feb 13, 2024
ea5831e
Fix assets folder (#2020)
m1n0 Feb 14, 2024
f47801c
fix timezone issue and log messages (#188) (#2023)
baturayo Feb 21, 2024
fe70d82
feat: in anomaly detection simulator use soda core historic check res…
baturayo Feb 28, 2024
7d2ed7b
Update dask-sql (#2026)
m1n0 Feb 29, 2024
f07eba9
Add dask-sql version comment
m1n0 Feb 29, 2024
97c3545
Bump to 3.2.2
m1n0 Feb 29, 2024
6245a4c
feat: implement daily and monthly seasonality to external regressor ……
baturayo Feb 29, 2024
b62550e
Dremio: fix token support (#2028)
m1n0 Mar 6, 2024
8179c50
Bump to 3.2.3
m1n0 Mar 6, 2024
8e41a2c
[pre-commit.ci] pre-commit autoupdate (#2022)
pre-commit-ci[bot] Mar 11, 2024
91dd60f
bugfix: support attributes on multiple checks (#2032)
milanaleksic Mar 12, 2024
e3787d1
Use dbt's new access_url pattern to access cloud API (#2035)
bastienboutonnet Mar 14, 2024
c25a872
Bump to 3.2.4
m1n0 Mar 16, 2024
98c52ce
Contracts 2nd iteration (#2006)
tombaeyens Mar 16, 2024
bd04e84
Bump to 3.3.0
m1n0 Mar 16, 2024
a1a2008
feat: improved wording and tooltip formatting in simulator (#2038)
bastienboutonnet Mar 19, 2024
c20eb59
Failed rows: fix warn/fail thresholds (#2042)
m1n0 Mar 22, 2024
de1d4b4
Bump opentelemetry to 1.22 (#2043)
m1n0 Mar 22, 2024
d4b8183
Bump dev requirements (#2045)
m1n0 Mar 23, 2024
ae33e9f
Bump to 3.3.1
m1n0 Mar 24, 2024
aee8045
Rename argument in set_scan_results_file method (#2047)
ozgenbaris1 Apr 9, 2024
2e40e45
Dremio: support disableCertificateVerification option (#2049)
m1n0 Apr 9, 2024
9e95906
[pre-commit.ci] pre-commit autoupdate (#2037)
pre-commit-ci[bot] Apr 16, 2024
1d21a34
Denodo: fix connection timeout attribute (#2065)
m1n0 Apr 23, 2024
34ace6a
Update db2_data_source.py (#2063)
4rahulae Apr 23, 2024
c046af0
Bump to 3.3.2
m1n0 Apr 24, 2024
76159ca
Update autoflake precommit (#2070)
m1n0 Apr 30, 2024
062b1e2
Contracts v3 (#2067)
tombaeyens Apr 30, 2024
5e51e69
Bump to 3.3.3
tombaeyens Apr 30, 2024
31b1ab3
Fix automated monitoring, prevent duplicate queries (#2075)
m1n0 May 3, 2024
cc02c01
Hive: support scheme (#2077)
m1n0 May 7, 2024
63c73f8
Bump dev requirements (#2078)
m1n0 May 7, 2024
7866d27
Bump deps (#2079)
m1n0 May 7, 2024
8a1ce04
Bump to 3.3.4
m1n0 May 7, 2024
1819347
Failed rows: fix warn/fail thresholds for fail condition (#2084)
m1n0 May 16, 2024
09262b0
upgrade to latest version of ibm-db python client (#2076)
Antoninj May 17, 2024
5d1163c
User defined metric fail query (#2089)
m1n0 May 23, 2024
b014718
Bump to 3.3.5
m1n0 May 23, 2024
4e09b27
CLOUD-7708 - Add Snowflake CI account to pipeline for soda-core (#2088)
dakue-soda May 27, 2024
5776b5e
[CLOUD-7400] Improve memory usage (#2081)
dirkgroenen May 29, 2024
c3dc141
lower pre-commit version to support py38
dirkgroenen May 30, 2024
7e631d5
Duplicate check: fail gracefully in case of error in query (#2093)
m1n0 Jun 5, 2024
552a716
Bump requests and tox/docker (#2094)
m1n0 Jun 5, 2024
af649b9
Duplicate check: support sample exclude columns fully (#2096)
m1n0 Jun 7, 2024
a94bd47
Merge remote-tracking branch 'upstream/main'
bichitra95 Jun 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
User defined metric fail query (sodadata#2089)
* User defined metric check: support failed rows query

* Test file version as well

* Fix CI

* Make metric check cfg contructor flexy

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
m1n0 and pre-commit-ci[bot] authored May 23, 2024
commit 5d1163c17a508463ab6e0f884278ce4252f6430f
4 changes: 4 additions & 0 deletions .github/workflows/main.workflow.yml
Original file line number Diff line number Diff line change
@@ -98,6 +98,7 @@ jobs:
sudo apt-get update
ACCEPT_EULA=Y sudo apt-get install -y libsasl2-dev msodbcsql18
python -m pip install --upgrade pip
pip install requests==2.31.0
cat dev-requirements.in | grep tox | xargs pip install

- name: Test with tox
@@ -132,6 +133,7 @@ jobs:
sudo apt-get update
sudo apt-get install -y libsasl2-dev
python -m pip install --upgrade pip
pip install requests==2.31.0
cat dev-requirements.in | grep tox | xargs pip install

- name: Test with tox
@@ -166,6 +168,7 @@ jobs:
sudo apt-get update
sudo apt-get install -y libsasl2-dev
python -m pip install --upgrade pip
pip install requests==2.31.0
cat dev-requirements.in | grep tox | xargs pip install

- name: Test with tox
@@ -194,6 +197,7 @@ jobs:
sudo apt-get update
sudo apt-get install -y libsasl2-dev
python -m pip install --upgrade pip
pip install requests==2.31.0
cat dev-requirements.in | grep tox | xargs pip install

- name: Test with tox
4 changes: 4 additions & 0 deletions .github/workflows/pr.workflow.yml
Original file line number Diff line number Diff line change
@@ -78,6 +78,7 @@ jobs:
sudo apt-get update
ACCEPT_EULA=Y sudo apt-get install -y libsasl2-dev msodbcsql18
python -m pip install --upgrade pip
pip install requests==2.31.0
cat dev-requirements.in | grep tox | xargs pip install

- name: Test with tox
@@ -110,6 +111,7 @@ jobs:
sudo apt-get update
sudo apt-get install -y libsasl2-dev
python -m pip install --upgrade pip
pip install requests==2.31.0
cat dev-requirements.in | grep tox | xargs pip install

- name: Test with tox
@@ -142,6 +144,7 @@ jobs:
sudo apt-get update
sudo apt-get install -y libsasl2-dev
python -m pip install --upgrade pip
pip install requests==2.31.0
cat dev-requirements.in | grep tox | xargs pip install

- name: Test with tox
@@ -171,6 +174,7 @@ jobs:
sudo apt-get update
sudo apt-get install -y libsasl2-dev
python -m pip install --upgrade pip
pip install requests==2.31.0
cat dev-requirements.in | grep tox | xargs pip install

- name: Test with tox
2 changes: 2 additions & 0 deletions dev-requirements.in
Original file line number Diff line number Diff line change
@@ -15,3 +15,5 @@ readme-renderer~=32.0
certifi>=2022.12.07
wheel>=0.38.1
docutils<0.21 # 0.21 dropped py38 support, remove this after py38 support is gone
requests==2.31.0 # 2.32.0 is broken, does not support docker. Remove this after new version is out

8 changes: 5 additions & 3 deletions dev-requirements.txt
Original file line number Diff line number Diff line change
@@ -61,7 +61,7 @@ pathspec==0.12.1
# via black
pip-tools==7.4.1
# via -r dev-requirements.in
platformdirs==4.2.1
platformdirs==4.2.2
# via
# black
# virtualenv
@@ -100,7 +100,9 @@ python-dotenv==1.0.1
readme-renderer==32.0
# via -r dev-requirements.in
requests==2.31.0
# via docker
# via
# -r dev-requirements.in
# docker
schema==0.7.7
# via tbump
six==1.16.0
@@ -137,7 +139,7 @@ urllib3==1.26.18
# -r dev-requirements.in
# docker
# requests
virtualenv==20.26.1
virtualenv==20.26.2
# via tox
webencodings==0.5.1
# via bleach
20 changes: 18 additions & 2 deletions soda/core/soda/execution/metric/user_defined_numeric_metric.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
from __future__ import annotations

from numbers import Number

from soda.execution.metric.query_metric import QueryMetric
from soda.execution.query.sample_query import SampleQuery
from soda.execution.query.user_defined_numeric_query import UserDefinedNumericQuery


class UserDefinedNumericMetric(QueryMetric):
def __init__(
self,
data_source_scan: "DataSourceScan",
data_source_scan: DataSourceScan,
check_name: str,
sql: str,
check: "Check" = None,
check: Check = None,
):
super().__init__(
data_source_scan=data_source_scan,
@@ -19,6 +24,7 @@ def __init__(
identity_parts=[sql],
)
self.sql = sql
self.check = check

def __str__(self):
return f'"{self.name}"'
@@ -38,3 +44,13 @@ def ensure_query(self):
)
self.queries.append(query)
self.data_source_scan.queries.append(query)

def create_failed_rows_sample_query(self) -> SampleQuery | None:
sampler = self.data_source_scan.scan._configuration.sampler
if sampler and isinstance(self.value, Number) and self.check.check_cfg.failed_rows_query:
if self.samples_limit > 0:
jinja_resolve = self.data_source_scan.scan.jinja_resolve
sql = jinja_resolve(self.check.check_cfg.failed_rows_query)
sample_query = SampleQuery(self.data_source_scan, self, "failed_rows", sql)

return sample_query
9 changes: 8 additions & 1 deletion soda/core/soda/execution/query/user_defined_numeric_query.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
from __future__ import annotations

from soda.execution.metric.metric import Metric
from soda.execution.query.query import Query


class UserDefinedNumericQuery(Query):
def __init__(
self,
data_source_scan: "DataSourceScan",
data_source_scan: DataSourceScan,
check_name: str,
sql: str,
metric: Metric,
@@ -22,3 +24,8 @@ def execute(self):
if self.row[index] is not None:
metric_value = float(self.row[index])
self.metric.set_value(metric_value)

sample_query = self.metric.create_failed_rows_sample_query()
if sample_query:
self.metric.queries.append(sample_query)
sample_query.execute()
2 changes: 2 additions & 0 deletions soda/core/soda/sodacl/check_cfg.py
Original file line number Diff line number Diff line change
@@ -17,6 +17,7 @@ def __init__(
name: str | None,
samples_limit: int | None = None,
samples_columns: list | None = None,
failed_rows_query: str | None = None,
):
self.source_header: str = source_header
self.source_line: str = source_line
@@ -25,6 +26,7 @@ def __init__(
self.name: str | None = name
self.samples_limit: int | None = samples_limit
self.samples_columns: list | None = samples_columns
self.failed_rows_query: str | None = failed_rows_query

def get_column_name(self) -> str | None:
pass
11 changes: 10 additions & 1 deletion soda/core/soda/sodacl/metric_check_cfg.py
Original file line number Diff line number Diff line change
@@ -32,8 +32,17 @@ def __init__(
fail_threshold_cfg: ThresholdCfg | None,
warn_threshold_cfg: ThresholdCfg | None,
samples_limit: int | None = None,
failed_rows_query: str | None = None,
):
super().__init__(source_header, source_line, source_configurations, location, name, samples_limit)
super().__init__(
source_header,
source_line,
source_configurations,
location,
name,
samples_limit,
failed_rows_query=failed_rows_query,
)
self.metric_name: str = metric_name
self.metric_args: list[object] | None = metric_args
self.missing_and_valid_cfg: MissingAndValidCfg = missing_and_valid_cfg
60 changes: 42 additions & 18 deletions soda/core/soda/sodacl/sodacl_parser.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import functools
import inspect
import logging
import os
import re
@@ -619,6 +620,7 @@ def __parse_metric_check(
condition = None
metric_expression = None
metric_query = None
failed_rows_query = None
samples_limit = None
samples_columns = None
training_dataset_params: TrainingDatasetParameters = TrainingDatasetParameters()
@@ -657,6 +659,13 @@ def __parse_metric_check(
f'In configuration "{configuration_key}" the metric name must match exactly the metric name in the check "{metric_name}"',
location=self.location,
)
elif configuration_key == "failed rows query" or configuration_key == "failed rows sql_file":
if configuration_key.endswith("sql_file"):
fs = file_system()
sql_file_path = fs.join(fs.dirname(self.path_stack.file_path), configuration_value.strip())
failed_rows_query = dedent(fs.file_read_as_str(sql_file_path)).strip()
else:
failed_rows_query = dedent(configuration_value).strip()
elif configuration_key.endswith("query") or configuration_key.endswith("sql_file"):
if configuration_key.endswith("sql_file"):
fs = file_system()
@@ -918,24 +927,39 @@ def __parse_metric_check(
f"Invalid syntax used in '{check_str}'. More than one check attribute is not supported. A check like this will be skipped in future versions of Soda Core"
)

return metric_check_cfg_class(
source_header=header_str,
source_line=check_str,
source_configurations=check_configurations,
location=self.location,
name=name,
metric_name=metric_name,
metric_args=metric_args,
missing_and_valid_cfg=missing_and_valid_cfg,
filter=filter,
condition=condition,
metric_expression=metric_expression,
metric_query=metric_query,
change_over_time_cfg=change_over_time_cfg,
fail_threshold_cfg=fail_threshold_cfg,
warn_threshold_cfg=warn_threshold_cfg,
samples_limit=samples_limit,
)
def takes_keyword_argument(cls, keyword):
signature = inspect.signature(cls.__init__)
return keyword in signature.parameters

# Some arguments make no sense for certain metric checks, so we only pass the ones that are supported by the given class constructor.
# Do this instead of accepting kwargs and passing all arguments to the constructor, because it's easier to see what arguments are supported and they do not disappear in the constructor.
all_args = {
"source_header": header_str,
"source_line": check_str,
"source_configurations": check_configurations,
"location": self.location,
"name": name,
"metric_name": metric_name,
"metric_args": metric_args,
"missing_and_valid_cfg": missing_and_valid_cfg,
"filter": filter,
"condition": condition,
"metric_expression": metric_expression,
"metric_query": metric_query,
"change_over_time_cfg": change_over_time_cfg,
"fail_threshold_cfg": fail_threshold_cfg,
"warn_threshold_cfg": warn_threshold_cfg,
"samples_limit": samples_limit,
"failed_rows_query": failed_rows_query,
}

use_args = {}

for arg in all_args.keys():
if takes_keyword_argument(metric_check_cfg_class, arg):
use_args[arg] = all_args[arg]

return metric_check_cfg_class(**use_args)

def __parse_configuration_threshold_condition(self, value) -> ThresholdCfg | None:
if isinstance(value, str):
59 changes: 59 additions & 0 deletions soda/core/tests/data_source/test_user_defined_metric_checks.py
Original file line number Diff line number Diff line change
@@ -135,3 +135,62 @@ def test_user_defined_data_source_query_metric_with_sql_file(data_source_fixture

finally:
os.remove(path)


def test_user_defined_data_source_query_metric_check_with_fail_query(data_source_fixture: DataSourceFixture):
table_name = data_source_fixture.ensure_test_table(customers_test_table)

qualified_table_name = data_source_fixture.data_source.qualified_table_name(table_name)

scan = data_source_fixture.create_test_scan()
mock_soda_cloud = scan.enable_mock_soda_cloud()
scan.enable_mock_sampler()
scan.add_sodacl_yaml_str(
f"""
checks:
- belgium_customers = 6:
belgium_customers query: |
SELECT count(*) as belgium_customers
FROM {qualified_table_name}
WHERE country = 'BE'
failed rows query: |
SELECT *
FROM {qualified_table_name}
WHERE country != 'BE'
"""
)
scan.execute()
scan.assert_all_checks_pass()

assert mock_soda_cloud.find_failed_rows_line_count(0) == 4


def test_user_defined_data_source_query_metric_check_with_fail_query_file(data_source_fixture: DataSourceFixture):
fd, path = tempfile.mkstemp()
table_name = data_source_fixture.ensure_test_table(customers_test_table)
qualified_table_name = data_source_fixture.data_source.qualified_table_name(table_name)

scan = data_source_fixture.create_test_scan()
mock_soda_cloud = scan.enable_mock_soda_cloud()
scan.enable_mock_sampler()
try:
with os.fdopen(fd, "w") as tmp:
tmp.write(f"SELECT * FROM {qualified_table_name} WHERE country != 'BE'")

scan.add_sodacl_yaml_str(
f"""
checks:
- belgium_customers = 6:
belgium_customers query: |
SELECT count(*) as belgium_customers
FROM {qualified_table_name}
WHERE country = 'BE'
failed rows sql_file: "{path}"
"""
)
scan.execute()
scan.assert_all_checks_pass()
assert mock_soda_cloud.find_failed_rows_line_count(0) == 4

finally:
os.remove(path)