Skip to content

Commit

Permalink
Data source contributing docs (#1546)
Browse files Browse the repository at this point in the history
* Data source contributing docs

* Minor edits for style and grammar, formatting.

Co-authored-by: Vijay Kiran <[email protected]>
Co-authored-by: Janet Revell <[email protected]>
  • Loading branch information
3 people authored Aug 19, 2022
1 parent 989fa17 commit 234598a
Show file tree
Hide file tree
Showing 2 changed files with 75 additions and 7 deletions.
66 changes: 66 additions & 0 deletions CONTRIBUTING-DATA-SOURCE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Contribute support for a datasource

Thanks for considering contributing to Soda Core's library of supported data sources!

To make a data source available to our user community, we require that you provide the following:
- a **working data source**
- locally - provide a command to launch a docker container with the data source, either a Docker file or docker-compose file.
- in the cloud - provide a service account to connect to.
- a **Python library** for the data source connection. Usually, you need to install an existing official connector library.
- a **data source package** that handles the following:
- get connection properties
- connect to the data source
- access any data source-specific code to ensure full support

## Implementation basics

**Datasource file and folder structure**
- The package goes to `soda/xy`, following the same structure as other datasource packages.
- The main file is `soda/xy/soda/data_sources/xy_data_source.py` with a `XyDataSource(DataSource)` class.

**Basic code in the data source class**
- Implement the `__init__` method to retrieve and save connection properties.
- Implement the `connect` method that returns a PEP 249-compatible connection object.

**Required overrides**
- Type mappings; refer to the base DataSource class comments for more detail.
- `SCHEMA_CHECK_TYPES_MAPPING`
- `SQL_TYPE_FOR_CREATE_TABLE_MAP`
- `SQL_TYPE_FOR_SCHEMA_CHECK_MAP`
- `NUMERIC_TYPES_FOR_PROFILING`
- `TEXT_TYPES_FOR_PROFILING`
- `safe_connection_data()` method

**Optional overrides, frequent**
- `sql_get_table_names_with_count()` - SQL query to retrieve all tables and their respective counts. This is usually data source-specific.
- `default_casify_*()` - indicates any default case manipulation that a data source does when retrieving respective identifiers.
- Table/column metadata methods
- `column_metadata_columns()`
- `column_metadata_catalog_column()`
- `sql_get_table_names_with_count()`
- Regex support
- `escape_regex()` or `escape_string()` to ensure correct regex formatting.
- `regex_replace_flags()` - for data sources that support regex replace flags; for example, `g` for `global`.
- Identifier quoting - `quote_*()` methods handle identifier quoting; `qualified_table_name()` creates a fully-qualified table name.

**Optional overrides, infrequent**
- Any of the `sql_*` methods when a particular data source needs a specific query to get a desired result.

**Further considerations**
- How are schemas (or the equivalent) handled? Can they be set globally for the connection, or do they need to be prefixed in all the queries?



## Test the datasource support

**Required tests**
- Create a `soda/xy/tests/text_xy.py` file with `test_xy()` method. Use this file for any data source-specific tests.
- Implement `XyDataSourceFixture` for everything related to tests:
- `_build_configuration_dict()` - connection configuration the tests use
- `_create_schema_if_not_exists_sql()` / `_drop_schema_if_exists_sql` - DDL to create or drop a new schema or database

**To test the data source**

1. Create an `.env` file based on `.env.example` and add the appropriate variables for the data source.
2. Change the `test_data_source` variable to the data source you are testing.
3. Run the tests using `pytest`.
16 changes: 9 additions & 7 deletions soda/core/soda/execution/data_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,13 +103,9 @@ def build_default_formats():

class DataSource:
"""
Implementing a DataSource:
@m1n0, can you add a checklist here of places where DataSource implementors need to make updates to add
a new DataSource?
Data source implementation.
Validation of the connection configuration properties:
The DataSource impl is only responsible to raise an exception with an appropriate message in te #connect()
See that abstract method below for more details.
For documentation on implementing a new DataSource see the CONTRIBUTING-DATA-SOURCE.md document.
"""

# Maps synonym types for the convenience of use in checks.
Expand All @@ -120,6 +116,8 @@ class DataSource:
"timestamp without time zone": ["timestamp"],
"timestamp with time zone": ["timestamptz"],
}

# Supported data types used in create statements. These are used in tests for creating test data and do not affect the actual library functionality.
SQL_TYPE_FOR_CREATE_TABLE_MAP: dict = {
DataType.TEXT: "VARCHAR(255)",
DataType.INTEGER: "INT",
Expand All @@ -131,6 +129,7 @@ class DataSource:
DataType.BOOLEAN: "BOOLEAN",
}

# Supported data types as returned by the given data source when retrieving dataset schema. Used in schema checks.
SQL_TYPE_FOR_SCHEMA_CHECK_MAP = {
DataType.TEXT: "character varying",
DataType.INTEGER: "integer",
Expand All @@ -142,6 +141,7 @@ class DataSource:
DataType.BOOLEAN: "boolean",
}

# Indicate which numeric/test data types can be used for profiling checks.
NUMERIC_TYPES_FOR_PROFILING = ["integer", "double precision", "double"]
TEXT_TYPES_FOR_PROFILING = ["character varying", "varchar", "text"]

Expand Down Expand Up @@ -260,10 +260,12 @@ def is_same_type_in_schema_check(self, expected_type: str, actual_type: str):

@staticmethod
def column_metadata_columns() -> list:
"""Columns to be used for retrieving column metadata."""
return ["column_name", "data_type", "is_nullable"]

@staticmethod
def column_metadata_catalog_column() -> str:
"""Column to be used as a 'database' equivalent."""
return "table_catalog"

######################
Expand Down Expand Up @@ -930,7 +932,7 @@ def default_casify_type_name(self, identifier: str) -> str:
def safe_connection_data(self):
"""Return non-critically sensitive connection details.
Useful for debugging.
Useful for debugging and telemetry.
"""
# to be overridden by subclass

Expand Down

0 comments on commit 234598a

Please sign in to comment.