Data source contributing docs (#1546)

* Data source contributing docs * Minor edits for style and grammar, formatting. Co-authored-by: Vijay Kiran <[email protected]> Co-authored-by: Janet Revell <[email protected]>
sodadata · Aug 19, 2022 · 234598a · 234598a
1 parent 989fa17
commit 234598a
Show file tree

Hide file tree

Showing 2 changed files with 75 additions and 7 deletions.
diff --git a/CONTRIBUTING-DATA-SOURCE.md b/CONTRIBUTING-DATA-SOURCE.md
@@ -0,0 +1,66 @@
+# Contribute support for a datasource
+
+Thanks for considering contributing to Soda Core's library of supported data sources! 
+
+To make a data source available to our user community, we require that you provide the following:
+- a **working data source**
+    - locally - provide a command to launch a docker container with the data source, either a Docker file or docker-compose file.
+    - in the cloud - provide a service account to connect to.
+- a **Python library** for the data source connection. Usually, you need to install an existing official connector library.
+- a **data source package** that handles the following:
+    - get connection properties
+    - connect to the data source
+    - access any data source-specific code to ensure full support
+
+## Implementation basics
+
+**Datasource file and folder structure**
+- The package goes to `soda/xy`, following the same structure as other datasource packages.
+- The main file is `soda/xy/soda/data_sources/xy_data_source.py` with a `XyDataSource(DataSource)` class.
+
+**Basic code in the data source class**
+- Implement the `__init__` method to retrieve and save connection properties.
+- Implement the `connect` method that returns a PEP 249-compatible connection object.
+
+**Required overrides**
+- Type mappings; refer to the base DataSource class comments for more detail.
+    - `SCHEMA_CHECK_TYPES_MAPPING`
+    - `SQL_TYPE_FOR_CREATE_TABLE_MAP`
+    - `SQL_TYPE_FOR_SCHEMA_CHECK_MAP`
+    - `NUMERIC_TYPES_FOR_PROFILING`
+    - `TEXT_TYPES_FOR_PROFILING`
+- `safe_connection_data()` method
+
+**Optional overrides, frequent**
+- `sql_get_table_names_with_count()` - SQL query to retrieve all tables and their respective counts. This is usually data source-specific.
+- `default_casify_*()` - indicates any default case manipulation that a data source does when retrieving respective identifiers.
+- Table/column metadata methods
+    - `column_metadata_columns()`
+    - `column_metadata_catalog_column()`
+    - `sql_get_table_names_with_count()`
+- Regex support
+    - `escape_regex()` or `escape_string()` to ensure correct regex formatting.
+    - `regex_replace_flags()` - for data sources that support regex replace flags; for example, `g` for `global`.
+- Identifier quoting - `quote_*()` methods handle identifier quoting; `qualified_table_name()` creates a fully-qualified table name.
+
+**Optional overrides, infrequent**
+- Any of the `sql_*` methods when a particular data source needs a specific query to get a desired result.
+
+**Further considerations**
+- How are schemas (or the equivalent) handled? Can they be set globally for the connection, or do they need to be prefixed in all the queries?
+
+
+
+## Test the datasource support
+
+**Required tests**
+- Create a `soda/xy/tests/text_xy.py` file with `test_xy()` method. Use this file for any data source-specific tests.
+- Implement `XyDataSourceFixture` for everything related to tests:
+    - `_build_configuration_dict()` - connection configuration the tests use
+    - `_create_schema_if_not_exists_sql()` / `_drop_schema_if_exists_sql` - DDL to create or drop a new schema or database
+
+**To test the data source**
+
+1. Create an `.env` file based on `.env.example` and add the appropriate variables for the data source.
+2. Change the `test_data_source` variable to the data source you are testing.
+3. Run the tests using `pytest`.
diff --git a/soda/core/soda/execution/data_source.py b/soda/core/soda/execution/data_source.py
@@ -103,13 +103,9 @@ def build_default_formats():
 
 class DataSource:
     """
-    Implementing a DataSource:
-    @m1n0, can you add a checklist here of places where DataSource implementors need to make updates to add
-    a new DataSource?
+    Data source implementation.
 
-    Validation of the connection configuration properties:
-    The DataSource impl is only responsible to raise an exception with an appropriate message in te #connect()
-    See that abstract method below for more details.
+    For documentation on implementing a new DataSource see the CONTRIBUTING-DATA-SOURCE.md document.
     """
 
     # Maps synonym types for the convenience of use in checks.
@@ -120,6 +116,8 @@ class DataSource:
         "timestamp without time zone": ["timestamp"],
         "timestamp with time zone": ["timestamptz"],
     }
+
+    # Supported data types used in create statements. These are used in tests for creating test data and do not affect the actual library functionality.
     SQL_TYPE_FOR_CREATE_TABLE_MAP: dict = {
         DataType.TEXT: "VARCHAR(255)",
         DataType.INTEGER: "INT",
@@ -131,6 +129,7 @@ class DataSource:
         DataType.BOOLEAN: "BOOLEAN",
     }
 
+    # Supported data types as returned by the given data source when retrieving dataset schema. Used in schema checks.
     SQL_TYPE_FOR_SCHEMA_CHECK_MAP = {
         DataType.TEXT: "character varying",
         DataType.INTEGER: "integer",
@@ -142,6 +141,7 @@ class DataSource:
         DataType.BOOLEAN: "boolean",
     }
 
+    # Indicate which numeric/test data types can be used for profiling checks.
     NUMERIC_TYPES_FOR_PROFILING = ["integer", "double precision", "double"]
     TEXT_TYPES_FOR_PROFILING = ["character varying", "varchar", "text"]
 
@@ -260,10 +260,12 @@ def is_same_type_in_schema_check(self, expected_type: str, actual_type: str):
 
     @staticmethod
     def column_metadata_columns() -> list:
+        """Columns to be used for retrieving column metadata."""
         return ["column_name", "data_type", "is_nullable"]
 
     @staticmethod
     def column_metadata_catalog_column() -> str:
+        """Column to be used as a 'database' equivalent."""
         return "table_catalog"
 
     ######################
@@ -930,7 +932,7 @@ def default_casify_type_name(self, identifier: str) -> str:
     def safe_connection_data(self):
         """Return non-critically sensitive connection details.
 
-        Useful for debugging.
+        Useful for debugging and telemetry.
         """
         # to be overridden by subclass