Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add column description support for BigQuery adapter #2179

Open
K-Oxon opened this issue Dec 25, 2024 · 1 comment
Open

feat: Add column description support for BigQuery adapter #2179

K-Oxon opened this issue Dec 25, 2024 · 1 comment
Assignees

Comments

@K-Oxon
Copy link

K-Oxon commented Dec 25, 2024

Feature description

I'd like to suggest adding support for column descriptions in the BigQuery adapter. This would allow users to define and manage column-level metadata when loading data into BigQuery tables. This feature would enhance the data documentation and governance capabilities within the dlt BigQuery integration.

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

When loading data into BigQuery, having clear and accurate column descriptions is critical for data governance and documentation purposes. Currently, DLT doesn't provide a way to define column descriptions during data loading, making it difficult to maintain metadata and properly document data structures.

Problem solved:

  • Data teams must manually add column descriptions after data is loaded.
  • Risk of inconsistent or missing column documentation
  • Additional overhead of maintaining separate data dictionaries
  • Difficulty in tracking and versioning changes to column metadata

Proposed solution

Implement column description support in the BigQuery adapter by

Extending the BigQuery adapter interface to accept column descriptions during schema definition
Modify dlt/destinations/impl/bigquery/bigquery_adapter.py and dlt/destinations/impl/bigquery/bigquery.py to add column description parameters:

class BigQueryClient(SqlJobClientWithStagingDataset, SupportsStagingDestination):
    ...
    def _get_column_def_sql(self, column: TColumnSchema, table: PreparedTableSchema = None) -> str:
        column_def_sql = super()._get_column_def_sql(column, table)
        if column.get(ROUND_HALF_EVEN_HINT, False):
            column_def_sql += " OPTIONS (rounding_mode='ROUND_HALF_EVEN')"
        if column.get(ROUND_HALF_AWAY_FROM_ZERO_HINT, False):
            column_def_sql += " OPTIONS (rounding_mode='ROUND_HALF_AWAY_FROM_ZERO')"
        
        # adding this
        if "description" in column:
            column_def_sql += f" OPTIONS (description='{column['description']}')"
        
        return column_def_sql
    ...
def bigquery_adapter(
    data: Any,
    partition: TColumnNames = None,
    cluster: TColumnNames = None,
    round_half_away_from_zero: TColumnNames = None,
    round_half_even: TColumnNames = None,
    table_description: Optional[str] = None,
    table_expiration_datetime: Optional[str] = None,
    insert_api: Optional[Literal["streaming", "default"]] = None,
    autodetect_schema: Optional[bool] = None,
    partition_expiration_days: Optional[int] = None,
    column_descriptions: Optional[Dict[str, str]] = None,  # Added parameter
) -> DltResource:
    ...
    # adding column description
    if column_descriptions:
        for column_name, description in column_descriptions.items():
            if column_name in column_hints:
                column_hints[column_name]["description"] = description
            else:
                column_hints[column_name] = {"name": column_name, "description": description}
    
    if partition:
        ...

The implementation would match BigQuery's native support for column descriptions, while maintaining dlt's current functionality and performance.
An example usage could look like this

data = [{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}]
column_descriptions = {"id": "User ID", "name": "User Name"}

pipeline = dlt.pipeline(
    pipeline_name="description_test",
    destination="bigquery",
    dataset_name="description_test",
    export_schema_path="dlt_schemas/export",
)
load_info = pipeline.run(
    bigquery_adapter(data, column_descriptions=column_descriptions),
    table_name="users",
    write_disposition="replace",
    refresh="drop_data",
)

print(load_info)

Related issues

No related issues found.

@sh-rp
Copy link
Collaborator

sh-rp commented Jan 2, 2025

Note for implementation in case we do it: TColumnSchema already has a description value, so we would not need to extend the adapter but rather use that and translate it to the correct commands when creating or altering tables in bigquey.

@sh-rp sh-rp self-assigned this Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants