Skip to content

Latest commit

 

History

History
177 lines (117 loc) · 8.11 KB

README.md

File metadata and controls

177 lines (117 loc) · 8.11 KB

dbt-audit-helper-ext

Extended Audit Helper solution 💪

dbt-hub support-snowflake support-bigquery support-dbt

This repository provides a collection of powerful macros designed to enhance data validation workflows that support:

  • Historical Logging: Automatically saving detailed validation results into a designated DWH table for comprehensive audit tracking
  • Latest Summary Reporting: Maintaining a concise, up-to-date summary table for quick insights into the current state of validations
  • Codegen and Scripts: Simplifying workflows, particularly valuable for migration projects by automating repetitive tasks

Data Warehouses:

  • ❄️ Snowflake (default)
  • ☁️ BigQuery

Installation

  • Add to packages.yml file:

    packages:
      - package: infinitelambda/audit_helper_ext
        version: [">=0.1.0", "<1.0.0"]
        # keep an eye on the latest version, and change it accordingly

    Or use the latest version from git:

    packages:
      - git: "https://github.com/infinitelambda/dbt-audit-helper-ext"
        revision: <release version or tag> # 0.1.0

    And run dbt deps to install the package!

  • Initialize the resources:

    dbt deps
    dbt run -s audit_helper_ext

    This step will create log table (validation_log) and the summary view on top (validation_log_report)

  • Generate the validation macros:

    Check /scripts directory for all the codegen utilities

    Firstly, we need to determine the location (database and schema) of the source tables:

    ** If all source tables are in the same location, we can use the environment variable to set these values:

    export SOURCE_SCHEMA=MY_SOURCE_SCHEMA
    export SOURCE_DATABASE=MY_SOURCE_DATABASE

    ** If having multiple locations, we can start to configure the location inside each dbt models' config block:

    {{
      config(
        ...
        audit_helper__source_database = 'MY_SOURCE_SCHEMA',
        audit_helper__source_schema = 'MY_SOURCE_DATABASE'
      )
    }}
    ...

    Then, we can start generating the validation macro files now. Let's say we need to validate all models in 03_mart directory:

    python dbt_packages/audit_helper_ext/scripts/create_validation_macros.py models/03_mart

    Or just aim to validation a specific model which is 03_mart/dim_sales:

    python dbt_packages/audit_helper_ext/scripts/create_validation_macros.py \
      models/03_mart \
      dim_sales

    Finally, check out your dbt project at the directory named macros/validation!

Validation Strategy

This repo contains the useful macros to support for saving the historical validation results into the DWH table (validation_log), together with the latest summary table (validation_log_report).

There are 3 main types of validation:

Additionally, we have the 4th type - upstream_row_count (source) which will be very useful to understand better the validtion context, for example, the result might be up to 100% matched rate but there is 0 updates in the upstream models, hence there no updates in the final table, that means we can't not say surely it was a perfect match.

Depending on projects, it might be vary in the strategy of validation. Therefore, in this package, we're suggesting 1 first approach that we've used successfully in the real-life migration project (Informatica to dbt).

Context: Our dbt project has 3 layers (staging, intermediate, and mart). Each mart model will have the independant set of upstream models, or it is the isolated pipeline for each mart model. We want to validate mart models only.

Goal: 100% matched rate ✅, >=99% is still good 🟡, and below 99% is unacceptable ❌

Pre-requisites: 2 consecutive snapshots (e.g. Day1, Day2) of both source data and mart tables

Flow:

  • Freeze the source data, so we have source__YYYYMMD1 and source__YYYYMMD2, mart__YYYYMMD1 and mart__YYYYMMD2
  • Scenario 1: Validate the fresh run against D1
    • Configure source yml to use source__YYYYMMD1
    • Run dbt to build mart tables, callled mart_dbt
    • Run validation macros to compare between mart_dbt vs mart__YYYYMMD1 👍
  • Scenario 2: Validate the incremental run against D2 based on D1
    • Configure source yml to use source__YYYYMMD2
    • Clone mart__YYYYMMD1 to mart_dbt to mimic that dbt should have the D1 data already (e.g. clone_relation)
    • Run incrementally dbt to build mart tables
    • Run validation macros to compare between mart_dbt vs mart__YYYYMMD2 👍👍

Finnally, check the validation log report, and decide what to do next steps:

🛩️ Sample report table on Snowflake:

alt text

💡 Optionally, let's build the Sheet to communicate the outcome with clent, here is the BigQuery+GGSheet sample:

alt text

Demo

How to Contribute

dbt-audit-helper-ext is an open-source dbt package. Whether you are a seasoned open-source contributor or a first-time committer, we welcome and encourage you to contribute code, documentation, ideas, or problem statements to this project.

👉 See CONTRIBUTING guideline

🌟 And finally, kudos to our beloved OG Contributors who orginally developed the macros and scripts in this package: @William, @Duc, @Csabi, @Adrien & @Dat

About Infinite Lambda

Infinite Lambda is a cloud and data consultancy. We build strategies, help organizations implement them, and pass on the expertise to look after the infrastructure.

We are an Elite Snowflake Partner, a Platinum dbt Partner, and a two-time Fivetran Innovation Partner of the Year for EMEA.

Naturally, we love exploring innovative solutions and sharing knowledge, so go ahead and:

🔧 Take a look around our Git

✏️ Browse our tech blog

We are also chatty, so:

👀 Follow us on LinkedIn

👋🏼 Or just get in touch

About IL