Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review dbt data validation failures on transform_warehouse #3684

Closed
erikamov opened this issue Feb 10, 2025 · 22 comments
Closed

Review dbt data validation failures on transform_warehouse #3684

erikamov opened this issue Feb 10, 2025 · 22 comments
Assignees

Comments

@erikamov
Copy link
Contributor

erikamov commented Feb 10, 2025

User story / feature request

Review dbt data validation failures on transform_warehouse:

  • accepted_values_int_payments__authorisations_deduped_request_type__AUTHORISATION__DEBT_RECOVERY_AUTHCHECK__DEBT_RECOVERY_REVERSAL__CARD_CHECK

    • models/intermediate/payments/_int_payments.yml
  • dbt_utils_accepted_range_fct_daily_schedule_feed_validation_notices_date__DATE_2024_03_26___DATE_2024_01_20_

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • dbt_utils_accepted_range_int_payments__settlements_to_aggregations_net_settled_amount_dollars__True__0

    • models/intermediate/payments/_int_payments.yml
  • dbt_utils_expression_is_true_int_gtfs_schedule__stop_times_grouped_num_approximate_timepoint_stop_times_num_exact_timepoint_stop_times_num_stop_times

    • models/* intermediate/gtfs/_int_gtfs.yaml
  • dbt_utils_expression_is_true_int_gtfs_schedule__stop_times_grouped__num_gtfs_flex_stop_times_0_is_gtfs_flex_trip

    • models/intermediate/gtfs/_int_gtfs.yaml
  • dbt_utils_mutually_exclusive_ranges_dim_contract_attachments_required___valid_from__source_record_id___valid_to

    • models/mart/* transit_database/_mart_transit_database.yml
  • dbt_utils_expression_is_true_int_payments__micropayments_adjustments_refunds_joined_micropayment_refund_amount_charge_amount

    • models/intermediate/payments/_int_payments.yml
  • dbt_utils_unique_combination_of_columns_base_tts_services_ct_services_map_ct_key__ct_date

    • models/staging/transit_database/base/_base_transit_database.yml
  • dbt_utils_unique_combination_of_columns_base_tts_services_ct_services_map_tts_key__tts_date

    • models/staging/transit_database/base/_base_transit_database.yml
  • not_null_dim_calendar_service_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_schedule_feeds_feed_timezone

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_transfers_from_stop_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_transfers_to_stop_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_trips_route_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_trips_trip_id

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_fct_daily_organization_combined_guideline_checks_status

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_daily_schedule_feed_validation_notices_code

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_daily_schedule_feed_validation_notices_validation_validator_version

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_daily_schedule_feed_validation_notices_severity

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_daily_service_combined_guideline_checks_status

    • models/mart/gtfs_quality/_mart_gtfs_quality.yml
  • not_null_fct_vehicle_locations_grouped_trip_instance_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • not_null_fct_vehicle_locations_trip_instance_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • not_null_int_gtfs_quality__guideline_checks_long_status

    • models/intermediate/gtfs_quality/_int_gtfs_quality.yml
  • relationships_fct_daily_rt_feed_files_schedule_to_use_for_rt_validation_gtfs_dataset_key__key__ref_dim_gtfs_datasets_

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • unique_dim_calendar__gtfs_key

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • unique_dim_contract_attachments_key

    • models/mart/transit_database/_mart_transit_database.yml
  • unique_dim_fare_media__gtfs_key

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • unique_fct_elavon__transactions_trn_ref_num

    • models/mart/payments/_payments.yml
  • unique_fct_payments_deposit_transactions_trn_ref_num

    • models/mart/payments/_payments.yml
  • unique_fct_payments_aggregations_aggregation_id

    • models/mart/payments/_payments.yml
  • unique_int_elavon__deposit_transactions_trn_ref_num

    • models/staging/payments/elavon/_elavon.yml
  • unique_int_gtfs_quality__guideline_checks_long_key

    • models/intermediate/gtfs_quality/_int_gtfs_quality.yml
  • unique_int_gtfs_quality__rt_validation_outcomes_key

    • models/intermediate/gtfs_quality/_int_gtfs_quality.yml
  • unique_int_payments__settlements_to_aggregations_aggregation_id

    • models/intermediate/payments/_int_payments.yml
  • unique_int_payments__settlements_to_aggregations_retrieval_reference_number

    • models/intermediate/payments/_int_payments.yml
  • unique_miles_traveled_location_name

    • seeds/_seeds.yml
  • unique_miles_traveled_off_location_name

    • seeds/_seeds.yml
  • unique_int_payments__micropayments_adjustments_refunds_joined_micropayment_id

    • models/intermediate/payments/_int_payments.yml
  • unique_proportion_fct_vehicle_locations_grouped_0_9999__trip_instance_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • unique_proportion_fct_vehicle_locations_0_9999__trip_instance_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • unique_fct_service_alerts_messages_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • validate_guidelines_checks_rows_per_check_matches_index

    • tests/validate_guidelines_checks_rows_per_check_matches_index.sql
  • dbt_utils_unique_combination_of_columns_int_transit_database__service_components_unnested_service_key__product_key__component_key__dt

    • models/intermediate/transit_database/_int_transit_database.yml
  • not_null_dim_calendar_end_date

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_friday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_saturday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_start_date

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_sunday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_wednesday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_calendar_thursday

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_fare_attributes_transfers

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_feed_info_feed_publisher_name

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_shapes_shape_pt_sequence

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_shapes_shape_pt_lon

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_stop_times_stop_sequence

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_dim_transfers_transfer_type

    • models/mart/gtfs/_mart_gtfs_dims.yml
  • not_null_fct_daily_schedule_feeds_gtfs_dataset_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml
  • not_null_fct_schedule_feed_downloads_gtfs_dataset_key

    • models/mart/gtfs/_mart_gtfs_fcts.yml

Acceptance Criteria

New tickets created or referenced.

@erikamov erikamov self-assigned this Feb 10, 2025
@erikamov erikamov added the bug Something isn't working label Feb 10, 2025
@erikamov erikamov changed the title Fix DBT test failures on transform_warehouse Fix dbt data validation failures on transform_warehouse Feb 10, 2025
@erikamov erikamov changed the title Fix dbt data validation failures on transform_warehouse Review dbt data validation failures on transform_warehouse Feb 10, 2025
@erikamov
Copy link
Contributor Author

erikamov commented Feb 10, 2025

I will be adding comments with more details about each failure.

  • accepted_values_int_payments__authorisations_deduped_request_type__AUTHORISATION__DEBT_RECOVERY_AUTHCHECK__DEBT_RECOVERY_REVERSAL__CARD_CHECK

    Table: staging.int_payments__authorisations_deduped
    Column: request_type
    Accepted values: 'AUTHORISATION', 'DEBT_RECOVERY_AUTHCHECK', 'DEBT_RECOVERY_REVERSAL', 'CARD_CHECK'

    Failing because of 2 records with request_type = 'Reversal' from MST on 2024-12-05 and ATN on 2024-12-16

 select *
   from staging.int_payments__authorisations_deduped
 where request_type not in ('AUTHORISATION', 'DEBT_RECOVERY_AUTHCHECK', 'DEBT_RECOVERY_REVERSAL', 'CARD_CHECK')

Created issue
#3716

@erikamov erikamov removed the bug Something isn't working label Feb 11, 2025
@erikamov
Copy link
Contributor Author

erikamov commented Feb 11, 2025

  • dbt_utils_accepted_range_int_payments__settlements_to_aggregations_net_settled_amount_dollars__True__0
    Table: staging.int_payments__settlements_to_aggregations
    Column: net_settled_amount_dollars
    Accepted value >= 0

    Failing because of 2 records have negative values: ccjpa -4 on 2024-08-21 and -6 on 2024-12-17

select *
from staging.int_payments__settlements_to_aggregations
where net_settled_amount_dollars < 0
  • unique_int_payments__settlements_to_aggregations_aggregation_id
    Table: staging.int_payments__settlements_to_aggregations
    Column: aggregation_id
    Unique values

    Failing because of 2 duplicated aggregation_id from MST between 2024-11-30 and 2024-12-03

select *
from staging.int_payments__settlements_to_aggregations
where aggregation_id in (select aggregation_id
                         from staging.int_payments__settlements_to_aggregations
                         group by aggregation_id
                         having count(*) > 1)
  • unique_int_payments__settlements_to_aggregations_retrieval_reference_number
    Table: staging.int_payments__settlements_to_aggregations
    Column: retrieval_reference_number
    Unique values

    Failing because of 8 duplicated retrieval_reference_number from MST and ATN

select *
from staging.int_payments__settlements_to_aggregations
where retrieval_reference_number in (select retrieval_reference_number
                         from staging.int_payments__settlements_to_aggregations
                         group by retrieval_reference_number
                         having count(*) > 1)

Added to ticket #3590

@erikamov
Copy link
Contributor Author

erikamov commented Feb 11, 2025

  • unique_int_payments__micropayments_adjustments_refunds_joined_micropayment_id
    Table: staging.int_payments__micropayments_adjustments_refunds_joined
    Column: micropayment_id
    Unique values

    Failing because of 2 duplicated micropayment_id from CCJPA 2024-11-19 and MST 2024-08-27

select *
from staging.int_payments__micropayments_adjustments_refunds_joined
where micropayment_id in (select micropayment_id
                            from staging.int_payments__micropayments_adjustments_refunds_joined
                           group by micropayment_id
                          having count(*) > 1)

#3589

  • dbt_utils_expression_is_true_int_payments__micropayments_adjustments_refunds_joined_micropayment_refund_amount_charge_amount
    Table: staging.int_payments__micropayments_adjustments_refunds_joined
    Columns: micropayment_refund_amount and charge_amount
    micropayment_refund_amount <= charge_amount

    Failing because of 4 records where micropayment_refund_amount > charge_amount
    CCJPA on 2024-08-09, 24 > 20
    ATN on 2024-08-27, 22.5 > 7.5
    SBMTD on 2024-02-23, 1.7 > 0.85
    MST on 2025-01-14, 4 > 2

select *
from staging.int_payments__micropayments_adjustments_refunds_joined
where micropayment_refund_amount > charge_amount

#3616

@erikamov
Copy link
Contributor Author

erikamov commented Feb 11, 2025

  • dbt_utils_accepted_range_fct_daily_schedule_feed_validation_notices_date__DATE_2024_03_26___DATE_2024_01_20_

    Table: mart_gtfs_quality.fct_daily_schedule_feed_validation_notices
    Column: date

    • accepted_range:
      min_value: "DATE'2021-04-16'"
      max_value: "DATE'2022-09-14'"
      where: validation_validator_version = 'v2.0.0'
    • accepted_range:
      min_value: "DATE'2022-09-15'"
      max_value: "DATE'2022-11-15'"
      where: validation_validator_version = 'v3.1.1'
    • accepted_range:
      min_value: "DATE'2022-11-16'"
      max_value: "DATE'2023-08-31'"
      where: validation_validator_version = 'v4.0.0'
    • accepted_range:
      min_value: "DATE'2023-09-01'"
      max_value: "DATE'2024-01-19'"
      where: validation_validator_version = 'v4.1.0'
    • accepted_range:
      min_value: "DATE'2024-01-20'"
      max_value: "DATE'2024-03-26'"
      where: validation_validator_version = 'v4.2.0'
    • accepted_range:
      min_value: "DATE'2024-03-27'"
      where: validation_validator_version = 'v5.0.0'

    Failing because there are 29,140 records with validation_validator_version = 'v4.2.0' and date between 2024-03-27 and 2024-07-29 (outcome_extract_dt between 2024-01-20 and 2024-02-23)

select date, outcome_extract_dt, validation_validator_version, code, count(*) as records_count
  from mart_gtfs_quality.fct_daily_schedule_feed_validation_notices
 where date between '2024-01-20' and '2024-03-27'
   and validation_validator_version != 'v4.2.0'
 group by date, outcome_extract_dt, validation_validator_version, code

#3633

  • not_null_fct_daily_schedule_feed_validation_notices_code / not_null_fct_daily_schedule_feed_validation_notices_validation_validator_version / not_null_fct_daily_schedule_feed_validation_notices_severity

    Table: mart_gtfs_quality.fct_daily_schedule_feed_validation_notices
    Column: code, validation_validator_version, and severity

    Failing because of 23 records without code, validation_validator_version, and severity, dates 2025-01-26 and 2025-02-04

select *
from mart_gtfs_quality.fct_daily_schedule_feed_validation_notices
where code is null

#3637

@erikamov
Copy link
Contributor Author

erikamov commented Feb 11, 2025

  • dbt_utils_expression_is_true_int_gtfs_schedule__stop_times_grouped_num_approximate_timepoint_stop_times_num_exact_timepoint_stop_times_num_stop_times

    Table: staging.int_gtfs_schedule__stop_times_grouped
    expression: num_approximate_timepoint_stop_times + num_exact_timepoint_stop_times = num_stop_times
    Failing because of 1 record where num_approximate_timepoint_stop_times + num_exact_timepoint_stop_times is different than num_stop_times

select *
from staging.int_gtfs_schedule__stop_times_grouped
where num_approximate_timepoint_stop_times + num_exact_timepoint_stop_times != num_stop_times

#3538

  • dbt_utils_expression_is_true_int_gtfs_schedule__stop_times_grouped__num_gtfs_flex_stop_times_0_is_gtfs_flex_trip

    Table: staging.int_gtfs_schedule__stop_times_grouped
    expression: (num_gtfs_flex_stop_times > 0) = is_gtfs_flex_trip
    Failing because of 82 records where num_gtfs_flex_stop_times > 0 does not match is_gtfs_flex_trip

select *
from staging.int_gtfs_schedule__stop_times_grouped
where (num_gtfs_flex_stop_times > 0) != is_gtfs_flex_trip

#3626 and #2774

@erikamov
Copy link
Contributor Author

erikamov commented Feb 11, 2025

  • dbt_utils_mutually_exclusive_ranges_dim_contract_attachments_required___valid_from__source_record_id___valid_to

    Table: mart_transit_database.dim_contract_attachments
    mutually_exclusive_ranges:
    lower_bound_column: _valid_from
    upper_bound_column: _valid_to
    partition_by: source_record_id
    gaps: required

    Failing because of 2 records are overlapping the range

with window_functions as (
    select
        
        source_record_id as partition_by_col,
        
        _valid_from as lower_bound,
        _valid_to as upper_bound,

        lead(_valid_from) over (
            partition by source_record_id
            order by _valid_from, _valid_to
        ) as next_lower_bound,

        row_number() over (
            partition by source_record_id
            order by _valid_from desc, _valid_to desc
        ) = 1 as is_last_record

    from  mart_transit_database.dim_contract_attachments
),

calc as (
    -- We want to return records where one of our assumptions fails, so we'll use
    -- the `not` function with `and` statements so we can write our assumptions more cleanly
    select
        *,

        -- For each record: lower_bound should be < upper_bound.
        -- Coalesce it to return an error on the null case (implicit assumption
        -- these columns are not_null)
        coalesce(
            lower_bound < upper_bound,
            false
        ) as lower_bound_less_than_upper_bound,

        -- For each record: upper_bound < the next lower_bound.
        -- Coalesce it to handle null cases for the last record.
        coalesce(
            upper_bound < next_lower_bound,
            is_last_record,
            false
        ) as upper_bound_less_than_next_lower_bound

    from window_functions

),

validation_errors as (

    select
        *
    from calc

    where not(
        -- THE FOLLOWING SHOULD BE TRUE --
        lower_bound_less_than_upper_bound
        and upper_bound_less_than_next_lower_bound
    )
)

select * from validation_errors

#3628 and #2909

  • unique_dim_contract_attachments_key

    Table: mart_transit_database.dim_contract_attachments
    mutually_exclusive_ranges:
    unique key: 'unnested_attachments.id' and '_valid_from'

#3630

@erikamov
Copy link
Contributor Author

erikamov commented Feb 11, 2025

  • dbt_utils_unique_combination_of_columns_base_tts_services_ct_services_map_ct_key__ct_date / dbt_utils_unique_combination_of_columns_base_tts_services_ct_services_map_tts_key__tts_date

    Table: staging.base_tts_services_ct_services_map

    unique_combination_of_columns: tts_key and tts_date
    Failing because of 78 records with tts_key and tts_date duplicated 2022-08-17 - 2023-03-29)

select tts_key, tts_date, count(*)
from staging.base_tts_services_ct_services_map
group by tts_key, tts_date
having count(*) > 1

#3632 and #1806

unique_combination_of_columns: ct_key and ct_date
Failing because of 78 records with ct_key and ct_date duplicated 2022-08-17 - 2023-03-29)

select ct_key, ct_date, count(*)
from staging.base_tts_services_ct_services_map
group by ct_key, ct_date
having count(*) > 1

For a unique combination of all four columns tts_key, tts_date, ct_key, and ct_date, it would return 8 duplicated records

#3631 and #1806

@erikamov
Copy link
Contributor Author

erikamov commented Feb 12, 2025

  • unique_dim_calendar__gtfs_key

    Table: mart_gtfs.dim_calendar
    Column: _gtfs_key
    unique 'feed_key' and 'service_id'
    where: "feed_key != '6368fe701bdd68c4f521751a9a222a10'" (filtered out because of some issue with Auburn data)

    Failing because of 3 duplicated _gtfs_key

select _gtfs_key, feed_key, service_id, count(1)
from mart_gtfs.dim_calendar
where feed_key != '6368fe701bdd68c4f521751a9a222a10'
group by _gtfs_key, feed_key, service_id
having count(*) > 1

#3595

  • not_null_dim_calendar_service_id / not_null_dim_calendar_end_date / not_null_dim_calendar_friday / not_null_dim_calendar_saturday / not_null_dim_calendar_start_date / not_null_dim_calendar_sunday / not_null_dim_calendar_wednesday / not_null_dim_calendar_thursday

    Table: mart_gtfs.dim_calendar
    Columns cannot have null values: service_id, start_date, end_date, wednesday, thursday, friday, saturday, sunday
    where: "feed_key != '6368fe701bdd68c4f521751a9a222a10'" (filtered out because of some issue with Auburn data)

    Failing because of 1 record without values on service_id, start_date, end_date, wednesday, thursday, friday, saturday, and sunday columns

select *
from mart_gtfs.dim_calendar
where feed_key != '6368fe701bdd68c4f521751a9a222a10'
and (service_id = '' 
     or start_date is null
     or end_date is null
     or wednesday is null
     or thursday is null
     or friday is null
     or saturday is null
     or sunday is null)

#3617

@erikamov
Copy link
Contributor Author

erikamov commented Feb 12, 2025

  • not_null_dim_schedule_feeds_feed_timezone

    Table: mart_gtfs.dim_schedule_feeds
    Column: feed_timezone

    Failing because of 1 record with null feed_timezone

select * from mart_gtfs.dim_schedule_feeds where feed_timezone is null

#3594

  • not_null_dim_transfers_from_stop_id / not_null_dim_transfers_to_stop_id / not_null_dim_transfers_transfer_type

    Table: mart_gtfs.dim_transfers
    Columns: from_stop_id, to_stop_id, transfer_type

    Failing because of 1,115 records with null from_stop_id and to_stop_id, and 8,278 records with null transfer_type

select * from mart_gtfs.dim_transfers
where from_stop_id is null -- 1115
or to_stop_id is null -- 1115
or transfer_type is null --8278

#3597

  • not_null_dim_trips_route_id

    Table: mart_gtfs.dim_trips
    route_id not_null:
    where: "feed_key != '6368fe701bdd68c4f521751a9a222a10'"

    Failing because of 513 records with null route_id

select *
  from mart_gtfs.dim_trips
 where route_id is null
   and feed_key != '6368fe701bdd68c4f521751a9a222a10'

#3591

  • not_null_dim_trips_trip_id

    Table: mart_gtfs.dim_trips
    trip_id not_null:
    error_if: ">1"

    Failing because of 664 records with null trip_id

select *
  from mart_gtfs.dim_trips
 where trip_id is null

#3623

@erikamov
Copy link
Contributor Author

erikamov commented Feb 13, 2025

  • not_null_int_gtfs_quality__guideline_checks_long_status

    Table: staging.int_gtfs_quality__guideline_checks_long
    Column: status

    Failing because of 678 records with null status

select *
from staging.int_gtfs_quality__guideline_checks_long
where status is null

#3627

  • unique_int_gtfs_quality__guideline_checks_long_key

    Table: staging.int_gtfs_quality__guideline_checks_long
    Column: key ('unioned.key', 'date', 'check')

    Failing beacause of 876 records with 'unioned.key', 'date', 'check' not unique, probably need to add other column to be unique

select key, date, check, count(*)
from staging.int_gtfs_quality__guideline_checks_long
group by key, date, check
having count(*) > 1

#3537

  • not_null_fct_daily_organization_combined_guideline_checks_status

    Table: mart_gtfs_quality.fct_daily_organization_combined_guideline_checks
    Column: status (from int_gtfs_quality__guideline_checks_long)

    Failing because of 240 records with null status

select *
from mart_gtfs_quality.fct_daily_organization_combined_guideline_checks
where status is null

#3593

  • not_null_fct_daily_service_combined_guideline_checks_status

    Table: mart_gtfs_quality.fct_daily_service_combined_guideline_checks
    Column: status (from int_gtfs_quality__guideline_checks_long)

    Failing because of 240 records with null status

select *
from mart_gtfs_quality.fct_daily_service_combined_guideline_checks
where status is null

#3592

  • validate_guidelines_checks_rows_per_check_matches_index
    Test: validate_guidelines_checks_rows_per_check_matches_index.sql
    Returned 8 failing results
WITH check_cts AS (
    SELECT check, COUNT(*) AS actual_checks
    FROM staging.int_gtfs_quality__guideline_checks_long
    GROUP BY 1
),

idx_check_cts AS (
    SELECT check, COUNT(*) AS idx_checks
    FROM staging.int_gtfs_quality__guideline_checks_index
    GROUP BY 1
)

SELECT
    check,
    actual_checks,
    idx_checks
FROM check_cts
FULL OUTER JOIN idx_check_cts
USING (check)
WHERE actual_checks != idx_checks

#3539

@erikamov
Copy link
Contributor Author

erikamov commented Feb 13, 2025

  • not_null_fct_daily_schedule_feeds_gtfs_dataset_key

    Table: mart_gtfs.fct_daily_schedule_feeds
    Column: gtfs_dataset_key (from int_transit_database__urls_to_gtfs_datasets)
    error_if: ">10000"

    Failing because of 7,056 records with null gtfs_dataset_key

select *
from mart_gtfs.fct_daily_schedule_feeds
where gtfs_dataset_key is null

#3718

  • not_null_fct_schedule_feed_downloads_gtfs_dataset_key

    Table: mart_gtfs.fct_schedule_feed_downloads
    Column: gtfs_dataset_key (from int_transit_database__urls_to_gtfs_datasets)
    error_if: ">10000"

    Failing because of 7,709 records with null gtfs_dataset_key

select *
from mart_gtfs.fct_schedule_feed_downloads
where gtfs_dataset_key is null

#3719

@erikamov
Copy link
Contributor Author

erikamov commented Feb 13, 2025

  • relationships_fct_daily_rt_feed_files_schedule_to_use_for_rt_validation_gtfs_dataset_key__key__ref_dim_gtfs_datasets_

    Table: mart_gtfs.fct_daily_rt_feed_files
    Column: schedule_to_use_for_rt_validation_gtfs_dataset_key
    Relationship to: mart_transit_database.dim_gtfs_datasets.key

    Failing because of 72 records where schedule_to_use_for_rt_validation_gtfs_dataset_key does not match mart_transit_database.dim_gtfs_datasets.key

select *
from mart_gtfs.fct_daily_rt_feed_files
where schedule_to_use_for_rt_validation_gtfs_dataset_key not in (select distinct key
                                                                   from mart_transit_database.dim_gtfs_datasets)

#3598

@erikamov
Copy link
Contributor Author

erikamov commented Feb 15, 2025

  • not_null_fct_vehicle_locations_trip_instance_key

    Table: mart_gtfs.fct_vehicle_locations
    Column: trip_instance_key

    Failing because of 15,866,369 records with null trip_instance_key

select *
from mart_gtfs.fct_vehicle_locations
where trip_instance_key is null

#3600

  • unique_proportion_fct_vehicle_locations_0_9999__trip_instance_key

    Table: mart_gtfs.fct_vehicle_locations
    Column: trip_instance_key (macro test_unique_proportion at_least: 0.9999)

    Failing because it is returning 0.006116776, need to be at least 0.9999

with validation as (
  select
    sum(case when row_number > 1 then 0 else 1 end) / cast(count(*) as numeric) as unique_proportion
  from (select
    trip_instance_key
    , row_number() over (partition by trip_instance_key) as row_number
  from mart_gtfs.fct_vehicle_locations) row_counts
),
validation_errors as (
  select
    unique_proportion
  from validation
  where unique_proportion < 0.9999 or unique_proportion > 1
)
select
  *
from validation_errors

#3599

  • not_null_fct_vehicle_locations_grouped_trip_instance_key / unique_proportion_fct_vehicle_locations_grouped_0_9999__trip_instance_key

    Table: mart_gtfs.fct_vehicle_locations_grouped
    Column: trip_instance_key

    Failing because of 8,718,512 records with null trip_instance_key, need to fix fct_vehicle_locations first

select *
from mart_gtfs.fct_vehicle_locations_grouped
where trip_instance_key is null

Column: trip_instance_key (macro test_unique_proportion at_least: 0.9999)

Failing because it is returning 0.008506957, need to be at least 0.9999

with validation as (
  select
    sum(case when row_number > 1 then 0 else 1 end) / cast(count(*) as numeric) as unique_proportion
  from (select
    trip_instance_key
    , row_number() over (partition by trip_instance_key) as row_number
  from mart_gtfs.fct_vehicle_locations_grouped) row_counts
),
validation_errors as (
  select
    unique_proportion
  from validation
  where unique_proportion < 0.9999 or unique_proportion > 1
)
select
  *
from validation_errors

#3720

@erikamov
Copy link
Contributor Author

  • unique_int_gtfs_quality__rt_validation_outcomes_key
    There is already a ticket open.

@erikamov
Copy link
Contributor Author

erikamov commented Feb 15, 2025

  • unique_dim_fare_media__gtfs_key

    Table: mart_gtfs.dim_fare_media
    Column: _gtfs_key ('feed_key', 'fare_media_id')
    unique where feed_key != '6368fe701bdd68c4f521751a9a222a10' (filtered out because of some issue with Auburn data)

    Failing because of one 2e3954f135226903eb57a07e00c8bd4d _gtfs_key is repeating 7 times

select _gtfs_key, count(*)
from mart_gtfs.dim_fare_media
where feed_key != '6368fe701bdd68c4f521751a9a222a10'
group by _gtfs_key
having count(*) > 1

#3624

@erikamov
Copy link
Contributor Author

erikamov commented Feb 15, 2025

  • not_null_dim_fare_attributes_transfers

    Table: mart_gtfs.dim_fare_attributes
    Column: transfers
    where: "feed_key != '6368fe701bdd68c4f521751a9a222a10'"

    Failing because of 468,799 records with null transfers

select *
from mart_gtfs.dim_fare_attributes
where transfers is null
and feed_key != '6368fe701bdd68c4f521751a9a222a10'a

#3721

@erikamov
Copy link
Contributor Author

erikamov commented Feb 15, 2025

  • unique_int_elavon__deposit_transactions_trn_ref_num
    Table: staging.int_elavon__deposit_transactions
    Column: trn_ref_num

    Failing because of 24 duplicated trn_ref_num

select trn_ref_num  , count(*)
from staging.int_elavon__deposit_transactions
group by trn_ref_num
having count(*) > 1

#3722

  • unique_fct_elavon__transactions_trn_ref_num

    Table: mart_payments.fct_elavon__transactions
    Column: trn_ref_num

    Failing because of 24 duplicated trn_ref_num, related to unique_int_elavon__deposit_transactions_trn_ref_num error

select trn_ref_num  , count(*)
from mart_payments.fct_elavon__transactions
group by trn_ref_num
having count(*) > 1

#3723

  • unique_fct_payments_deposit_transactions_trn_ref_num

    Table: mart_payments.fct_payments_deposit_transactions
    Column: trn_ref_num

    Failing because of 24 duplicated trn_ref_num, related to unique_int_elavon__deposit_transactions_trn_ref_num error

select trn_ref_num  , count(*)
from mart_payments.fct_payments_deposit_transactions
group by trn_ref_num
having count(*) > 1

#3724

@erikamov
Copy link
Contributor Author

erikamov commented Feb 15, 2025

  • unique_fct_payments_aggregations_aggregation_id

    Table: mart_payments.fct_payments_aggregations
    Column: aggregation_id

    Failing because of 2 duplicated aggregation_id

select aggregation_id, count(*)
from mart_payments.fct_payments_aggregations
group by aggregation_id
having count(*) > 1

#3725

@erikamov
Copy link
Contributor Author

erikamov commented Feb 15, 2025

  • unique_miles_traveled_location_name / unique_miles_traveled_off_location_name

The file contains duplicated names, to be unique the test need to check for a unique pair [location_name, off_location_name]

#3618 and #3619

@erikamov
Copy link
Contributor Author

erikamov commented Feb 15, 2025

  • unique_fct_service_alerts_messages_key

    Tables: mart_gtfs.fct_service_alerts_messages
    Column: key ('_extract_ts', 'base64_url', 'id')

    Failing because of 360 records where key ('_extract_ts', 'base64_url', 'id') is not unique

with dbt_test__target as (

  select key as unique_field
  from (select * from `cal-itp-data-infra`.`mart_gtfs`.`fct_service_alerts_messages` where
            dt in (
                '2025-02-14',
                '2025-02-13'
                )

            AND hour in (
                '2025-02-14 00:00:00',
                '2025-02-13 02:00:00'
            )
            ) dbt_subquery
  where key is not null

)

select
    unique_field,
    count(*) as n_records

from dbt_test__target
group by unique_field
having count(*) > 1

#3726

@erikamov
Copy link
Contributor Author

#3731

@erikamov
Copy link
Contributor Author

All tickets created or referenced for each error.
Also linked tickets on Sentry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant