Skip to content

Conversation

@erikamov
Copy link
Contributor

@erikamov erikamov commented Dec 10, 2025

Description

This PR created materialized tables to generate a GTFS Downloader Dashboard in order to validate the results of the new GTFS Download DAG #4571.

In conversation with @vevetron and @lauriemerrell we noticed that the filter data_quality_pipeline was missing, but it did not cause any change on the datasets. Added the filter on this PR.

As requested we are splitting Download process from Unzip, parse, and Validate. So a new download_gtfs DAG will be responsible for downloading the GTFS data, then triggering parse_and_validate_gtfs that will do the rest.

For the concern of getting current data, we moved the logic from the materialized table staging.int_transit_database__gtfs_datasets_dim to a new view staging.int_gtfs_datasets to get the current datasets from airtable.california_transit__gtfs_datasets.

  • pointing staging.int_transit_database__gtfs_datasets_dim to the new view to have only one source of truth.

Also to make sure the external tables are created/updated in correct order, I switched the create_external_table DAG to after syncs DAGS as part of the Review DAGs’ schedule #3714.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

How has this been tested?

Tested running dbt tables locally:

❯ poetry run dbt run -s models/mart/gtfs_audit/ --target staging
08:04:19  Running with dbt=1.10.1
08:04:20  Registered adapter: bigquery=1.10.0
08:04:26  Found 609 models, 1226 data tests, 14 seeds, 219 sources, 4 exposures, 1078 macros
08:04:26
08:04:26  Concurrency: 8 threads (target='staging')
08:04:26
08:04:31  1 of 5 START sql table model mart_gtfs_audit.dim_gtfs_download_configs ......... [RUN]
08:04:31  2 of 5 START sql table model mart_gtfs_audit.dim_gtfs_schedule_download_outcomes  [RUN]
08:04:31  3 of 5 START sql table model mart_gtfs_audit.dim_gtfs_schedule_unzip_outcomes .. [RUN]
08:04:31  4 of 5 START sql table model mart_gtfs_audit.dim_gtfs_schedule_validation_notices  [RUN]
08:04:31  5 of 5 START sql table model mart_gtfs_audit.dim_gtfs_schedule_validation_outcomes  [RUN]
08:04:34  1 of 5 OK created sql table model mart_gtfs_audit.dim_gtfs_download_configs .... [CREATE TABLE (77.6k rows, 23.9 MiB processed) in 2.72s]
08:04:35  2 of 5 OK created sql table model mart_gtfs_audit.dim_gtfs_schedule_download_outcomes  [CREATE TABLE (18.5k rows, 24.2 MiB processed) in 3.27s]
08:04:35  5 of 5 OK created sql table model mart_gtfs_audit.dim_gtfs_schedule_validation_outcomes  [CREATE TABLE (35.6k rows, 53.1 MiB processed) in 3.51s]
08:04:40  3 of 5 OK created sql table model mart_gtfs_audit.dim_gtfs_schedule_unzip_outcomes  [CREATE TABLE (547.4k rows, 274.3 MiB processed) in 9.11s]
08:04:44  4 of 5 OK created sql table model mart_gtfs_audit.dim_gtfs_schedule_validation_notices  [CREATE TABLE (188.7k rows, 1.5 GiB processed) in 12.85s]
08:04:44
08:04:44  Finished running 5 table models in 0 hours 0 minutes and 17.76 seconds (17.76s).
08:04:45
08:04:45  Completed successfully
08:04:45
08:04:45  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=5

Post-merge follow-ups

  • No action required
  • Actions required (specified below)

@erikamov erikamov self-assigned this Dec 10, 2025
@erikamov erikamov force-pushed the mov/4571-gtfs-dashboard branch 2 times, most recently from 97b410d to e6537a2 Compare December 10, 2025 08:06
@github-actions
Copy link

github-actions bot commented Dec 10, 2025

Terraform plan in iac/cal-itp-data-infra/airflow/us

Plan: 9 to add, 8 to change, 0 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create
!~  update in-place

Terraform will perform the following actions:

  # google_storage_bucket_object.calitp-composer["dags/README.md"] will be created
+   resource "google_storage_bucket_object" "calitp-composer" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "dags/README.md"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../airflow/dags/README.md"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer["dags/create_external_tables/METADATA.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer" {
!~      crc32c              = "TAOSvA==" -> (known after apply)
!~      detect_md5hash      = "BWlENDF70NFJXo55EWKoiQ==" -> "different hash"
!~      generation          = 1764100054786170 -> (known after apply)
        id                  = "calitp-composer-dags/create_external_tables/METADATA.yml"
!~      md5hash             = "BWlENDF70NFJXo55EWKoiQ==" -> (known after apply)
        name                = "dags/create_external_tables/METADATA.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer["dags/download_parse_and_validate_gtfs.py"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer" {
!~      crc32c              = "G0BNVQ==" -> (known after apply)
!~      detect_md5hash      = "MXFYtPprsFpWte9yivjiHQ==" -> "different hash"
!~      generation          = 1765308834202492 -> (known after apply)
        id                  = "calitp-composer-dags/download_parse_and_validate_gtfs.py"
!~      md5hash             = "MXFYtPprsFpWte9yivjiHQ==" -> (known after apply)
        name                = "dags/download_parse_and_validate_gtfs.py"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer["dags/sync_ntd_data_api/METADATA.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer" {
!~      crc32c              = "jC3FRw==" -> (known after apply)
!~      detect_md5hash      = "FVme+riRchXahturQIFHlg==" -> "different hash"
!~      generation          = 1765312247501642 -> (known after apply)
        id                  = "calitp-composer-dags/sync_ntd_data_api/METADATA.yml"
!~      md5hash             = "FVme+riRchXahturQIFHlg==" -> (known after apply)
        name                = "dags/sync_ntd_data_api/METADATA.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer["plugins/operators/bigquery_to_download_config_operator.py"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer" {
!~      crc32c              = "KvENpw==" -> (known after apply)
!~      detect_md5hash      = "T4VqE/DM0g8dGIROGUZ4Ww==" -> "different hash"
!~      generation          = 1765308834176164 -> (known after apply)
        id                  = "calitp-composer-plugins/operators/bigquery_to_download_config_operator.py"
!~      md5hash             = "T4VqE/DM0g8dGIROGUZ4Ww==" -> (known after apply)
        name                = "plugins/operators/bigquery_to_download_config_operator.py"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer["plugins/operators/gcs_to_gtfs_download_operator.py"] will be created
+   resource "google_storage_bucket_object" "calitp-composer" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "plugins/operators/gcs_to_gtfs_download_operator.py"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../airflow/plugins/operators/gcs_to_gtfs_download_operator.py"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["dbt_project.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "xy4uIA==" -> (known after apply)
!~      detect_md5hash      = "jaSUWSXE+sudfy0c0AgiJA==" -> "different hash"
!~      generation          = 1763589717446130 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/dbt_project.yml"
!~      md5hash             = "jaSUWSXE+sudfy0c0AgiJA==" -> (known after apply)
        name                = "data/warehouse/dbt_project.yml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/intermediate/gtfs/_int_gtfs.yaml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "JuVq+A==" -> (known after apply)
!~      detect_md5hash      = "XuNJoijQihZNoiN67C5rlg==" -> "different hash"
!~      generation          = 1761707840881407 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/models/intermediate/gtfs/_int_gtfs.yaml"
!~      md5hash             = "XuNJoijQihZNoiN67C5rlg==" -> (known after apply)
        name                = "data/warehouse/models/intermediate/gtfs/_int_gtfs.yaml"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/intermediate/gtfs/int_gtfs_datasets.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/intermediate/gtfs/int_gtfs_datasets.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/intermediate/gtfs/int_gtfs_datasets.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/intermediate/transit_database/dimensions/int_transit_database__gtfs_datasets_dim.sql"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "cy68sQ==" -> (known after apply)
!~      detect_md5hash      = "2FqX0bk2xM2PZaS2Esldww==" -> "different hash"
!~      generation          = 1757531130514617 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/models/intermediate/transit_database/dimensions/int_transit_database__gtfs_datasets_dim.sql"
!~      md5hash             = "2FqX0bk2xM2PZaS2Esldww==" -> (known after apply)
        name                = "data/warehouse/models/intermediate/transit_database/dimensions/int_transit_database__gtfs_datasets_dim.sql"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/gtfs_audit/_mart_gtfs_audit.yml"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/gtfs_audit/_mart_gtfs_audit.yml"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/gtfs_audit/_mart_gtfs_audit.yml"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/gtfs_audit/dim_gtfs_download_configs.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/gtfs_audit/dim_gtfs_download_configs.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/gtfs_audit/dim_gtfs_download_configs.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/gtfs_audit/dim_gtfs_schedule_download_outcomes.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/gtfs_audit/dim_gtfs_schedule_download_outcomes.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/gtfs_audit/dim_gtfs_schedule_download_outcomes.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/gtfs_audit/dim_gtfs_schedule_unzip_outcomes.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/gtfs_audit/dim_gtfs_schedule_unzip_outcomes.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/gtfs_audit/dim_gtfs_schedule_unzip_outcomes.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/gtfs_audit/dim_gtfs_schedule_validation_notices.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/gtfs_audit/dim_gtfs_schedule_validation_notices.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/gtfs_audit/dim_gtfs_schedule_validation_notices.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/mart/gtfs_audit/dim_gtfs_schedule_validation_outcomes.sql"] will be created
+   resource "google_storage_bucket_object" "calitp-composer-dags" {
+       bucket         = "calitp-composer"
+       content        = (sensitive value)
+       content_type   = (known after apply)
+       crc32c         = (known after apply)
+       detect_md5hash = "different hash"
+       generation     = (known after apply)
+       id             = (known after apply)
+       kms_key_name   = (known after apply)
+       md5hash        = (known after apply)
+       md5hexhash     = (known after apply)
+       media_link     = (known after apply)
+       name           = "data/warehouse/models/mart/gtfs_audit/dim_gtfs_schedule_validation_outcomes.sql"
+       output_name    = (known after apply)
+       self_link      = (known after apply)
+       source         = "../../../../warehouse/models/mart/gtfs_audit/dim_gtfs_schedule_validation_outcomes.sql"
+       storage_class  = (known after apply)
    }

  # google_storage_bucket_object.calitp-composer-dags["models/staging/gtfs/_src_gtfs_schedule_external_tables.yml"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-composer-dags" {
!~      crc32c              = "JRuXXA==" -> (known after apply)
!~      detect_md5hash      = "Caqsk8kIhrzLrYxkwYo53g==" -> "different hash"
!~      generation          = 1751416666931203 -> (known after apply)
        id                  = "calitp-composer-data/warehouse/models/staging/gtfs/_src_gtfs_schedule_external_tables.yml"
!~      md5hash             = "Caqsk8kIhrzLrYxkwYo53g==" -> (known after apply)
        name                = "data/warehouse/models/staging/gtfs/_src_gtfs_schedule_external_tables.yml"
#        (17 unchanged attributes hidden)
    }

Plan: 9 to add, 8 to change, 0 to destroy.

📝 Plan generated in Plan Terraform for Warehouse and DAG changes #1199

@github-actions
Copy link

github-actions bot commented Dec 10, 2025

Terraform plan in iac/cal-itp-data-infra-staging/airflow/us

Plan: 0 to add, 4 to change, 0 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
!~  update in-place

Terraform will perform the following actions:

  # google_storage_bucket_object.calitp-staging-composer["dags/download_parse_and_validate_gtfs.py"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer" {
!~      crc32c              = "GvruJA==" -> (known after apply)
!~      detect_md5hash      = "EVCNHN6Hq0uodzYuK8Zg4A==" -> "different hash"
!~      generation          = 1765494172783784 -> (known after apply)
        id                  = "calitp-staging-composer-dags/download_parse_and_validate_gtfs.py"
!~      md5hash             = "EVCNHN6Hq0uodzYuK8Zg4A==" -> (known after apply)
        name                = "dags/download_parse_and_validate_gtfs.py"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer["plugins/operators/bigquery_to_download_config_operator.py"] will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer" {
!~      crc32c              = "6oRS3A==" -> (known after apply)
!~      detect_md5hash      = "xxkAT0jhkh3LKcuWUQ40AA==" -> "different hash"
!~      generation          = 1765492313306019 -> (known after apply)
        id                  = "calitp-staging-composer-plugins/operators/bigquery_to_download_config_operator.py"
!~      md5hash             = "xxkAT0jhkh3LKcuWUQ40AA==" -> (known after apply)
        name                = "plugins/operators/bigquery_to_download_config_operator.py"
#        (17 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-catalog will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-catalog" {
!~      content             = (sensitive value)
!~      crc32c              = "C6pTUQ==" -> (known after apply)
!~      detect_md5hash      = "G1/zRN8BpLbORlo3Fq9mFQ==" -> "different hash"
!~      generation          = 1765493898541075 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/target/catalog.json"
!~      md5hash             = "G1/zRN8BpLbORlo3Fq9mFQ==" -> (known after apply)
        name                = "data/warehouse/target/catalog.json"
#        (16 unchanged attributes hidden)
    }

  # google_storage_bucket_object.calitp-staging-composer-manifest will be updated in-place
!~  resource "google_storage_bucket_object" "calitp-staging-composer-manifest" {
!~      content             = (sensitive value)
!~      crc32c              = "9rnV6w==" -> (known after apply)
!~      detect_md5hash      = "ByFXB40BeDe5NkvaWS4/Qg==" -> "different hash"
!~      generation          = 1765493899808953 -> (known after apply)
        id                  = "calitp-staging-composer-data/warehouse/target/manifest.json"
!~      md5hash             = "ByFXB40BeDe5NkvaWS4/Qg==" -> (known after apply)
        name                = "data/warehouse/target/manifest.json"
#        (16 unchanged attributes hidden)
    }

Plan: 0 to add, 4 to change, 0 to destroy.

📝 Plan generated in Plan Terraform for Warehouse and DAG changes #1199

@erikamov erikamov force-pushed the mov/4571-gtfs-dashboard branch 2 times, most recently from 08b0863 to 438cdff Compare December 10, 2025 08:11
@github-actions
Copy link

github-actions bot commented Dec 10, 2025

Warehouse report: Failed to add ci-report to a comment. Review the ci-report in the Summary.

@lauriemerrell
Copy link
Contributor

lauriemerrell commented Dec 10, 2025

Thank you @erikamov ! A couple of questions:

  1. (logistical) Is there a goal to actually make a Metabase dashboard that we can use to compare results directly, or should we do independent testing/comparisons? (Is there a dependency on the staging Metabase task?)

  2. (logistical) Where is the best place to see the current state of the code under this refactor? Like is there a way I can review all these changes together, are they already on main?

  3. My substantive comments are (let me know if these should be moved to an implementation PR):
    a. I would still advocate quite strongly that the download task be separate from the unzip/parse/validate, for the reasons that we have talked about but especially so that they can be more safely and easily re-run independently of one another (see for example this week where I needed to re-run several historical validation jobs). I still feel like the distinction between "now" and "interval processing" type jobs outlined here is useful and relevant and I'd like to try to maintain that distinction. (And this also relates to our parallel discussion about things that are downloaded outside the core pipeline.) This is a blocking concern from my perspective.
    b. I am also still concerned about the use of a table that is created by dbt to generate the configs because that introduces a new source of lag and in general we have tried not to have ingest jobs depend on dbt jobs, as a question of separation of concerns but also timing. (int_transit_database__gtfs_datasets_dim is a table, not a view, so it won't get data updates until dbt runs.) The in-between option would be to just use the Airtable external table which will get the latest data as soon as it lands in the bucket. This is also a blocking concern from my perspective.
    c. I am not sure where best to look in the code to see the updates to cal-itp-data-infra.storage but I would be a little bit concerned about moving code in and out of there because part of the reason to have that be a package is to allow it to be used by the archiver and the schedule download code. I think you are also planning a refactor of the archiver, so maybe that is where this is going and it's just temporary. I'm actually not sure the best way to assess the impact here because I don't know the current state of the archiver test/staging instance. I don't think this is a blocking concern necessarily as long as code is not being deleted from the storage package, but I do find it confusing to have two copies of the same code used by different things.
    d. It sort of feels like a lot of the observability outcomes would be achieved by just making an external table on the existing GTFS Download Configs bucket, if the goal is just to see "what is the configuration that caused this download to occur as it did". Like, ultimately, this is not really simplifying the logic IMO (you still have to take an Airtable artifact, convert it to a config, and then ingest that config -- this is using a dbt/warehouse artifact instead of the Airtable extract file, which moves the logic to a more observable place but IMO introduces an untenable dependency on dbt). This is just an observation, there might be other goals here.

@erikamov erikamov force-pushed the mov/4571-gtfs-dashboard branch 2 times, most recently from 84c3eef to 4294d13 Compare December 11, 2025 02:17
@erikamov
Copy link
Contributor Author

Answering @lauriemerrell questions:

  1. The dashboard is a @vevetron request for his team to monitor GTFS and see differences on downloads and such. I am just using this new dashboard as a way to compare the results between the new DAG and the v2 process.

  2. The code is on the Main branch. The data_quality_pipeline filter that we talked yesterday is on this PR.

3.a. We heard you and we are splitting the Download from the unzip, parse, and validate and including on this PR.
3.b. We will create a ticket to build a view so the data will be always up to date with the airtable.california_transit__gtfs_datasets.
3.c. We did not change the code on Storage.py nor using on this new DAG. It was moved to a different folder because was blocking the pydantic upgrade and it will be deleted once we don't need it anymore.
3.d. Same as 1.

@lauriemerrell
Copy link
Contributor

Thank you @erikamov ! That all sounds great, for #1 I just wanted to clarify whether a Metabase dashboard already exists right now that we should be looking at as part of this review, or if the plan is to merge this PR and then create the dashboard

@lauriemerrell
Copy link
Contributor

For 3b-- I would be inclined to define this as an exposure on the dbt side so that the dependency is very explicit and visible there, because this will be a unique pattern (to have a dependency between an ingest and something managed by dbt).

@lauriemerrell
Copy link
Contributor

lauriemerrell commented Dec 11, 2025

We also might want to call this (again 3b) out on this page: https://docs.calitp.org/data-infra/architecture/data.html#architecture-data since this introduces a new dependency pattern

@ohrite ohrite force-pushed the mov/4571-gtfs-dashboard branch from 4294d13 to ad3c623 Compare December 11, 2025 17:03
@erikamov erikamov requested a review from jparr as a code owner December 11, 2025 20:25
@github-actions
Copy link

github-actions bot commented Dec 11, 2025

Terraform plan in iac/cal-itp-data-infra-staging/composer/us

No changes. Your infrastructure matches the configuration.
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

📝 Plan generated in Plan Terraform for Warehouse and DAG changes #1199

@erikamov erikamov force-pushed the mov/4571-gtfs-dashboard branch 3 times, most recently from 3badf3f to 78e5e4f Compare December 11, 2025 22:30
@github-actions
Copy link

github-actions bot commented Dec 11, 2025

Terraform plan in iac/cal-itp-data-infra-staging/dashboards/us

No changes. Your infrastructure matches the configuration.
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

📝 Plan generated in Terraform Plan #726

@erikamov erikamov force-pushed the mov/4571-gtfs-dashboard branch from 78e5e4f to a90e6d6 Compare December 11, 2025 22:33
@lauriemerrell
Copy link
Contributor

Notes from live walkthrough convo 12/11:

  • Consider turning on bucket protections for where the GTFS download configs are written so that if a historical task is cleared by accident an older file cannot be overwritten (and maybe the same for the download task itself? so that previously downloaded zipfiles can't be overwritten?) Since things are now timestamped by the Airflow task time instead of the execution time
    • Want to maybe try to protect partitions if that is an option? To prevent a case where a specific file fails to download on date A and then someone clears that on later date B and the date B data is written to the date A partition
    • Or find another way within the DAG task logic to prevent data being written to a wrong historic partition, like check that the run date matches the partition date
  • Might want to add something to Metabase to compare dim_schedule_feeds where _valid_from >= today and make sure they're identical between prod & staging with these changes

@ohrite ohrite force-pushed the mov/4571-gtfs-dashboard branch from c8cda86 to 4e3b576 Compare December 12, 2025 01:04
@ohrite
Copy link
Contributor

ohrite commented Dec 12, 2025

As decided by @vevetron we'll want to make 2 issues after this branch is merged:

  1. Adjust the timestamps for the DAG to the execution timestamp
  2. If a zipfile already exists at the destination path, do not download it again

@lauriemerrell
Copy link
Contributor

@ohrite lmk if you want me to write up those issues! And @erikamov lmk if ok for me to add a dim_schedule_feeds check to the dashboards as a downstream integration check that the results are all working out ok

@ohrite ohrite force-pushed the mov/4571-gtfs-dashboard branch from 4e3b576 to 432e9bf Compare December 12, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants