Skip to content

[#334] Add Terraform variable validation; trim Python pre-flight#335

Draft
gavinelder wants to merge 12 commits intochore/v26-only-strip-legacyfrom
chore/v26-tf-variable-validation
Draft

[#334] Add Terraform variable validation; trim Python pre-flight#335
gavinelder wants to merge 12 commits intochore/v26-only-strip-legacyfrom
chore/v26-tf-variable-validation

Conversation

@gavinelder
Copy link
Copy Markdown

@gavinelder gavinelder commented May 6, 2026

Refs #334. Stacked on PR #333 (`chore/v26-only-strip-legacy`) — review/merge that first.

Migrates sanity-check logic from `scripts/installer/validation/check_configuration.py` to native Terraform variable validation, adds the corresponding `terraform test` suite, and wires it into CI.

Commits (in order)

SHA Slice
`ff1f55f` First wave of `validation {}` blocks (shape rules) + Python residual trim
`9975267` MySQL 8.x floor on master (instead of just rejecting 5.6)
`c3adb3d` `tests/terraform/` suite: 21 `run` blocks across 3 .tftest.hcl files
`e347600` Cross-variable validation: only-one-of-N flag groups + dependency-population rules; +11 test cases
`bc39139` CI workflow `.github/workflows/terraform-test.yml`
`c730935` CI: skip pre-existing pytest-suite `requirements.txt` (malformed)
`ad6a9a4` CI: trigger fix for stacked PR visibility
`69f400f` `mock_provider "aws"` + SSM `override_data` so tests run without AWS creds

Coverage

Native validations now in variables.tf

  • Shape: `tower_container_version` (tag regex + v26.1.0–v26.x window), `tower_server_url` (no scheme prefix), `tower_root_users` (non-empty), `alb_certificate_arn` (ACM ARN), `private_cacert_bucket_prefix` (s3://), `db_engine_version` & `db_container_engine_version` (MySQL 8.x), three workspace-ID lists (comma-int).
  • Cross-variable (only-one-of-N): VPC source, DB engine choice, Redis source, ALB/cert/HTTPS, SES vs SMTP.
  • Cross-variable (dependency): `flag_use_existing_vpc` ⇒ `vpc_existing_id` set; `flag_create_load_balancer` ⇒ `alb_certificate_arn` set; three Route53 zone-name rules.

`terraform test` suite (`tests/terraform/`)

  • 32 `run` blocks across 4 .tftest.hcl files — all green locally and in CI.
  • Files: `version_validation.tftest.hcl`, `string_shape_validation.tftest.hcl`, `workspace_id_validation.tftest.hcl`, `cross_variable_validation.tftest.hcl`.
  • Each file mocks the AWS provider and overrides the four SSM data sources + the connection-strings external Python data source — so tests run without any AWS credentials, AWS_PROFILE, or Docker daemon.

CI

  • New workflow `.github/workflows/terraform-test.yml`. Runs on PRs (path-filtered), pushes to `master` / `release/v*` / temporarily `chore/v26-**`, and `workflow_dispatch`.
  • Steps: checkout → terraform 1.14.8 → python 3.12 → terraform init -backend=false → `make generate_test_data` → `make terraform_test`.

Python residual (`check_configuration.py`)

What stays:

  • `verify_subnet_privacy` (AWS data source).
  • `verify_docker_version` (filesystem read of the docker-compose template; now also enforces MySQL 8.x).
  • `verify_sensitive_keys` (operates on the parsed tfvars dict).
  • Cross-variable rules where a `condition` would be unreadable (e.g. ssh-access logic, db-deletion-protection warnings).
  • `verify_tower_self_signed_certs`: narrowed to the cross-variable rule (required when `flag_use_private_cacert = true`).

What's deleted:

  • `verify_only_one_true_set`, `verify_tfvars_config_dependencies`, `verify_tower_root_users`, the floor/tag check in `main`, the `only_one_true_set` and `ensure_dependency_populated` helpers, and the version-gate clauses inside `verify_email_login_disablement`, `verify_ses_integration`, `verify_data_studio`, `verify_data_studio_ssh`, `verify_pipeline_versioning`, `verify_redis_version`.

Out of scope (deferred under #334)

  • Positive plan-shape tests (e.g. asserting "this combo of flags produces N security groups") — separate slice.
  • Audit + retirement of pytest snapshot tests now redundant with native validation.
  • `terraform fmt -check` in CI — pre-existing fmt drift across ~10 files outside this PR's scope; needs a housekeeping pass first.
  • Replacing the duplicated mock blocks across the four .tftest.hcl files — `terraform test` doesn't have a shared-fixture mechanism today.

Verification

  • Local: `make generate_test_data && make terraform_test` → 32 passed, 0 failed.
  • CI run 25443812421 — green.

Automated triage by Claude Code, stacked on #333.

gavinelder added 12 commits May 6, 2026 15:32
…hecks

First pass at moving sanity checks from scripts/installer/validation/check_configuration.py
up to native `validation {}` blocks on the corresponding variables. Errors fire at
`terraform plan` time instead of via a separate Python pre-flight, and the message
text travels with the variable definition.

Variable validations added (variables.tf):
- tower_container_version: tag-shape regex + v26.1.0 floor.
- tower_server_url: rejects scheme prefixes.
- tower_root_users: rejects empty/REPLACE_ME.
- alb_certificate_arn: ACM ARN shape (REPLACE_ME passthrough when no ALB).
- private_cacert_bucket_prefix: s3:// shape (REPLACE_ME passthrough when not in use).
- db_engine_version, db_container_engine_version: reject 5.6.
- data_studio_eligible_workspaces, data_studio_ssh_eligible_workspaces,
  pipeline_versioning_eligible_workspaces: comma-separated integer format.

Python residual (check_configuration.py):
- Drops the duplicate floor/tag check from __main__ (now in variable validation).
- verify_tower_server_url: keeps the port-reminder warning only.
- verify_tower_self_signed_certs: narrowed to the cross-variable rule
  (private_cacert_bucket_prefix required when flag_use_private_cacert = true).
- verify_data_studio / verify_data_studio_ssh / verify_pipeline_versioning: drop
  the now-duplicate workspace-id format checks.
- verify_docker_version: keeps the docker-compose template scan; drops the
  db_engine_version 5.6 check (now a variable validation).
- verify_tower_root_users: deleted (replaced by variable validation).

Cross-variable rules and AWS-data-source rules deliberately stay in Python.
Black-formatted.

Refs #334
- variables.tf: db_engine_version and db_container_engine_version validations
  now require an "8." prefix instead of merely rejecting "5.6". Error message
  states the master-only-8.x policy explicitly.
- check_configuration.py: verify_docker_version now flags any mysql:N.x pin
  below 8 in the docker-compose template (was: only flagged 5.6 specifically).

Refs #334
First slice of native Terraform testing, stacked on top of the variable-validation
work in this branch. 21 `run` blocks across three .tftest.hcl files exercise the
new validation {} rules with positive and negative cases.

tests/terraform/version_validation.tftest.hcl
  - Floor + tag-shape rules on tower_container_version (v26.1.0+, v26.x major lock).

tests/terraform/string_shape_validation.tftest.hcl
  - tower_server_url scheme-prefix rejection.
  - tower_root_users empty/REPLACE_ME rejection.
  - alb_certificate_arn ACM-ARN shape.
  - private_cacert_bucket_prefix s3:// shape.
  - MySQL 8.x floor on db_engine_version and db_container_engine_version.

tests/terraform/workspace_id_validation.tftest.hcl
  - Comma-separated-integer rule on three workspace-ID list variables.

Each file uses a top-level `override_data` block to stub the connection-strings
external Python data source, so variable-validation tests run without needing a
Docker daemon (the script reads the tfvars from disk via Docker).

Other changes
  - variables.tf: tower_container_version floor rule rewritten to parse the tag
    into numeric components (Terraform doesn't support `>=` between strings).
    Now requires major == 26 and minor >= 1, with a guard that lets the tag-shape
    validation report cleanly first.
  - Makefile: new `terraform_test` target. Materialises the generated tfvars at
    project root (the external Python script reads it by hard-coded path) and
    cleans up on exit.
  - tests/datafiles/generate_core_data.sh: bumps the test-fixture
    tower_container_version from v25.3.0 to v26.1.0 so the baseline tfvars
    satisfies the new floor.
  - tests/terraform/README.md: usage notes; what's in vs out of scope.

Run:    make generate_test_data && make terraform_test
Result: 21 passed, 0 failed.

Refs #334
…oss-variable validation

Replaces the verify_only_one_true_set and verify_tfvars_config_dependencies
groups in check_configuration.py with native cross-variable `validation {}`
blocks. Errors fire at terraform plan time and the rule travels with the
variable rather than living in a Python pre-flight.

Cross-variable rules added (variables.tf):
  Only-one-of-N flag groups (each anchored on the first flag in the group):
    - flag_create_new_vpc / flag_use_existing_vpc
    - flag_create_external_db / flag_use_existing_external_db / flag_use_container_db
    - flag_create_external_redis / flag_use_container_redis
    - flag_create_load_balancer / flag_use_private_cacert / flag_do_not_use_https
    - flag_use_aws_ses_iam_integration / flag_use_existing_smtp

  Conditional-dependency rules (anchored on the dependent variable):
    - flag_use_existing_vpc => vpc_existing_id != REPLACE_ME
    - flag_create_load_balancer => alb_certificate_arn != REPLACE_ME (second
      validation block alongside the existing ARN-shape check)
    - flag_create_route53_private_zone => new_route53_private_zone_name set
    - flag_use_existing_route53_public_zone => existing_route53_public_zone_name set
    - flag_use_existing_route53_private_zone => existing_route53_private_zone_name set

Residual Python:
  - verify_only_one_true_set, verify_tfvars_config_dependencies, the
    only_one_true_set helper, and the ensure_dependency_populated helper
    are deleted entirely (not stubbed) — no callers remain.
  - verify_sensitive_keys (parsed-tfvars-dict shape) and verify_subnet_privacy
    (AWS data source) stay; those can't be expressed as variable validations.

tests/terraform/cross_variable_validation.tftest.hcl
  - 11 run blocks covering negative cases for all 11 new rules.
  - All 32 .tftest.hcl run blocks now pass: 0 failed.

Refs #334
…ushes

Adds .github/workflows/terraform-test.yml. Triggers on PRs targeting master or
release/v* branches when relevant files change, and on pushes to those branches.
Uses Terraform 1.14.x and Python 3.12, runs `make generate_test_data` followed
by `make terraform_test`.

The variable-validation suite uses `override_data` blocks to stub the connection-strings
external Python data source, so the run does not need real cloud credentials.
`make generate_test_data` does need Docker (extractors.py uses an hcl2json container
to parse tfvars); GHA ubuntu-latest runners have Docker available by default.

CX_SKIP_SSM=true is set for fixture generation to avoid touching real SSM parameters.

Format-check is intentionally NOT enabled in this workflow yet — the repo currently
has pre-existing fmt drift across ~10 files outside this PR's scope. A separate
housekeeping pass will run `terraform fmt -recursive` and add the check.

Drive-by: applied `terraform fmt` to two files this PR already touched
(modules/connection_strings/v1.0.0/variables.tf, templates/TEMPLATE_terraform.tfvars)
so they're at least clean within scope.

Refs #334
The original triggers only fired on PRs targeting master or release/v*. Since
this work is stacked (PR #335 -> PR #333 -> master), GHA reads the workflow
config from the PR base branch, which doesn't yet have this file. Result:
the workflow never ran on the stacked PR.

Changes:
  - pull_request: drop the branch restriction so any PR with relevant path
    changes runs the suite. Cheap; the suite is variable-validation-only.
  - push: add chore/v26-** as a temporary trigger so commits to the in-flight
    stacked branches fire the workflow. Remove once master has the workflow.
  - workflow_dispatch: added so the suite can be triggered manually for ad-hoc
    runs (e.g. when iterating on a .tftest.hcl file).

Refs #334
The previous CI run failed at `pip install -r tests/requirements.txt` because
that file (pre-existing) contains a malformed line ('pip install requests=2.31.0').
That file is for the pytest suite — terraform-test doesn't need it.
generate_testing_secrets.py uses only stdlib, and the data.external script is
mocked via override_data, so no third-party Python deps are required.

Also create tests/logs/ before running generate_test_data; the bash script
appends to a file in there and emits a warning when the directory is missing.

Refs #334
The first CI run failed because the test fixtures pin var.aws_profile = "development"
and CI doesn't have that AWS profile configured. Variable-validation tests don't
need a real provider — they should run without any cloud credentials.

Adds to each .tftest.hcl file:
  - `mock_provider "aws"` so the AWS provider stops trying to read shared config.
  - Four `override_data` blocks providing realistic JSON for the four SSM secret
    bundles (tower, seqerakit, groundswell, wave_lite). The locals jsondecode each
    payload and look up specific keys (TOWER_DB_USER, SWELL_DB_USER, ssm_key, etc.),
    so the stub JSON has to carry both the value and ssm_key fields per entry.

The override blocks are duplicated across all four .tftest.hcl files because
terraform test doesn't have a shared-fixture mechanism. If a new SSM key reference
lands in the .tf files, all four .tftest.hcl files need the matching update.

Result: 32 passed, 0 failed locally without any AWS credentials or AWS_PROFILE set.

Refs #334
Wires two new gates into .github/workflows/terraform-test.yml:
  - `terraform fmt -check -recursive -diff` — fails on any unformatted file.
  - `tflint --recursive` (with .tflint.hcl at repo root) — fails on errors;
    warnings are still informational and gated behind disabled rules in
    .tflint.hcl pending follow-up housekeeping.

Real bugs surfaced + fixed by tflint (terraform_map_duplicate_keys):
  - 004_iam.tf: `app_name` defined twice in instance_role_policy_main
    templatefile. Removed the duplicate.
  - 009_define_file_templates.tf: `db_database_name` defined twice in
    groundswell_sql templatefile. Removed the duplicate.

Auto-fixes applied via `tflint --fix`:
  - 23 deprecated interpolation expressions ("${var.x}" -> var.x) across
    000_main.tf, 002_security_groups.tf, 004_iam.tf, 005_parameter_store.tf,
    008_route53.tf, 009_define_file_templates.tf, 010_prepare_config_files.tf,
    011_configure_vm.tf, modules/connection_strings/v1.0.0/main.tf, and
    modules/elasticache/output.tf.

Manual residual:
  - 5 deprecated interpolations in modules/connection_strings/v1.0.0/main.tf
    and modules/elasticache/output.tf that --fix bailed on (path resolution
    quirk in the fix tool with submodules).

Repo-wide fmt:
  - `terraform fmt -recursive` applied to the whole tree. tests/datafiles/
    fixtures (which are regenerated each test run) are now also fmt'd at the
    end of generate_core_data.sh so the regenerated output stays clean.

.tflint.hcl (new, repo root):
  - Enables the `terraform` recommended preset and the AWS ruleset.
  - Disables a handful of rules with documented follow-up:
    terraform_module_version (the repo pins via folder name, not version arg),
    terraform_documented_variables / _outputs (large pre-existing gap;
    docs pass scheduled separately),
    terraform_required_version / _required_providers (need a minimum-version
    decision; terraform test/override_data set the floor at 1.7+),
    terraform_unused_declarations (per-item audit needed; some are TODO'd).

Local: terraform fmt -check -recursive (clean), tflint --recursive (exit 0),
make terraform_test (32 passed, 0 failed).

Refs #334
…bmodules

The previous CI run failed at the tflint step because:
  1. tflint exits 2 on *any* issue, including warnings — so even though
     errors were absent, warnings still failed the job.
  2. The root .tflint.hcl ruleset (which disables a handful of pre-existing
     housekeeping warnings) is not auto-discovered when --recursive walks
     into modules/connection_strings/v1.0.0/, modules/subnet_collector/v1.0.0/,
     and tests/datafiles/.

Fixes:
  - Pass --minimum-failure-severity=error so warnings stay informational.
  - Set TFLINT_CONFIG_FILE so submodules use the root config.

Tighten to --minimum-failure-severity=warning as part of the housekeeping
follow-up once the disabled rules in .tflint.hcl are addressed.

Refs #334
Addresses every remaining tflint finding instead of suppressing them.
After this commit `tflint --recursive` is fully clean (exit 0, no warnings)
and the CI step no longer needs --minimum-failure-severity=error.

terraform_required_version
  - Root 000_main.tf: bumped from ">= 1.1.0" to ">= 1.7.0" — terraform test
    override_data blocks (used by tests/terraform/) require >= 1.7.
  - modules/connection_strings/v1.0.0/main.tf: added terraform { required_version }.
  - modules/subnet_collector/v1.0.0/main.tf: added terraform { required_version }.
  - tests/datafiles/generate_core_data.sh: emit a terraform { required_version }
    block in the generated 012_testing_outputs.tf.

terraform_required_providers
  - Root 000_main.tf: declared all five providers used in the tree —
    aws (~> 5.12), tls (~> 4.0), null (~> 3.2), random (~> 3.6), external (~> 2.3).
  - modules/connection_strings/v1.0.0/main.tf: declared external (~> 2.3).
  - modules/subnet_collector/v1.0.0/main.tf: declared aws (~> 5.12).

terraform_unused_declarations
  - Removed `flag_create_load_balancer` from the connection_strings module's
    variables.tf and from the parent's module call in 000_main.tf — it was
    declared but never referenced inside the module.
  - Removed `mock_sp_vpc` and `mock_aws_subnets_all` locals from
    modules/subnet_collector/v1.0.0/main.tf — leftover stubs from an earlier
    mock pattern; never referenced.

.tflint.hcl
  - Removed the temporary disables for terraform_required_version,
    terraform_required_providers, and terraform_unused_declarations now that
    the underlying findings are fixed.

CI workflow
  - tflint step no longer passes --minimum-failure-severity=error; default
    behaviour fails on any warning, which is what we want now.

Local: `terraform fmt -check -recursive` exit 0, `tflint --recursive` exit 0,
`make terraform_test` 32 passed / 0 failed.

Refs #334
Sweeps comments added during the variable-validation migration that document
"this used to be in check_configuration.py / now lives in validation {} blocks"
or otherwise narrate the migration story. Those belong in the PR description,
not in long-lived code where they will rot.

Cuts (≈120 lines removed, no behaviour change):

  scripts/installer/validation/check_configuration.py
    - Trimmed "Note: ... is now enforced by `validation {}` block" notes
      from verify_tower_server_url, verify_tower_self_signed_certs,
      verify_docker_version, verify_data_studio, verify_data_studio_ssh.
    - Removed verify_pipeline_versioning entirely (no-op stub) and its
      call site in __main__.
    - Dropped two migration-narrative blocks above the residual __main__ calls.

  variables.tf
    - Dropped "replaces verify_only_one_true_set" preamble before the
      cross-variable rules; kept a one-liner stating the anchor convention.
    - Removed "REPLACE_ME passthrough" / "stays in Python because spans
      variables" notes — the conditions and error_messages already say that.
    - Trimmed the floor-rule comment to just the genuinely non-obvious part
      (`!can(...)` short-circuit guard).
    - Removed inline "master supports MySQL 8.x only" comments — duplicates
      the error_message text.

  tests/terraform/*.tftest.hcl
    - File headers reduced to a single descriptive line. The repeated
      4-line "000_main.tf jsondecodes..." block is now a 2-line note that
      states the only non-obvious constraint (each entry needs both `value`
      and `ssm_key`).
    - Dropped "See version_validation.tftest.hcl for the baseline-tfvars
      convention" cross-references; tests/terraform/README.md owns that.

  .github/workflows/terraform-test.yml
    - Header no longer references issue #334.
    - Trimmed the long Python-deps explanation to one line.
    - Dropped the inline default-behaviour explanation on the tflint step.
    - Dropped the redundant CX_SKIP_SSM = "skip writing real SSM" comment
      (the variable name is self-documenting).

  .tflint.hcl
    - Removed the "Initial baseline ... tighten over time" preamble and
      restated the rule disables as compact one-liners.

  tests/datafiles/generate_core_data.sh
    - Tightened the terraform-fmt block comment.

Local: terraform fmt -check -recursive (clean), tflint --recursive (exit 0),
make terraform_test (32 passed, 0 failed).

Refs #334
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant