Add support for external LBs for fulcio, dex, TSA by cmurphy · Pull Request #195 · sigstore/terraform-modules

cmurphy · 2026-04-03T23:30:48Z

Add support for external LBs for fulcio, dex, TSA

Add a sigstore_global module that deploys common resources to support
external load balancers for Fulcio, Dex, and TSA. If deployed in a
single region, individual modules call their respective "global" modules
to deploy universal resources like Cloud Armor policies or IP addresses.
If deployed in multi-region, the individual modules defer to the
sigstore_global module to deploy these resources and additional load
balancer resources so they are only managed once. The modules have the
option of not managing the DNS record to enable a smooth transition from
having the individual module manage it to managing it with the global
module.

We use DNS Authorization using CNAME records instead of raw Certificate
Manager certificates because the DNS record might, during the transition
period, be managed by another module, and the IP address may not resolve
to the new global IP address. Using DNS Authorization means we can still
provision all the SSL resources even before the DNS A record is fully
transitioned to the new global IP address.

Add aggregator service for Dex

Dex has a /keys endpoint that Fulcio queries to verify identity tokens.
Without this aggregator service, if Fulcio queried the global URL, the
keys it fetches may come from either instance and may then cause the
verification to fail. This change creates a bucket to store the keys,
load balancer routing to route requests to the bucket, and a Cloud
Function that merges keys from each instance into one list. Separately,
the helm chart for Dex will include a cron job that publishes its keys
to this bucket for the Cloud Function to merge.

note: jwks-merger code written by Gemini and reviewed by myself. Supporting terraform code for Cloud Functions written with help from Gemini.

See https://github.com/sigstore/public-good-instance/issues/3603#issuecomment-4362535721 for more details on this change.

Relates to https://github.com/sigstore/public-good-instance/issues/3603
Depends on #199

Summary

Release Note

Documentation

Hayden-IO · 2026-04-15T20:43:03Z

Can you say more about the leader instances? It makes sense that we want to only create global resources once, though I'm wondering if we should have these resources declared separately. Do you think there's any value in that?

cmurphy · 2026-04-15T22:28:30Z

The non-shared-resources are linked to the shared resources and it's a little easier to pass those references directly from one resource to another in the same module than to pass them through from one module to another. And, I want to avoid as much as possible adding to the bloat of modules that the root module has to call beyond the sigstore composite module. I like the simplicity of having one sigstore module per region and I regret that the tiles module breaks that pattern.

cmurphy · 2026-04-28T18:04:18Z

Since I wasn't able to replicate the leader-type model to tiles_tlog, I did end up finding it made more sense to create a separate global module for the shared resources and this current version does that.

Add a sigstore_global module that deploys common resources to support external load balancers for Fulcio, Dex, and TSA. If deployed in a single region, individual modules call their respective "global" modules to deploy universal resources like Cloud Armor policies or IP addresses. If deployed in multi-region, the individual modules defer to the sigstore_global module to deploy these resources and additional load balancer resources so they are only managed once. The modules have the option of not managing the DNS record to enable a smooth transition from having the individual module manage it to managing it with the global module. We use DNS Authorization using CNAME records instead of raw Certificate Manager certificates because the DNS record might, during the transition period, be managed by another module, and the IP address may not resolve to the new global IP address. Using DNS Authorization means we can still provision all the SSL resources even before the DNS A record is fully transitioned to the new global IP address. Also move project_roles to the global module, since IAM bindings are also global resources and can't be duplicated. Calling modules can just set their IAM bindings in sigstore_global instead of sigstore. Signed-off-by: Colleen Murphy <[email protected]>

Dex has a /keys endpoint that Fulcio queries to verify identity tokens. Without this aggregator service, if Fulcio queried the global URL, the keys it fetches may come from either instance and may then cause the verification to fail. This change creates a bucket to store the keys, load balancer routing to route requests to the bucket, and a Cloud Function that merges keys from each instance into one list. Separately, the helm chart for Dex will include a cron job that publishes its keys to this bucket for the Cloud Function to merge. Signed-off-by: Colleen Murphy <[email protected]>

+resource "google_storage_bucket_iam_member" "function_bucket_access" {
+  bucket = google_storage_bucket.auth_bucket.name
+  role   = "roles/storage.objectUser"
+  member = "serviceAccount:${var.project_number}[email protected]"
+}


+module "dex" {
+  source = "../dex/global"
+
+  project_id     = var.project_id
+  project_number = var.project_number
+
+  lb_address_name = "dex-global-ext-lb"
+
+  dns_zone_name       = var.dns_zone_name
+  dns_domain_name     = var.dns_domain_name
+  manage_dns_a_record = var.dex_manage_dns_a_record
+
+  enable_cloud_armor         = true
+  cloud_armor_policy_name    = "dex-service-security-policy-global"
+  cloud_armor_rules          = var.dex_cloud_armor_rules
+  enable_adaptive_protection = true
+  enable_ssl_policy          = true
+  ssl_policy_name            = "dex-ingress-ssl-policy-global"
+
+  network_endpoint_group_zones = var.network_endpoint_group_zones
+  network_endpoint_group_name  = var.dex_network_endpoint_group_name
+  backend_service_max_rps      = var.dex_backend_service_max_rps
+
+  enable_healthcheck_logging     = var.enable_loadbalancer_logging
+  enable_backend_service_logging = var.enable_loadbalancer_logging
+}


+resource "google_storage_bucket" "auth_bucket" {
+  project = var.project_id
+
+  name     = "${var.project_id}-dex-jwks-storage"
+  location = "US"
+
+  uniform_bucket_level_access = true
+}


+module "dex" {
+  source = "../dex/global"
+
+  project_id     = var.project_id
+  project_number = var.project_number
+
+  lb_address_name = "dex-global-ext-lb"
+
+  dns_zone_name       = var.dns_zone_name
+  dns_domain_name     = var.dns_domain_name
+  manage_dns_a_record = var.dex_manage_dns_a_record
+
+  enable_cloud_armor         = true
+  cloud_armor_policy_name    = "dex-service-security-policy-global"
+  cloud_armor_rules          = var.dex_cloud_armor_rules
+  enable_adaptive_protection = true
+  enable_ssl_policy          = true
+  ssl_policy_name            = "dex-ingress-ssl-policy-global"
+
+  network_endpoint_group_zones = var.network_endpoint_group_zones
+  network_endpoint_group_name  = var.dex_network_endpoint_group_name
+  backend_service_max_rps      = var.dex_backend_service_max_rps
+
+  enable_healthcheck_logging     = var.enable_loadbalancer_logging
+  enable_backend_service_logging = var.enable_loadbalancer_logging
+}


loosebazooka

I think I'm finally wrapping my head around this. I have a few questions.

Also gemini said this (maybe this is a concern?)

  The sigstore_global module uses data sources to find the Kubernetes NEGs.
   * The Bug: Terraform evaluates data sources during the Plan phase. If a user is deploying a brand-new region, the GKE cluster doesn't exist yet, so the NEGs don't exist yet.
   * Impact: terraform plan will fail and exit because the data source returns a "404 Not Found" error. This makes it impossible to do a clean, one-command deployment of a new global stack.
   * Fix: The module should use a pattern that allows the NEG names to be passed as strings and only validated during the Apply phase, or include a "skip" toggle for the data source.

loosebazooka · 2026-05-04T18:48:26Z

+  }
+
+  event_trigger {
+    trigger_region = "us"


does this retry automatically?

Can we end up in a stale state on dex key rotation if the cloud function triggered run fails?

I imagine we don't change those keys that often, but I'm not sure I understand the full failure case here.

Also could we have a race condition if both regions updated keys at the same time?

does this retry automatically?

As-is this doesn't retry automatically but it looks like that's easy to add so I'll do that.

Can we end up in a stale state on dex key rotation if the cloud function triggered run fails?

Yes, but there is an alert set up to prevent that from becoming a problem.

I imagine we don't change those keys that often

We don't change the keys ourselves, Dex does that itself every 24 hours. So the failure case would be something like:

Dex pod pushes a bad update. Cloud Function triggers, and fails.

The Dex cronjob triggers every 1 minute, so depending on the failure it may self-resolve if the Cloud Function then triggers again and succeeds

If not, the last known keys are untouched and will remain valid for a while, even after Dex rotates its key the old one will still be valid for 24 hours.

The stale keys alert will trigger an hour after the bad push, so someone will check the problem and fix it before the keys expire.

Also could we have a race condition if both regions updated keys at the same time?

Each region will push updates to keys/[region].json, so there is no race condition on push. Both Cloud Functions will trigger the merge update no matter which region updated their keys, so it is probable that both will run at about the same time, but this should be safe because of GCS's strong consistency and because the output is idempotent, so it doesn't matter which region finishes first or last.

Cloud Functions have to be regional because of the compute resources they use, there's no global version. We could just have one in one region to avoid the duplication and unnecessary dual writes, but that kind of defeats the purpose of the redundant infrastructure we're building.

If not, the last known keys are untouched and will remain valid for a while, even after Dex rotates its key the old one will still be valid for 24 hours.

What's the delay between key rotation and when the instance starts using the new key to sign?

I'm pretty sure it's instantaneous. So what I think you're pointing out is that there could be a gap between when the key starts being used to sign and when it's actually available in the bucket. So I'll need to figure something out to avoid having identity tokens fail to verify for one minute every 24 hours.

Hah yeah I guess I'm trying to understand what could go wrong?

loosebazooka · 2026-05-04T18:55:15Z

+  uniform_bucket_level_access = true
+}
+
+# Grant the global internet permission to read the merged keys file via the CDN


Does the global internet need to read these? Dex has its own .well-known to really access those values. And I think somewhat importantly, how are client supposed to access this info? Clients are designed to use the .well-known to validate dex tokens before they send them off to fulcio.

Does the global internet need to read these? Dex has its own .well-known to really access those values.

The .well-known configuration provides the jwks_uri to point to the keys endpoint but it doesn't provide the keys themselves. The public keys are public information and the current Dex instance already exposes this publicly (https://oauth2.sigstage.dev/auth/keys).

And I think somewhat importantly, how are client supposed to access this info?

They'll access the .well-known the same way, and the jwks_uri points to the same address as it always has. The URL map configuration makes this completely transparent.

before they send them off to fulcio.

My understanding is that Fulcio also has to validate the tokens too.

So I'm trying to understand this:

the client requests a token from oauth2.sigstage.dev

the client gets a token, it then validates the token before sending it to fulcio (this is what sigstore java does atleast).

The issue I'm concerned about is client behavior: can it correctly find the /.well-known/openid-configuration endpoint if it's hidden behind the LB proxy? it should be asking the specific dex instance for this info?

Is our config that both/all dex instances expose the exact same /.well-known/openid-configuration with /.well-known/keys just pointing to the consolidated keys file?

The .well-known/openid-configuration isn't hidden, it is passed through to the Dex pod by https://github.com/sigstore/terraform-modules/pull/195/changes/BASE..c0f76e2c7dc6926b7b43c069c38d459d8380d6de#diff-8285fd7d43c899c705485c388abca03ab5bd1342b6c42cf2c43ce252f27c1429R241 which still serves that well-known endpoint. It's only if you make a request to /auth/keys that you'll be routed to this bucket. All the Dex instances will be serving their own .well-known configs, and they will happen to be identical because all of the URLs in it will be the same.

yeah "hidden" was probably the wrong word. I wasn't fully grasping how a client wanting to verify the token would interact with the system. What interesting to me (now that I've looked at it more) is the token and auth endpoints obtained to use for a web flow:

In the flow (at least how sigstore-java does it).

go ask for /.well-known/openid-configuration and parse out authorization_endpoint and token_endpoints which we need for the web flow.

Use authEndpoint to initiate login and get an auth code

Use tokenEndpoint to trade the auth token for an id token

If these endpoints are mixed up or not routed to the same oauth2 server, I think things will get wonky. What I'm trying to figure out here is if all contact with the global dex endpoint will hit the same dex instance over the course of the auth flow.

If I am understanding this correctly, authEndpoint and tokenEndpoint should be instance specific.

Hmm this is a valid concern and a flaw in my design. I was only thinking about Fulcio verifying the token and not the token acquisition itself. I'll have to think through this and come up with an alternative.

cmurphy · 2026-05-04T22:18:26Z

Will come back to each in-line review soon. In the meantime, here is a draft of the migration strategy that should hopefully give context on why this is structured the way it is and how all the pieces come together. Yes, the NEGs will not be discoverable at first, there is a multi-step process to set up the base infrastructure in Terraform, create the NEGs from Kubernetes, then attach to them in Terraform.

cmurphy force-pushed the external-lbs branch 3 times, most recently from 3f1f640 to cac824f Compare April 8, 2026 22:00

cmurphy marked this pull request as ready for review April 14, 2026 15:27

cmurphy requested a review from a team as a code owner April 14, 2026 15:27

cmurphy marked this pull request as draft April 17, 2026 17:19

cmurphy force-pushed the external-lbs branch 4 times, most recently from 9e880e4 to b977494 Compare April 28, 2026 18:01

cmurphy marked this pull request as ready for review April 28, 2026 18:04

cmurphy force-pushed the external-lbs branch from b977494 to f71302b Compare April 28, 2026 22:51

cmurphy added 2 commits April 30, 2026 15:45

cmurphy force-pushed the external-lbs branch from f71302b to c0f76e2 Compare May 2, 2026 01:59

github-advanced-security AI found potential problems May 2, 2026

View reviewed changes

loosebazooka reviewed May 4, 2026

View reviewed changes

Conversation

cmurphy commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add support for external LBs for fulcio, dex, TSA

Add aggregator service for Dex

Summary

Release Note

Documentation

Uh oh!

Hayden-IO commented Apr 15, 2026

Uh oh!

cmurphy commented Apr 15, 2026

Uh oh!

cmurphy commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

loosebazooka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmurphy May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loosebazooka May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmurphy commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cmurphy commented Apr 3, 2026 •

edited

Loading

cmurphy May 5, 2026 •

edited

Loading

loosebazooka May 5, 2026 •

edited

Loading