Skip to content

Add support for external LBs for fulcio, dex, TSA#195

Open
cmurphy wants to merge 2 commits intosigstore:mainfrom
cmurphy:external-lbs
Open

Add support for external LBs for fulcio, dex, TSA#195
cmurphy wants to merge 2 commits intosigstore:mainfrom
cmurphy:external-lbs

Conversation

@cmurphy
Copy link
Copy Markdown
Contributor

@cmurphy cmurphy commented Apr 3, 2026

Add support for external LBs for fulcio, dex, TSA

Add a sigstore_global module that deploys common resources to support
external load balancers for Fulcio, Dex, and TSA. If deployed in a
single region, individual modules call their respective "global" modules
to deploy universal resources like Cloud Armor policies or IP addresses.
If deployed in multi-region, the individual modules defer to the
sigstore_global module to deploy these resources and additional load
balancer resources so they are only managed once. The modules have the
option of not managing the DNS record to enable a smooth transition from
having the individual module manage it to managing it with the global
module.

We use DNS Authorization using CNAME records instead of raw Certificate
Manager certificates because the DNS record might, during the transition
period, be managed by another module, and the IP address may not resolve
to the new global IP address. Using DNS Authorization means we can still
provision all the SSL resources even before the DNS A record is fully
transitioned to the new global IP address.

Add aggregator service for Dex

Dex has a /keys endpoint that Fulcio queries to verify identity tokens.
Without this aggregator service, if Fulcio queried the global URL, the
keys it fetches may come from either instance and may then cause the
verification to fail. This change creates a bucket to store the keys,
load balancer routing to route requests to the bucket, and a Cloud
Function that merges keys from each instance into one list. Separately,
the helm chart for Dex will include a cron job that publishes its keys
to this bucket for the Cloud Function to merge.

note: jwks-merger code written by Gemini and reviewed by myself. Supporting terraform code for Cloud Functions written with help from Gemini.

See https://github.com/sigstore/public-good-instance/issues/3603#issuecomment-4362535721 for more details on this change.

Relates to https://github.com/sigstore/public-good-instance/issues/3603
Depends on #199

Summary

Release Note

Documentation

@cmurphy cmurphy force-pushed the external-lbs branch 3 times, most recently from 3f1f640 to cac824f Compare April 8, 2026 22:00
@cmurphy cmurphy marked this pull request as ready for review April 14, 2026 15:27
@cmurphy cmurphy requested a review from a team as a code owner April 14, 2026 15:27
@Hayden-IO
Copy link
Copy Markdown
Contributor

Can you say more about the leader instances? It makes sense that we want to only create global resources once, though I'm wondering if we should have these resources declared separately. Do you think there's any value in that?

@cmurphy
Copy link
Copy Markdown
Contributor Author

cmurphy commented Apr 15, 2026

The non-shared-resources are linked to the shared resources and it's a little easier to pass those references directly from one resource to another in the same module than to pass them through from one module to another. And, I want to avoid as much as possible adding to the bloat of modules that the root module has to call beyond the sigstore composite module. I like the simplicity of having one sigstore module per region and I regret that the tiles module breaks that pattern.

@cmurphy cmurphy marked this pull request as draft April 17, 2026 17:19
@cmurphy cmurphy force-pushed the external-lbs branch 4 times, most recently from 9e880e4 to b977494 Compare April 28, 2026 18:01
@cmurphy
Copy link
Copy Markdown
Contributor Author

cmurphy commented Apr 28, 2026

Since I wasn't able to replicate the leader-type model to tiles_tlog, I did end up finding it made more sense to create a separate global module for the shared resources and this current version does that.

@cmurphy cmurphy marked this pull request as ready for review April 28, 2026 18:04
cmurphy added 2 commits April 30, 2026 15:45
Add a sigstore_global module that deploys common resources to support
external load balancers for Fulcio, Dex, and TSA. If deployed in a
single region, individual modules call their respective "global" modules
to deploy universal resources like Cloud Armor policies or IP addresses.
If deployed in multi-region, the individual modules defer to the
sigstore_global module to deploy these resources and additional load
balancer resources so they are only managed once. The modules have the
option of not managing the DNS record to enable a smooth transition from
having the individual module manage it to managing it with the global
module.

We use DNS Authorization using CNAME records instead of raw Certificate
Manager certificates because the DNS record might, during the transition
period, be managed by another module, and the IP address may not resolve
to the new global IP address. Using DNS Authorization means we can still
provision all the SSL resources even before the DNS A record is fully
transitioned to the new global IP address.

Also move project_roles to the global module, since IAM bindings are
also global resources and can't be duplicated. Calling modules can just
set their IAM bindings in sigstore_global instead of sigstore.

Signed-off-by: Colleen Murphy <[email protected]>
Dex has a /keys endpoint that Fulcio queries to verify identity tokens.
Without this aggregator service, if Fulcio queried the global URL, the
keys it fetches may come from either instance and may then cause the
verification to fail. This change creates a bucket to store the keys,
load balancer routing to route requests to the bucket, and a Cloud
Function that merges keys from each instance into one list. Separately,
the helm chart for Dex will include a cron job that publishes its keys
to this bucket for the Cloud Function to merge.

Signed-off-by: Colleen Murphy <[email protected]>
Comment thread gcp/modules/dex/global/gcs.tf Dismissed
Comment thread gcp/modules/sigstore_global/global.tf Dismissed
Comment on lines +34 to +38
resource "google_storage_bucket_iam_member" "function_bucket_access" {
bucket = google_storage_bucket.auth_bucket.name
role = "roles/storage.objectUser"
member = "serviceAccount:${var.project_number}[email protected]"
}
Comment on lines +23 to +48
module "dex" {
source = "../dex/global"

project_id = var.project_id
project_number = var.project_number

lb_address_name = "dex-global-ext-lb"

dns_zone_name = var.dns_zone_name
dns_domain_name = var.dns_domain_name
manage_dns_a_record = var.dex_manage_dns_a_record

enable_cloud_armor = true
cloud_armor_policy_name = "dex-service-security-policy-global"
cloud_armor_rules = var.dex_cloud_armor_rules
enable_adaptive_protection = true
enable_ssl_policy = true
ssl_policy_name = "dex-ingress-ssl-policy-global"

network_endpoint_group_zones = var.network_endpoint_group_zones
network_endpoint_group_name = var.dex_network_endpoint_group_name
backend_service_max_rps = var.dex_backend_service_max_rps

enable_healthcheck_logging = var.enable_loadbalancer_logging
enable_backend_service_logging = var.enable_loadbalancer_logging
}
Comment on lines +17 to +24
resource "google_storage_bucket" "auth_bucket" {
project = var.project_id

name = "${var.project_id}-dex-jwks-storage"
location = "US"

uniform_bucket_level_access = true
}
Comment on lines +23 to +48
module "dex" {
source = "../dex/global"

project_id = var.project_id
project_number = var.project_number

lb_address_name = "dex-global-ext-lb"

dns_zone_name = var.dns_zone_name
dns_domain_name = var.dns_domain_name
manage_dns_a_record = var.dex_manage_dns_a_record

enable_cloud_armor = true
cloud_armor_policy_name = "dex-service-security-policy-global"
cloud_armor_rules = var.dex_cloud_armor_rules
enable_adaptive_protection = true
enable_ssl_policy = true
ssl_policy_name = "dex-ingress-ssl-policy-global"

network_endpoint_group_zones = var.network_endpoint_group_zones
network_endpoint_group_name = var.dex_network_endpoint_group_name
backend_service_max_rps = var.dex_backend_service_max_rps

enable_healthcheck_logging = var.enable_loadbalancer_logging
enable_backend_service_logging = var.enable_loadbalancer_logging
}
Copy link
Copy Markdown
Member

@loosebazooka loosebazooka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm finally wrapping my head around this. I have a few questions.

Also gemini said this (maybe this is a concern?)

  The sigstore_global module uses data sources to find the Kubernetes NEGs.
   * The Bug: Terraform evaluates data sources during the Plan phase. If a user is deploying a brand-new region, the GKE cluster doesn't exist yet, so the NEGs don't exist yet.
   * Impact: terraform plan will fail and exit because the data source returns a "404 Not Found" error. This makes it impossible to do a clean, one-command deployment of a new global stack.
   * Fix: The module should use a pattern that allows the NEG names to be passed as strings and only validated during the Apply phase, or include a "skip" toggle for the data source.

Comment thread gcp/modules/sigstore_global/variables.tf
Comment thread gcp/modules/sigstore_global/outputs.tf
Comment thread gcp/modules/fulcio/global/network.tf
Comment thread gcp/modules/dex/global/network.tf
Comment thread gcp/modules/dex/gcs.tf
}

event_trigger {
trigger_region = "us"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this retry automatically?

Can we end up in a stale state on dex key rotation if the cloud function triggered run fails?

I imagine we don't change those keys that often, but I'm not sure I understand the full failure case here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could we have a race condition if both regions updated keys at the same time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this retry automatically?

As-is this doesn't retry automatically but it looks like that's easy to add so I'll do that.

Can we end up in a stale state on dex key rotation if the cloud function triggered run fails?

Yes, but there is an alert set up to prevent that from becoming a problem.

I imagine we don't change those keys that often

We don't change the keys ourselves, Dex does that itself every 24 hours. So the failure case would be something like:

  1. Dex pod pushes a bad update. Cloud Function triggers, and fails.
  2. The Dex cronjob triggers every 1 minute, so depending on the failure it may self-resolve if the Cloud Function then triggers again and succeeds
  3. If not, the last known keys are untouched and will remain valid for a while, even after Dex rotates its key the old one will still be valid for 24 hours.
  4. The stale keys alert will trigger an hour after the bad push, so someone will check the problem and fix it before the keys expire.

Copy link
Copy Markdown
Contributor Author

@cmurphy cmurphy May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could we have a race condition if both regions updated keys at the same time?

Each region will push updates to keys/[region].json, so there is no race condition on push. Both Cloud Functions will trigger the merge update no matter which region updated their keys, so it is probable that both will run at about the same time, but this should be safe because of GCS's strong consistency and because the output is idempotent, so it doesn't matter which region finishes first or last.

Cloud Functions have to be regional because of the compute resources they use, there's no global version. We could just have one in one region to avoid the duplication and unnecessary dual writes, but that kind of defeats the purpose of the redundant infrastructure we're building.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, the last known keys are untouched and will remain valid for a while, even after Dex rotates its key the old one will still be valid for 24 hours.

What's the delay between key rotation and when the instance starts using the new key to sign?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it's instantaneous. So what I think you're pointing out is that there could be a gap between when the key starts being used to sign and when it's actually available in the bucket. So I'll need to figure something out to avoid having identity tokens fail to verify for one minute every 24 hours.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah yeah I guess I'm trying to understand what could go wrong?

uniform_bucket_level_access = true
}

# Grant the global internet permission to read the merged keys file via the CDN
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the global internet need to read these? Dex has its own .well-known to really access those values. And I think somewhat importantly, how are client supposed to access this info? Clients are designed to use the .well-known to validate dex tokens before they send them off to fulcio.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the global internet need to read these? Dex has its own .well-known to really access those values.

The .well-known configuration provides the jwks_uri to point to the keys endpoint but it doesn't provide the keys themselves. The public keys are public information and the current Dex instance already exposes this publicly (https://oauth2.sigstage.dev/auth/keys).

And I think somewhat importantly, how are client supposed to access this info?

They'll access the .well-known the same way, and the jwks_uri points to the same address as it always has. The URL map configuration makes this completely transparent.

before they send them off to fulcio.

My understanding is that Fulcio also has to validate the tokens too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm trying to understand this:

  1. the client requests a token from oauth2.sigstage.dev
  2. the client gets a token, it then validates the token before sending it to fulcio (this is what sigstore java does atleast).

The issue I'm concerned about is client behavior: can it correctly find the /.well-known/openid-configuration endpoint if it's hidden behind the LB proxy? it should be asking the specific dex instance for this info?

Is our config that both/all dex instances expose the exact same /.well-known/openid-configuration with /.well-known/keys just pointing to the consolidated keys file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .well-known/openid-configuration isn't hidden, it is passed through to the Dex pod by https://github.com/sigstore/terraform-modules/pull/195/changes/BASE..c0f76e2c7dc6926b7b43c069c38d459d8380d6de#diff-8285fd7d43c899c705485c388abca03ab5bd1342b6c42cf2c43ce252f27c1429R241 which still serves that well-known endpoint. It's only if you make a request to /auth/keys that you'll be routed to this bucket. All the Dex instances will be serving their own .well-known configs, and they will happen to be identical because all of the URLs in it will be the same.

Copy link
Copy Markdown
Member

@loosebazooka loosebazooka May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah "hidden" was probably the wrong word. I wasn't fully grasping how a client wanting to verify the token would interact with the system. What interesting to me (now that I've looked at it more) is the token and auth endpoints obtained to use for a web flow:

In the flow (at least how sigstore-java does it).

  1. go ask for /.well-known/openid-configuration and parse out authorization_endpoint and token_endpoints which we need for the web flow.
  2. Use authEndpoint to initiate login and get an auth code
  3. Use tokenEndpoint to trade the auth token for an id token

If these endpoints are mixed up or not routed to the same oauth2 server, I think things will get wonky. What I'm trying to figure out here is if all contact with the global dex endpoint will hit the same dex instance over the course of the auth flow.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am understanding this correctly, authEndpoint and tokenEndpoint should be instance specific.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is a valid concern and a flaw in my design. I was only thinking about Fulcio verifying the token and not the token acquisition itself. I'll have to think through this and come up with an alternative.

@cmurphy
Copy link
Copy Markdown
Contributor Author

cmurphy commented May 4, 2026

Will come back to each in-line review soon. In the meantime, here is a draft of the migration strategy that should hopefully give context on why this is structured the way it is and how all the pieces come together. Yes, the NEGs will not be discoverable at first, there is a multi-step process to set up the base infrastructure in Terraform, create the NEGs from Kubernetes, then attach to them in Terraform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants