Add support for external LBs for fulcio, dex, TSA#195
Add support for external LBs for fulcio, dex, TSA#195cmurphy wants to merge 2 commits intosigstore:mainfrom
Conversation
3f1f640 to
cac824f
Compare
|
Can you say more about the leader instances? It makes sense that we want to only create global resources once, though I'm wondering if we should have these resources declared separately. Do you think there's any value in that? |
|
The non-shared-resources are linked to the shared resources and it's a little easier to pass those references directly from one resource to another in the same module than to pass them through from one module to another. And, I want to avoid as much as possible adding to the bloat of modules that the root module has to call beyond the sigstore composite module. I like the simplicity of having one sigstore module per region and I regret that the tiles module breaks that pattern. |
9e880e4 to
b977494
Compare
|
Since I wasn't able to replicate the leader-type model to tiles_tlog, I did end up finding it made more sense to create a separate global module for the shared resources and this current version does that. |
Add a sigstore_global module that deploys common resources to support external load balancers for Fulcio, Dex, and TSA. If deployed in a single region, individual modules call their respective "global" modules to deploy universal resources like Cloud Armor policies or IP addresses. If deployed in multi-region, the individual modules defer to the sigstore_global module to deploy these resources and additional load balancer resources so they are only managed once. The modules have the option of not managing the DNS record to enable a smooth transition from having the individual module manage it to managing it with the global module. We use DNS Authorization using CNAME records instead of raw Certificate Manager certificates because the DNS record might, during the transition period, be managed by another module, and the IP address may not resolve to the new global IP address. Using DNS Authorization means we can still provision all the SSL resources even before the DNS A record is fully transitioned to the new global IP address. Also move project_roles to the global module, since IAM bindings are also global resources and can't be duplicated. Calling modules can just set their IAM bindings in sigstore_global instead of sigstore. Signed-off-by: Colleen Murphy <[email protected]>
Dex has a /keys endpoint that Fulcio queries to verify identity tokens. Without this aggregator service, if Fulcio queried the global URL, the keys it fetches may come from either instance and may then cause the verification to fail. This change creates a bucket to store the keys, load balancer routing to route requests to the bucket, and a Cloud Function that merges keys from each instance into one list. Separately, the helm chart for Dex will include a cron job that publishes its keys to this bucket for the Cloud Function to merge. Signed-off-by: Colleen Murphy <[email protected]>
| resource "google_storage_bucket_iam_member" "function_bucket_access" { | ||
| bucket = google_storage_bucket.auth_bucket.name | ||
| role = "roles/storage.objectUser" | ||
| member = "serviceAccount:${var.project_number}[email protected]" | ||
| } |
| module "dex" { | ||
| source = "../dex/global" | ||
|
|
||
| project_id = var.project_id | ||
| project_number = var.project_number | ||
|
|
||
| lb_address_name = "dex-global-ext-lb" | ||
|
|
||
| dns_zone_name = var.dns_zone_name | ||
| dns_domain_name = var.dns_domain_name | ||
| manage_dns_a_record = var.dex_manage_dns_a_record | ||
|
|
||
| enable_cloud_armor = true | ||
| cloud_armor_policy_name = "dex-service-security-policy-global" | ||
| cloud_armor_rules = var.dex_cloud_armor_rules | ||
| enable_adaptive_protection = true | ||
| enable_ssl_policy = true | ||
| ssl_policy_name = "dex-ingress-ssl-policy-global" | ||
|
|
||
| network_endpoint_group_zones = var.network_endpoint_group_zones | ||
| network_endpoint_group_name = var.dex_network_endpoint_group_name | ||
| backend_service_max_rps = var.dex_backend_service_max_rps | ||
|
|
||
| enable_healthcheck_logging = var.enable_loadbalancer_logging | ||
| enable_backend_service_logging = var.enable_loadbalancer_logging | ||
| } |
| resource "google_storage_bucket" "auth_bucket" { | ||
| project = var.project_id | ||
|
|
||
| name = "${var.project_id}-dex-jwks-storage" | ||
| location = "US" | ||
|
|
||
| uniform_bucket_level_access = true | ||
| } |
| module "dex" { | ||
| source = "../dex/global" | ||
|
|
||
| project_id = var.project_id | ||
| project_number = var.project_number | ||
|
|
||
| lb_address_name = "dex-global-ext-lb" | ||
|
|
||
| dns_zone_name = var.dns_zone_name | ||
| dns_domain_name = var.dns_domain_name | ||
| manage_dns_a_record = var.dex_manage_dns_a_record | ||
|
|
||
| enable_cloud_armor = true | ||
| cloud_armor_policy_name = "dex-service-security-policy-global" | ||
| cloud_armor_rules = var.dex_cloud_armor_rules | ||
| enable_adaptive_protection = true | ||
| enable_ssl_policy = true | ||
| ssl_policy_name = "dex-ingress-ssl-policy-global" | ||
|
|
||
| network_endpoint_group_zones = var.network_endpoint_group_zones | ||
| network_endpoint_group_name = var.dex_network_endpoint_group_name | ||
| backend_service_max_rps = var.dex_backend_service_max_rps | ||
|
|
||
| enable_healthcheck_logging = var.enable_loadbalancer_logging | ||
| enable_backend_service_logging = var.enable_loadbalancer_logging | ||
| } |
loosebazooka
left a comment
There was a problem hiding this comment.
I think I'm finally wrapping my head around this. I have a few questions.
Also gemini said this (maybe this is a concern?)
The sigstore_global module uses data sources to find the Kubernetes NEGs.
* The Bug: Terraform evaluates data sources during the Plan phase. If a user is deploying a brand-new region, the GKE cluster doesn't exist yet, so the NEGs don't exist yet.
* Impact: terraform plan will fail and exit because the data source returns a "404 Not Found" error. This makes it impossible to do a clean, one-command deployment of a new global stack.
* Fix: The module should use a pattern that allows the NEG names to be passed as strings and only validated during the Apply phase, or include a "skip" toggle for the data source.
| } | ||
|
|
||
| event_trigger { | ||
| trigger_region = "us" |
There was a problem hiding this comment.
does this retry automatically?
Can we end up in a stale state on dex key rotation if the cloud function triggered run fails?
I imagine we don't change those keys that often, but I'm not sure I understand the full failure case here.
There was a problem hiding this comment.
Also could we have a race condition if both regions updated keys at the same time?
There was a problem hiding this comment.
does this retry automatically?
As-is this doesn't retry automatically but it looks like that's easy to add so I'll do that.
Can we end up in a stale state on dex key rotation if the cloud function triggered run fails?
Yes, but there is an alert set up to prevent that from becoming a problem.
I imagine we don't change those keys that often
We don't change the keys ourselves, Dex does that itself every 24 hours. So the failure case would be something like:
- Dex pod pushes a bad update. Cloud Function triggers, and fails.
- The Dex cronjob triggers every 1 minute, so depending on the failure it may self-resolve if the Cloud Function then triggers again and succeeds
- If not, the last known keys are untouched and will remain valid for a while, even after Dex rotates its key the old one will still be valid for 24 hours.
- The stale keys alert will trigger an hour after the bad push, so someone will check the problem and fix it before the keys expire.
There was a problem hiding this comment.
Also could we have a race condition if both regions updated keys at the same time?
Each region will push updates to keys/[region].json, so there is no race condition on push. Both Cloud Functions will trigger the merge update no matter which region updated their keys, so it is probable that both will run at about the same time, but this should be safe because of GCS's strong consistency and because the output is idempotent, so it doesn't matter which region finishes first or last.
Cloud Functions have to be regional because of the compute resources they use, there's no global version. We could just have one in one region to avoid the duplication and unnecessary dual writes, but that kind of defeats the purpose of the redundant infrastructure we're building.
There was a problem hiding this comment.
If not, the last known keys are untouched and will remain valid for a while, even after Dex rotates its key the old one will still be valid for 24 hours.
What's the delay between key rotation and when the instance starts using the new key to sign?
There was a problem hiding this comment.
I'm pretty sure it's instantaneous. So what I think you're pointing out is that there could be a gap between when the key starts being used to sign and when it's actually available in the bucket. So I'll need to figure something out to avoid having identity tokens fail to verify for one minute every 24 hours.
There was a problem hiding this comment.
Hah yeah I guess I'm trying to understand what could go wrong?
| uniform_bucket_level_access = true | ||
| } | ||
|
|
||
| # Grant the global internet permission to read the merged keys file via the CDN |
There was a problem hiding this comment.
Does the global internet need to read these? Dex has its own .well-known to really access those values. And I think somewhat importantly, how are client supposed to access this info? Clients are designed to use the .well-known to validate dex tokens before they send them off to fulcio.
There was a problem hiding this comment.
Does the global internet need to read these? Dex has its own .well-known to really access those values.
The .well-known configuration provides the jwks_uri to point to the keys endpoint but it doesn't provide the keys themselves. The public keys are public information and the current Dex instance already exposes this publicly (https://oauth2.sigstage.dev/auth/keys).
And I think somewhat importantly, how are client supposed to access this info?
They'll access the .well-known the same way, and the jwks_uri points to the same address as it always has. The URL map configuration makes this completely transparent.
before they send them off to fulcio.
My understanding is that Fulcio also has to validate the tokens too.
There was a problem hiding this comment.
So I'm trying to understand this:
- the client requests a token from
oauth2.sigstage.dev - the client gets a token, it then validates the token before sending it to fulcio (this is what sigstore java does atleast).
The issue I'm concerned about is client behavior: can it correctly find the /.well-known/openid-configuration endpoint if it's hidden behind the LB proxy? it should be asking the specific dex instance for this info?
Is our config that both/all dex instances expose the exact same /.well-known/openid-configuration with /.well-known/keys just pointing to the consolidated keys file?
There was a problem hiding this comment.
The .well-known/openid-configuration isn't hidden, it is passed through to the Dex pod by https://github.com/sigstore/terraform-modules/pull/195/changes/BASE..c0f76e2c7dc6926b7b43c069c38d459d8380d6de#diff-8285fd7d43c899c705485c388abca03ab5bd1342b6c42cf2c43ce252f27c1429R241 which still serves that well-known endpoint. It's only if you make a request to /auth/keys that you'll be routed to this bucket. All the Dex instances will be serving their own .well-known configs, and they will happen to be identical because all of the URLs in it will be the same.
There was a problem hiding this comment.
yeah "hidden" was probably the wrong word. I wasn't fully grasping how a client wanting to verify the token would interact with the system. What interesting to me (now that I've looked at it more) is the token and auth endpoints obtained to use for a web flow:
In the flow (at least how sigstore-java does it).
- go ask for
/.well-known/openid-configurationand parse outauthorization_endpointandtoken_endpointswhich we need for the web flow. - Use authEndpoint to initiate login and get an auth code
- Use tokenEndpoint to trade the auth token for an id token
If these endpoints are mixed up or not routed to the same oauth2 server, I think things will get wonky. What I'm trying to figure out here is if all contact with the global dex endpoint will hit the same dex instance over the course of the auth flow.
There was a problem hiding this comment.
If I am understanding this correctly, authEndpoint and tokenEndpoint should be instance specific.
There was a problem hiding this comment.
Hmm this is a valid concern and a flaw in my design. I was only thinking about Fulcio verifying the token and not the token acquisition itself. I'll have to think through this and come up with an alternative.
|
Will come back to each in-line review soon. In the meantime, here is a draft of the migration strategy that should hopefully give context on why this is structured the way it is and how all the pieces come together. Yes, the NEGs will not be discoverable at first, there is a multi-step process to set up the base infrastructure in Terraform, create the NEGs from Kubernetes, then attach to them in Terraform. |
Add support for external LBs for fulcio, dex, TSA
Add a sigstore_global module that deploys common resources to support
external load balancers for Fulcio, Dex, and TSA. If deployed in a
single region, individual modules call their respective "global" modules
to deploy universal resources like Cloud Armor policies or IP addresses.
If deployed in multi-region, the individual modules defer to the
sigstore_global module to deploy these resources and additional load
balancer resources so they are only managed once. The modules have the
option of not managing the DNS record to enable a smooth transition from
having the individual module manage it to managing it with the global
module.
We use DNS Authorization using CNAME records instead of raw Certificate
Manager certificates because the DNS record might, during the transition
period, be managed by another module, and the IP address may not resolve
to the new global IP address. Using DNS Authorization means we can still
provision all the SSL resources even before the DNS A record is fully
transitioned to the new global IP address.
Add aggregator service for Dex
Dex has a /keys endpoint that Fulcio queries to verify identity tokens.
Without this aggregator service, if Fulcio queried the global URL, the
keys it fetches may come from either instance and may then cause the
verification to fail. This change creates a bucket to store the keys,
load balancer routing to route requests to the bucket, and a Cloud
Function that merges keys from each instance into one list. Separately,
the helm chart for Dex will include a cron job that publishes its keys
to this bucket for the Cloud Function to merge.
note: jwks-merger code written by Gemini and reviewed by myself. Supporting terraform code for Cloud Functions written with help from Gemini.
See https://github.com/sigstore/public-good-instance/issues/3603#issuecomment-4362535721 for more details on this change.
Relates to https://github.com/sigstore/public-good-instance/issues/3603
Depends on #199
Summary
Release Note
Documentation