-
Notifications
You must be signed in to change notification settings - Fork 1
[DevOps] ADVENTIST production issue: configure healthy probe path and mitigate Akamai SureRoute 404s #1018
Description
Per Claude: https://notch8.slack.com/archives/C0877SLCMN3/p1758470487290609
Incident: ADL prod returning 520/503 (“Web server returning an unknown error”)
When: ~09:05–09:18 PT (approx)
Impact: Public traffic to ADL intermittently failed via the CDN. Internal pods stayed up.
What we saw
- Rails web pods were healthy (no app 5xx). Normal DB/Solr activity during the window.
- Traffic to the pods suddenly dropped (Ingress/CDN stopped sending requests).
- Logs show repeated requests from 23.x.x.x IPs to
GET /akamai/sureroute-test-object.html, returning 404.
Root cause (external)
- One tenant hostname,
libraryarchive.ahu.edu(AHU), is fronted by Akamai with SureRoute enabled. - SureRoute performs origin health checks to a default path:
/akamai/sureroute-test-object.html. - Our app doesn’t serve that path → 404 → Akamai marks origin unhealthy → upstream 520/503 to users.
- Adding a web pod restored service (edge re-established healthy backends), but that was capacity/refresh—not the underlying config issue.
Actions taken (today)
- Scaled the web deployment up by +1 pod to stabilize traffic.
Permanent fix (proposal)
-
Option A (fast + we control): Serve the expected probe as a static file so any Akamai checks succeed:
- Add file:
public/akamai/sureroute-test-object.htmlwith contents likeOK. - Ensure
RAILS_SERVE_STATIC_FILES=1in prod so Rails serves it.
- Add file:
-
Option B (preferred long-term): Standardize on a health endpoint and align edge checks:
-
Expose a guaranteed-200 health endpoint from the app:
# config/routes.rb get "/healthz", to: proc { [200, {"Content-Type" => "text/plain"}, ["ok"]] }
-
Ask AHU to update their Akamai property so health checks target
/healthz(and/or disable SureRoute if not needed). -
If Cloudflare health checks are also enabled, point them to
/healthztoo (avoid double checks to routes that can 404).
-
-
Monitoring: add alerts for Ingress 5xx spikes and CDN “origin down” events.
Status/Owners
- Add
public/akamai/sureroute-test-object.html(ours) — owner: — ETA: - Add
/healthzroute (ours) — owner: — ETA: - Ask AHU to update Akamai health-check path to
/healthz(or disable SureRoute) — owner: — ETA: - Verify: curl
/healthzand/akamai/sureroute-test-object.htmlvia origin and through CDN; watch origin-health in edge UI — owner:
Notes
- No evidence of application errors or DB/Solr failures. The fault was at the edge↔origin health-check layer.
- Scaling web helped because it refreshed endpoints and capacity, but we still need the health-check alignment to prevent recurrences.