[DevOps] ADVENTIST production issue: configure healthy probe path and mitigate Akamai SureRoute 404s

Per Claude: https://notch8.slack.com/archives/C0877SLCMN3/p1758470487290609


**Incident:** ADL prod returning 520/503 (“Web server returning an unknown error”)
**When:** \~09:05–09:18 PT (approx)
**Impact:** Public traffic to ADL intermittently failed via the CDN. Internal pods stayed up.

**What we saw**

* Rails web pods were healthy (no app 5xx). Normal DB/Solr activity during the window.
* Traffic to the pods suddenly dropped (Ingress/CDN stopped sending requests).
* Logs show repeated requests from 23.x.x.x IPs to `GET /akamai/sureroute-test-object.html`, returning 404.

**Root cause (external)**

* One tenant hostname, **`libraryarchive.ahu.edu` (AHU)**, is fronted by **Akamai** with **SureRoute** enabled.
* SureRoute performs origin health checks to a default path: `/akamai/sureroute-test-object.html`.
* Our app doesn’t serve that path → 404 → Akamai marks origin **unhealthy** → upstream 520/503 to users.
* Adding a web pod restored service (edge re-established healthy backends), but that was capacity/refresh—not the underlying config issue.

**Actions taken (today)**

* Scaled the web deployment up by +1 pod to stabilize traffic.

**Permanent fix (proposal)**

* **Option A (fast + we control):** Serve the expected probe as a static file so any Akamai checks succeed:

  * Add file: `public/akamai/sureroute-test-object.html` with contents like `OK`.
  * Ensure `RAILS_SERVE_STATIC_FILES=1` in prod so Rails serves it.
* **Option B (preferred long-term):** Standardize on a health endpoint and align edge checks:

  1. Expose a guaranteed-200 health endpoint from the app:

     ```ruby
     # config/routes.rb
     get "/healthz", to: proc { [200, {"Content-Type" => "text/plain"}, ["ok"]] }
     ```
  2. Ask **AHU** to update their **Akamai** property so health checks target `/healthz` (and/or disable SureRoute if not needed).
  3. If Cloudflare health checks are also enabled, point them to `/healthz` too (avoid double checks to routes that can 404).
* **Monitoring:** add alerts for Ingress 5xx spikes and CDN “origin down” events.

**Status/Owners**

* [ ] Add `public/akamai/sureroute-test-object.html` (**ours**) — *owner:* <name> — *ETA:* <time>
* [ ] Add `/healthz` route (**ours**) — *owner:* <name> — *ETA:* <time>
* [ ] Ask AHU to update Akamai health-check path to `/healthz` (or disable SureRoute) — *owner:* <name> — *ETA:* <time>
* [ ] Verify: curl `/healthz` and `/akamai/sureroute-test-object.html` via origin and through CDN; watch origin-health in edge UI — *owner:* <name>

**Notes**

* No evidence of application errors or DB/Solr failures. The fault was at the edge↔origin health-check layer.
* Scaling web helped because it refreshed endpoints and capacity, but we still need the health-check alignment to prevent recurrences.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DevOps] ADVENTIST production issue: configure healthy probe path and mitigate Akamai SureRoute 404s #1018

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DevOps] ADVENTIST production issue: configure healthy probe path and mitigate Akamai SureRoute 404s #1018

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions