Skip to content

[DevOps] ADVENTIST production issue: configure healthy probe path and mitigate Akamai SureRoute 404s #1018

@ShanaLMoore

Description

@ShanaLMoore

Per Claude: https://notch8.slack.com/archives/C0877SLCMN3/p1758470487290609

Incident: ADL prod returning 520/503 (“Web server returning an unknown error”)
When: ~09:05–09:18 PT (approx)
Impact: Public traffic to ADL intermittently failed via the CDN. Internal pods stayed up.

What we saw

  • Rails web pods were healthy (no app 5xx). Normal DB/Solr activity during the window.
  • Traffic to the pods suddenly dropped (Ingress/CDN stopped sending requests).
  • Logs show repeated requests from 23.x.x.x IPs to GET /akamai/sureroute-test-object.html, returning 404.

Root cause (external)

  • One tenant hostname, libraryarchive.ahu.edu (AHU), is fronted by Akamai with SureRoute enabled.
  • SureRoute performs origin health checks to a default path: /akamai/sureroute-test-object.html.
  • Our app doesn’t serve that path → 404 → Akamai marks origin unhealthy → upstream 520/503 to users.
  • Adding a web pod restored service (edge re-established healthy backends), but that was capacity/refresh—not the underlying config issue.

Actions taken (today)

  • Scaled the web deployment up by +1 pod to stabilize traffic.

Permanent fix (proposal)

  • Option A (fast + we control): Serve the expected probe as a static file so any Akamai checks succeed:

    • Add file: public/akamai/sureroute-test-object.html with contents like OK.
    • Ensure RAILS_SERVE_STATIC_FILES=1 in prod so Rails serves it.
  • Option B (preferred long-term): Standardize on a health endpoint and align edge checks:

    1. Expose a guaranteed-200 health endpoint from the app:

      # config/routes.rb
      get "/healthz", to: proc { [200, {"Content-Type" => "text/plain"}, ["ok"]] }
    2. Ask AHU to update their Akamai property so health checks target /healthz (and/or disable SureRoute if not needed).

    3. If Cloudflare health checks are also enabled, point them to /healthz too (avoid double checks to routes that can 404).

  • Monitoring: add alerts for Ingress 5xx spikes and CDN “origin down” events.

Status/Owners

  • Add public/akamai/sureroute-test-object.html (ours) — owner:ETA:
  • Add /healthz route (ours) — owner:ETA:
  • Ask AHU to update Akamai health-check path to /healthz (or disable SureRoute) — owner:ETA:
  • Verify: curl /healthz and /akamai/sureroute-test-object.html via origin and through CDN; watch origin-health in edge UI — owner:

Notes

  • No evidence of application errors or DB/Solr failures. The fault was at the edge↔origin health-check layer.
  • Scaling web helped because it refreshed endpoints and capacity, but we still need the health-check alignment to prevent recurrences.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions