Latency logging by mccabete · Pull Request #244 · Earth-Information-System/fireatlas

mccabete · 2026-03-09T20:08:13Z

It's been really hard to stay on top of our data latency issues. This is in part because the data the users see can be delayed for a variety of reasons, but we only have alerts for run failures. This is a tack on to the existing alerting system that will let us know if the data the users are seeing is going stale faster than we want it to.

Adding a logging & alert system when data latency is greater than 16 hours (12 hours expected).

action.yaml, cribbed from alerting .yaml

Quickstart on job

query the api and check if most recent data is more than a certain number of hours stale compared to now. If it is, send an alert with a statement about the time difference.

12 hour latency expectation

getting rid of MAAP specific stuff

piping results of latency_logging.py to $GITHUB_OUTPUT for display in workflow

trying to capture output from stdout

adding action id and output

Adding alert parameter

add alert boolean

first terrible draft of workflow

Runs hourly and will alert if the data latency is longer than the # of hours as defined in latency_logging.py

mccabete · 2026-03-09T20:09:02Z

This is partially in response to #244 -- but notably this will only "log" by issuing lots of alerts when we have data older than 12 hours.

mccabete · 2026-03-09T20:13:06Z

@zebbecker could you glance at this and tell me if you think it will break our existing alerts system?

.github/workflows/schedule-latency_logging.yaml

maap_runtime/latency_logging.py

zebbecker · 2026-03-16T15:41:41Z

Two ways to work on debugging this:

first, debugging the python script locally. Make sure to set up the development environment the same way that we do in production so that it matches (docs here) and then just a python latency_logging.py should do the trick
The trickier part is testing the Action. In my last commit a few minutes ago, I added a temporary trigger to the workflow file that runs this action when a PR is opened or changed. This is kinda hacky, but the alternative is probably to use the act tool to test locally, which seems like more trouble than it is worth at this juncture. For now, I'd suggest we just use this trigger, and take it out once everything is running smoothly

zebbecker · 2026-03-16T15:49:46Z

On a more general design note- I actually would really like to be able to access a log containing regular data about our latency, either in addition to or instead of alerts only when latency is above the threshold we set as "concerning." Probably in addition to, as I think it would be good to have the alert as you've already written as an additional health check.

Anyways, with that data, we could quantify both our average + worst case latency (which would be very helpful for communicating both with users and with overhead) as well as our uptime/downtime- both of which are key stats as we move into a phase of the project where reliability starts to matter much more than it does for experimental science code.

@mccabete would you be interested in adding the logging feature to this workflow? I recognize that I'm asking for additional functionality here, and can also add it to my dev backlog if you prefer not to get this dumped on you!

mccabete · 2026-03-16T15:59:49Z

My preference would be to get what we have up first and then do a new PR for added logging functionality just to get something live. But I feel the need: I want regular logs too.

Any thoughts on the "where should the logs live " question?

zebbecker · 2026-03-16T16:13:59Z

That question is definitely what makes this a little trickier given our constraints. I'll do a little digging- I kinda suspect that creating a lightweight DPS job to do this so that we have easy write access to a logfile on MAAP might be the move.

I'm on board with getting this alert working and then figuring out the logging next.

testing swapping this to branch name.

combining id and shell to avert "Unexpected type '' encountered while reading 'steps item uses'" error

mccabete added 15 commits March 6, 2026 14:35

Create latency_logging

273a347

Adding a logging & alert system when data latency is greater than 16 hours (12 hours expected).

Delete .github/actions/latency_logging

74dda5a

Create action.yaml

c55e7e0

Update action.yaml

edb65a1

action.yaml, cribbed from alerting .yaml

Create latency_logging.py

54abdaf

Quickstart on job

Update latency_logging.py

4e22097

query the api and check if most recent data is more than a certain number of hours stale compared to now. If it is, send an alert with a statement about the time difference.

Update latency_logging.py

db18747

12 hour latency expectation

Update action.yaml

ec8122c

getting rid of MAAP specific stuff

Update action.yaml

07f9fe1

piping results of latency_logging.py to $GITHUB_OUTPUT for display in workflow

Update latency_logging.py

1c70701

trying to capture output from stdout

Update action.yaml

44adee4

adding action id and output

Update action.yaml

779dfc0

Adding alert parameter

Update latency_logging.py

f68cefa

add alert boolean

Create schedule-latency_logging.yaml

947519a

first terrible draft of workflow

Update schedule-latency_logging.yaml

a54c838

Runs hourly and will alert if the data latency is longer than the # of hours as defined in latency_logging.py

mccabete assigned zebbecker Mar 9, 2026

pin pandas <3

d17b625

zebbecker reviewed Mar 16, 2026

View reviewed changes

.github/workflows/schedule-latency_logging.yaml Outdated Show resolved Hide resolved

Update schedule-latency_logging.yaml

a0550b0

zebbecker reviewed Mar 16, 2026

View reviewed changes

maap_runtime/latency_logging.py Show resolved Hide resolved

zebbecker reviewed Mar 16, 2026

View reviewed changes

maap_runtime/latency_logging.py Show resolved Hide resolved

Trigger on PR to test

f2cf8da

zebbecker had a problem deploying to production March 16, 2026 15:21 — with GitHub Actions Failure

Update pyproject.toml

61f610c

mccabete had a problem deploying to production March 16, 2026 16:03 — with GitHub Actions Failure

Update latency_logging.py

7fe35ae

mccabete had a problem deploying to production March 16, 2026 16:13 — with GitHub Actions Failure

Update schedule-latency_logging.yaml

06a5155

mccabete had a problem deploying to production March 16, 2026 16:25 — with GitHub Actions Failure

Update schedule-latency_logging.yaml

390df7f

testing swapping this to branch name.

mccabete had a problem deploying to production March 16, 2026 16:57 — with GitHub Actions Failure

Update schedule-latency_logging.yaml

30b9e09

mccabete had a problem deploying to production March 16, 2026 17:59 — with GitHub Actions Failure

Update action.yaml

2d11ea2

combining id and shell to avert "Unexpected type '' encountered while reading 'steps item uses'" error

mccabete temporarily deployed to production March 16, 2026 18:10 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency logging#244

Latency logging#244
mccabete wants to merge 24 commits intoconus-dpsfrom
latency_logging

mccabete commented Mar 9, 2026

Uh oh!

mccabete commented Mar 9, 2026 •

edited

Loading

Uh oh!

mccabete commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zebbecker commented Mar 16, 2026

Uh oh!

zebbecker commented Mar 16, 2026

Uh oh!

mccabete commented Mar 16, 2026

Uh oh!

zebbecker commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mccabete commented Mar 9, 2026

Uh oh!

mccabete commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mccabete commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zebbecker commented Mar 16, 2026

Uh oh!

zebbecker commented Mar 16, 2026

Uh oh!

mccabete commented Mar 16, 2026

Uh oh!

zebbecker commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mccabete commented Mar 9, 2026 •

edited

Loading