Skip to content

Latency logging#244

Open
mccabete wants to merge 24 commits intoconus-dpsfrom
latency_logging
Open

Latency logging#244
mccabete wants to merge 24 commits intoconus-dpsfrom
latency_logging

Conversation

@mccabete
Copy link
Contributor

@mccabete mccabete commented Mar 9, 2026

It's been really hard to stay on top of our data latency issues. This is in part because the data the users see can be delayed for a variety of reasons, but we only have alerts for run failures. This is a tack on to the existing alerting system that will let us know if the data the users are seeing is going stale faster than we want it to.

mccabete added 15 commits March 6, 2026 14:35
Adding a logging & alert system when data latency is greater than 16 hours (12 hours expected).
action.yaml, cribbed from alerting .yaml
Quickstart on job
query the api and check if most recent data is more than a certain number of hours stale compared to now. If it is, send an alert with a statement about the time difference.
12 hour latency expectation
getting rid of MAAP specific stuff
piping results of latency_logging.py to $GITHUB_OUTPUT for display in workflow
trying to capture output from stdout
adding action id and output
Adding alert parameter
add alert boolean
first terrible draft of workflow
Runs hourly and will alert if the data latency is longer than the # of hours as defined in latency_logging.py
@mccabete
Copy link
Contributor Author

mccabete commented Mar 9, 2026

This is partially in response to #244 -- but notably this will only "log" by issuing lots of alerts when we have data older than 12 hours.

@mccabete
Copy link
Contributor Author

mccabete commented Mar 9, 2026

@zebbecker could you glance at this and tell me if you think it will break our existing alerts system?

@zebbecker
Copy link
Collaborator

Two ways to work on debugging this:

  • first, debugging the python script locally. Make sure to set up the development environment the same way that we do in production so that it matches (docs here) and then just a python latency_logging.py should do the trick
  • The trickier part is testing the Action. In my last commit a few minutes ago, I added a temporary trigger to the workflow file that runs this action when a PR is opened or changed. This is kinda hacky, but the alternative is probably to use the act tool to test locally, which seems like more trouble than it is worth at this juncture. For now, I'd suggest we just use this trigger, and take it out once everything is running smoothly

@zebbecker
Copy link
Collaborator

On a more general design note- I actually would really like to be able to access a log containing regular data about our latency, either in addition to or instead of alerts only when latency is above the threshold we set as "concerning." Probably in addition to, as I think it would be good to have the alert as you've already written as an additional health check.

Anyways, with that data, we could quantify both our average + worst case latency (which would be very helpful for communicating both with users and with overhead) as well as our uptime/downtime- both of which are key stats as we move into a phase of the project where reliability starts to matter much more than it does for experimental science code.

@mccabete would you be interested in adding the logging feature to this workflow? I recognize that I'm asking for additional functionality here, and can also add it to my dev backlog if you prefer not to get this dumped on you!

@mccabete
Copy link
Contributor Author

My preference would be to get what we have up first and then do a new PR for added logging functionality just to get something live. But I feel the need: I want regular logs too.

Any thoughts on the "where should the logs live " question?

@zebbecker
Copy link
Collaborator

That question is definitely what makes this a little trickier given our constraints. I'll do a little digging- I kinda suspect that creating a lightweight DPS job to do this so that we have easy write access to a logfile on MAAP might be the move.

I'm on board with getting this alert working and then figuring out the logging next.

testing swapping this to branch name.
combining id and shell to avert  "Unexpected type '' encountered while reading 'steps item uses'" error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants