Conversation
Adding a logging & alert system when data latency is greater than 16 hours (12 hours expected).
action.yaml, cribbed from alerting .yaml
Quickstart on job
query the api and check if most recent data is more than a certain number of hours stale compared to now. If it is, send an alert with a statement about the time difference.
12 hour latency expectation
getting rid of MAAP specific stuff
piping results of latency_logging.py to $GITHUB_OUTPUT for display in workflow
trying to capture output from stdout
adding action id and output
Adding alert parameter
add alert boolean
first terrible draft of workflow
Runs hourly and will alert if the data latency is longer than the # of hours as defined in latency_logging.py
|
This is partially in response to #244 -- but notably this will only "log" by issuing lots of alerts when we have data older than 12 hours. |
|
@zebbecker could you glance at this and tell me if you think it will break our existing alerts system? |
|
Two ways to work on debugging this:
|
|
On a more general design note- I actually would really like to be able to access a log containing regular data about our latency, either in addition to or instead of alerts only when latency is above the threshold we set as "concerning." Probably in addition to, as I think it would be good to have the alert as you've already written as an additional health check. Anyways, with that data, we could quantify both our average + worst case latency (which would be very helpful for communicating both with users and with overhead) as well as our uptime/downtime- both of which are key stats as we move into a phase of the project where reliability starts to matter much more than it does for experimental science code. @mccabete would you be interested in adding the logging feature to this workflow? I recognize that I'm asking for additional functionality here, and can also add it to my dev backlog if you prefer not to get this dumped on you! |
|
My preference would be to get what we have up first and then do a new PR for added logging functionality just to get something live. But I feel the need: I want regular logs too. Any thoughts on the "where should the logs live " question? |
|
That question is definitely what makes this a little trickier given our constraints. I'll do a little digging- I kinda suspect that creating a lightweight DPS job to do this so that we have easy write access to a logfile on MAAP might be the move. I'm on board with getting this alert working and then figuring out the logging next. |
testing swapping this to branch name.
combining id and shell to avert "Unexpected type '' encountered while reading 'steps item uses'" error
It's been really hard to stay on top of our data latency issues. This is in part because the data the users see can be delayed for a variety of reasons, but we only have alerts for run failures. This is a tack on to the existing alerting system that will let us know if the data the users are seeing is going stale faster than we want it to.