feat(metering): cron to monitor billing-events S3 DLQ bucket#6668
feat(metering): cron to monitor billing-events S3 DLQ bucket#6668pfreixes wants to merge 3 commits into
Conversation
Hourly cron in metering that lists the DLQ bucket and emits a Datadog gauge `nango.billing.events.s3.dlq.files`. Steady state is 0; the alert fires on >0. Past 100 files we stop paginating — we're already broken. Disabled unless both BILLING_EVENTS_S3_DLQ_BUCKET and the cron-minute env are set, so no behaviour change in any environment until the helm/infra side rolls it out. NAN-5734. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Superagent didn't find any vulnerabilities or security issues in this PR. |
The gauge for file count goes silent if the cron itself fails (bad bucket name, missing IAM, region mismatch). Add a counter `nango.billing.events.s3.dlq.monitor.run.result` tagged success:true|false emitted exactly once per firing so a monitor on success:false catches "the monitor itself is broken." Mirrors BILLING_USAGE_CLICKHOUSE_S3_EXPORT_RUN_RESULT. NAN-5734. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
… span Without re-throwing, the swallowed error keeps the tracer.trace span looking healthy in Datadog APM even when the run failed. The `finally` still emits the success:false run-result counter before the rejection propagates; the cron scheduler's top-level .catch handles the throw. NAN-5734. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
||
| const LOCK_KEY = 'lock:cron:billingEventsS3DLQMonitor'; | ||
| // Cron fires hourly; lock should expire well before the next tick. | ||
| const lockTtlMs = 30 * 60 * 1000; |
There was a problem hiding this comment.
Why is it hourly? If this is detecting things being broken, wouldn't it be worth knowing faster? I might be wrong, but it doesn't look like it's an expensive operation.
There was a problem hiding this comment.
We upload files once a day, so knowing it - worst case - after one hour is ok. From there we have "plenty" of time to fix the issue and reexport the data. Until we do not fix the issue we would be just alerted that there are files in the DLQ that require our attention, once fixed the issue we must remove manually the files from the DLQ
Im planning to provide a full demo of all of this
Summary
nango.billing.events.s3.dlq.files. Steady state is 0; the alert fires on >0.nango.billing.events.s3.dlq.monitor.run.result(counter, taggedsuccess:true|false) so a separate monitor catches the case where the cron itself is broken (bad bucket name, missing IAM, region mismatch).CRON_BILLING_EVENTS_S3_DLQ_MONITOR_MINUTEandBILLING_EVENTS_S3_DLQ_BUCKETare set.Operability
When the alert fires: investigate the failed events, fix the underlying ingest issue, then delete the file(s) from the DLQ bucket. The monitor can be muted while the investigation is in flight.
Rollout / merge order
Companion PRs land first; this PR ships last after a manual smoke test in dev:
s3:ListBucketon the DLQ bucket to the metering pod IAM role. Deploy.nango-billing-events-dlq-development, wait for the next cron firing, confirm the Datadog gauge rises above 0 and the monitor fires. Then delete the file.The order works because the cron in this PR is dead code until those envs are set — landing #168 and #140 first doesn't change behaviour in any environment beyond exposing the bucket to be listed. Once the manual test confirms the wire-up works end-to-end, this PR is purely an additive code change.
NAN-5734.
Test plan
npm run ts-buildcleannpx oxlintcleansuccess:truetick every hour