feat(metering): cron to monitor billing-events S3 DLQ bucket by pfreixes · Pull Request #6668 · NangoHQ/nango

pfreixes · 2026-06-26T07:57:17Z

Summary

Adds an hourly metering cron that lists the billing-events S3 DLQ bucket and emits a Datadog gauge nango.billing.events.s3.dlq.files. Steady state is 0; the alert fires on >0.
Also emits nango.billing.events.s3.dlq.monitor.run.result (counter, tagged success:true|false) so a separate monitor catches the case where the cron itself is broken (bad bucket name, missing IAM, region mismatch).
Disabled by default: the cron is a no-op unless both CRON_BILLING_EVENTS_S3_DLQ_MONITOR_MINUTE and BILLING_EVENTS_S3_DLQ_BUCKET are set.
Paginated count caps at 100 — we're already broken past that threshold, no value in burning more list calls.

Operability

When the alert fires: investigate the failed events, fix the underlying ingest issue, then delete the file(s) from the DLQ bucket. The monitor can be muted while the investigation is in flight.

Rollout / merge order

Companion PRs land first; this PR ships last after a manual smoke test in dev:

NangoHQ/nango-infra#168 — grants s3:ListBucket on the DLQ bucket to the metering pod IAM role. Deploy.
NangoHQ/nango-environments#140 — turns on the cron (sets the two envs) in dev + prod at minute 50. Deploy.
Manual smoke test in dev: drop a dummy file into nango-billing-events-dlq-development, wait for the next cron firing, confirm the Datadog gauge rises above 0 and the monitor fires. Then delete the file.
This PR: merge after approval + the dev smoke test passes.

The order works because the cron in this PR is dead code until those envs are set — landing #168 and #140 first doesn't change behaviour in any environment beyond exposing the bucket to be listed. Once the manual test confirms the wire-up works end-to-end, this PR is purely an additive code change.

NAN-5734.

Test plan

npm run ts-build clean
npx oxlint clean
Manual smoke in dev: drop a file into the DLQ bucket, observe the gauge rise + monitor fire, clean up
After this PR deploys to prod: confirm steady-state 0 on the gauge and a success:true tick every hour

Hourly cron in metering that lists the DLQ bucket and emits a Datadog gauge `nango.billing.events.s3.dlq.files`. Steady state is 0; the alert fires on >0. Past 100 files we stop paginating — we're already broken. Disabled unless both BILLING_EVENTS_S3_DLQ_BUCKET and the cron-minute env are set, so no behaviour change in any environment until the helm/infra side rolls it out. NAN-5734. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

linear-code · 2026-06-26T07:57:20Z

NAN-5734

superagent-security · 2026-06-26T07:58:17Z

Superagent didn't find any vulnerabilities or security issues in this PR.

The gauge for file count goes silent if the cron itself fails (bad bucket name, missing IAM, region mismatch). Add a counter `nango.billing.events.s3.dlq.monitor.run.result` tagged success:true|false emitted exactly once per firing so a monitor on success:false catches "the monitor itself is broken." Mirrors BILLING_USAGE_CLICKHOUSE_S3_EXPORT_RUN_RESULT. NAN-5734. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

… span Without re-throwing, the swallowed error keeps the tracer.trace span looking healthy in Datadog APM even when the run failed. The `finally` still emits the success:false run-result counter before the rejection propagates; the cron scheduler's top-level .catch handles the throw. NAN-5734. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

marcindobry · 2026-06-26T08:39:12Z

+
+const LOCK_KEY = 'lock:cron:billingEventsS3DLQMonitor';
+// Cron fires hourly; lock should expire well before the next tick.
+const lockTtlMs = 30 * 60 * 1000;


Why is it hourly? If this is detecting things being broken, wouldn't it be worth knowing faster? I might be wrong, but it doesn't look like it's an expensive operation.

We upload files once a day, so knowing it - worst case - after one hour is ok. From there we have "plenty" of time to fix the issue and reexport the data. Until we do not fix the issue we would be just alerted that there are files in the DLQ that require our attention, once fixed the issue we must remove manually the files from the DLQ

Im planning to provide a full demo of all of this

cubic-dev-ai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread packages/metering/lib/crons/billingEventsS3DLQMonitor.ts

pfreixes requested review from a team and marcindobry June 26, 2026 08:21

marcindobry approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(metering): cron to monitor billing-events S3 DLQ bucket#6668

feat(metering): cron to monitor billing-events S3 DLQ bucket#6668
pfreixes wants to merge 3 commits into
masterfrom
pfreixes/nan-5734-listen-dlq-events

pfreixes commented Jun 26, 2026 •

edited

Loading

Uh oh!

linear-code Bot commented Jun 26, 2026

Uh oh!

superagent-security Bot commented Jun 26, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

marcindobry Jun 26, 2026

Uh oh!

pfreixes Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pfreixes commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Operability

Rollout / merge order

Test plan

Uh oh!

linear-code Bot commented Jun 26, 2026

Uh oh!

superagent-security Bot commented Jun 26, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marcindobry Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pfreixes Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pfreixes commented Jun 26, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading