Skip to content

feat(metering): cron to monitor billing-events S3 DLQ bucket#6668

Open
pfreixes wants to merge 3 commits into
masterfrom
pfreixes/nan-5734-listen-dlq-events
Open

feat(metering): cron to monitor billing-events S3 DLQ bucket#6668
pfreixes wants to merge 3 commits into
masterfrom
pfreixes/nan-5734-listen-dlq-events

Conversation

@pfreixes

@pfreixes pfreixes commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds an hourly metering cron that lists the billing-events S3 DLQ bucket and emits a Datadog gauge nango.billing.events.s3.dlq.files. Steady state is 0; the alert fires on >0.
  • Also emits nango.billing.events.s3.dlq.monitor.run.result (counter, tagged success:true|false) so a separate monitor catches the case where the cron itself is broken (bad bucket name, missing IAM, region mismatch).
  • Disabled by default: the cron is a no-op unless both CRON_BILLING_EVENTS_S3_DLQ_MONITOR_MINUTE and BILLING_EVENTS_S3_DLQ_BUCKET are set.
  • Paginated count caps at 100 — we're already broken past that threshold, no value in burning more list calls.

Operability

When the alert fires: investigate the failed events, fix the underlying ingest issue, then delete the file(s) from the DLQ bucket. The monitor can be muted while the investigation is in flight.

Rollout / merge order

Companion PRs land first; this PR ships last after a manual smoke test in dev:

  1. NangoHQ/nango-infra#168 — grants s3:ListBucket on the DLQ bucket to the metering pod IAM role. Deploy.
  2. NangoHQ/nango-environments#140 — turns on the cron (sets the two envs) in dev + prod at minute 50. Deploy.
  3. Manual smoke test in dev: drop a dummy file into nango-billing-events-dlq-development, wait for the next cron firing, confirm the Datadog gauge rises above 0 and the monitor fires. Then delete the file.
  4. This PR: merge after approval + the dev smoke test passes.

The order works because the cron in this PR is dead code until those envs are set — landing #168 and #140 first doesn't change behaviour in any environment beyond exposing the bucket to be listed. Once the manual test confirms the wire-up works end-to-end, this PR is purely an additive code change.

NAN-5734.

Test plan

  • npm run ts-build clean
  • npx oxlint clean
  • Manual smoke in dev: drop a file into the DLQ bucket, observe the gauge rise + monitor fire, clean up
  • After this PR deploys to prod: confirm steady-state 0 on the gauge and a success:true tick every hour

Hourly cron in metering that lists the DLQ bucket and emits a Datadog
gauge `nango.billing.events.s3.dlq.files`. Steady state is 0; the alert
fires on >0. Past 100 files we stop paginating — we're already broken.

Disabled unless both BILLING_EVENTS_S3_DLQ_BUCKET and the cron-minute
env are set, so no behaviour change in any environment until the
helm/infra side rolls it out.

NAN-5734.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@linear-code

linear-code Bot commented Jun 26, 2026

Copy link
Copy Markdown

NAN-5734

@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

The gauge for file count goes silent if the cron itself fails (bad
bucket name, missing IAM, region mismatch). Add a counter
`nango.billing.events.s3.dlq.monitor.run.result` tagged success:true|false
emitted exactly once per firing so a monitor on success:false catches
"the monitor itself is broken."

Mirrors BILLING_USAGE_CLICKHOUSE_S3_EXPORT_RUN_RESULT.

NAN-5734.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread packages/metering/lib/crons/billingEventsS3DLQMonitor.ts
@pfreixes pfreixes requested review from a team and marcindobry June 26, 2026 08:21
… span

Without re-throwing, the swallowed error keeps the tracer.trace span
looking healthy in Datadog APM even when the run failed. The `finally`
still emits the success:false run-result counter before the rejection
propagates; the cron scheduler's top-level .catch handles the throw.

NAN-5734.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

const LOCK_KEY = 'lock:cron:billingEventsS3DLQMonitor';
// Cron fires hourly; lock should expire well before the next tick.
const lockTtlMs = 30 * 60 * 1000;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it hourly? If this is detecting things being broken, wouldn't it be worth knowing faster? I might be wrong, but it doesn't look like it's an expensive operation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We upload files once a day, so knowing it - worst case - after one hour is ok. From there we have "plenty" of time to fix the issue and reexport the data. Until we do not fix the issue we would be just alerted that there are files in the DLQ that require our attention, once fixed the issue we must remove manually the files from the DLQ

Im planning to provide a full demo of all of this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants