Skip to content

feat(server): meter GET /records egress bytes#6648

Merged
ErickRDev merged 1 commit into
masterfrom
erickr/NAN-6040/ingest-get-records-dto-into-clickhouse
Jun 25, 2026
Merged

feat(server): meter GET /records egress bytes#6648
ErickRDev merged 1 commit into
masterfrom
erickr/NAN-6040/ingest-get-records-dto-into-clickhouse

Conversation

@ErickRDev

@ErickRDev ErickRDev commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Add a telemetry recorder to server powered by a Batcher singleton and meter bytes egressing from GET /records. The recorder publishes the telemetry to pubsub, which then gets funneled into the current pipeline that posts to DD as well as ClickHouse.

Also start tagging the EGRESS_BYTES DD metric with the route metered by the middleware. This will allow us to compare the DTO we're tracking on the server with the close-to-the-wire bytes metered by the middleware.

Review in cubic

@ErickRDev ErickRDev self-assigned this Jun 24, 2026
@linear-code

linear-code Bot commented Jun 24, 2026

Copy link
Copy Markdown

NAN-6040

@ErickRDev ErickRDev force-pushed the erickr/NAN-6040/ingest-get-records-dto-into-clickhouse branch from fda38a5 to a865d81 Compare June 24, 2026 18:48

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found and verified against the latest diff

Confidence score: 3/5

  • In packages/server/lib/utils/egressTelemetry.ts, publishBatch treats partial failures as success, so res.value.failed items can be dropped silently and telemetry completeness/accuracy will degrade after merge; treat non-empty failed lists as retryable (or explicitly requeue/retry those items) before merging.

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread packages/server/lib/utils/egressTelemetry.ts
@ErickRDev ErickRDev marked this pull request as ready for review June 24, 2026 21:33
@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

Add a telemetry recorder to `server` powered by a Batcher singleton and
meter bytes egressing from `GET /records`. The recorder publishes the
telemetry to pubsub, which then gets funneled into the current pipeline
that posts to DD as well as ClickHouse.

Also start tagging the EGRESS_BYTES DD metric with the route metered
by the middleware. This will allow us to compare the DTO we're tracking
on the server with the close-to-the-wire bytes metered by the
middleware.
@ErickRDev ErickRDev force-pushed the erickr/NAN-6040/ingest-get-records-dto-into-clickhouse branch from a865d81 to 7d4cb96 Compare June 24, 2026 21:48
@ErickRDev ErickRDev requested a review from a team June 24, 2026 23:00
Comment on lines +36 to +38
SERVER_EGRESS_TELEMETRY_BATCH_SIZE: z.coerce.number().int().positive().default(1_000),
SERVER_EGRESS_TELEMETRY_FLUSH_INTERVAL_MS: z.coerce.number().int().nonnegative().default(60_000),
SERVER_EGRESS_TELEMETRY_MAX_QUEUE_SIZE: z.coerce.number().int().positive().default(100_000),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should they be prefixed with NANGO?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if they should, this boat has kind of sailed already as I introduced these two in a previous PR:

    RUNNER_TELEMETRY_BATCH_SIZE: z.coerce.number().int().positive().max(1000).default(500),
    RUNNER_TELEMETRY_FLUSH_INTERVAL_MS: z.coerce.number().int().nonnegative().default(10_000),

But I also see a bunch of other env vars without the NANGO_ prefix, so I'm not sure. Are we shooting to have that prefix as a standard?

@pfreixes pfreixes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment!

export const egressTelemetryRecorder = {
record(entry: ServerEgressTelemetry): void {
const res = batcher.add(entry);
if (res.isErr()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will tell me, but do we need some cohesive metric here for knowing when the batcher is strugling, for knowing when we are droping metrics? For clickhouse we implemented this metric for knwoing when events are dropped, either bc the queue was full, we reached max number of retries, etc

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently relying on log-based filters and was planning on extracting metrics out of those logs with this. So the tl;dr is that I'm tracking it, just not with a regular metric maintained at the app level.

Alternatively, since the Batcher class is a shared utility, we could introduce a unified metric that adds a dimension based on where the Batcher is used (or something similar), so we'd be able to track specific instances of it. This feels like more than I'd like to do in this PR, though, and I'm trying to move fast with this PR to unlock the next analysis phase.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im ok on doing this in a new PR!

@ErickRDev ErickRDev added this pull request to the merge queue Jun 25, 2026
Merged via the queue into master with commit bfa17bd Jun 25, 2026
40 checks passed
@ErickRDev ErickRDev deleted the erickr/NAN-6040/ingest-get-records-dto-into-clickhouse branch June 25, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants