feat(server): meter GET /records egress bytes#6648
Conversation
fda38a5 to
a865d81
Compare
There was a problem hiding this comment.
1 issue found and verified against the latest diff
Confidence score: 3/5
- In
packages/server/lib/utils/egressTelemetry.ts,publishBatchtreats partial failures as success, sores.value.faileditems can be dropped silently and telemetry completeness/accuracy will degrade after merge; treat non-empty failed lists as retryable (or explicitly requeue/retry those items) before merging.
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
Superagent didn't find any vulnerabilities or security issues in this PR. |
Add a telemetry recorder to `server` powered by a Batcher singleton and meter bytes egressing from `GET /records`. The recorder publishes the telemetry to pubsub, which then gets funneled into the current pipeline that posts to DD as well as ClickHouse. Also start tagging the EGRESS_BYTES DD metric with the route metered by the middleware. This will allow us to compare the DTO we're tracking on the server with the close-to-the-wire bytes metered by the middleware.
a865d81 to
7d4cb96
Compare
| SERVER_EGRESS_TELEMETRY_BATCH_SIZE: z.coerce.number().int().positive().default(1_000), | ||
| SERVER_EGRESS_TELEMETRY_FLUSH_INTERVAL_MS: z.coerce.number().int().nonnegative().default(60_000), | ||
| SERVER_EGRESS_TELEMETRY_MAX_QUEUE_SIZE: z.coerce.number().int().positive().default(100_000), |
There was a problem hiding this comment.
Should they be prefixed with NANGO?
There was a problem hiding this comment.
Hmm, if they should, this boat has kind of sailed already as I introduced these two in a previous PR:
RUNNER_TELEMETRY_BATCH_SIZE: z.coerce.number().int().positive().max(1000).default(500),
RUNNER_TELEMETRY_FLUSH_INTERVAL_MS: z.coerce.number().int().nonnegative().default(10_000),
But I also see a bunch of other env vars without the NANGO_ prefix, so I'm not sure. Are we shooting to have that prefix as a standard?
| export const egressTelemetryRecorder = { | ||
| record(entry: ServerEgressTelemetry): void { | ||
| const res = batcher.add(entry); | ||
| if (res.isErr()) { |
There was a problem hiding this comment.
You will tell me, but do we need some cohesive metric here for knowing when the batcher is strugling, for knowing when we are droping metrics? For clickhouse we implemented this metric for knwoing when events are dropped, either bc the queue was full, we reached max number of retries, etc
There was a problem hiding this comment.
I'm currently relying on log-based filters and was planning on extracting metrics out of those logs with this. So the tl;dr is that I'm tracking it, just not with a regular metric maintained at the app level.
Alternatively, since the Batcher class is a shared utility, we could introduce a unified metric that adds a dimension based on where the Batcher is used (or something similar), so we'd be able to track specific instances of it. This feels like more than I'd like to do in this PR, though, and I'm trying to move fast with this PR to unlock the next analysis phase.
There was a problem hiding this comment.
Im ok on doing this in a new PR!
Add a telemetry recorder to
serverpowered by a Batcher singleton and meter bytes egressing fromGET /records. The recorder publishes the telemetry to pubsub, which then gets funneled into the current pipeline that posts to DD as well as ClickHouse.Also start tagging the EGRESS_BYTES DD metric with the route metered by the middleware. This will allow us to compare the DTO we're tracking on the server with the close-to-the-wire bytes metered by the middleware.