fix(observability): preserve OTLP resource attributes on forwarder hop#1618
fix(observability): preserve OTLP resource attributes on forwarder hop#1618deep-name wants to merge 6 commits into
Conversation
OTLP span forwarder drops `resource_spans[].resource` on the
engine → collector hop; SigNoz/Grafana/etc render `service.name=<nil>`
even though `OTEL_SERVICE_NAME` is set at the SDK.
Adds a tonic gRPC mock collector that captures every inbound
`ExportTraceServiceRequest` and asserts the captured resource carries
all three inbound attributes (`service.name`, `service.namespace`,
`service.version`) — full-resource preservation rather than a
single-field band-aid.
Runs in its own integration-test binary so the process-global
`SDK_SPAN_FORWARDER` `OnceLock` is fresh, with a 5s poll deadline + 20ms
tick so the happy path stays fast and the failure path stays bounded.
FAILS against the unfixed engine with `assertion: left: None, right:
Some("my-service")`. Locks in the regression contract before the fix
lands.
Closes prerequisite for iii-hq#1617.
Ridden with [Loa](https://github.com/0xHoneyJar/loa)
Replaces the bare `opentelemetry_otlp::SpanExporter` forwarder with a `tonic::Channel` to a `TraceServiceClient`, and bypasses `convert_otlp_to_span_data` on the forwarder branch by translating the parsed `OtlpExportTraceServiceRequest` directly into the proto `ExportTraceServiceRequest`. The full `resource_spans[].resource` block survives the engine → collector hop byte-for-byte, including `service.name`, `service.namespace`, `service.version`, and any other attributes the inbound SDK sent. Root cause: `SpanData` carries no resource — the OTel SDK contract is that the owning `TracerProvider` supplies one before the exporter serialises. `SDK_SPAN_FORWARDER` was a bare exporter (no provider), so the export call wrote an empty resource onto the wire. Collectors rendered the missing service identity as `<nil>`. Changes: - `SDK_SPAN_FORWARDER`: `OnceLock<Arc<SpanExporter>>` → `OnceLock<Channel>`. Channel clones cheaply, lazy connect matches the prior exporter-builder semantics, no init-time blocking. - `init_sdk_span_forwarder`: rewritten to build an `Endpoint` and call `connect_lazy()`. Visibility bumped to `pub` so the new integration test can point the forwarder at a mock collector. - Forwarder branch in `ingest_otlp_json`: translates the parsed JSON request to the proto type via `otlp_json_request_to_proto` and a family of field-for-field helpers, then sends through a freshly-cloned `TraceServiceClient`. - Deleted `convert_otlp_to_span_data` and `otlp_kv_to_key_value` (truly dead post-replacement; in-memory storage path does its own per-field conversion) plus their unit tests. - Trimmed one assertion in `test_ingest_otlp_json_skips_spans_with_invalid_identifiers` — malformed-span filtering was a side effect of the SpanData-based path. The new path relays everything and lets the collector validate. Memory-storage behaviour is unchanged (the actual contract under test). - `engine/Cargo.toml`: promote `opentelemetry-proto = "0.31"`, `tonic = "0.14"`, `tokio-stream = "0.1"` to direct deps. All three were transitively present via `opentelemetry-otlp`'s `grpc-tonic` feature — promotion is the only change. Sweep: 1542 lib tests passed, 0 failed; new forwarder integration test passes; cargo fmt + clippy clean on touched files. Credits: @alsruf36 for the diagnosis and the suggested fix shape in iii-hq#1617. Closes iii-hq#1617. Ridden with [Loa](https://github.com/0xHoneyJar/loa)
…Value coverage
Round-2 fixes for sprint-bug-1 review:
C1/DISS-001 — TLS regression for `https://` collectors. The new
raw-tonic channel did not configure TLS; the pre-fix
`opentelemetry_otlp::SpanExporter` builder auto-configured it via
its `tls-roots` feature. Production deployments behind TLS-fronted
collectors (Grafana Cloud, SigNoz Cloud, Honeycomb, the standard
prod shape) would silently fail at handshake time.
- `engine/Cargo.toml`: `tonic` features extended with
`tls-native-roots`.
- `engine/src/workers/observability/otel.rs:696-744`:
`init_sdk_span_forwarder` now branches on
`endpoint.starts_with("https://")` and applies
`ClientTlsConfig::new().with_native_roots()` to the Endpoint
before `connect_lazy()`. TLS-config failure falls back to
plaintext with a logged warning so HTTP-only deployments are
unaffected.
C2/DISS-002 — trace flags default mismatch. The forwarder defaulted
missing `flags` to 0 (unsampled) while the in-memory storage path
at `otel.rs:1521-1524` defaults to `TraceFlags::SAMPLED`. Same
inbound span was being treated as sampled by storage but unsampled
by the forwarder.
- `engine/src/workers/observability/otel.rs:1747`: flipped
`unwrap_or(0)` → `unwrap_or(1)` (W3C SAMPLED bit) with a parity
comment referencing the storage path.
N1 — AnyValue conversion test coverage. The round-1 test only
exercised the `StringValue` branch.
- `engine/tests/otel_forwarder_resource_test.rs`: payload now
includes `replica.count` (intValue) and `telemetry.enabled`
(boolValue); assertions extended via `find_int_attr` and
`find_bool_attr`. Flags-default assertion added for the SAMPLED
parity check.
N2 — source-level call-out for the deliberate malformed-span
forwarding behaviour change.
- `engine/src/workers/observability/otel.rs:1607-1622` extended
with the rationale in the forwarder-branch comment.
Round-2 adversarial cross-model review (gpt-5.3-codex):
`findings: []` / status `clean`. Sweep: 1542 lib tests pass, 1
extended forwarder integration test passes, cargo fmt + clippy
clean on touched files.
Closes review-loop iteration 1 for sprint-bug-1. Sprint approved.
Ridden with [Loa](https://github.com/0xHoneyJar/loa)
|
@deep-name is attempting to deploy a commit to the motia Team on Vercel. A member of the Team first needs to authorize it. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughReplaces the SDK SpanExporter forwarding path with direct OTLP proto gRPC forwarding: adds proto types and converters, exposes a TLS-aware global forwarder init, converts parsed OTLP JSON to ExportTraceServiceRequest, forwards with tonic::Channel and injected OTLP headers, and adds integration tests for resource attributes, headers, and partial-batch behavior. ChangesOTLP Span Forwarding: Direct Proto Relay
Sequence DiagramsequenceDiagram
participant Ingest as ingest_otlp_json
participant ProtoConv as otlp_json_request_to_proto
participant Forwarder as TraceServiceClient
participant Collector as OTLP Collector
Ingest->>Ingest: parse OTLP JSON and merge into in-memory storage
alt Forwarder initialized
Ingest->>ProtoConv: convert parsed JSON -> ExportTraceServiceRequest
ProtoConv->>ProtoConv: preserve resourceSpans[].resource attributes
ProtoConv-->>Forwarder: proto ExportTraceServiceRequest
Forwarder->>Collector: gRPC Export call (with injected OTLP headers)
Collector-->>Forwarder: Export response
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
deep-name
left a comment
There was a problem hiding this comment.
Summary
Analytical review of #1618. Enrichment pass was unavailable; findings are unenriched.
Findings
{
"schema_version": 1,
"findings": [
{
"id": "F1",
"title": "Heavy new dependencies (tonic, opentelemetry-proto, tokio-stream) added at top-level Cargo.toml for test-only use",
"severity": "MEDIUM",
"category": "dependencies",
"file": "engine/Cargo.toml",
"description": "The PR adds `opentelemetry-proto`, `tonic` (with `transport` and `tls-native-roots`), and `tokio-stream` to the main `[dependencies]` section. From the visible diff, these are only consumed by an integration test (`engine/tests/otel_forwarder_resource_test.rs`). Putting them in `[dependencies]` instead of `[dev-dependencies]` enlarges the production binary closure (tonic with TLS pulls in rustls/native-roots, prost, hyper) and lengthens compile times for every consumer of the engine crate. If the forwarder source itself doesn't actually import these (the otel.rs section is truncated so this can't be fully verified), they should be `[dev-dependencies]`.",
"suggestion": "Confirm whether the forwarder implementation in `engine/src/workers/observability/otel.rs` directly uses `opentelemetry-proto`, `tonic`, or `tokio-stream`. If not (i.e. only the test does), move all three to `[dev-dependencies]`. If `opentelemetry-proto` is genuinely needed in production for proto conversion, keep just that one in `[dependencies]` and move `tonic`/`tokio-stream` to dev.",
"confidence": 0.6
},
{
"id": "F2",
"title": "Process-global OnceLock forwarder state makes integration test order-dependent",
"severity": "MEDIUM",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "The header comment states the test relies on `SDK_SPAN_FORWARDER` being a process-global `OnceLock`, fresh per integration test file. This is true today but fragile: if a future test file in the same `tests/` directory also calls `init_sdk_span_forwarder`, or if cargo test harness changes (e.g., a future shared test binary, or someone adds a `#[test]` to the same file pointing at a different endpoint), the OnceLock will silently bind the wrong endpoint and the second test will export to the wrong collector. There is no documented protection nor a test-only reset hook.",
"suggestion": "Either (a) add a `#[cfg(test)] pub fn reset_sdk_span_forwarder_for_tests()` that swaps the OnceLock contents, or (b) refactor the forwarder to accept an injected endpoint/exporter so tests don't depend on process-global state, or (c) at minimum add a code comment on `SDK_SPAN_FORWARDER` itself warning future integration tests not to call `init_sdk_span_forwarder` in the same binary.",
"confidence": 0.7
},
{
"id": "F3",
"title": "50ms sleep for collector startup is a flake vector",
"severity": "LOW",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "`spawn_mock_collector` sleeps 50ms unconditionally before returning, with a comment acknowledging this exists to avoid races with tonic's accept loop. Loaded CI runners (especially under heavy parallel cargo test) can easily exceed 50ms before the spawned task is polled. The downstream poll loop (5s deadline) does provide a fallback safety net for the actual export, but the initial race could still cause connection-refused on the first export attempt and rely on tonic's internal retry — and that retry is not guaranteed across tonic versions.",
"suggestion": "Replace the sleep with an active readiness check: in a small loop, attempt to `TcpStream::connect(addr)` until success or a bounded timeout (~1s). Alternatively, bind the listener synchronously *before* the spawn and only spawn the `serve_with_incoming` future — `TcpListener::bind` already guarantees the port is listening, so the race is solely about the tonic accept-loop being polled. A `tokio::task::yield_now().await` followed by a connect-probe loop would be deterministic.",
"confidence": 0.6
},
{
"id": "F4",
"title": "Mock collector server task swallows panics on `unwrap()`",
"severity": "LOW",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "The spawned `Server::builder()...serve_with_incoming(...).await.unwrap()` runs in a detached `tokio::spawn`. If the server fails to start or errors out, the `.unwrap()` will panic inside the spawned task, but the test harness will see the export simply never arrive and surface as `captured.len() == 0` with a misleading message. The task handle is not awaited or stored.",
"suggestion": "Either retain the JoinHandle and check it on test failure, or replace `.unwrap()` with `.expect(\"mock collector server failed to start\")` plus a `tracing::error!` — at minimum so the panic surfaces in test output rather than silently disappearing into a detached task.",
"confidence": 0.5
},
{
"id": "F5",
"title": "Test asserts SAMPLED default but contract is undocumented in PR scope",
"severity": "LOW",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "The final assertion (`span.flags & 0x01 == 0x01`) pins the forwarder to default missing trace-flags to SAMPLED, with rationale that it 'mirrors the in-memory storage path'. This is a meaningful behavioural contract, but it is being introduced alongside a fix titled 'preserve OTLP resource attributes' — the trace-flags default is a separate concern. If this default is new behaviour added in this PR, it deserves explicit call-out in the PR description; if it's pre-existing behaviour now being locked in, that's fine but worth noting. The OTLP spec leaves trace flags semantics to the sender; a forwarder defaulting unsampled→SAMPLED could subtly amplify sampled trace volume downstream.",
"suggestion": "Confirm in the PR description whether the SAMPLED default is (a) pre-existing and being regression-locked, or (b) new in this PR. If (b), consider whether defaulting to whatever the inbound payload says (0 when absent) is more correct per OTLP semantics, since collectors typically interpret missing flags as 'unspecified' rather than 'unsampled'.",
"confidence": 0.5
},
{
"id": "F6",
"title": "Test only asserts forwarded payload, not absence of duplication or modification",
"severity": "LOW",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "The test asserts `captured.len() == 1` and that 5 attributes are correctly preserved. It does not assert that *no extra* attributes were injected by the forwarder (e.g. a hop-level `forwarder.id` attribute), nor that `scope_spans`/spans are passed through unmodified beyond the flags check. A forwarder bug that *adds* attributes or rewrites span names would pass this test.",
"suggestion": "Consider asserting `resource.attributes.len() == 5` and `req.resource_spans[0].scope_spans[0].spans[0].name == \"forwarder.preserves.resource\"` to lock in pass-through fidelity, not just preservation of the named subset.",
"confidence": 0.55
},
{
"id": "F7",
"title": "Net deletion of ~374 lines in otel.rs alongside 'fix' is non-obvious",
"severity": "SPECULATION",
"category": "review-visibility",
"file": "engine/src/workers/observability/otel.rs",
"description": "The truncated diff shows `engine/src/workers/observability/otel.rs` with +323 -697, a net deletion of ~374 lines, in a PR framed as a bug fix for resource-attribute preservation. A fix of this magnitude typically suggests a refactor or rewrite of the forwarder path rather than a surgical fix. Without seeing the source diff, I cannot evaluate whether the deletion removes dead code, consolidates a parser, replaces a hand-rolled protobuf encoder with `opentelemetry-proto`, or removes functionality. If it removes the hand-rolled OTLP/JSON→proto translation in favour of using `opentelemetry-proto` types directly, that's likely a good change but is much larger than the PR title implies.",
"suggestion": "Request the full diff or a summary in the PR description explaining the structural change: is this replacing a custom proto encoder with `opentelemetry-proto::tonic` types? Are any code paths (e.g. backpressure, retry, batching) being removed in the process? Reviewers should not approve a -374-line change on the strength of one regression test without seeing what was removed.",
"confidence": 0.7
},
{
"id": "F8",
"title": "Regression test design with full AnyValue branch coverage and clear failure messages",
"severity": "PRAISE",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "The test deliberately exercises three `AnyValue` discriminator branches (stringValue, intValue, boolValue), each with an assertion message that explains the bug class being guarded against rather than just the expected value. The header comment also explicitly states why the test is in a separate integration file (process-global OnceLock isolation) and why the assertion contract is 'full resource preservation, not just service.name' — preventing a future tempting single-field band-aid from passing this test. This is the right shape for a regression test tied to a specific incident.",
"suggestion": "No action needed."
}
]
}Callouts
Enrichment unavailable for this review.
Local pre-emptive Bridgebuilder pass (claude-opus-4-7 via persona at
grimoires/bridgebuilder/BEAUVOIR.md) surfaced 6 net new findings beyond
the round-1 review-sprint adversarial dissent. Round-3 fixes:
F1 (MEDIUM, dep placement) — `tokio-stream` is only used by the test
file's `TcpListenerStream::new(listener)`. Moved from `[dependencies]`
to `[dev-dependencies]`. `opentelemetry-proto` + `tonic` stay in
`[dependencies]` (they're used by production code in
`engine/src/workers/observability/otel.rs` for proto types,
`TraceServiceClient`, `Channel`, `ClientTlsConfig`).
F2 (MEDIUM, OnceLock test fragility) — `SDK_SPAN_FORWARDER` is a
process-global `OnceLock`; a second test in the same integration
binary calling `init_sdk_span_forwarder` would silently no-op against
the first test's endpoint. Added a doc-comment on the static
explicitly warning future contributors to place new tests in their
own `engine/tests/*.rs` file and pointing at the
`#[cfg(test)] pub fn reset_for_test()` escape hatch if parameterised
endpoints are needed.
F3 (LOW, 50ms sleep flake) — replaced the fixed `tokio::time::sleep`
in `spawn_mock_collector` with a bounded `TcpStream::connect` probe
loop (2s deadline, `yield_now` between attempts). On loaded CI
runners the 50ms heuristic could lose the race with tonic's accept-
loop polling; the probe is deterministic.
F4 (LOW, swallowed panics) — replaced `.unwrap()` on the spawned
mock-server task with `.expect("mock OTLP collector failed to start;
see tonic transport error above")`. The detached task would silently
disappear on bind/serve failure; now the panic message surfaces.
F6 (LOW, pass-through fidelity) — added two assertions to the
forwarder test:
- `resource.attributes.len() == 5` catches a forwarder bug that
ADDS attributes (e.g. injecting a hop-level `forwarder.id`).
- `span.name == "forwarder.preserves.resource"` catches a forwarder
bug that rewrites the span name. Without this, a name-rewrite
defect would pass all the existing attribute assertions.
Findings F5 (SAMPLED-default semantics) and F7 (-374 net lines) were
triaged as already-addressed — F5 is documented in commit `76f0b05e`
and the round-1 reviewer.md; F7 is explained in the PR description as
the dead-code removal of `convert_otlp_to_span_data`.
Bridgebuilder review preserved on PR thread for audit trail; full
findings at iii-hq#1618 (review).
Sweep: 1542 lib tests pass, 1 forwarder integration test pass with
extended assertions (5-attr count + span-name check), cargo fmt
clean, clippy clean on touched files.
Ridden with [Loa](https://github.com/0xHoneyJar/loa)
|
local self-review summary before maintainer queue: ran our own bridgebuilder pass (claude-opus-4-7 via custom reviewer persona) plus two rounds of adversarial cross-model dissent (gpt-5.3-codex for both review + audit phases). 4 model-passes total, 3 distinct concerns surfaced + addressed before this hit your queue:
artifacts kept in our state zone for traceability:
ready when you are. happy to drop the bridgebuilder bot comment above if it's noisy — it's preserved here as the audit trail showing the issues were caught locally, not surfaced by your team's review cycle. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
engine/src/workers/observability/otel.rs (1)
1775-1781: ⚡ Quick winComment overstates "parity" — in-memory storage doesn't default
flagsto SAMPLED.The doc claims the SAMPLED default mirrors "the in-memory storage path's default at the top of
ingest_otlp_json", but the storage path storesflags: span.flagsunchanged (Option) — missing flags stayNonethere, not1. The forwarder-side default to1is defensible on its own (the proto field is non-optionalu32, and defaulting to0would mark all SDK-origin spans as unsampled at downstream collectors), and the integration test pins it. Just trim the parity framing so future readers don't go looking for a matching default in the storage code that doesn't exist.📝 Suggested wording tweak
- // Parity with the in-memory storage path's default at the top of - // `ingest_otlp_json` — a span arriving without an explicit `flags` - // field is treated as SAMPLED (W3C trace-flags bit 0 = 0x01). - // Defaulting to 0 here would mark such spans as unsampled on the - // forwarder hop while storage records them as sampled, producing - // inconsistent behaviour across the two halves of the same call. + // A span arriving without an explicit `flags` field is treated + // as SAMPLED (W3C trace-flags bit 0 = 0x01). The proto field is + // a non-optional u32; defaulting to 0 here would mark every + // SDK-origin span without explicit flags as unsampled at + // downstream samplers/collectors that honour the SAMPLED bit. + // The in-memory storage path keeps `flags` as Option<u32> and + // serialises it via `skip_serializing_if`, so the two halves + // express the same intent in different shapes. flags: span.flags.unwrap_or(1),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@engine/src/workers/observability/otel.rs` around lines 1775 - 1781, Update the comment next to the flags assignment (flags: span.flags.unwrap_or(1) in otel.rs) to remove the incorrect claim of parity with the in-memory storage path; instead note that the forwarder defaults to 1 because the proto field is non-optional and defaulting to 0 would mark SDK-origin spans unsampled, and clarify that the storage path (ingest_otlp_json) preserves Option<u32> (span.flags) and does not default missing flags to 1.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@engine/src/workers/observability/otel.rs`:
- Around line 1775-1781: Update the comment next to the flags assignment (flags:
span.flags.unwrap_or(1) in otel.rs) to remove the incorrect claim of parity with
the in-memory storage path; instead note that the forwarder defaults to 1
because the proto field is non-optional and defaulting to 0 would mark
SDK-origin spans unsampled, and clarify that the storage path (ingest_otlp_json)
preserves Option<u32> (span.flags) and does not default missing flags to 1.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8ccda45d-028d-4d49-9a7b-ed43547dd6a0
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
engine/Cargo.tomlengine/src/workers/observability/otel.rsengine/tests/otel_forwarder_resource_test.rs
|
status update for the maintainer queue:
#1618 stays scoped to the forwarder. ready for review. |
|
Hey @deep-name, thanks for your contribution one more time. Can you consider add these tests on your PR? Feature: SDK span forwarder honors OTEL_EXPORTER_OTLP_HEADERS
As an operator running iii behind an auth-required OTLP collector
(Grafana Cloud, Honeycomb, SigNoz Cloud, or any API-keyed deployment)
I want SDK-ingested spans forwarded by the engine to carry the same
auth metadata as engine-originated spans
So that worker-forwarded telemetry does not silently get rejected at
the collector while engine telemetry keeps flowing
Background:
Given a mock OTLP gRPC collector listening on an ephemeral loopback port
And the collector captures the tonic MetadataMap of every Export request
Scenario: Authorization header from env var lands on the wire
Given the environment variable OTEL_EXPORTER_OTLP_HEADERS is set to
"authorization=Bearer-test-token-12345"
And the SDK span forwarder has been initialized against the mock collector
When a well-formed OTLP/JSON payload with one valid span is ingested via
ingest_otlp_json
Then the mock collector receives exactly one ExportTraceServiceRequest
And that request carries an "authorization" metadata entry
AndFeature: One bad span does not poison the rest of the forwarded batch
As an operator whose SDKs occasionally emit malformed trace/span IDs
(instrumentation bugs, truncated context propagation, custom emitters)
I want the forwarder to drop only the malformed spans
So that valid telemetry in the same payload still reaches the collector
instead of being rejected wholesale when one span has empty ID bytes
Background:
Given a mock OTLP gRPC collector listening on an ephemeral loopback port
And the collector captures every ExportTraceServiceRequest it receives
Scenario: Mixed batch of valid and malformed spans is partially forwarded
Given the SDK span forwarder has been initialized against the mock collector
And an OTLP/JSON payload contains two spans:
| name | trace_id | span_id | validity |
| bad-trace-id | not-hex | also-not-hex | invalid |
| valid-span | 0af7651916cd43dd8448eb211c80319c | b7ad6b7169203331 | valid |
When the payload is ingested via ingest_otlp_json
Then the mock collector receives at least one ExportTraceServiceRequest
And no span on the wire has an empty trace_id
And no span on the wire has an empty span_id
And the captured stream contains exactly one span
And that span has name "valid-span"
And that span has trace_id 0af7651916cd43dd8448eb211c80319c
And that span has span_id b7ad6b7169203331
|
|
both scenarios are spot-on @ytallo — these are the next two regression classes the raw-tonic switch introduced beyond TLS. on it. quick read so you know what's landing: scenario 1 (OTEL_EXPORTER_OTLP_HEADERS): confirmed regression. scenario 2 (one bad span doesn't poison the batch): the audit-sprint feedback called out the "relay malformed spans, let collector validate" tradeoff as deliberate, but you're right that empty-byte IDs on the wire are a regression vs the pre-fix shipping both scenarios as runnable rust tests against the existing tonic mock + an HTTPS-fronted mock variant for header propagation. should have the commit batch up in the next hour or so. |
… spans Addresses both gherkin scenarios from @ytallo's iii-hq#1618 review: Scenario 1 — OTEL_EXPORTER_OTLP_HEADERS regression. The pre-fix `opentelemetry_otlp::SpanExporter::builder().with_tonic()` parsed `OTEL_EXPORTER_OTLP_HEADERS` (and the signal-specific `OTEL_EXPORTER_OTLP_TRACES_HEADERS` override) automatically per the OTel SDK spec, injecting each `key=value` pair as a tonic `MetadataMap` entry on every export. The raw-`tonic::Channel` replacement in earlier commits lost that behaviour silently — every API-keyed collector (Grafana Cloud, Honeycomb, SigNoz Cloud) was about to reject forwarded spans at auth. Changes for scenario 1: - New `SdkForwarderState` struct wrapping the `tonic::Channel` plus the parsed `Vec<(MetadataKey<Ascii>, MetadataValue<Ascii>)>` pairs. Replaces the bare-`Channel` `OnceLock<Channel>` static. - New `parse_otlp_headers_env()` reads `OTEL_EXPORTER_OTLP_TRACES_HEADERS` first (signal-specific takes precedence per OTel spec), falls back to `OTEL_EXPORTER_OTLP_HEADERS`. Comma-split on entries, `=`-split on key/value, invalid pairs warn-and-skip rather than fail init (matches `opentelemetry-otlp`'s tolerance). - Forwarder dispatch builds a `tonic::Request` and attaches every parsed pair to `metadata_mut()` before calling `TraceServiceClient::export(...)`. Scenario 2 — one bad span no longer poisons the batch. The previous `otlp_span_to_proto` was infallible: it called `hex_decode_or_empty` on malformed `trace_id` / `span_id` strings and emitted spans with empty-byte IDs on the wire. Most collectors reject any batch containing such spans wholesale, so a single bad SDK emitter (instrumentation bug, truncated context propagation, custom client) could blackhole otherwise-valid telemetry from the same payload. Changes for scenario 2: - `otlp_span_to_proto` now returns `Option<ProtoSpan>` and returns `None` whenever `trace_id` or `span_id` fail to hex-decode OR decode to wrong byte lengths (trace_id MUST be 16 bytes, span_id MUST be 8 — surfaced by adversarial review DISS-001 in this round as a gap in the original hex-only filter: `hex::decode("01")` succeeds and produces a 1-byte trace_id that collectors still reject). - New `OTLP_TRACE_ID_BYTES`/`OTLP_SPAN_ID_BYTES` consts so the spec numbers have a single point of truth. - `otlp_scope_spans_to_proto` now `filter_map`s over the spans, dropping `None`s so valid spans in the same batch survive. - `parent_span_id` validates the same way but falls back to empty (rather than dropping the whole span) when malformed — the orphaned-parent case is recoverable downstream. - `hex_decode_or_empty` helper deleted (only consumer is link metadata, which inlines `hex::decode(...).unwrap_or_default()` — links to malformed-ID remote spans are recoverable, the link's parent span itself was already validated). - Forwarder-dispatch comment updated to reflect the new drop-malformed-spans contract (was previously "relays everything, collector validates"; now "drops malformed, valid spans survive"). Tests: - `engine/tests/otel_forwarder_headers_test.rs` (new) — sets `OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer-test-token-12345,x-tenant=acme` before init, asserts both metadata entries land on the `ExportTraceServiceRequest`'s tonic `MetadataMap`. Single test per binary to avoid env-var contamination across parallel tests. - `engine/tests/otel_forwarder_partial_batch_test.rs` (new) — ingests a three-span payload (non-hex IDs, hex-but-wrong-length IDs, valid IDs), asserts the collector receives exactly one span on the wire (the valid one) with spec-correct ID lengths (16/8 bytes). - Existing `engine/tests/otel_forwarder_resource_test.rs` unchanged and still green; the spec-length filter doesn't disturb its happy-path payload. Sweep: 1542 lib tests passed, 3 forwarder integration tests passed (resource preservation + headers env var + partial-batch filtering), cargo fmt + clippy clean on touched files. Adversarial cross-model re-review (gpt-5.3-codex) returned `findings: []` / status `clean` after the spec-length tightening. Credits: @ytallo for the gherkin scenarios surfacing both regressions on iii-hq#1618 before they reached production. @alsruf36 for the end-to-end validation earlier in the review cycle that anchored the contract this PR is pinning. Ridden with [Loa](https://github.com/0xHoneyJar/loa)
deep-name
left a comment
There was a problem hiding this comment.
Summary
Analytical review of #1618. Enrichment pass was unavailable; findings are unenriched.
Findings
{
"schema_version": 1,
"findings": [
{
"id": "F1",
"title": "Integration tests rely on global SDK_SPAN_FORWARDER OnceLock and cannot coexist",
"severity": "MEDIUM",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "Both new integration test files (otel_forwarder_resource_test.rs and otel_forwarder_partial_batch_test.rs, plus the pre-existing otel_forwarder_headers_test.rs) call init_sdk_span_forwarder(&endpoint) which writes to a process-global OnceLock. The comment in each file asserts that 'Owns its own process via being an integration test file, so the process-global SDK_SPAN_FORWARDER OnceLock is fresh on every run.' This is true only because cargo compiles each integration test file in tests/ as a separate binary. However, if any test file in the future contains more than one #[tokio::test] that calls init_sdk_span_forwarder with a different endpoint, all subsequent tests will silently send to the first endpoint (OnceLock semantics) and the assertions about 'exactly one ExportTraceServiceRequest' will become non-deterministic across runs. There is no test-level guardrail (no serial_test marker, no assertion that the OnceLock was uninitialized) preventing future contributors from breaking this invariant.",
"suggestion": "Either (a) add a doc-comment near init_sdk_span_forwarder warning that the OnceLock requires one-test-per-binary, with a brief explanation, or (b) expose a #[cfg(test)] reset helper so multiple tests per binary become safe. Option (b) is more robust.",
"confidence": 0.75
},
{
"id": "F2",
"title": "Mock collector spawn task swallows panics on Server::builder errors after readiness",
"severity": "LOW",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "The tokio::spawn task that runs the tonic Server uses .expect() on serve_with_incoming. If the server errors mid-test (e.g., port reclaimed, OS resource exhaustion), the spawned task panics in the background but the main test continues, producing confusing 'received 0 requests' failures rather than the underlying transport error. Same pattern in otel_forwarder_partial_batch_test.rs.",
"suggestion": "Capture the JoinHandle and abort()/check it at end-of-test, or use a oneshot channel to surface server startup errors to the main test thread.",
"confidence": 0.6
},
{
"id": "F3",
"title": "Readiness probe leaves a stray connect socket against tonic accept loop",
"severity": "LOW",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "The TCP connect-then-drop readiness probe at line ~62 establishes a real TCP connection to the tonic server that is then dropped without an HTTP/2 handshake. tonic/h2 may log a spurious warning ('connection error: not a valid HTTP/2 connection preface' or similar) on stderr for each test. Not a correctness bug but adds noise to test output.",
"suggestion": "Acceptable as-is; if the noise becomes problematic, switch to a lightweight in-process readiness signal (oneshot sent from inside the spawn before serve_with_incoming).",
"confidence": 0.4
},
{
"id": "F4",
"title": "Excellent regression-test design with explicit contract assertions",
"severity": "PRAISE",
"category": "testing",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "The resource-preservation test pins not only the originally-broken service.name field but also tests three AnyValue discriminator branches (string, int, bool), span-name pass-through, attribute count (catches injection), and SAMPLED-flag default behavior. The partial-batch test goes further: it pins both hex-decode-failure AND hex-decodes-but-wrong-length cases, citing the spec lengths (16 bytes / 8 bytes) explicitly. Failure messages reference issue numbers and explain the intended invariant, making future debugging significantly easier.",
"suggestion": "No change required."
},
{
"id": "F5",
"title": "Production tonic dependency added with tls-native-roots — confirm intent",
"severity": "LOW",
"category": "dependencies",
"file": "engine/Cargo.toml",
"description": "tonic is added to the [dependencies] section (not [dev-dependencies]) with features = [\"transport\", \"tls-native-roots\"]. The new test files use tonic, but they only need tonic at dev-time. If the forwarder implementation in the truncated otel.rs needs tonic directly (not just transitively through opentelemetry-otlp which already pulls tonic), that is fine. Otherwise, this is dependency bloat and should move to [dev-dependencies]. Cannot fully verify without the truncated otel.rs diff.",
"suggestion": "Verify whether engine/src/workers/observability/otel.rs imports tonic directly. If only the integration tests need it, move tonic + opentelemetry-proto to [dev-dependencies].",
"confidence": 0.55
},
{
"id": "F6",
"title": "Typo in test comment",
"severity": "LOW",
"category": "nit",
"file": "engine/tests/otel_forwarder_resource_test.rs",
"description": "'preceeding' is misspelled (should be 'preceding') in the comment block at the attribute-count assertion.",
"suggestion": "Fix spelling.",
"confidence": 0.95
},
{
"id": "F7",
"title": "Duplicated mock-collector helper across three test files",
"severity": "LOW",
"category": "maintainability",
"file": "engine/tests/otel_forwarder_partial_batch_test.rs",
"description": "CapturingTraceService, TraceService impl, and spawn_mock_collector are duplicated verbatim across otel_forwarder_resource_test.rs, otel_forwarder_partial_batch_test.rs, and (per the truncated diff) otel_forwarder_headers_test.rs. Cargo integration tests can share code via a tests/common/mod.rs module; this would also enforce that the readiness/spawn semantics stay in lockstep across the suite.",
"suggestion": "Extract the mock collector into engine/tests/common/mod.rs and have each integration test `mod common;` it.",
"confidence": 0.85
},
{
"id": "F8",
"title": "Truncated otel.rs change is the actual fix — cannot verify behavior without it",
"severity": "REFRAME",
"category": "review-scope",
"file": "engine/src/workers/observability/otel.rs",
"description": "The substantive fix (449 added, 696 removed lines in engine/src/workers/observability/otel.rs) is truncated from the diff. The tests pin behavior but the production change itself — including the `otlp_span_to_proto` returning Option, the resource conversion path, and the SAMPLED-default logic — is unreviewed. Findings here cover only the test/dependency surface; the core fix needs a separate pass.",
"suggestion": "Request the full otel.rs diff for the next review pass to verify: (1) hex-decode length checks on trace_id (16) and span_id (8), (2) resource attribute conversion preserves all AnyValue variants, (3) SAMPLED-default flag handling, (4) no unwrap()/expect() on adversarial input.",
"confidence": 0.9
}
]
}Callouts
Enrichment unavailable for this review.
Addresses two Bridgebuilder F-findings from the local self-review pass on PR iii-hq#1618 head a3c37fb: F7 (LOW, maintainability) — `CapturingTraceService`, the `TraceService` impl, the spawn helper, and the readiness probe were duplicated across all three `otel_forwarder_*` integration tests (~60 lines each, three copies). When the `SdkForwarderState` shape changed in earlier rounds the three helpers drifted out of lockstep until the partial-batch test caught it. Extracted to `engine/tests/common/forwarder_mock.rs` with two variants: - `spawn_body_capturing_collector` — used by the resource-preservation and partial-batch tests (both need the `ExportTraceServiceRequest` body). - `spawn_metadata_capturing_collector` — used by the headers test (needs `tonic::MetadataMap` instead of body). Both share the same readiness-probe contract (bounded `TcpStream::connect` retry; see file-level doc-comment for the rationale that replaced the 50ms-sleep heuristic earlier in the review cycle). Registered in `engine/tests/common/mod.rs` alongside the existing `http_helpers` / `queue_helpers` modules. Net effect: tests/ shrinks by ~210 lines (210 deletions vs 32 insertions, after the new helper module is counted). F6 (NIT, spelling) — "preceeding" → "preceding" in the comment block on the attribute-count assertion in `otel_forwarder_resource_test.rs`. Deferred F-findings (with rationale): F1 (OnceLock fragility) was already addressed in round-3 via the doc-comment on `SDK_SPAN_FORWARDER`; BB couldn't see otel.rs in the truncated diff. F2/F3 (server task panics / readiness probe noise) — `.expect()` was added in round-3 and the noise is acceptable per BB's own confidence (0.4/0.6). F5 (dep placement) — `tonic` IS used in production code (`Channel`, `Endpoint`, `MetadataKey/Value`, `ClientTlsConfig`); the `[dependencies]` placement is correct. F8 (truncated otel.rs review scope) — adversarial cross-model dissent (`gpt-5.3-codex`) reviewed the full diff and returned `findings: []` after the round-4 spec-length tightening. Sweep: 1542 lib tests passed, 3 forwarder integration tests passed (all three now via the shared helper module), cargo fmt + clippy clean on touched files. Ridden with [Loa](https://github.com/0xHoneyJar/loa)
|
both gherkin scenarios closed end-to-end @ytallo. round summary across 2 commits: a3c37fb — scenarios proper scenario 1 (OTEL_EXPORTER_OTLP_HEADERS): new scenario 2 (one bad span doesn't poison the batch): 380d684 — test maintainability nits from local bridgebuilder pass extracted the duplicated mock-collector helpers into cross-model adversarial dissent (gpt-5.3-codex) was clean against the final state. sweep: 1542 lib tests + 3 forwarder integration tests pass, cargo fmt + clippy clean on touched files. ready when you are. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@engine/src/workers/observability/otel.rs`:
- Around line 713-756: parse_otlp_headers_env currently uses header values
verbatim; update it to percent-decode the value string before constructing the
MetadataValue so OTLP percent-encoded headers (e.g. Authorization=Basic%20...)
are decoded per the spec: after trimming v in parse_otlp_headers_env, run a
percent-decode (e.g. via percent_encoding::percent_decode_str or similar) and if
decoding fails log a warning (like the existing MetadataValue error branch,
include header_name = k) and skip the entry; then pass the decoded string/bytes
into MetadataValue::<Ascii>::try_from instead of the original v; preserve the
existing header-name parsing/validation and warnings for invalid name/value.
- Around line 1926-1934: The otlp_link_to_proto function currently uses
hex::decode(...).unwrap_or_default() which can produce short byte vectors (e.g.,
1 byte) and cause OTLP collectors to reject the whole batch; change the trace_id
and span_id handling so you decode the hex string, check the decoded length
(trace_id must be exactly 16 bytes, span_id must be exactly 8 bytes), and if the
length is invalid replace with an empty Vec<u8> (preserving the previous
"recoverable metadata loss" behavior) instead of using unwrap_or_default; update
the trace_id/span_id expressions in otlp_link_to_proto to perform this
decode+length-check logic (and keep trace_state as
link.trace_state.clone().unwrap_or_default()).
In `@engine/tests/otel_forwarder_headers_test.rs`:
- Around line 99-170: The test
forwarder_injects_otel_exporter_otlp_headers_into_export_metadata must
explicitly guard against existing OTEL_EXPORTER_OTLP_TRACES_HEADERS precedence:
capture the current values of OTEL_EXPORTER_OTLP_HEADERS and
OTEL_EXPORTER_OTLP_TRACES_HEADERS, clear or set both appropriately before
calling iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint)
(ensuring the desired OTEL_EXPORTER_OTLP_HEADERS is set and
OTEL_EXPORTER_OTLP_TRACES_HEADERS is cleared), and then restore the original
values at the end of the test; this guarantees deterministic header precedence
for the assertions without changing init_sdk_span_forwarder itself.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5a3a7220-e44a-4749-8520-077514a8310d
📒 Files selected for processing (3)
engine/src/workers/observability/otel.rsengine/tests/otel_forwarder_headers_test.rsengine/tests/otel_forwarder_partial_batch_test.rs
| fn parse_otlp_headers_env() -> Vec<(MetadataKey<Ascii>, MetadataValue<Ascii>)> { | ||
| let raw = std::env::var("OTEL_EXPORTER_OTLP_TRACES_HEADERS") | ||
| .or_else(|_| std::env::var("OTEL_EXPORTER_OTLP_HEADERS")) | ||
| .ok(); | ||
| let Some(raw) = raw else { | ||
| return Vec::new(); | ||
| }; | ||
| if raw.trim().is_empty() { | ||
| return Vec::new(); | ||
| } | ||
| raw.split(',') | ||
| .filter_map(|pair| { | ||
| let pair = pair.trim(); | ||
| if pair.is_empty() { | ||
| return None; | ||
| } | ||
| } | ||
| let (k, v) = pair.split_once('=')?; | ||
| let k = k.trim(); | ||
| let v = v.trim(); | ||
| let key = match MetadataKey::<Ascii>::from_bytes(k.as_bytes()) { | ||
| Ok(k) => k, | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| error = %e, | ||
| header_name = k, | ||
| "OTEL_EXPORTER_OTLP_HEADERS entry has invalid header name; skipping" | ||
| ); | ||
| return None; | ||
| } | ||
| }; | ||
| let value = match MetadataValue::<Ascii>::try_from(v) { | ||
| Ok(v) => v, | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| error = %e, | ||
| header_name = k, | ||
| "OTEL_EXPORTER_OTLP_HEADERS entry has invalid header value; skipping" | ||
| ); | ||
| return None; | ||
| } | ||
| }; | ||
| Some((key, value)) | ||
| }) | ||
| .collect() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS, are values required to be URL-decoded before sending request headers?
💡 Result:
No. The OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS values are not something you should “URL-decode” as part of sending the HTTP/gRPC request headers. What the spec requires is that the environment variable value itself is formatted like W3C Baggage (key=value,key2=value2), and that the value portion is URL-encoded; SDKs should percent-decode that value into the actual header value they send on the wire. The header key should not be percent-decoded; only the value is treated as URL-encoded baggage. [1][2][3] In practice: if you set OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic%20..." then the SDK should decode %20 to a space before sending the Authorization header. [4][5][6][7] So: you (the user) should provide URL-encoded header values in the env var; the SDK performs the decoding when constructing the actual request headers. You generally do not URL-decode yourself before exporting the env vars. [1][2][3]
Citations:
- 1: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md
- 2: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/protocol/exporter.md
- 3: OTLP exporters should not percent decode the key when parsing HEADERS env vars open-telemetry/opentelemetry-go#5623
- 4: Decode URL-encoded headers in environment vars open-telemetry/opentelemetry-python#2312
- 5: URL encoded values for
OTEL_EXPORTER_OTLP_HEADERSare not exported open-telemetry/opentelemetry-cpp#2523 - 6: URL encoded values for
OTEL_EXPORTER_OTLP_HEADERSare not supported open-telemetry/opentelemetry-php#1239 - 7: URL encoded values for
OTEL_EXPORTER_OTLP_HEADERSare not supported open-telemetry/opentelemetry-rust#1577
Add percent-decoding for OTEL header environment variable values per OTLP specification.
The OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS environment variables require header values to be percent-encoded by the user; the SDK must percent-decode them before constructing request headers. The current code (lines 723–745) parses values verbatim without decoding, which breaks headers with encoded characters (e.g., Authorization=Basic%20... will send literal %20 instead of a space). This violates the OTLP specification and causes interoperability issues with collectors.
Proposed fix
let v = v.trim();
+ let v = match urlencoding::decode(v) {
+ Ok(decoded) => decoded.into_owned(),
+ Err(e) => {
+ tracing::warn!(
+ error = %e,
+ header_name = k,
+ "OTEL_EXPORTER_OTLP_HEADERS entry has invalid percent-encoding; skipping"
+ );
+ return None;
+ }
+ };
let value = match MetadataValue::<Ascii>::try_from(v.as_str()) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@engine/src/workers/observability/otel.rs` around lines 713 - 756,
parse_otlp_headers_env currently uses header values verbatim; update it to
percent-decode the value string before constructing the MetadataValue so OTLP
percent-encoded headers (e.g. Authorization=Basic%20...) are decoded per the
spec: after trimming v in parse_otlp_headers_env, run a percent-decode (e.g. via
percent_encoding::percent_decode_str or similar) and if decoding fails log a
warning (like the existing MetadataValue error branch, include header_name = k)
and skip the entry; then pass the decoded string/bytes into
MetadataValue::<Ascii>::try_from instead of the original v; preserve the
existing header-name parsing/validation and warnings for invalid name/value.
| fn otlp_link_to_proto(link: &OtlpSpanLink) -> opentelemetry_proto::tonic::trace::v1::span::Link { | ||
| // Links to malformed-ID remote spans fall back to empty bytes here | ||
| // (rather than dropping the link) — the parent span itself was already | ||
| // validated by `otlp_span_to_proto`, so a bad link is a recoverable | ||
| // metadata loss, not a wholesale span-loss event. | ||
| opentelemetry_proto::tonic::trace::v1::span::Link { | ||
| trace_id: hex::decode(&link.trace_id).unwrap_or_default(), | ||
| span_id: hex::decode(&link.span_id).unwrap_or_default(), | ||
| trace_state: link.trace_state.clone().unwrap_or_default(), |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In OTLP Trace Span.Link, must trace_id be exactly 16 bytes and span_id exactly 8 bytes, and can collectors reject entire exports when these lengths are invalid?
💡 Result:
Yes, in OTLP Trace Span.Link, trace_id must be exactly 16 bytes and span_id exactly 8 bytes, as defined in the official OpenTelemetry protobuf schema for the Span.Link message: "A unique identifier of a trace that this linked span is part of. The ID is a 16-byte array." for trace_id and "A unique identifier for the linked span. The ID is an 8-byte array." for span_id [1][2]. This matches the specification for Span trace_id and span_id [3][4][5]. These fields are bytes with fixed semantic lengths; incorrect lengths violate the spec and are treated as invalid [6][7].
Citations:
- 1: https://github.com/open-telemetry/opentelemetry-proto/blob/ebef7c999f4dea62b5b033e92a221411c49c0966/opentelemetry/proto/trace/v1/trace.proto
- 2: https://github.com/open-telemetry/opentelemetry-proto/blob/30d237e1ff3ab7aa50e0922b5bebdd93505090af/opentelemetry/proto/trace/v1/trace.proto
- 3: https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/trace/0059-otlp-trace-data-format.md
- 4: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md
- 5: https://opentelemetry.io/docs/specs/otel/trace/api/
- 6: Collector return "invalid length for ID" for trace reporting open-telemetry/opentelemetry-collector#3195
- 7: Collector cannot forward to another collector using the otlphttp exporter open-telemetry/opentelemetry-collector#4221
🏁 Script executed:
cat -n engine/src/workers/observability/otel.rs | sed -n '1920,1940p'Repository: iii-hq/iii
Length of output: 1121
🏁 Script executed:
rg "OTLP_TRACE_ID_BYTES|OTLP_SPAN_ID_BYTES" engine/src/workers/observability/otel.rsRepository: iii-hq/iii
Length of output: 281
🏁 Script executed:
rg "fn otlp_span_to_proto" engine/src/workers/observability/otel.rs -A 20Repository: iii-hq/iii
Length of output: 1053
🏁 Script executed:
rg "\.len\(\)" engine/src/workers/observability/otel.rs | head -20Repository: iii-hq/iii
Length of output: 924
Add link ID length validation to prevent malformed IDs from poisoning batch exports.
The otlp_link_to_proto function accepts hex-decoded IDs of any length via unwrap_or_default(). Short IDs like hex::decode("01") produce 1-byte values that violate the OTLP spec (trace_id must be 16 bytes, span_id must be 8 bytes) and trigger collector-side rejection of the entire export. The parent function otlp_span_to_proto already validates these lengths correctly; otlp_link_to_proto must do the same.
Proposed fix
fn otlp_link_to_proto(link: &OtlpSpanLink) -> opentelemetry_proto::tonic::trace::v1::span::Link {
// Links to malformed-ID remote spans fall back to empty bytes here
// (rather than dropping the link) — the parent span itself was already
// validated by `otlp_span_to_proto`, so a bad link is a recoverable
// metadata loss, not a wholesale span-loss event.
+ let trace_id = hex::decode(&link.trace_id)
+ .ok()
+ .filter(|b| b.len() == OTLP_TRACE_ID_BYTES)
+ .unwrap_or_default();
+ let span_id = hex::decode(&link.span_id)
+ .ok()
+ .filter(|b| b.len() == OTLP_SPAN_ID_BYTES)
+ .unwrap_or_default();
+
opentelemetry_proto::tonic::trace::v1::span::Link {
- trace_id: hex::decode(&link.trace_id).unwrap_or_default(),
- span_id: hex::decode(&link.span_id).unwrap_or_default(),
+ trace_id,
+ span_id,
trace_state: link.trace_state.clone().unwrap_or_default(),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@engine/src/workers/observability/otel.rs` around lines 1926 - 1934, The
otlp_link_to_proto function currently uses hex::decode(...).unwrap_or_default()
which can produce short byte vectors (e.g., 1 byte) and cause OTLP collectors to
reject the whole batch; change the trace_id and span_id handling so you decode
the hex string, check the decoded length (trace_id must be exactly 16 bytes,
span_id must be exactly 8 bytes), and if the length is invalid replace with an
empty Vec<u8> (preserving the previous "recoverable metadata loss" behavior)
instead of using unwrap_or_default; update the trace_id/span_id expressions in
otlp_link_to_proto to perform this decode+length-check logic (and keep
trace_state as link.trace_state.clone().unwrap_or_default()).
| async fn forwarder_injects_otel_exporter_otlp_headers_into_export_metadata() { | ||
| // Order is load-bearing: the env var MUST be set BEFORE | ||
| // init_sdk_span_forwarder is called, because the parsing happens at | ||
| // forwarder-init time (matching the upstream `opentelemetry-otlp` | ||
| // builder semantics). Setting it post-init would not retroactively | ||
| // populate `SdkForwarderState.headers`. | ||
| // | ||
| // # Safety: this integration test binary is single-threaded with | ||
| // respect to env-var mutation — only one test exists in the file, | ||
| // and `SDK_SPAN_FORWARDER`'s `OnceLock` ensures we only init once. | ||
| unsafe { | ||
| std::env::set_var( | ||
| "OTEL_EXPORTER_OTLP_HEADERS", | ||
| "authorization=Bearer-test-token-12345,x-tenant=acme", | ||
| ); | ||
| } | ||
|
|
||
| let (endpoint, received_metadata) = spawn_mock_collector().await; | ||
|
|
||
| iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint); | ||
|
|
||
| iii::workers::observability::otel::ingest_otlp_json(MINIMAL_PAYLOAD) | ||
| .await | ||
| .expect("ingest_otlp_json should accept a well-formed payload"); | ||
|
|
||
| let deadline = std::time::Instant::now() + Duration::from_secs(5); | ||
| while std::time::Instant::now() < deadline { | ||
| if !received_metadata.lock().await.is_empty() { | ||
| break; | ||
| } | ||
| tokio::time::sleep(Duration::from_millis(20)).await; | ||
| } | ||
|
|
||
| let captured = received_metadata.lock().await; | ||
| assert_eq!( | ||
| captured.len(), | ||
| 1, | ||
| "Mock collector should have received exactly one ExportTraceServiceRequest" | ||
| ); | ||
|
|
||
| let metadata = &captured[0]; | ||
|
|
||
| // The authorization header from OTEL_EXPORTER_OTLP_HEADERS must | ||
| // land on the wire as a tonic Ascii metadata entry. Without this | ||
| // the upstream-spec env var is silently ignored and every API-keyed | ||
| // collector rejects forwarded spans at auth. | ||
| let auth = metadata | ||
| .get("authorization") | ||
| .expect("authorization metadata MUST be set from OTEL_EXPORTER_OTLP_HEADERS"); | ||
| assert_eq!( | ||
| auth.to_str().unwrap(), | ||
| "Bearer-test-token-12345", | ||
| "authorization metadata value did not match OTEL_EXPORTER_OTLP_HEADERS" | ||
| ); | ||
|
|
||
| // Second pair confirms the comma-split parser handles multiple | ||
| // entries (single-pair could pass with a naive `=`-only split). | ||
| let tenant = metadata | ||
| .get("x-tenant") | ||
| .expect("x-tenant metadata MUST be set; comma-split parser regression?"); | ||
| assert_eq!( | ||
| tenant.to_str().unwrap(), | ||
| "acme", | ||
| "x-tenant metadata value did not match OTEL_EXPORTER_OTLP_HEADERS" | ||
| ); | ||
|
|
||
| // Cleanup so a future test binary inheriting this process's env (if | ||
| // any) doesn't leak the test credential. # Safety: same single-test | ||
| // file rationale as the set_var above. | ||
| unsafe { | ||
| std::env::remove_var("OTEL_EXPORTER_OTLP_HEADERS"); | ||
| } |
There was a problem hiding this comment.
Test isolation is incomplete for header precedence.
Because OTEL_EXPORTER_OTLP_TRACES_HEADERS overrides OTEL_EXPORTER_OTLP_HEADERS, this test can flake if the process inherits *_TRACES_HEADERS. Clear (and restore) both vars around init to make the assertion deterministic.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@engine/tests/otel_forwarder_headers_test.rs` around lines 99 - 170, The test
forwarder_injects_otel_exporter_otlp_headers_into_export_metadata must
explicitly guard against existing OTEL_EXPORTER_OTLP_TRACES_HEADERS precedence:
capture the current values of OTEL_EXPORTER_OTLP_HEADERS and
OTEL_EXPORTER_OTLP_TRACES_HEADERS, clear or set both appropriately before
calling iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint)
(ensuring the desired OTEL_EXPORTER_OTLP_HEADERS is set and
OTEL_EXPORTER_OTLP_TRACES_HEADERS is cleared), and then restore the original
values at the end of the test; this guarantees deterministic header precedence
for the assertions without changing init_sdk_span_forwarder itself.
|
This PR touches engine code. For us to accept it, we need your explicit confirmation in this thread that you license your changes to the engine portion of the codebase under Apache 2.0. Please reply with: |
Summary
Fixes #1617 — OTLP span forwarding drops
resource_spans[].resource, so SDK spans reach the collector withservice.name=<nil>even whenOTEL_SERVICE_NAMEis set at the SDK.Why
SpanDatacarries no resource — in the OTel SDK contract, the owningTracerProvidersupplies one before the exporter serialises.SDK_SPAN_FORWARDERwas a bareopentelemetry_otlp::SpanExporterwith noTracerProviderwrapping it, soconvert_otlp_to_span_data → SpanExporter::exportwrote an empty resource onto the wire. Collectors rendered the missing service identity as<nil>(SigNoz/Grafana/etc).@alsruf36's diagnosis on the issue thread was exactly right; the fix shape follows their suggestion.
What changed
engine/src/workers/observability/otel.rsSDK_SPAN_FORWARDER:OnceLock<Arc<SpanExporter>>→OnceLock<tonic::Channel>. Cheap to clone (Arc internally), lazy connect, no init-time blocking.init_sdk_span_forwarder: rewritten to build atonic::Endpoint+connect_lazy(). Forhttps://endpoints, appliesClientTlsConfig::new().with_native_roots()so the previous deployment contract (handled implicitly byopentelemetry-otlp'stls-rootsfeature) is preserved.ingest_otlp_json(lines 1607-1635): translates the parsedOtlpExportTraceServiceRequestintoopentelemetry_proto::tonic::collector::trace::v1::ExportTraceServiceRequestvia field-for-field helpers (otlp_json_request_to_protoand friends, lines 1640-1834), then sends via a freshly-clonedTraceServiceClient::new(channel.clone()). Resource block survives byte-for-byte.convert_otlp_to_span_dataandotlp_kv_to_key_value(truly dead post-replacement; in-memory storage path does its own per-field conversion) and their unit tests.test_ingest_otlp_json_skips_spans_with_invalid_identifiers— malformed-span filtering was a side effect of the old SpanData path, not a contract. New forwarder relays and lets the collector validate. Memory-storage behaviour is unchanged.engine/Cargo.tomlopentelemetry-proto = "0.31"(features["gen-tonic"]),tonic = "0.14"(features["transport", "tls-native-roots"]),tokio-stream = "0.1"to direct deps. All three were already transitively present viaopentelemetry-otlp'sgrpc-tonicfeature — promotion adds nothing new to the dep tree.engine/tests/otel_forwarder_resource_test.rs(new)SDK_SPAN_FORWARDEROnceLockis fresh.service.name=my-service,service.namespace=production,service.version=1.4.2survive the hop (three resource attributes locked in, not just one).replica.count=7(intValue) andtelemetry.enabled=true(boolValue) — covers the proto-conversion's non-string AnyValue branches.flags & 0x01 == 0x01for an inbound payload that omitsflags— pins the SAMPLED default that mirrors the storage path atotel.rs:1521-1524.Behaviour change worth noting
The new forwarder relays spans with malformed trace/span IDs (the old SpanData path pre-filtered them via
TraceId::from_hex/SpanId::from_hexbailout). Downstream collectors validate and reject — engine-side pre-filtering hid that diagnostic signal. The in-memory storage path's filtering is unchanged. Called out in the source comment atengine/src/workers/observability/otel.rs:1607-1622.Test plan
cargo test -p iii --test otel_forwarder_resource_test— 1 passed (with extended assertions for string + int + bool resource attrs + flags=SAMPLED default)cargo test -p iii --lib— 1542 passed, 0 failed, 6 ignored (KVM-gated)cargo test -p iii --lib workers::observability— 289 passedcargo test -p iii --lib test_ingest_otlp_json— 10 passed (including the trimmedtest_ingest_otlp_json_skips_spans_with_invalid_identifiers)cargo fmt --check— cleancargo clippy -p iii --lib --tests— no new warnings on touched filesOTEL_SERVICE_NAME=my-service, confirm UI shows the service name (rather than<nil>)https://collector with valid certs, confirm TLS handshake succeeds (verified locally against a self-signed loopback; full prod-collector cert validation deferred to CI / reviewer)Credits
@alsruf36 — diagnosis + suggested fix shape on the issue thread.
Closes #1617.
Summary by CodeRabbit
Bug Fixes
Improvements
Tests
Chores