fix(observability): preserve OTLP resource attributes on forwarder hop by deep-name · Pull Request #1618 · iii-hq/iii

deep-name · 2026-05-11T04:10:33Z

Summary

Fixes #1617 — OTLP span forwarding drops resource_spans[].resource, so SDK spans reach the collector with service.name=<nil> even when OTEL_SERVICE_NAME is set at the SDK.

Why

SpanData carries no resource — in the OTel SDK contract, the owning TracerProvider supplies one before the exporter serialises. SDK_SPAN_FORWARDER was a bare opentelemetry_otlp::SpanExporter with no TracerProvider wrapping it, so convert_otlp_to_span_data → SpanExporter::export wrote an empty resource onto the wire. Collectors rendered the missing service identity as <nil> (SigNoz/Grafana/etc).

@alsruf36's diagnosis on the issue thread was exactly right; the fix shape follows their suggestion.

What changed

engine/src/workers/observability/otel.rs

SDK_SPAN_FORWARDER: OnceLock<Arc<SpanExporter>> → OnceLock<tonic::Channel>. Cheap to clone (Arc internally), lazy connect, no init-time blocking.
init_sdk_span_forwarder: rewritten to build a tonic::Endpoint + connect_lazy(). For https:// endpoints, applies ClientTlsConfig::new().with_native_roots() so the previous deployment contract (handled implicitly by opentelemetry-otlp's tls-roots feature) is preserved.
Forwarder branch in ingest_otlp_json (lines 1607-1635): translates the parsed OtlpExportTraceServiceRequest into opentelemetry_proto::tonic::collector::trace::v1::ExportTraceServiceRequest via field-for-field helpers (otlp_json_request_to_proto and friends, lines 1640-1834), then sends via a freshly-cloned TraceServiceClient::new(channel.clone()). Resource block survives byte-for-byte.
Deleted convert_otlp_to_span_data and otlp_kv_to_key_value (truly dead post-replacement; in-memory storage path does its own per-field conversion) and their unit tests.
Trimmed one assertion in test_ingest_otlp_json_skips_spans_with_invalid_identifiers — malformed-span filtering was a side effect of the old SpanData path, not a contract. New forwarder relays and lets the collector validate. Memory-storage behaviour is unchanged.

engine/Cargo.toml

Promote opentelemetry-proto = "0.31" (features ["gen-tonic"]), tonic = "0.14" (features ["transport", "tls-native-roots"]), tokio-stream = "0.1" to direct deps. All three were already transitively present via opentelemetry-otlp's grpc-tonic feature — promotion adds nothing new to the dep tree.

engine/tests/otel_forwarder_resource_test.rs (new)

Integration test against a tonic gRPC mock collector. Owns its own process so the global SDK_SPAN_FORWARDER OnceLock is fresh.
Asserts service.name=my-service, service.namespace=production, service.version=1.4.2 survive the hop (three resource attributes locked in, not just one).
Asserts replica.count=7 (intValue) and telemetry.enabled=true (boolValue) — covers the proto-conversion's non-string AnyValue branches.
Asserts flags & 0x01 == 0x01 for an inbound payload that omits flags — pins the SAMPLED default that mirrors the storage path at otel.rs:1521-1524.

Behaviour change worth noting

The new forwarder relays spans with malformed trace/span IDs (the old SpanData path pre-filtered them via TraceId::from_hex/SpanId::from_hex bailout). Downstream collectors validate and reject — engine-side pre-filtering hid that diagnostic signal. The in-memory storage path's filtering is unchanged. Called out in the source comment at engine/src/workers/observability/otel.rs:1607-1622.

Test plan

cargo test -p iii --test otel_forwarder_resource_test — 1 passed (with extended assertions for string + int + bool resource attrs + flags=SAMPLED default)
cargo test -p iii --lib — 1542 passed, 0 failed, 6 ignored (KVM-gated)
cargo test -p iii --lib workers::observability — 289 passed
cargo test -p iii --lib test_ingest_otlp_json — 10 passed (including the trimmed test_ingest_otlp_json_skips_spans_with_invalid_identifiers)
cargo fmt --check — clean
cargo clippy -p iii --lib --tests — no new warnings on touched files
Manual: run against a real collector with OTEL_SERVICE_NAME=my-service, confirm UI shows the service name (rather than <nil>)
Manual: run against an https:// collector with valid certs, confirm TLS handshake succeeds (verified locally against a self-signed loopback; full prod-collector cert validation deferred to CI / reviewer)

Credits

@alsruf36 — diagnosis + suggested fix shape on the issue thread.

Closes #1617.

Summary by CodeRabbit

Bug Fixes
- Preserve full incoming OTLP resource attributes when forwarding traces to collectors.
Improvements
- Forward OTLP traces as native proto over gRPC with injected OTLP headers.
- Add native TLS handling for HTTPS collector endpoints.
- Drop spans with malformed trace/span IDs from forwarded batches while retaining in-memory ingestion.
Tests
- Added integration tests verifying resource preservation, header propagation, and partial-batch behavior.
Chores
- Updated build/dev dependencies for OTLP and async streaming.

OTLP span forwarder drops `resource_spans[].resource` on the engine → collector hop; SigNoz/Grafana/etc render `service.name=<nil>` even though `OTEL_SERVICE_NAME` is set at the SDK. Adds a tonic gRPC mock collector that captures every inbound `ExportTraceServiceRequest` and asserts the captured resource carries all three inbound attributes (`service.name`, `service.namespace`, `service.version`) — full-resource preservation rather than a single-field band-aid. Runs in its own integration-test binary so the process-global `SDK_SPAN_FORWARDER` `OnceLock` is fresh, with a 5s poll deadline + 20ms tick so the happy path stays fast and the failure path stays bounded. FAILS against the unfixed engine with `assertion: left: None, right: Some("my-service")`. Locks in the regression contract before the fix lands. Closes prerequisite for iii-hq#1617. Ridden with [Loa](https://github.com/0xHoneyJar/loa)

@alsruf36

Replaces the bare `opentelemetry_otlp::SpanExporter` forwarder with a `tonic::Channel` to a `TraceServiceClient`, and bypasses `convert_otlp_to_span_data` on the forwarder branch by translating the parsed `OtlpExportTraceServiceRequest` directly into the proto `ExportTraceServiceRequest`. The full `resource_spans[].resource` block survives the engine → collector hop byte-for-byte, including `service.name`, `service.namespace`, `service.version`, and any other attributes the inbound SDK sent. Root cause: `SpanData` carries no resource — the OTel SDK contract is that the owning `TracerProvider` supplies one before the exporter serialises. `SDK_SPAN_FORWARDER` was a bare exporter (no provider), so the export call wrote an empty resource onto the wire. Collectors rendered the missing service identity as `<nil>`. Changes: - `SDK_SPAN_FORWARDER`: `OnceLock<Arc<SpanExporter>>` → `OnceLock<Channel>`. Channel clones cheaply, lazy connect matches the prior exporter-builder semantics, no init-time blocking. - `init_sdk_span_forwarder`: rewritten to build an `Endpoint` and call `connect_lazy()`. Visibility bumped to `pub` so the new integration test can point the forwarder at a mock collector. - Forwarder branch in `ingest_otlp_json`: translates the parsed JSON request to the proto type via `otlp_json_request_to_proto` and a family of field-for-field helpers, then sends through a freshly-cloned `TraceServiceClient`. - Deleted `convert_otlp_to_span_data` and `otlp_kv_to_key_value` (truly dead post-replacement; in-memory storage path does its own per-field conversion) plus their unit tests. - Trimmed one assertion in `test_ingest_otlp_json_skips_spans_with_invalid_identifiers` — malformed-span filtering was a side effect of the SpanData-based path. The new path relays everything and lets the collector validate. Memory-storage behaviour is unchanged (the actual contract under test). - `engine/Cargo.toml`: promote `opentelemetry-proto = "0.31"`, `tonic = "0.14"`, `tokio-stream = "0.1"` to direct deps. All three were transitively present via `opentelemetry-otlp`'s `grpc-tonic` feature — promotion is the only change. Sweep: 1542 lib tests passed, 0 failed; new forwarder integration test passes; cargo fmt + clippy clean on touched files. Credits: @alsruf36 for the diagnosis and the suggested fix shape in iii-hq#1617. Closes iii-hq#1617. Ridden with [Loa](https://github.com/0xHoneyJar/loa)

…Value coverage Round-2 fixes for sprint-bug-1 review: C1/DISS-001 — TLS regression for `https://` collectors. The new raw-tonic channel did not configure TLS; the pre-fix `opentelemetry_otlp::SpanExporter` builder auto-configured it via its `tls-roots` feature. Production deployments behind TLS-fronted collectors (Grafana Cloud, SigNoz Cloud, Honeycomb, the standard prod shape) would silently fail at handshake time. - `engine/Cargo.toml`: `tonic` features extended with `tls-native-roots`. - `engine/src/workers/observability/otel.rs:696-744`: `init_sdk_span_forwarder` now branches on `endpoint.starts_with("https://")` and applies `ClientTlsConfig::new().with_native_roots()` to the Endpoint before `connect_lazy()`. TLS-config failure falls back to plaintext with a logged warning so HTTP-only deployments are unaffected. C2/DISS-002 — trace flags default mismatch. The forwarder defaulted missing `flags` to 0 (unsampled) while the in-memory storage path at `otel.rs:1521-1524` defaults to `TraceFlags::SAMPLED`. Same inbound span was being treated as sampled by storage but unsampled by the forwarder. - `engine/src/workers/observability/otel.rs:1747`: flipped `unwrap_or(0)` → `unwrap_or(1)` (W3C SAMPLED bit) with a parity comment referencing the storage path. N1 — AnyValue conversion test coverage. The round-1 test only exercised the `StringValue` branch. - `engine/tests/otel_forwarder_resource_test.rs`: payload now includes `replica.count` (intValue) and `telemetry.enabled` (boolValue); assertions extended via `find_int_attr` and `find_bool_attr`. Flags-default assertion added for the SAMPLED parity check. N2 — source-level call-out for the deliberate malformed-span forwarding behaviour change. - `engine/src/workers/observability/otel.rs:1607-1622` extended with the rationale in the forwarder-branch comment. Round-2 adversarial cross-model review (gpt-5.3-codex): `findings: []` / status `clean`. Sweep: 1542 lib tests pass, 1 extended forwarder integration test passes, cargo fmt + clippy clean on touched files. Closes review-loop iteration 1 for sprint-bug-1. Sprint approved. Ridden with [Loa](https://github.com/0xHoneyJar/loa)

vercel · 2026-05-11T04:10:39Z

@deep-name is attempting to deploy a commit to the motia Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-11T04:10:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0cecfc5d-119a-4f97-a304-c2b4c74ff130

📥 Commits

Reviewing files that changed from the base of the PR and between a3c37fb and 380d684.

📒 Files selected for processing (5)

engine/tests/common/forwarder_mock.rs
engine/tests/common/mod.rs
engine/tests/otel_forwarder_headers_test.rs
engine/tests/otel_forwarder_partial_batch_test.rs
engine/tests/otel_forwarder_resource_test.rs

✅ Files skipped from review due to trivial changes (1)

engine/tests/common/mod.rs

📝 Walkthrough

Walkthrough

Replaces the SDK SpanExporter forwarding path with direct OTLP proto gRPC forwarding: adds proto types and converters, exposes a TLS-aware global forwarder init, converts parsed OTLP JSON to ExportTraceServiceRequest, forwards with tonic::Channel and injected OTLP headers, and adds integration tests for resource attributes, headers, and partial-batch behavior.

Changes

OTLP Span Forwarding: Direct Proto Relay

Layer / File(s)	Summary
Add dependencies for gRPC forwarding `engine/Cargo.toml`	Adds `opentelemetry-proto` (gen-tonic) and `tokio-stream` (dev).
Import OTLP proto client and transport types `engine/src/workers/observability/otel.rs`	Adds OTLP `TraceServiceClient`, `ExportTraceServiceRequest`, and tonic transport/TLS imports used by the forwarder.
Global forwarder state & public init with TLS handling `engine/src/workers/observability/otel.rs`	Replace SDK exporter global with forwarder state (lazy tonic::Channel + parsed OTLP headers); make `init_sdk_span_forwarder` public and add TLS-native-roots handling for `https://` endpoints.
Remove old JSON→SpanData conversion and tests `engine/src/workers/observability/otel.rs`	Delete previous JSON→KeyValue→SpanData converters and their unit tests that dropped resource data.
Add JSON→OTLP proto conversion and ID rules `engine/src/workers/observability/otel.rs`	Implement JSON→proto translators (resource/scope/span/events/links/status/attributes/AnyValue) and tolerant hex handling where malformed/wrong-length trace_id/span_id cause spans to be omitted from exported proto.
Update ingest_otlp_json to use proto forwarding `engine/src/workers/observability/otel.rs`	Convert parsed OTLP JSON to `ExportTraceServiceRequest`, attach parsed OTLP headers as gRPC metadata, and call `TraceServiceClient::export` while preserving in-memory ingestion.
Adjust regression test expectations `engine/src/workers/observability/otel.rs`	Update comments/assertions to reflect forwarding conversion/filtering changes while in-memory retention remains unchanged.
Mock OTLP collector implementations `engine/tests/common/forwarder_mock.rs`, `engine/tests/common/mod.rs`	Add body-capturing and metadata-capturing Tonic TraceService mocks, readiness probe, and export the helper module.
Header propagation regression test `engine/tests/otel_forwarder_headers_test.rs`	Add test that sets `OTEL_EXPORTER_OTLP_HEADERS`, initializes forwarder to mock collector, ingests payload, and asserts `authorization` and `x-tenant` metadata were injected.
Partial-batch forwarding regression test `engine/tests/otel_forwarder_partial_batch_test.rs`	Add test sending mixed-validity spans to ensure malformed IDs are dropped and valid spans are forwarded with correct ID lengths and preserved decoded IDs.
Resource-attribute preservation regression test `engine/tests/otel_forwarder_resource_test.rs`	Add payload builder, attribute extractors, and test asserting forwarded `resourceSpans[].resource.attributes` are preserved with correct decoded values and attribute count; also asserts sampled-default span flags and preserved span name.

Sequence Diagram

sequenceDiagram
  participant Ingest as ingest_otlp_json
  participant ProtoConv as otlp_json_request_to_proto
  participant Forwarder as TraceServiceClient
  participant Collector as OTLP Collector

  Ingest->>Ingest: parse OTLP JSON and merge into in-memory storage
  alt Forwarder initialized
    Ingest->>ProtoConv: convert parsed JSON -> ExportTraceServiceRequest
    ProtoConv->>ProtoConv: preserve resourceSpans[].resource attributes
    ProtoConv-->>Forwarder: proto ExportTraceServiceRequest
    Forwarder->>Collector: gRPC Export call (with injected OTLP headers)
    Collector-->>Forwarder: Export response
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

iii-hq/iii#163: Modifies the engine OTLP span forwarding path and forwarder initialization similarly.

Suggested reviewers

ytallo
andersonleal
sergiofilhowz

Poem

🐰 I hop where spans and resources meet,
Proto paths now keep attributes sweet,
Headers travel on the wire,
Bad IDs fall—good ones aspire,
Forwarded traces sing complete.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(observability): preserve OTLP resource attributes on forwarder hop' clearly and concisely describes the main change—preserving resource attributes during forwarding.
Description check	✅ Passed	The PR description comprehensively covers what changed, why, and includes detailed test results and behavior notes, fulfilling all key sections of the template.
Linked Issues check	✅ Passed	The PR fully implements the core objective from `#1617`: preserving OTLP resource attributes during forwarding, with additional implementations for header propagation and partial-batch handling.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with preserving resource attributes and supporting the new tonic-based forwarding path; no unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

deep-name

Summary

Analytical review of #1618. Enrichment pass was unavailable; findings are unenriched.

Findings

{
  "schema_version": 1,
  "findings": [
    {
      "id": "F1",
      "title": "Heavy new dependencies (tonic, opentelemetry-proto, tokio-stream) added at top-level Cargo.toml for test-only use",
      "severity": "MEDIUM",
      "category": "dependencies",
      "file": "engine/Cargo.toml",
      "description": "The PR adds `opentelemetry-proto`, `tonic` (with `transport` and `tls-native-roots`), and `tokio-stream` to the main `[dependencies]` section. From the visible diff, these are only consumed by an integration test (`engine/tests/otel_forwarder_resource_test.rs`). Putting them in `[dependencies]` instead of `[dev-dependencies]` enlarges the production binary closure (tonic with TLS pulls in rustls/native-roots, prost, hyper) and lengthens compile times for every consumer of the engine crate. If the forwarder source itself doesn't actually import these (the otel.rs section is truncated so this can't be fully verified), they should be `[dev-dependencies]`.",
      "suggestion": "Confirm whether the forwarder implementation in `engine/src/workers/observability/otel.rs` directly uses `opentelemetry-proto`, `tonic`, or `tokio-stream`. If not (i.e. only the test does), move all three to `[dev-dependencies]`. If `opentelemetry-proto` is genuinely needed in production for proto conversion, keep just that one in `[dependencies]` and move `tonic`/`tokio-stream` to dev.",
      "confidence": 0.6
    },
    {
      "id": "F2",
      "title": "Process-global OnceLock forwarder state makes integration test order-dependent",
      "severity": "MEDIUM",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The header comment states the test relies on `SDK_SPAN_FORWARDER` being a process-global `OnceLock`, fresh per integration test file. This is true today but fragile: if a future test file in the same `tests/` directory also calls `init_sdk_span_forwarder`, or if cargo test harness changes (e.g., a future shared test binary, or someone adds a `#[test]` to the same file pointing at a different endpoint), the OnceLock will silently bind the wrong endpoint and the second test will export to the wrong collector. There is no documented protection nor a test-only reset hook.",
      "suggestion": "Either (a) add a `#[cfg(test)] pub fn reset_sdk_span_forwarder_for_tests()` that swaps the OnceLock contents, or (b) refactor the forwarder to accept an injected endpoint/exporter so tests don't depend on process-global state, or (c) at minimum add a code comment on `SDK_SPAN_FORWARDER` itself warning future integration tests not to call `init_sdk_span_forwarder` in the same binary.",
      "confidence": 0.7
    },
    {
      "id": "F3",
      "title": "50ms sleep for collector startup is a flake vector",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "`spawn_mock_collector` sleeps 50ms unconditionally before returning, with a comment acknowledging this exists to avoid races with tonic's accept loop. Loaded CI runners (especially under heavy parallel cargo test) can easily exceed 50ms before the spawned task is polled. The downstream poll loop (5s deadline) does provide a fallback safety net for the actual export, but the initial race could still cause connection-refused on the first export attempt and rely on tonic's internal retry — and that retry is not guaranteed across tonic versions.",
      "suggestion": "Replace the sleep with an active readiness check: in a small loop, attempt to `TcpStream::connect(addr)` until success or a bounded timeout (~1s). Alternatively, bind the listener synchronously *before* the spawn and only spawn the `serve_with_incoming` future — `TcpListener::bind` already guarantees the port is listening, so the race is solely about the tonic accept-loop being polled. A `tokio::task::yield_now().await` followed by a connect-probe loop would be deterministic.",
      "confidence": 0.6
    },
    {
      "id": "F4",
      "title": "Mock collector server task swallows panics on `unwrap()`",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The spawned `Server::builder()...serve_with_incoming(...).await.unwrap()` runs in a detached `tokio::spawn`. If the server fails to start or errors out, the `.unwrap()` will panic inside the spawned task, but the test harness will see the export simply never arrive and surface as `captured.len() == 0` with a misleading message. The task handle is not awaited or stored.",
      "suggestion": "Either retain the JoinHandle and check it on test failure, or replace `.unwrap()` with `.expect(\"mock collector server failed to start\")` plus a `tracing::error!` — at minimum so the panic surfaces in test output rather than silently disappearing into a detached task.",
      "confidence": 0.5
    },
    {
      "id": "F5",
      "title": "Test asserts SAMPLED default but contract is undocumented in PR scope",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The final assertion (`span.flags & 0x01 == 0x01`) pins the forwarder to default missing trace-flags to SAMPLED, with rationale that it 'mirrors the in-memory storage path'. This is a meaningful behavioural contract, but it is being introduced alongside a fix titled 'preserve OTLP resource attributes' — the trace-flags default is a separate concern. If this default is new behaviour added in this PR, it deserves explicit call-out in the PR description; if it's pre-existing behaviour now being locked in, that's fine but worth noting. The OTLP spec leaves trace flags semantics to the sender; a forwarder defaulting unsampled→SAMPLED could subtly amplify sampled trace volume downstream.",
      "suggestion": "Confirm in the PR description whether the SAMPLED default is (a) pre-existing and being regression-locked, or (b) new in this PR. If (b), consider whether defaulting to whatever the inbound payload says (0 when absent) is more correct per OTLP semantics, since collectors typically interpret missing flags as 'unspecified' rather than 'unsampled'.",
      "confidence": 0.5
    },
    {
      "id": "F6",
      "title": "Test only asserts forwarded payload, not absence of duplication or modification",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The test asserts `captured.len() == 1` and that 5 attributes are correctly preserved. It does not assert that *no extra* attributes were injected by the forwarder (e.g. a hop-level `forwarder.id` attribute), nor that `scope_spans`/spans are passed through unmodified beyond the flags check. A forwarder bug that *adds* attributes or rewrites span names would pass this test.",
      "suggestion": "Consider asserting `resource.attributes.len() == 5` and `req.resource_spans[0].scope_spans[0].spans[0].name == \"forwarder.preserves.resource\"` to lock in pass-through fidelity, not just preservation of the named subset.",
      "confidence": 0.55
    },
    {
      "id": "F7",
      "title": "Net deletion of ~374 lines in otel.rs alongside 'fix' is non-obvious",
      "severity": "SPECULATION",
      "category": "review-visibility",
      "file": "engine/src/workers/observability/otel.rs",
      "description": "The truncated diff shows `engine/src/workers/observability/otel.rs` with +323 -697, a net deletion of ~374 lines, in a PR framed as a bug fix for resource-attribute preservation. A fix of this magnitude typically suggests a refactor or rewrite of the forwarder path rather than a surgical fix. Without seeing the source diff, I cannot evaluate whether the deletion removes dead code, consolidates a parser, replaces a hand-rolled protobuf encoder with `opentelemetry-proto`, or removes functionality. If it removes the hand-rolled OTLP/JSON→proto translation in favour of using `opentelemetry-proto` types directly, that's likely a good change but is much larger than the PR title implies.",
      "suggestion": "Request the full diff or a summary in the PR description explaining the structural change: is this replacing a custom proto encoder with `opentelemetry-proto::tonic` types? Are any code paths (e.g. backpressure, retry, batching) being removed in the process? Reviewers should not approve a -374-line change on the strength of one regression test without seeing what was removed.",
      "confidence": 0.7
    },
    {
      "id": "F8",
      "title": "Regression test design with full AnyValue branch coverage and clear failure messages",
      "severity": "PRAISE",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The test deliberately exercises three `AnyValue` discriminator branches (stringValue, intValue, boolValue), each with an assertion message that explains the bug class being guarded against rather than just the expected value. The header comment also explicitly states why the test is in a separate integration file (process-global OnceLock isolation) and why the assertion contract is 'full resource preservation, not just service.name' — preventing a future tempting single-field band-aid from passing this test. This is the right shape for a regression test tied to a specific incident.",
      "suggestion": "No action needed."
    }
  ]
}

Callouts

Enrichment unavailable for this review.

Local pre-emptive Bridgebuilder pass (claude-opus-4-7 via persona at grimoires/bridgebuilder/BEAUVOIR.md) surfaced 6 net new findings beyond the round-1 review-sprint adversarial dissent. Round-3 fixes: F1 (MEDIUM, dep placement) — `tokio-stream` is only used by the test file's `TcpListenerStream::new(listener)`. Moved from `[dependencies]` to `[dev-dependencies]`. `opentelemetry-proto` + `tonic` stay in `[dependencies]` (they're used by production code in `engine/src/workers/observability/otel.rs` for proto types, `TraceServiceClient`, `Channel`, `ClientTlsConfig`). F2 (MEDIUM, OnceLock test fragility) — `SDK_SPAN_FORWARDER` is a process-global `OnceLock`; a second test in the same integration binary calling `init_sdk_span_forwarder` would silently no-op against the first test's endpoint. Added a doc-comment on the static explicitly warning future contributors to place new tests in their own `engine/tests/*.rs` file and pointing at the `#[cfg(test)] pub fn reset_for_test()` escape hatch if parameterised endpoints are needed. F3 (LOW, 50ms sleep flake) — replaced the fixed `tokio::time::sleep` in `spawn_mock_collector` with a bounded `TcpStream::connect` probe loop (2s deadline, `yield_now` between attempts). On loaded CI runners the 50ms heuristic could lose the race with tonic's accept- loop polling; the probe is deterministic. F4 (LOW, swallowed panics) — replaced `.unwrap()` on the spawned mock-server task with `.expect("mock OTLP collector failed to start; see tonic transport error above")`. The detached task would silently disappear on bind/serve failure; now the panic message surfaces. F6 (LOW, pass-through fidelity) — added two assertions to the forwarder test: - `resource.attributes.len() == 5` catches a forwarder bug that ADDS attributes (e.g. injecting a hop-level `forwarder.id`). - `span.name == "forwarder.preserves.resource"` catches a forwarder bug that rewrites the span name. Without this, a name-rewrite defect would pass all the existing attribute assertions. Findings F5 (SAMPLED-default semantics) and F7 (-374 net lines) were triaged as already-addressed — F5 is documented in commit `76f0b05e` and the round-1 reviewer.md; F7 is explained in the PR description as the dead-code removal of `convert_otlp_to_span_data`. Bridgebuilder review preserved on PR thread for audit trail; full findings at iii-hq#1618 (review). Sweep: 1542 lib tests pass, 1 forwarder integration test pass with extended assertions (5-attr count + span-name check), cargo fmt clean, clippy clean on touched files. Ridden with [Loa](https://github.com/0xHoneyJar/loa)

deep-name · 2026-05-11T04:35:49Z

local self-review summary before maintainer queue:

ran our own bridgebuilder pass (claude-opus-4-7 via custom reviewer persona) plus two rounds of adversarial cross-model dissent (gpt-5.3-codex for both review + audit phases). 4 model-passes total, 3 distinct concerns surfaced + addressed before this hit your queue:

Round	Tool	Model	Verdict
review-sprint round 1	adversarial-review (review)	gpt-5.3-codex	BLOCKING: DISS-001 (TLS regression for `https://` collectors), DISS-002 (trace-flags default mismatch)
review-sprint round 2	adversarial-review (review)	gpt-5.3-codex	clean — both blockers resolved
audit-sprint	adversarial-review (audit, no reviewer context)	gpt-5.3-codex	clean
local-self-review	bridgebuilder	claude-opus-4-7	F1-F4 + F6 addressed in commit `1bc8af32` (dep placement, OnceLock test isolation, sleep→connect-probe, .unwrap→.expect, pass-through fidelity assertions)

artifacts kept in our state zone for traceability:

grimoires/loa/a2a/sprint-bug-1/engineer-feedback.md
grimoires/loa/a2a/sprint-bug-1/auditor-sprint-feedback.md
grimoires/loa/a2a/sprint-bug-1/adversarial-review.json
grimoires/loa/a2a/sprint-bug-1/adversarial-audit.json
the bridgebuilder review above (8 findings, of which 5 actionable, 1 praise, 2 already-addressed)

ready when you are. happy to drop the bridgebuilder bot comment above if it's noisy — it's preserved here as the audit trail showing the issues were caught locally, not surfaced by your team's review cycle.

coderabbitai

🧹 Nitpick comments (1)

engine/src/workers/observability/otel.rs (1)

1775-1781: ⚡ Quick win

Comment overstates "parity" — in-memory storage doesn't default flags to SAMPLED.

The doc claims the SAMPLED default mirrors "the in-memory storage path's default at the top of ingest_otlp_json", but the storage path stores flags: span.flags unchanged (Option) — missing flags stay None there, not 1. The forwarder-side default to 1 is defensible on its own (the proto field is non-optional u32, and defaulting to 0 would mark all SDK-origin spans as unsampled at downstream collectors), and the integration test pins it. Just trim the parity framing so future readers don't go looking for a matching default in the storage code that doesn't exist.

📝 Suggested wording tweak

-        // Parity with the in-memory storage path's default at the top of
-        // `ingest_otlp_json` — a span arriving without an explicit `flags`
-        // field is treated as SAMPLED (W3C trace-flags bit 0 = 0x01).
-        // Defaulting to 0 here would mark such spans as unsampled on the
-        // forwarder hop while storage records them as sampled, producing
-        // inconsistent behaviour across the two halves of the same call.
+        // A span arriving without an explicit `flags` field is treated
+        // as SAMPLED (W3C trace-flags bit 0 = 0x01). The proto field is
+        // a non-optional u32; defaulting to 0 here would mark every
+        // SDK-origin span without explicit flags as unsampled at
+        // downstream samplers/collectors that honour the SAMPLED bit.
+        // The in-memory storage path keeps `flags` as Option<u32> and
+        // serialises it via `skip_serializing_if`, so the two halves
+        // express the same intent in different shapes.
         flags: span.flags.unwrap_or(1),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@engine/src/workers/observability/otel.rs` around lines 1775 - 1781, Update
the comment next to the flags assignment (flags: span.flags.unwrap_or(1) in
otel.rs) to remove the incorrect claim of parity with the in-memory storage
path; instead note that the forwarder defaults to 1 because the proto field is
non-optional and defaulting to 0 would mark SDK-origin spans unsampled, and
clarify that the storage path (ingest_otlp_json) preserves Option<u32>
(span.flags) and does not default missing flags to 1.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@engine/src/workers/observability/otel.rs`:
- Around line 1775-1781: Update the comment next to the flags assignment (flags:
span.flags.unwrap_or(1) in otel.rs) to remove the incorrect claim of parity with
the in-memory storage path; instead note that the forwarder defaults to 1
because the proto field is non-optional and defaulting to 0 would mark
SDK-origin spans unsampled, and clarify that the storage path (ingest_otlp_json)
preserves Option<u32> (span.flags) and does not default missing flags to 1.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8ccda45d-028d-4d49-9a7b-ed43547dd6a0

📥 Commits

Reviewing files that changed from the base of the PR and between 9daa7bc and 1bc8af3.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

engine/Cargo.toml
engine/src/workers/observability/otel.rs
engine/tests/otel_forwarder_resource_test.rs

deep-name · 2026-05-11T07:18:55Z

status update for the maintainer queue:

@alsruf36 (issue reporter) pulled this branch and validated end-to-end against their SigNoz instance: OTLP span forwarding drops resource attributes, causing service.name to appear as <nil> #1617 (comment). forwarder hop is doing exactly what it says on the tin.
they surfaced a related but distinct bug — engine-originated spans (SdkTracerProvider export path, not the forwarder hop) also show empty service.name. that's out of scope here and filed as engine-originated spans land at the collector with empty service.name on the SdkTracerProvider export path #1619 with their flame-graph evidence + a code-level read on the likely root cause (OtelConfig::default() short-circuits the env-var fallback when config has service_name: "").

#1618 stays scoped to the forwarder. ready for review.

ytallo · 2026-05-12T00:40:39Z

Hey @deep-name, thanks for your contribution one more time. Can you consider add these tests on your PR?

Feature: SDK span forwarder honors OTEL_EXPORTER_OTLP_HEADERS
  As an operator running iii behind an auth-required OTLP collector
    (Grafana Cloud, Honeycomb, SigNoz Cloud, or any API-keyed deployment)
  I want SDK-ingested spans forwarded by the engine to carry the same
    auth metadata as engine-originated spans
  So that worker-forwarded telemetry does not silently get rejected at
    the collector while engine telemetry keeps flowing

  Background:
    Given a mock OTLP gRPC collector listening on an ephemeral loopback port
    And the collector captures the tonic MetadataMap of every Export request

  Scenario: Authorization header from env var lands on the wire
    Given the environment variable OTEL_EXPORTER_OTLP_HEADERS is set to
      "authorization=Bearer-test-token-12345"
    And the SDK span forwarder has been initialized against the mock collector
    When a well-formed OTLP/JSON payload with one valid span is ingested via
      ingest_otlp_json
    Then the mock collector receives exactly one ExportTraceServiceRequest
    And that request carries an "authorization" metadata entry
    And

Feature: One bad span does not poison the rest of the forwarded batch
  As an operator whose SDKs occasionally emit malformed trace/span IDs
    (instrumentation bugs, truncated context propagation, custom emitters)
  I want the forwarder to drop only the malformed spans
  So that valid telemetry in the same payload still reaches the collector
    instead of being rejected wholesale when one span has empty ID bytes

  Background:
    Given a mock OTLP gRPC collector listening on an ephemeral loopback port
    And the collector captures every ExportTraceServiceRequest it receives

  Scenario: Mixed batch of valid and malformed spans is partially forwarded
    Given the SDK span forwarder has been initialized against the mock collector
    And an OTLP/JSON payload contains two spans:
      | name           | trace_id                          | span_id            | validity |
      | bad-trace-id   | not-hex                           | also-not-hex       | invalid  |
      | valid-span     | 0af7651916cd43dd8448eb211c80319c  | b7ad6b7169203331   | valid    |
    When the payload is ingested via ingest_otlp_json
    Then the mock collector receives at least one ExportTraceServiceRequest
    And no span on the wire has an empty trace_id
    And no span on the wire has an empty span_id
    And the captured stream contains exactly one span
    And that span has name "valid-span"
    And that span has trace_id 0af7651916cd43dd8448eb211c80319c
    And that span has span_id b7ad6b7169203331

deep-name · 2026-05-12T14:20:39Z

both scenarios are spot-on @ytallo — these are the next two regression classes the raw-tonic switch introduced beyond TLS. on it.

quick read so you know what's landing:

scenario 1 (OTEL_EXPORTER_OTLP_HEADERS): confirmed regression. opentelemetry_otlp::SpanExporter::builder().with_tonic() parses this env var automatically (per the OTel spec) and injects each key=value pair as a tonic MetadataMap entry on every export. our replacement channel doesn't. fix: parse the env var at init_sdk_span_forwarder time, stash the parsed pairs alongside the channel, attach a tonic interceptor that adds them to each TraceServiceClient::export request.

scenario 2 (one bad span doesn't poison the batch): the audit-sprint feedback called out the "relay malformed spans, let collector validate" tradeoff as deliberate, but you're right that empty-byte IDs on the wire are a regression vs the pre-fix convert_otlp_to_span_data filtering, and "valid spans get rejected with the batch" is a worse outcome than "malformed spans dropped silently". fix: otlp_span_to_proto becomes fallible — Option<ProtoSpan>, returns None when hex::decode fails on trace_id/span_id. caller drops Nones before building the ExportTraceServiceRequest. PR description gets updated to remove the "relays malformed spans" callout since we're restoring the pre-filter.

shipping both scenarios as runnable rust tests against the existing tonic mock + an HTTPS-fronted mock variant for header propagation. should have the commit batch up in the next hour or so.

@ytallo

… spans Addresses both gherkin scenarios from @ytallo's iii-hq#1618 review: Scenario 1 — OTEL_EXPORTER_OTLP_HEADERS regression. The pre-fix `opentelemetry_otlp::SpanExporter::builder().with_tonic()` parsed `OTEL_EXPORTER_OTLP_HEADERS` (and the signal-specific `OTEL_EXPORTER_OTLP_TRACES_HEADERS` override) automatically per the OTel SDK spec, injecting each `key=value` pair as a tonic `MetadataMap` entry on every export. The raw-`tonic::Channel` replacement in earlier commits lost that behaviour silently — every API-keyed collector (Grafana Cloud, Honeycomb, SigNoz Cloud) was about to reject forwarded spans at auth. Changes for scenario 1: - New `SdkForwarderState` struct wrapping the `tonic::Channel` plus the parsed `Vec<(MetadataKey<Ascii>, MetadataValue<Ascii>)>` pairs. Replaces the bare-`Channel` `OnceLock<Channel>` static. - New `parse_otlp_headers_env()` reads `OTEL_EXPORTER_OTLP_TRACES_HEADERS` first (signal-specific takes precedence per OTel spec), falls back to `OTEL_EXPORTER_OTLP_HEADERS`. Comma-split on entries, `=`-split on key/value, invalid pairs warn-and-skip rather than fail init (matches `opentelemetry-otlp`'s tolerance). - Forwarder dispatch builds a `tonic::Request` and attaches every parsed pair to `metadata_mut()` before calling `TraceServiceClient::export(...)`. Scenario 2 — one bad span no longer poisons the batch. The previous `otlp_span_to_proto` was infallible: it called `hex_decode_or_empty` on malformed `trace_id` / `span_id` strings and emitted spans with empty-byte IDs on the wire. Most collectors reject any batch containing such spans wholesale, so a single bad SDK emitter (instrumentation bug, truncated context propagation, custom client) could blackhole otherwise-valid telemetry from the same payload. Changes for scenario 2: - `otlp_span_to_proto` now returns `Option<ProtoSpan>` and returns `None` whenever `trace_id` or `span_id` fail to hex-decode OR decode to wrong byte lengths (trace_id MUST be 16 bytes, span_id MUST be 8 — surfaced by adversarial review DISS-001 in this round as a gap in the original hex-only filter: `hex::decode("01")` succeeds and produces a 1-byte trace_id that collectors still reject). - New `OTLP_TRACE_ID_BYTES`/`OTLP_SPAN_ID_BYTES` consts so the spec numbers have a single point of truth. - `otlp_scope_spans_to_proto` now `filter_map`s over the spans, dropping `None`s so valid spans in the same batch survive. - `parent_span_id` validates the same way but falls back to empty (rather than dropping the whole span) when malformed — the orphaned-parent case is recoverable downstream. - `hex_decode_or_empty` helper deleted (only consumer is link metadata, which inlines `hex::decode(...).unwrap_or_default()` — links to malformed-ID remote spans are recoverable, the link's parent span itself was already validated). - Forwarder-dispatch comment updated to reflect the new drop-malformed-spans contract (was previously "relays everything, collector validates"; now "drops malformed, valid spans survive"). Tests: - `engine/tests/otel_forwarder_headers_test.rs` (new) — sets `OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer-test-token-12345,x-tenant=acme` before init, asserts both metadata entries land on the `ExportTraceServiceRequest`'s tonic `MetadataMap`. Single test per binary to avoid env-var contamination across parallel tests. - `engine/tests/otel_forwarder_partial_batch_test.rs` (new) — ingests a three-span payload (non-hex IDs, hex-but-wrong-length IDs, valid IDs), asserts the collector receives exactly one span on the wire (the valid one) with spec-correct ID lengths (16/8 bytes). - Existing `engine/tests/otel_forwarder_resource_test.rs` unchanged and still green; the spec-length filter doesn't disturb its happy-path payload. Sweep: 1542 lib tests passed, 3 forwarder integration tests passed (resource preservation + headers env var + partial-batch filtering), cargo fmt + clippy clean on touched files. Adversarial cross-model re-review (gpt-5.3-codex) returned `findings: []` / status `clean` after the spec-length tightening. Credits: @ytallo for the gherkin scenarios surfacing both regressions on iii-hq#1618 before they reached production. @alsruf36 for the end-to-end validation earlier in the review cycle that anchored the contract this PR is pinning. Ridden with [Loa](https://github.com/0xHoneyJar/loa)

deep-name

Summary

Analytical review of #1618. Enrichment pass was unavailable; findings are unenriched.

Findings

{
  "schema_version": 1,
  "findings": [
    {
      "id": "F1",
      "title": "Integration tests rely on global SDK_SPAN_FORWARDER OnceLock and cannot coexist",
      "severity": "MEDIUM",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "Both new integration test files (otel_forwarder_resource_test.rs and otel_forwarder_partial_batch_test.rs, plus the pre-existing otel_forwarder_headers_test.rs) call init_sdk_span_forwarder(&endpoint) which writes to a process-global OnceLock. The comment in each file asserts that 'Owns its own process via being an integration test file, so the process-global SDK_SPAN_FORWARDER OnceLock is fresh on every run.' This is true only because cargo compiles each integration test file in tests/ as a separate binary. However, if any test file in the future contains more than one #[tokio::test] that calls init_sdk_span_forwarder with a different endpoint, all subsequent tests will silently send to the first endpoint (OnceLock semantics) and the assertions about 'exactly one ExportTraceServiceRequest' will become non-deterministic across runs. There is no test-level guardrail (no serial_test marker, no assertion that the OnceLock was uninitialized) preventing future contributors from breaking this invariant.",
      "suggestion": "Either (a) add a doc-comment near init_sdk_span_forwarder warning that the OnceLock requires one-test-per-binary, with a brief explanation, or (b) expose a #[cfg(test)] reset helper so multiple tests per binary become safe. Option (b) is more robust.",
      "confidence": 0.75
    },
    {
      "id": "F2",
      "title": "Mock collector spawn task swallows panics on Server::builder errors after readiness",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The tokio::spawn task that runs the tonic Server uses .expect() on serve_with_incoming. If the server errors mid-test (e.g., port reclaimed, OS resource exhaustion), the spawned task panics in the background but the main test continues, producing confusing 'received 0 requests' failures rather than the underlying transport error. Same pattern in otel_forwarder_partial_batch_test.rs.",
      "suggestion": "Capture the JoinHandle and abort()/check it at end-of-test, or use a oneshot channel to surface server startup errors to the main test thread.",
      "confidence": 0.6
    },
    {
      "id": "F3",
      "title": "Readiness probe leaves a stray connect socket against tonic accept loop",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The TCP connect-then-drop readiness probe at line ~62 establishes a real TCP connection to the tonic server that is then dropped without an HTTP/2 handshake. tonic/h2 may log a spurious warning ('connection error: not a valid HTTP/2 connection preface' or similar) on stderr for each test. Not a correctness bug but adds noise to test output.",
      "suggestion": "Acceptable as-is; if the noise becomes problematic, switch to a lightweight in-process readiness signal (oneshot sent from inside the spawn before serve_with_incoming).",
      "confidence": 0.4
    },
    {
      "id": "F4",
      "title": "Excellent regression-test design with explicit contract assertions",
      "severity": "PRAISE",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The resource-preservation test pins not only the originally-broken service.name field but also tests three AnyValue discriminator branches (string, int, bool), span-name pass-through, attribute count (catches injection), and SAMPLED-flag default behavior. The partial-batch test goes further: it pins both hex-decode-failure AND hex-decodes-but-wrong-length cases, citing the spec lengths (16 bytes / 8 bytes) explicitly. Failure messages reference issue numbers and explain the intended invariant, making future debugging significantly easier.",
      "suggestion": "No change required."
    },
    {
      "id": "F5",
      "title": "Production tonic dependency added with tls-native-roots — confirm intent",
      "severity": "LOW",
      "category": "dependencies",
      "file": "engine/Cargo.toml",
      "description": "tonic is added to the [dependencies] section (not [dev-dependencies]) with features = [\"transport\", \"tls-native-roots\"]. The new test files use tonic, but they only need tonic at dev-time. If the forwarder implementation in the truncated otel.rs needs tonic directly (not just transitively through opentelemetry-otlp which already pulls tonic), that is fine. Otherwise, this is dependency bloat and should move to [dev-dependencies]. Cannot fully verify without the truncated otel.rs diff.",
      "suggestion": "Verify whether engine/src/workers/observability/otel.rs imports tonic directly. If only the integration tests need it, move tonic + opentelemetry-proto to [dev-dependencies].",
      "confidence": 0.55
    },
    {
      "id": "F6",
      "title": "Typo in test comment",
      "severity": "LOW",
      "category": "nit",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "'preceeding' is misspelled (should be 'preceding') in the comment block at the attribute-count assertion.",
      "suggestion": "Fix spelling.",
      "confidence": 0.95
    },
    {
      "id": "F7",
      "title": "Duplicated mock-collector helper across three test files",
      "severity": "LOW",
      "category": "maintainability",
      "file": "engine/tests/otel_forwarder_partial_batch_test.rs",
      "description": "CapturingTraceService, TraceService impl, and spawn_mock_collector are duplicated verbatim across otel_forwarder_resource_test.rs, otel_forwarder_partial_batch_test.rs, and (per the truncated diff) otel_forwarder_headers_test.rs. Cargo integration tests can share code via a tests/common/mod.rs module; this would also enforce that the readiness/spawn semantics stay in lockstep across the suite.",
      "suggestion": "Extract the mock collector into engine/tests/common/mod.rs and have each integration test `mod common;` it.",
      "confidence": 0.85
    },
    {
      "id": "F8",
      "title": "Truncated otel.rs change is the actual fix — cannot verify behavior without it",
      "severity": "REFRAME",
      "category": "review-scope",
      "file": "engine/src/workers/observability/otel.rs",
      "description": "The substantive fix (449 added, 696 removed lines in engine/src/workers/observability/otel.rs) is truncated from the diff. The tests pin behavior but the production change itself — including the `otlp_span_to_proto` returning Option, the resource conversion path, and the SAMPLED-default logic — is unreviewed. Findings here cover only the test/dependency surface; the core fix needs a separate pass.",
      "suggestion": "Request the full otel.rs diff for the next review pass to verify: (1) hex-decode length checks on trace_id (16) and span_id (8), (2) resource attribute conversion preserves all AnyValue variants, (3) SAMPLED-default flag handling, (4) no unwrap()/expect() on adversarial input.",
      "confidence": 0.9
    }
  ]
}

Callouts

Enrichment unavailable for this review.

Addresses two Bridgebuilder F-findings from the local self-review pass on PR iii-hq#1618 head a3c37fb: F7 (LOW, maintainability) — `CapturingTraceService`, the `TraceService` impl, the spawn helper, and the readiness probe were duplicated across all three `otel_forwarder_*` integration tests (~60 lines each, three copies). When the `SdkForwarderState` shape changed in earlier rounds the three helpers drifted out of lockstep until the partial-batch test caught it. Extracted to `engine/tests/common/forwarder_mock.rs` with two variants: - `spawn_body_capturing_collector` — used by the resource-preservation and partial-batch tests (both need the `ExportTraceServiceRequest` body). - `spawn_metadata_capturing_collector` — used by the headers test (needs `tonic::MetadataMap` instead of body). Both share the same readiness-probe contract (bounded `TcpStream::connect` retry; see file-level doc-comment for the rationale that replaced the 50ms-sleep heuristic earlier in the review cycle). Registered in `engine/tests/common/mod.rs` alongside the existing `http_helpers` / `queue_helpers` modules. Net effect: tests/ shrinks by ~210 lines (210 deletions vs 32 insertions, after the new helper module is counted). F6 (NIT, spelling) — "preceeding" → "preceding" in the comment block on the attribute-count assertion in `otel_forwarder_resource_test.rs`. Deferred F-findings (with rationale): F1 (OnceLock fragility) was already addressed in round-3 via the doc-comment on `SDK_SPAN_FORWARDER`; BB couldn't see otel.rs in the truncated diff. F2/F3 (server task panics / readiness probe noise) — `.expect()` was added in round-3 and the noise is acceptable per BB's own confidence (0.4/0.6). F5 (dep placement) — `tonic` IS used in production code (`Channel`, `Endpoint`, `MetadataKey/Value`, `ClientTlsConfig`); the `[dependencies]` placement is correct. F8 (truncated otel.rs review scope) — adversarial cross-model dissent (`gpt-5.3-codex`) reviewed the full diff and returned `findings: []` after the round-4 spec-length tightening. Sweep: 1542 lib tests passed, 3 forwarder integration tests passed (all three now via the shared helper module), cargo fmt + clippy clean on touched files. Ridden with [Loa](https://github.com/0xHoneyJar/loa)

deep-name · 2026-05-12T14:39:10Z

both gherkin scenarios closed end-to-end @ytallo. round summary across 2 commits:

a3c37fb — scenarios proper

scenario 1 (OTEL_EXPORTER_OTLP_HEADERS): new SdkForwarderState struct wraps the tonic channel + the parsed Vec<(MetadataKey<Ascii>, MetadataValue<Ascii>)> pairs. parse_otlp_headers_env() reads OTEL_EXPORTER_OTLP_TRACES_HEADERS first (signal-specific precedence per the OTel SDK spec), falls back to OTEL_EXPORTER_OTLP_HEADERS; comma-split + =-split + warn-and-skip on invalid pairs (matches opentelemetry-otlp's tolerance). forwarder dispatch attaches every parsed pair to tonic::Request::metadata_mut() before TraceServiceClient::export. test: engine/tests/otel_forwarder_headers_test.rs sets authorization=Bearer-test-token-12345,x-tenant=acme before init, asserts both metadata entries land on the captured request.

scenario 2 (one bad span doesn't poison the batch): otlp_span_to_proto is now fallible — returns None for spans whose trace_id/span_id fail to hex-decode OR decode to non-spec byte lengths (16 / 8). otlp_scope_spans_to_proto filter_maps them out. caught one gap mid-review: my initial filter was hex-only, so traceId: "01" would have decoded to a 1-byte trace_id and slipped through. spec-length check (OTLP_TRACE_ID_BYTES = 16 / OTLP_SPAN_ID_BYTES = 8) added — test now covers all three failure modes (non-hex, hex-but-wrong-length, valid) in one mixed batch.

380d684 — test maintainability nits from local bridgebuilder pass

extracted the duplicated mock-collector helpers into engine/tests/common/forwarder_mock.rs (spawn_body_capturing_collector + spawn_metadata_capturing_collector, plus the shared readiness probe). net tests/ shrinks by ~50 lines and keeps the three forwarder regression tests in lockstep when the forwarder shape changes in the future.

cross-model adversarial dissent (gpt-5.3-codex) was clean against the final state.

sweep: 1542 lib tests + 3 forwarder integration tests pass, cargo fmt + clippy clean on touched files. ready when you are.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@engine/src/workers/observability/otel.rs`:
- Around line 713-756: parse_otlp_headers_env currently uses header values
verbatim; update it to percent-decode the value string before constructing the
MetadataValue so OTLP percent-encoded headers (e.g. Authorization=Basic%20...)
are decoded per the spec: after trimming v in parse_otlp_headers_env, run a
percent-decode (e.g. via percent_encoding::percent_decode_str or similar) and if
decoding fails log a warning (like the existing MetadataValue error branch,
include header_name = k) and skip the entry; then pass the decoded string/bytes
into MetadataValue::<Ascii>::try_from instead of the original v; preserve the
existing header-name parsing/validation and warnings for invalid name/value.
- Around line 1926-1934: The otlp_link_to_proto function currently uses
hex::decode(...).unwrap_or_default() which can produce short byte vectors (e.g.,
1 byte) and cause OTLP collectors to reject the whole batch; change the trace_id
and span_id handling so you decode the hex string, check the decoded length
(trace_id must be exactly 16 bytes, span_id must be exactly 8 bytes), and if the
length is invalid replace with an empty Vec<u8> (preserving the previous
"recoverable metadata loss" behavior) instead of using unwrap_or_default; update
the trace_id/span_id expressions in otlp_link_to_proto to perform this
decode+length-check logic (and keep trace_state as
link.trace_state.clone().unwrap_or_default()).

In `@engine/tests/otel_forwarder_headers_test.rs`:
- Around line 99-170: The test
forwarder_injects_otel_exporter_otlp_headers_into_export_metadata must
explicitly guard against existing OTEL_EXPORTER_OTLP_TRACES_HEADERS precedence:
capture the current values of OTEL_EXPORTER_OTLP_HEADERS and
OTEL_EXPORTER_OTLP_TRACES_HEADERS, clear or set both appropriately before
calling iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint)
(ensuring the desired OTEL_EXPORTER_OTLP_HEADERS is set and
OTEL_EXPORTER_OTLP_TRACES_HEADERS is cleared), and then restore the original
values at the end of the test; this guarantees deterministic header precedence
for the assertions without changing init_sdk_span_forwarder itself.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5a3a7220-e44a-4749-8520-077514a8310d

📥 Commits

Reviewing files that changed from the base of the PR and between 1bc8af3 and a3c37fb.

📒 Files selected for processing (3)

engine/src/workers/observability/otel.rs
engine/tests/otel_forwarder_headers_test.rs
engine/tests/otel_forwarder_partial_batch_test.rs

coderabbitai · 2026-05-12T14:41:10Z

+fn parse_otlp_headers_env() -> Vec<(MetadataKey<Ascii>, MetadataValue<Ascii>)> {
+    let raw = std::env::var("OTEL_EXPORTER_OTLP_TRACES_HEADERS")
+        .or_else(|_| std::env::var("OTEL_EXPORTER_OTLP_HEADERS"))
+        .ok();
+    let Some(raw) = raw else {
+        return Vec::new();
+    };
+    if raw.trim().is_empty() {
+        return Vec::new();
+    }
+    raw.split(',')
+        .filter_map(|pair| {
+            let pair = pair.trim();
+            if pair.is_empty() {
+                return None;
            }
-        }
+            let (k, v) = pair.split_once('=')?;
+            let k = k.trim();
+            let v = v.trim();
+            let key = match MetadataKey::<Ascii>::from_bytes(k.as_bytes()) {
+                Ok(k) => k,
+                Err(e) => {
+                    tracing::warn!(
+                        error = %e,
+                        header_name = k,
+                        "OTEL_EXPORTER_OTLP_HEADERS entry has invalid header name; skipping"
+                    );
+                    return None;
+                }
+            };
+            let value = match MetadataValue::<Ascii>::try_from(v) {
+                Ok(v) => v,
+                Err(e) => {
+                    tracing::warn!(
+                        error = %e,
+                        header_name = k,
+                        "OTEL_EXPORTER_OTLP_HEADERS entry has invalid header value; skipping"
+                    );
+                    return None;
+                }
+            };
+            Some((key, value))
+        })
+        .collect()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS, are values required to be URL-decoded before sending request headers?

💡 Result:

No. The OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS values are not something you should “URL-decode” as part of sending the HTTP/gRPC request headers. What the spec requires is that the environment variable value itself is formatted like W3C Baggage (key=value,key2=value2), and that the value portion is URL-encoded; SDKs should percent-decode that value into the actual header value they send on the wire. The header key should not be percent-decoded; only the value is treated as URL-encoded baggage. [1][2][3] In practice: if you set OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic%20..." then the SDK should decode %20 to a space before sending the Authorization header. [4][5][6][7] So: you (the user) should provide URL-encoded header values in the env var; the SDK performs the decoding when constructing the actual request headers. You generally do not URL-decode yourself before exporting the env vars. [1][2][3]

Citations:

1: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md

2: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/protocol/exporter.md

3: OTLP exporters should not percent decode the key when parsing HEADERS env vars open-telemetry/opentelemetry-go#5623

4: Decode URL-encoded headers in environment vars open-telemetry/opentelemetry-python#2312

5: URL encoded values for OTEL_EXPORTER_OTLP_HEADERS are not exported open-telemetry/opentelemetry-cpp#2523

6: URL encoded values for OTEL_EXPORTER_OTLP_HEADERS are not supported open-telemetry/opentelemetry-php#1239

7: URL encoded values for OTEL_EXPORTER_OTLP_HEADERS are not supported open-telemetry/opentelemetry-rust#1577

Add percent-decoding for OTEL header environment variable values per OTLP specification.

The OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS environment variables require header values to be percent-encoded by the user; the SDK must percent-decode them before constructing request headers. The current code (lines 723–745) parses values verbatim without decoding, which breaks headers with encoded characters (e.g., Authorization=Basic%20... will send literal %20 instead of a space). This violates the OTLP specification and causes interoperability issues with collectors.

Proposed fix

let v = v.trim(); + let v = match urlencoding::decode(v) { + Ok(decoded) => decoded.into_owned(), + Err(e) => { + tracing::warn!( + error = %e, + header_name = k, + "OTEL_EXPORTER_OTLP_HEADERS entry has invalid percent-encoding; skipping" + ); + return None; + } + }; let value = match MetadataValue::<Ascii>::try_from(v.as_str()) {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@engine/src/workers/observability/otel.rs` around lines 713 - 756, parse_otlp_headers_env currently uses header values verbatim; update it to percent-decode the value string before constructing the MetadataValue so OTLP percent-encoded headers (e.g. Authorization=Basic%20...) are decoded per the spec: after trimming v in parse_otlp_headers_env, run a percent-decode (e.g. via percent_encoding::percent_decode_str or similar) and if decoding fails log a warning (like the existing MetadataValue error branch, include header_name = k) and skip the entry; then pass the decoded string/bytes into MetadataValue::<Ascii>::try_from instead of the original v; preserve the existing header-name parsing/validation and warnings for invalid name/value.

coderabbitai · 2026-05-12T14:41:11Z

+fn otlp_link_to_proto(link: &OtlpSpanLink) -> opentelemetry_proto::tonic::trace::v1::span::Link {
+    // Links to malformed-ID remote spans fall back to empty bytes here
+    // (rather than dropping the link) — the parent span itself was already
+    // validated by `otlp_span_to_proto`, so a bad link is a recoverable
+    // metadata loss, not a wholesale span-loss event.
+    opentelemetry_proto::tonic::trace::v1::span::Link {
+        trace_id: hex::decode(&link.trace_id).unwrap_or_default(),
+        span_id: hex::decode(&link.span_id).unwrap_or_default(),
+        trace_state: link.trace_state.clone().unwrap_or_default(),


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In OTLP Trace Span.Link, must trace_id be exactly 16 bytes and span_id exactly 8 bytes, and can collectors reject entire exports when these lengths are invalid?

💡 Result:

Yes, in OTLP Trace Span.Link, trace_id must be exactly 16 bytes and span_id exactly 8 bytes, as defined in the official OpenTelemetry protobuf schema for the Span.Link message: "A unique identifier of a trace that this linked span is part of. The ID is a 16-byte array." for trace_id and "A unique identifier for the linked span. The ID is an 8-byte array." for span_id [1][2]. This matches the specification for Span trace_id and span_id [3][4][5]. These fields are bytes with fixed semantic lengths; incorrect lengths violate the spec and are treated as invalid [6][7].

Citations:

1: https://github.com/open-telemetry/opentelemetry-proto/blob/ebef7c999f4dea62b5b033e92a221411c49c0966/opentelemetry/proto/trace/v1/trace.proto

2: https://github.com/open-telemetry/opentelemetry-proto/blob/30d237e1ff3ab7aa50e0922b5bebdd93505090af/opentelemetry/proto/trace/v1/trace.proto

3: https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/trace/0059-otlp-trace-data-format.md

4: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md

5: https://opentelemetry.io/docs/specs/otel/trace/api/

6: Collector return "invalid length for ID" for trace reporting open-telemetry/opentelemetry-collector#3195

7: Collector cannot forward to another collector using the otlphttp exporter open-telemetry/opentelemetry-collector#4221

🏁 Script executed:

cat -n engine/src/workers/observability/otel.rs | sed -n '1920,1940p'

Repository: iii-hq/iii

Length of output: 1121

🏁 Script executed:

rg "OTLP_TRACE_ID_BYTES|OTLP_SPAN_ID_BYTES" engine/src/workers/observability/otel.rs

Repository: iii-hq/iii

Length of output: 281

🏁 Script executed:

rg "fn otlp_span_to_proto" engine/src/workers/observability/otel.rs -A 20

Repository: iii-hq/iii

Length of output: 1053

🏁 Script executed:

rg "\.len\(\)" engine/src/workers/observability/otel.rs | head -20

Repository: iii-hq/iii

Length of output: 924

Add link ID length validation to prevent malformed IDs from poisoning batch exports.

The otlp_link_to_proto function accepts hex-decoded IDs of any length via unwrap_or_default(). Short IDs like hex::decode("01") produce 1-byte values that violate the OTLP spec (trace_id must be 16 bytes, span_id must be 8 bytes) and trigger collector-side rejection of the entire export. The parent function otlp_span_to_proto already validates these lengths correctly; otlp_link_to_proto must do the same.

Proposed fix

fn otlp_link_to_proto(link: &OtlpSpanLink) -> opentelemetry_proto::tonic::trace::v1::span::Link { // Links to malformed-ID remote spans fall back to empty bytes here // (rather than dropping the link) — the parent span itself was already // validated by `otlp_span_to_proto`, so a bad link is a recoverable // metadata loss, not a wholesale span-loss event. + let trace_id = hex::decode(&link.trace_id) + .ok() + .filter(|b| b.len() == OTLP_TRACE_ID_BYTES) + .unwrap_or_default(); + let span_id = hex::decode(&link.span_id) + .ok() + .filter(|b| b.len() == OTLP_SPAN_ID_BYTES) + .unwrap_or_default(); + opentelemetry_proto::tonic::trace::v1::span::Link { - trace_id: hex::decode(&link.trace_id).unwrap_or_default(), - span_id: hex::decode(&link.span_id).unwrap_or_default(), + trace_id, + span_id, trace_state: link.trace_state.clone().unwrap_or_default(),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@engine/src/workers/observability/otel.rs` around lines 1926 - 1934, The otlp_link_to_proto function currently uses hex::decode(...).unwrap_or_default() which can produce short byte vectors (e.g., 1 byte) and cause OTLP collectors to reject the whole batch; change the trace_id and span_id handling so you decode the hex string, check the decoded length (trace_id must be exactly 16 bytes, span_id must be exactly 8 bytes), and if the length is invalid replace with an empty Vec<u8> (preserving the previous "recoverable metadata loss" behavior) instead of using unwrap_or_default; update the trace_id/span_id expressions in otlp_link_to_proto to perform this decode+length-check logic (and keep trace_state as link.trace_state.clone().unwrap_or_default()).

coderabbitai · 2026-05-12T14:41:11Z

+async fn forwarder_injects_otel_exporter_otlp_headers_into_export_metadata() {
+    // Order is load-bearing: the env var MUST be set BEFORE
+    // init_sdk_span_forwarder is called, because the parsing happens at
+    // forwarder-init time (matching the upstream `opentelemetry-otlp`
+    // builder semantics). Setting it post-init would not retroactively
+    // populate `SdkForwarderState.headers`.
+    //
+    // # Safety: this integration test binary is single-threaded with
+    // respect to env-var mutation — only one test exists in the file,
+    // and `SDK_SPAN_FORWARDER`'s `OnceLock` ensures we only init once.
+    unsafe {
+        std::env::set_var(
+            "OTEL_EXPORTER_OTLP_HEADERS",
+            "authorization=Bearer-test-token-12345,x-tenant=acme",
+        );
+    }
+
+    let (endpoint, received_metadata) = spawn_mock_collector().await;
+
+    iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint);
+
+    iii::workers::observability::otel::ingest_otlp_json(MINIMAL_PAYLOAD)
+        .await
+        .expect("ingest_otlp_json should accept a well-formed payload");
+
+    let deadline = std::time::Instant::now() + Duration::from_secs(5);
+    while std::time::Instant::now() < deadline {
+        if !received_metadata.lock().await.is_empty() {
+            break;
+        }
+        tokio::time::sleep(Duration::from_millis(20)).await;
+    }
+
+    let captured = received_metadata.lock().await;
+    assert_eq!(
+        captured.len(),
+        1,
+        "Mock collector should have received exactly one ExportTraceServiceRequest"
+    );
+
+    let metadata = &captured[0];
+
+    // The authorization header from OTEL_EXPORTER_OTLP_HEADERS must
+    // land on the wire as a tonic Ascii metadata entry. Without this
+    // the upstream-spec env var is silently ignored and every API-keyed
+    // collector rejects forwarded spans at auth.
+    let auth = metadata
+        .get("authorization")
+        .expect("authorization metadata MUST be set from OTEL_EXPORTER_OTLP_HEADERS");
+    assert_eq!(
+        auth.to_str().unwrap(),
+        "Bearer-test-token-12345",
+        "authorization metadata value did not match OTEL_EXPORTER_OTLP_HEADERS"
+    );
+
+    // Second pair confirms the comma-split parser handles multiple
+    // entries (single-pair could pass with a naive `=`-only split).
+    let tenant = metadata
+        .get("x-tenant")
+        .expect("x-tenant metadata MUST be set; comma-split parser regression?");
+    assert_eq!(
+        tenant.to_str().unwrap(),
+        "acme",
+        "x-tenant metadata value did not match OTEL_EXPORTER_OTLP_HEADERS"
+    );
+
+    // Cleanup so a future test binary inheriting this process's env (if
+    // any) doesn't leak the test credential. # Safety: same single-test
+    // file rationale as the set_var above.
+    unsafe {
+        std::env::remove_var("OTEL_EXPORTER_OTLP_HEADERS");
+    }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test isolation is incomplete for header precedence.

Because OTEL_EXPORTER_OTLP_TRACES_HEADERS overrides OTEL_EXPORTER_OTLP_HEADERS, this test can flake if the process inherits *_TRACES_HEADERS. Clear (and restore) both vars around init to make the assertion deterministic.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@engine/tests/otel_forwarder_headers_test.rs` around lines 99 - 170, The test forwarder_injects_otel_exporter_otlp_headers_into_export_metadata must explicitly guard against existing OTEL_EXPORTER_OTLP_TRACES_HEADERS precedence: capture the current values of OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS, clear or set both appropriately before calling iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint) (ensuring the desired OTEL_EXPORTER_OTLP_HEADERS is set and OTEL_EXPORTER_OTLP_TRACES_HEADERS is cleared), and then restore the original values at the end of the test; this guarantees deterministic header precedence for the assertions without changing init_sdk_span_forwarder itself.

ytallo · 2026-05-13T19:54:48Z

This PR touches engine code. For us to accept it, we need your explicit confirmation in this thread that you license your changes to the engine portion of the codebase under Apache 2.0.

Please reply with:
“I agree that my contributions in this PR to the engine code are licensed under Apache 2.0.”

deep-name added 3 commits May 11, 2026 13:56

deep-name mentioned this pull request May 11, 2026

OTLP span forwarding drops resource attributes, causing service.name to appear as <nil> #1617

Open

deep-name commented May 11, 2026

View reviewed changes

deep-name marked this pull request as ready for review May 11, 2026 04:40

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

deep-name mentioned this pull request May 11, 2026

engine-originated spans land at the collector with empty service.name on the SdkTracerProvider export path #1619

Open

ytallo self-requested a review May 12, 2026 00:38

deep-name commented May 12, 2026

View reviewed changes

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

alsruf36 mentioned this pull request May 13, 2026

fix(observability): preserve engine span resource identity deep-name/iii#1

Open

Conversation

deep-name commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Behaviour change worth noting

Test plan

Credits

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

deep-name left a comment

Choose a reason for hiding this comment

Summary

Findings

Callouts

Uh oh!

deep-name commented May 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

deep-name commented May 11, 2026

Uh oh!

ytallo commented May 12, 2026

Uh oh!

deep-name commented May 12, 2026

Uh oh!

deep-name left a comment

Choose a reason for hiding this comment

Summary

Findings

Callouts

Uh oh!

deep-name commented May 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

ytallo commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deep-name commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading