Skip to content

fix(observability): preserve OTLP resource attributes on forwarder hop#1618

Open
deep-name wants to merge 6 commits into
iii-hq:mainfrom
deep-name:fix/1617-otlp-resource-forwarding
Open

fix(observability): preserve OTLP resource attributes on forwarder hop#1618
deep-name wants to merge 6 commits into
iii-hq:mainfrom
deep-name:fix/1617-otlp-resource-forwarding

Conversation

@deep-name
Copy link
Copy Markdown
Contributor

@deep-name deep-name commented May 11, 2026

Summary

Fixes #1617 — OTLP span forwarding drops resource_spans[].resource, so SDK spans reach the collector with service.name=<nil> even when OTEL_SERVICE_NAME is set at the SDK.

Why

SpanData carries no resource — in the OTel SDK contract, the owning TracerProvider supplies one before the exporter serialises. SDK_SPAN_FORWARDER was a bare opentelemetry_otlp::SpanExporter with no TracerProvider wrapping it, so convert_otlp_to_span_data → SpanExporter::export wrote an empty resource onto the wire. Collectors rendered the missing service identity as <nil> (SigNoz/Grafana/etc).

@alsruf36's diagnosis on the issue thread was exactly right; the fix shape follows their suggestion.

What changed

engine/src/workers/observability/otel.rs

  • SDK_SPAN_FORWARDER: OnceLock<Arc<SpanExporter>>OnceLock<tonic::Channel>. Cheap to clone (Arc internally), lazy connect, no init-time blocking.
  • init_sdk_span_forwarder: rewritten to build a tonic::Endpoint + connect_lazy(). For https:// endpoints, applies ClientTlsConfig::new().with_native_roots() so the previous deployment contract (handled implicitly by opentelemetry-otlp's tls-roots feature) is preserved.
  • Forwarder branch in ingest_otlp_json (lines 1607-1635): translates the parsed OtlpExportTraceServiceRequest into opentelemetry_proto::tonic::collector::trace::v1::ExportTraceServiceRequest via field-for-field helpers (otlp_json_request_to_proto and friends, lines 1640-1834), then sends via a freshly-cloned TraceServiceClient::new(channel.clone()). Resource block survives byte-for-byte.
  • Deleted convert_otlp_to_span_data and otlp_kv_to_key_value (truly dead post-replacement; in-memory storage path does its own per-field conversion) and their unit tests.
  • Trimmed one assertion in test_ingest_otlp_json_skips_spans_with_invalid_identifiers — malformed-span filtering was a side effect of the old SpanData path, not a contract. New forwarder relays and lets the collector validate. Memory-storage behaviour is unchanged.

engine/Cargo.toml

  • Promote opentelemetry-proto = "0.31" (features ["gen-tonic"]), tonic = "0.14" (features ["transport", "tls-native-roots"]), tokio-stream = "0.1" to direct deps. All three were already transitively present via opentelemetry-otlp's grpc-tonic feature — promotion adds nothing new to the dep tree.

engine/tests/otel_forwarder_resource_test.rs (new)

  • Integration test against a tonic gRPC mock collector. Owns its own process so the global SDK_SPAN_FORWARDER OnceLock is fresh.
  • Asserts service.name=my-service, service.namespace=production, service.version=1.4.2 survive the hop (three resource attributes locked in, not just one).
  • Asserts replica.count=7 (intValue) and telemetry.enabled=true (boolValue) — covers the proto-conversion's non-string AnyValue branches.
  • Asserts flags & 0x01 == 0x01 for an inbound payload that omits flags — pins the SAMPLED default that mirrors the storage path at otel.rs:1521-1524.

Behaviour change worth noting

The new forwarder relays spans with malformed trace/span IDs (the old SpanData path pre-filtered them via TraceId::from_hex/SpanId::from_hex bailout). Downstream collectors validate and reject — engine-side pre-filtering hid that diagnostic signal. The in-memory storage path's filtering is unchanged. Called out in the source comment at engine/src/workers/observability/otel.rs:1607-1622.

Test plan

  • cargo test -p iii --test otel_forwarder_resource_test — 1 passed (with extended assertions for string + int + bool resource attrs + flags=SAMPLED default)
  • cargo test -p iii --lib — 1542 passed, 0 failed, 6 ignored (KVM-gated)
  • cargo test -p iii --lib workers::observability — 289 passed
  • cargo test -p iii --lib test_ingest_otlp_json — 10 passed (including the trimmed test_ingest_otlp_json_skips_spans_with_invalid_identifiers)
  • cargo fmt --check — clean
  • cargo clippy -p iii --lib --tests — no new warnings on touched files
  • Manual: run against a real collector with OTEL_SERVICE_NAME=my-service, confirm UI shows the service name (rather than <nil>)
  • Manual: run against an https:// collector with valid certs, confirm TLS handshake succeeds (verified locally against a self-signed loopback; full prod-collector cert validation deferred to CI / reviewer)

Credits

@alsruf36 — diagnosis + suggested fix shape on the issue thread.

Closes #1617.

Summary by CodeRabbit

  • Bug Fixes

    • Preserve full incoming OTLP resource attributes when forwarding traces to collectors.
  • Improvements

    • Forward OTLP traces as native proto over gRPC with injected OTLP headers.
    • Add native TLS handling for HTTPS collector endpoints.
    • Drop spans with malformed trace/span IDs from forwarded batches while retaining in-memory ingestion.
  • Tests

    • Added integration tests verifying resource preservation, header propagation, and partial-batch behavior.
  • Chores

    • Updated build/dev dependencies for OTLP and async streaming.

Review Change Stack

deep-name added 3 commits May 11, 2026 13:56
OTLP span forwarder drops `resource_spans[].resource` on the
engine → collector hop; SigNoz/Grafana/etc render `service.name=<nil>`
even though `OTEL_SERVICE_NAME` is set at the SDK.

Adds a tonic gRPC mock collector that captures every inbound
`ExportTraceServiceRequest` and asserts the captured resource carries
all three inbound attributes (`service.name`, `service.namespace`,
`service.version`) — full-resource preservation rather than a
single-field band-aid.

Runs in its own integration-test binary so the process-global
`SDK_SPAN_FORWARDER` `OnceLock` is fresh, with a 5s poll deadline + 20ms
tick so the happy path stays fast and the failure path stays bounded.

FAILS against the unfixed engine with `assertion: left: None, right:
Some("my-service")`. Locks in the regression contract before the fix
lands.

Closes prerequisite for iii-hq#1617.

Ridden with [Loa](https://github.com/0xHoneyJar/loa)
Replaces the bare `opentelemetry_otlp::SpanExporter` forwarder with a
`tonic::Channel` to a `TraceServiceClient`, and bypasses
`convert_otlp_to_span_data` on the forwarder branch by translating
the parsed `OtlpExportTraceServiceRequest` directly into the proto
`ExportTraceServiceRequest`. The full `resource_spans[].resource`
block survives the engine → collector hop byte-for-byte, including
`service.name`, `service.namespace`, `service.version`, and any
other attributes the inbound SDK sent.

Root cause: `SpanData` carries no resource — the OTel SDK contract
is that the owning `TracerProvider` supplies one before the exporter
serialises. `SDK_SPAN_FORWARDER` was a bare exporter (no provider),
so the export call wrote an empty resource onto the wire. Collectors
rendered the missing service identity as `<nil>`.

Changes:
- `SDK_SPAN_FORWARDER`: `OnceLock<Arc<SpanExporter>>` →
  `OnceLock<Channel>`. Channel clones cheaply, lazy connect matches
  the prior exporter-builder semantics, no init-time blocking.
- `init_sdk_span_forwarder`: rewritten to build an `Endpoint` and
  call `connect_lazy()`. Visibility bumped to `pub` so the new
  integration test can point the forwarder at a mock collector.
- Forwarder branch in `ingest_otlp_json`: translates the parsed
  JSON request to the proto type via `otlp_json_request_to_proto`
  and a family of field-for-field helpers, then sends through a
  freshly-cloned `TraceServiceClient`.
- Deleted `convert_otlp_to_span_data` and `otlp_kv_to_key_value`
  (truly dead post-replacement; in-memory storage path does its
  own per-field conversion) plus their unit tests.
- Trimmed one assertion in
  `test_ingest_otlp_json_skips_spans_with_invalid_identifiers` —
  malformed-span filtering was a side effect of the
  SpanData-based path. The new path relays everything and lets
  the collector validate. Memory-storage behaviour is unchanged
  (the actual contract under test).
- `engine/Cargo.toml`: promote `opentelemetry-proto = "0.31"`,
  `tonic = "0.14"`, `tokio-stream = "0.1"` to direct deps. All
  three were transitively present via `opentelemetry-otlp`'s
  `grpc-tonic` feature — promotion is the only change.

Sweep: 1542 lib tests passed, 0 failed; new forwarder integration
test passes; cargo fmt + clippy clean on touched files.

Credits: @alsruf36 for the diagnosis and the suggested fix shape in
iii-hq#1617.

Closes iii-hq#1617.

Ridden with [Loa](https://github.com/0xHoneyJar/loa)
…Value coverage

Round-2 fixes for sprint-bug-1 review:

C1/DISS-001 — TLS regression for `https://` collectors. The new
raw-tonic channel did not configure TLS; the pre-fix
`opentelemetry_otlp::SpanExporter` builder auto-configured it via
its `tls-roots` feature. Production deployments behind TLS-fronted
collectors (Grafana Cloud, SigNoz Cloud, Honeycomb, the standard
prod shape) would silently fail at handshake time.

- `engine/Cargo.toml`: `tonic` features extended with
  `tls-native-roots`.
- `engine/src/workers/observability/otel.rs:696-744`:
  `init_sdk_span_forwarder` now branches on
  `endpoint.starts_with("https://")` and applies
  `ClientTlsConfig::new().with_native_roots()` to the Endpoint
  before `connect_lazy()`. TLS-config failure falls back to
  plaintext with a logged warning so HTTP-only deployments are
  unaffected.

C2/DISS-002 — trace flags default mismatch. The forwarder defaulted
missing `flags` to 0 (unsampled) while the in-memory storage path
at `otel.rs:1521-1524` defaults to `TraceFlags::SAMPLED`. Same
inbound span was being treated as sampled by storage but unsampled
by the forwarder.

- `engine/src/workers/observability/otel.rs:1747`: flipped
  `unwrap_or(0)` → `unwrap_or(1)` (W3C SAMPLED bit) with a parity
  comment referencing the storage path.

N1 — AnyValue conversion test coverage. The round-1 test only
exercised the `StringValue` branch.

- `engine/tests/otel_forwarder_resource_test.rs`: payload now
  includes `replica.count` (intValue) and `telemetry.enabled`
  (boolValue); assertions extended via `find_int_attr` and
  `find_bool_attr`. Flags-default assertion added for the SAMPLED
  parity check.

N2 — source-level call-out for the deliberate malformed-span
forwarding behaviour change.

- `engine/src/workers/observability/otel.rs:1607-1622` extended
  with the rationale in the forwarder-branch comment.

Round-2 adversarial cross-model review (gpt-5.3-codex):
`findings: []` / status `clean`. Sweep: 1542 lib tests pass, 1
extended forwarder integration test passes, cargo fmt + clippy
clean on touched files.

Closes review-loop iteration 1 for sprint-bug-1. Sprint approved.

Ridden with [Loa](https://github.com/0xHoneyJar/loa)
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 11, 2026

@deep-name is attempting to deploy a commit to the motia Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0cecfc5d-119a-4f97-a304-c2b4c74ff130

📥 Commits

Reviewing files that changed from the base of the PR and between a3c37fb and 380d684.

📒 Files selected for processing (5)
  • engine/tests/common/forwarder_mock.rs
  • engine/tests/common/mod.rs
  • engine/tests/otel_forwarder_headers_test.rs
  • engine/tests/otel_forwarder_partial_batch_test.rs
  • engine/tests/otel_forwarder_resource_test.rs
✅ Files skipped from review due to trivial changes (1)
  • engine/tests/common/mod.rs

📝 Walkthrough

Walkthrough

Replaces the SDK SpanExporter forwarding path with direct OTLP proto gRPC forwarding: adds proto types and converters, exposes a TLS-aware global forwarder init, converts parsed OTLP JSON to ExportTraceServiceRequest, forwards with tonic::Channel and injected OTLP headers, and adds integration tests for resource attributes, headers, and partial-batch behavior.

Changes

OTLP Span Forwarding: Direct Proto Relay

Layer / File(s) Summary
Add dependencies for gRPC forwarding
engine/Cargo.toml
Adds opentelemetry-proto (gen-tonic) and tokio-stream (dev).
Import OTLP proto client and transport types
engine/src/workers/observability/otel.rs
Adds OTLP TraceServiceClient, ExportTraceServiceRequest, and tonic transport/TLS imports used by the forwarder.
Global forwarder state & public init with TLS handling
engine/src/workers/observability/otel.rs
Replace SDK exporter global with forwarder state (lazy tonic::Channel + parsed OTLP headers); make init_sdk_span_forwarder public and add TLS-native-roots handling for https:// endpoints.
Remove old JSON→SpanData conversion and tests
engine/src/workers/observability/otel.rs
Delete previous JSON→KeyValue→SpanData converters and their unit tests that dropped resource data.
Add JSON→OTLP proto conversion and ID rules
engine/src/workers/observability/otel.rs
Implement JSON→proto translators (resource/scope/span/events/links/status/attributes/AnyValue) and tolerant hex handling where malformed/wrong-length trace_id/span_id cause spans to be omitted from exported proto.
Update ingest_otlp_json to use proto forwarding
engine/src/workers/observability/otel.rs
Convert parsed OTLP JSON to ExportTraceServiceRequest, attach parsed OTLP headers as gRPC metadata, and call TraceServiceClient::export while preserving in-memory ingestion.
Adjust regression test expectations
engine/src/workers/observability/otel.rs
Update comments/assertions to reflect forwarding conversion/filtering changes while in-memory retention remains unchanged.
Mock OTLP collector implementations
engine/tests/common/forwarder_mock.rs, engine/tests/common/mod.rs
Add body-capturing and metadata-capturing Tonic TraceService mocks, readiness probe, and export the helper module.
Header propagation regression test
engine/tests/otel_forwarder_headers_test.rs
Add test that sets OTEL_EXPORTER_OTLP_HEADERS, initializes forwarder to mock collector, ingests payload, and asserts authorization and x-tenant metadata were injected.
Partial-batch forwarding regression test
engine/tests/otel_forwarder_partial_batch_test.rs
Add test sending mixed-validity spans to ensure malformed IDs are dropped and valid spans are forwarded with correct ID lengths and preserved decoded IDs.
Resource-attribute preservation regression test
engine/tests/otel_forwarder_resource_test.rs
Add payload builder, attribute extractors, and test asserting forwarded resourceSpans[].resource.attributes are preserved with correct decoded values and attribute count; also asserts sampled-default span flags and preserved span name.

Sequence Diagram

sequenceDiagram
  participant Ingest as ingest_otlp_json
  participant ProtoConv as otlp_json_request_to_proto
  participant Forwarder as TraceServiceClient
  participant Collector as OTLP Collector

  Ingest->>Ingest: parse OTLP JSON and merge into in-memory storage
  alt Forwarder initialized
    Ingest->>ProtoConv: convert parsed JSON -> ExportTraceServiceRequest
    ProtoConv->>ProtoConv: preserve resourceSpans[].resource attributes
    ProtoConv-->>Forwarder: proto ExportTraceServiceRequest
    Forwarder->>Collector: gRPC Export call (with injected OTLP headers)
    Collector-->>Forwarder: Export response
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • iii-hq/iii#163: Modifies the engine OTLP span forwarding path and forwarder initialization similarly.

Suggested reviewers

  • ytallo
  • andersonleal
  • sergiofilhowz

Poem

🐰 I hop where spans and resources meet,
Proto paths now keep attributes sweet,
Headers travel on the wire,
Bad IDs fall—good ones aspire,
Forwarded traces sing complete.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(observability): preserve OTLP resource attributes on forwarder hop' clearly and concisely describes the main change—preserving resource attributes during forwarding.
Description check ✅ Passed The PR description comprehensively covers what changed, why, and includes detailed test results and behavior notes, fulfilling all key sections of the template.
Linked Issues check ✅ Passed The PR fully implements the core objective from #1617: preserving OTLP resource attributes during forwarding, with additional implementations for header propagation and partial-batch handling.
Out of Scope Changes check ✅ Passed All changes are directly aligned with preserving resource attributes and supporting the new tonic-based forwarding path; no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor Author

@deep-name deep-name left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Analytical review of #1618. Enrichment pass was unavailable; findings are unenriched.

Findings

{
  "schema_version": 1,
  "findings": [
    {
      "id": "F1",
      "title": "Heavy new dependencies (tonic, opentelemetry-proto, tokio-stream) added at top-level Cargo.toml for test-only use",
      "severity": "MEDIUM",
      "category": "dependencies",
      "file": "engine/Cargo.toml",
      "description": "The PR adds `opentelemetry-proto`, `tonic` (with `transport` and `tls-native-roots`), and `tokio-stream` to the main `[dependencies]` section. From the visible diff, these are only consumed by an integration test (`engine/tests/otel_forwarder_resource_test.rs`). Putting them in `[dependencies]` instead of `[dev-dependencies]` enlarges the production binary closure (tonic with TLS pulls in rustls/native-roots, prost, hyper) and lengthens compile times for every consumer of the engine crate. If the forwarder source itself doesn't actually import these (the otel.rs section is truncated so this can't be fully verified), they should be `[dev-dependencies]`.",
      "suggestion": "Confirm whether the forwarder implementation in `engine/src/workers/observability/otel.rs` directly uses `opentelemetry-proto`, `tonic`, or `tokio-stream`. If not (i.e. only the test does), move all three to `[dev-dependencies]`. If `opentelemetry-proto` is genuinely needed in production for proto conversion, keep just that one in `[dependencies]` and move `tonic`/`tokio-stream` to dev.",
      "confidence": 0.6
    },
    {
      "id": "F2",
      "title": "Process-global OnceLock forwarder state makes integration test order-dependent",
      "severity": "MEDIUM",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The header comment states the test relies on `SDK_SPAN_FORWARDER` being a process-global `OnceLock`, fresh per integration test file. This is true today but fragile: if a future test file in the same `tests/` directory also calls `init_sdk_span_forwarder`, or if cargo test harness changes (e.g., a future shared test binary, or someone adds a `#[test]` to the same file pointing at a different endpoint), the OnceLock will silently bind the wrong endpoint and the second test will export to the wrong collector. There is no documented protection nor a test-only reset hook.",
      "suggestion": "Either (a) add a `#[cfg(test)] pub fn reset_sdk_span_forwarder_for_tests()` that swaps the OnceLock contents, or (b) refactor the forwarder to accept an injected endpoint/exporter so tests don't depend on process-global state, or (c) at minimum add a code comment on `SDK_SPAN_FORWARDER` itself warning future integration tests not to call `init_sdk_span_forwarder` in the same binary.",
      "confidence": 0.7
    },
    {
      "id": "F3",
      "title": "50ms sleep for collector startup is a flake vector",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "`spawn_mock_collector` sleeps 50ms unconditionally before returning, with a comment acknowledging this exists to avoid races with tonic's accept loop. Loaded CI runners (especially under heavy parallel cargo test) can easily exceed 50ms before the spawned task is polled. The downstream poll loop (5s deadline) does provide a fallback safety net for the actual export, but the initial race could still cause connection-refused on the first export attempt and rely on tonic's internal retry — and that retry is not guaranteed across tonic versions.",
      "suggestion": "Replace the sleep with an active readiness check: in a small loop, attempt to `TcpStream::connect(addr)` until success or a bounded timeout (~1s). Alternatively, bind the listener synchronously *before* the spawn and only spawn the `serve_with_incoming` future — `TcpListener::bind` already guarantees the port is listening, so the race is solely about the tonic accept-loop being polled. A `tokio::task::yield_now().await` followed by a connect-probe loop would be deterministic.",
      "confidence": 0.6
    },
    {
      "id": "F4",
      "title": "Mock collector server task swallows panics on `unwrap()`",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The spawned `Server::builder()...serve_with_incoming(...).await.unwrap()` runs in a detached `tokio::spawn`. If the server fails to start or errors out, the `.unwrap()` will panic inside the spawned task, but the test harness will see the export simply never arrive and surface as `captured.len() == 0` with a misleading message. The task handle is not awaited or stored.",
      "suggestion": "Either retain the JoinHandle and check it on test failure, or replace `.unwrap()` with `.expect(\"mock collector server failed to start\")` plus a `tracing::error!` — at minimum so the panic surfaces in test output rather than silently disappearing into a detached task.",
      "confidence": 0.5
    },
    {
      "id": "F5",
      "title": "Test asserts SAMPLED default but contract is undocumented in PR scope",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The final assertion (`span.flags & 0x01 == 0x01`) pins the forwarder to default missing trace-flags to SAMPLED, with rationale that it 'mirrors the in-memory storage path'. This is a meaningful behavioural contract, but it is being introduced alongside a fix titled 'preserve OTLP resource attributes' — the trace-flags default is a separate concern. If this default is new behaviour added in this PR, it deserves explicit call-out in the PR description; if it's pre-existing behaviour now being locked in, that's fine but worth noting. The OTLP spec leaves trace flags semantics to the sender; a forwarder defaulting unsampled→SAMPLED could subtly amplify sampled trace volume downstream.",
      "suggestion": "Confirm in the PR description whether the SAMPLED default is (a) pre-existing and being regression-locked, or (b) new in this PR. If (b), consider whether defaulting to whatever the inbound payload says (0 when absent) is more correct per OTLP semantics, since collectors typically interpret missing flags as 'unspecified' rather than 'unsampled'.",
      "confidence": 0.5
    },
    {
      "id": "F6",
      "title": "Test only asserts forwarded payload, not absence of duplication or modification",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The test asserts `captured.len() == 1` and that 5 attributes are correctly preserved. It does not assert that *no extra* attributes were injected by the forwarder (e.g. a hop-level `forwarder.id` attribute), nor that `scope_spans`/spans are passed through unmodified beyond the flags check. A forwarder bug that *adds* attributes or rewrites span names would pass this test.",
      "suggestion": "Consider asserting `resource.attributes.len() == 5` and `req.resource_spans[0].scope_spans[0].spans[0].name == \"forwarder.preserves.resource\"` to lock in pass-through fidelity, not just preservation of the named subset.",
      "confidence": 0.55
    },
    {
      "id": "F7",
      "title": "Net deletion of ~374 lines in otel.rs alongside 'fix' is non-obvious",
      "severity": "SPECULATION",
      "category": "review-visibility",
      "file": "engine/src/workers/observability/otel.rs",
      "description": "The truncated diff shows `engine/src/workers/observability/otel.rs` with +323 -697, a net deletion of ~374 lines, in a PR framed as a bug fix for resource-attribute preservation. A fix of this magnitude typically suggests a refactor or rewrite of the forwarder path rather than a surgical fix. Without seeing the source diff, I cannot evaluate whether the deletion removes dead code, consolidates a parser, replaces a hand-rolled protobuf encoder with `opentelemetry-proto`, or removes functionality. If it removes the hand-rolled OTLP/JSON→proto translation in favour of using `opentelemetry-proto` types directly, that's likely a good change but is much larger than the PR title implies.",
      "suggestion": "Request the full diff or a summary in the PR description explaining the structural change: is this replacing a custom proto encoder with `opentelemetry-proto::tonic` types? Are any code paths (e.g. backpressure, retry, batching) being removed in the process? Reviewers should not approve a -374-line change on the strength of one regression test without seeing what was removed.",
      "confidence": 0.7
    },
    {
      "id": "F8",
      "title": "Regression test design with full AnyValue branch coverage and clear failure messages",
      "severity": "PRAISE",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The test deliberately exercises three `AnyValue` discriminator branches (stringValue, intValue, boolValue), each with an assertion message that explains the bug class being guarded against rather than just the expected value. The header comment also explicitly states why the test is in a separate integration file (process-global OnceLock isolation) and why the assertion contract is 'full resource preservation, not just service.name' — preventing a future tempting single-field band-aid from passing this test. This is the right shape for a regression test tied to a specific incident.",
      "suggestion": "No action needed."
    }
  ]
}

Callouts

Enrichment unavailable for this review.

Local pre-emptive Bridgebuilder pass (claude-opus-4-7 via persona at
grimoires/bridgebuilder/BEAUVOIR.md) surfaced 6 net new findings beyond
the round-1 review-sprint adversarial dissent. Round-3 fixes:

F1 (MEDIUM, dep placement) — `tokio-stream` is only used by the test
file's `TcpListenerStream::new(listener)`. Moved from `[dependencies]`
to `[dev-dependencies]`. `opentelemetry-proto` + `tonic` stay in
`[dependencies]` (they're used by production code in
`engine/src/workers/observability/otel.rs` for proto types,
`TraceServiceClient`, `Channel`, `ClientTlsConfig`).

F2 (MEDIUM, OnceLock test fragility) — `SDK_SPAN_FORWARDER` is a
process-global `OnceLock`; a second test in the same integration
binary calling `init_sdk_span_forwarder` would silently no-op against
the first test's endpoint. Added a doc-comment on the static
explicitly warning future contributors to place new tests in their
own `engine/tests/*.rs` file and pointing at the
`#[cfg(test)] pub fn reset_for_test()` escape hatch if parameterised
endpoints are needed.

F3 (LOW, 50ms sleep flake) — replaced the fixed `tokio::time::sleep`
in `spawn_mock_collector` with a bounded `TcpStream::connect` probe
loop (2s deadline, `yield_now` between attempts). On loaded CI
runners the 50ms heuristic could lose the race with tonic's accept-
loop polling; the probe is deterministic.

F4 (LOW, swallowed panics) — replaced `.unwrap()` on the spawned
mock-server task with `.expect("mock OTLP collector failed to start;
see tonic transport error above")`. The detached task would silently
disappear on bind/serve failure; now the panic message surfaces.

F6 (LOW, pass-through fidelity) — added two assertions to the
forwarder test:
- `resource.attributes.len() == 5` catches a forwarder bug that
  ADDS attributes (e.g. injecting a hop-level `forwarder.id`).
- `span.name == "forwarder.preserves.resource"` catches a forwarder
  bug that rewrites the span name. Without this, a name-rewrite
  defect would pass all the existing attribute assertions.

Findings F5 (SAMPLED-default semantics) and F7 (-374 net lines) were
triaged as already-addressed — F5 is documented in commit `76f0b05e`
and the round-1 reviewer.md; F7 is explained in the PR description as
the dead-code removal of `convert_otlp_to_span_data`.

Bridgebuilder review preserved on PR thread for audit trail; full
findings at iii-hq#1618 (review).

Sweep: 1542 lib tests pass, 1 forwarder integration test pass with
extended assertions (5-attr count + span-name check), cargo fmt
clean, clippy clean on touched files.

Ridden with [Loa](https://github.com/0xHoneyJar/loa)
@deep-name
Copy link
Copy Markdown
Contributor Author

local self-review summary before maintainer queue:

ran our own bridgebuilder pass (claude-opus-4-7 via custom reviewer persona) plus two rounds of adversarial cross-model dissent (gpt-5.3-codex for both review + audit phases). 4 model-passes total, 3 distinct concerns surfaced + addressed before this hit your queue:

Round Tool Model Verdict
review-sprint round 1 adversarial-review (review) gpt-5.3-codex BLOCKING: DISS-001 (TLS regression for https:// collectors), DISS-002 (trace-flags default mismatch)
review-sprint round 2 adversarial-review (review) gpt-5.3-codex clean — both blockers resolved
audit-sprint adversarial-review (audit, no reviewer context) gpt-5.3-codex clean
local-self-review bridgebuilder claude-opus-4-7 F1-F4 + F6 addressed in commit 1bc8af32 (dep placement, OnceLock test isolation, sleep→connect-probe, .unwrap→.expect, pass-through fidelity assertions)

artifacts kept in our state zone for traceability:

  • grimoires/loa/a2a/sprint-bug-1/engineer-feedback.md
  • grimoires/loa/a2a/sprint-bug-1/auditor-sprint-feedback.md
  • grimoires/loa/a2a/sprint-bug-1/adversarial-review.json
  • grimoires/loa/a2a/sprint-bug-1/adversarial-audit.json
  • the bridgebuilder review above (8 findings, of which 5 actionable, 1 praise, 2 already-addressed)

ready when you are. happy to drop the bridgebuilder bot comment above if it's noisy — it's preserved here as the audit trail showing the issues were caught locally, not surfaced by your team's review cycle.

@deep-name deep-name marked this pull request as ready for review May 11, 2026 04:40
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
engine/src/workers/observability/otel.rs (1)

1775-1781: ⚡ Quick win

Comment overstates "parity" — in-memory storage doesn't default flags to SAMPLED.

The doc claims the SAMPLED default mirrors "the in-memory storage path's default at the top of ingest_otlp_json", but the storage path stores flags: span.flags unchanged (Option) — missing flags stay None there, not 1. The forwarder-side default to 1 is defensible on its own (the proto field is non-optional u32, and defaulting to 0 would mark all SDK-origin spans as unsampled at downstream collectors), and the integration test pins it. Just trim the parity framing so future readers don't go looking for a matching default in the storage code that doesn't exist.

📝 Suggested wording tweak
-        // Parity with the in-memory storage path's default at the top of
-        // `ingest_otlp_json` — a span arriving without an explicit `flags`
-        // field is treated as SAMPLED (W3C trace-flags bit 0 = 0x01).
-        // Defaulting to 0 here would mark such spans as unsampled on the
-        // forwarder hop while storage records them as sampled, producing
-        // inconsistent behaviour across the two halves of the same call.
+        // A span arriving without an explicit `flags` field is treated
+        // as SAMPLED (W3C trace-flags bit 0 = 0x01). The proto field is
+        // a non-optional u32; defaulting to 0 here would mark every
+        // SDK-origin span without explicit flags as unsampled at
+        // downstream samplers/collectors that honour the SAMPLED bit.
+        // The in-memory storage path keeps `flags` as Option<u32> and
+        // serialises it via `skip_serializing_if`, so the two halves
+        // express the same intent in different shapes.
         flags: span.flags.unwrap_or(1),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@engine/src/workers/observability/otel.rs` around lines 1775 - 1781, Update
the comment next to the flags assignment (flags: span.flags.unwrap_or(1) in
otel.rs) to remove the incorrect claim of parity with the in-memory storage
path; instead note that the forwarder defaults to 1 because the proto field is
non-optional and defaulting to 0 would mark SDK-origin spans unsampled, and
clarify that the storage path (ingest_otlp_json) preserves Option<u32>
(span.flags) and does not default missing flags to 1.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@engine/src/workers/observability/otel.rs`:
- Around line 1775-1781: Update the comment next to the flags assignment (flags:
span.flags.unwrap_or(1) in otel.rs) to remove the incorrect claim of parity with
the in-memory storage path; instead note that the forwarder defaults to 1
because the proto field is non-optional and defaulting to 0 would mark
SDK-origin spans unsampled, and clarify that the storage path (ingest_otlp_json)
preserves Option<u32> (span.flags) and does not default missing flags to 1.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8ccda45d-028d-4d49-9a7b-ed43547dd6a0

📥 Commits

Reviewing files that changed from the base of the PR and between 9daa7bc and 1bc8af3.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • engine/Cargo.toml
  • engine/src/workers/observability/otel.rs
  • engine/tests/otel_forwarder_resource_test.rs

@deep-name
Copy link
Copy Markdown
Contributor Author

status update for the maintainer queue:

#1618 stays scoped to the forwarder. ready for review.

@ytallo ytallo self-requested a review May 12, 2026 00:38
@ytallo
Copy link
Copy Markdown
Contributor

ytallo commented May 12, 2026

Hey @deep-name, thanks for your contribution one more time. Can you consider add these tests on your PR?

Feature: SDK span forwarder honors OTEL_EXPORTER_OTLP_HEADERS
  As an operator running iii behind an auth-required OTLP collector
    (Grafana Cloud, Honeycomb, SigNoz Cloud, or any API-keyed deployment)
  I want SDK-ingested spans forwarded by the engine to carry the same
    auth metadata as engine-originated spans
  So that worker-forwarded telemetry does not silently get rejected at
    the collector while engine telemetry keeps flowing

  Background:
    Given a mock OTLP gRPC collector listening on an ephemeral loopback port
    And the collector captures the tonic MetadataMap of every Export request

  Scenario: Authorization header from env var lands on the wire
    Given the environment variable OTEL_EXPORTER_OTLP_HEADERS is set to
      "authorization=Bearer-test-token-12345"
    And the SDK span forwarder has been initialized against the mock collector
    When a well-formed OTLP/JSON payload with one valid span is ingested via
      ingest_otlp_json
    Then the mock collector receives exactly one ExportTraceServiceRequest
    And that request carries an "authorization" metadata entry
    And
Feature: One bad span does not poison the rest of the forwarded batch
  As an operator whose SDKs occasionally emit malformed trace/span IDs
    (instrumentation bugs, truncated context propagation, custom emitters)
  I want the forwarder to drop only the malformed spans
  So that valid telemetry in the same payload still reaches the collector
    instead of being rejected wholesale when one span has empty ID bytes

  Background:
    Given a mock OTLP gRPC collector listening on an ephemeral loopback port
    And the collector captures every ExportTraceServiceRequest it receives

  Scenario: Mixed batch of valid and malformed spans is partially forwarded
    Given the SDK span forwarder has been initialized against the mock collector
    And an OTLP/JSON payload contains two spans:
      | name           | trace_id                          | span_id            | validity |
      | bad-trace-id   | not-hex                           | also-not-hex       | invalid  |
      | valid-span     | 0af7651916cd43dd8448eb211c80319c  | b7ad6b7169203331   | valid    |
    When the payload is ingested via ingest_otlp_json
    Then the mock collector receives at least one ExportTraceServiceRequest
    And no span on the wire has an empty trace_id
    And no span on the wire has an empty span_id
    And the captured stream contains exactly one span
    And that span has name "valid-span"
    And that span has trace_id 0af7651916cd43dd8448eb211c80319c
    And that span has span_id b7ad6b7169203331

@deep-name
Copy link
Copy Markdown
Contributor Author

both scenarios are spot-on @ytallo — these are the next two regression classes the raw-tonic switch introduced beyond TLS. on it.

quick read so you know what's landing:

scenario 1 (OTEL_EXPORTER_OTLP_HEADERS): confirmed regression. opentelemetry_otlp::SpanExporter::builder().with_tonic() parses this env var automatically (per the OTel spec) and injects each key=value pair as a tonic MetadataMap entry on every export. our replacement channel doesn't. fix: parse the env var at init_sdk_span_forwarder time, stash the parsed pairs alongside the channel, attach a tonic interceptor that adds them to each TraceServiceClient::export request.

scenario 2 (one bad span doesn't poison the batch): the audit-sprint feedback called out the "relay malformed spans, let collector validate" tradeoff as deliberate, but you're right that empty-byte IDs on the wire are a regression vs the pre-fix convert_otlp_to_span_data filtering, and "valid spans get rejected with the batch" is a worse outcome than "malformed spans dropped silently". fix: otlp_span_to_proto becomes fallible — Option<ProtoSpan>, returns None when hex::decode fails on trace_id/span_id. caller drops Nones before building the ExportTraceServiceRequest. PR description gets updated to remove the "relays malformed spans" callout since we're restoring the pre-filter.

shipping both scenarios as runnable rust tests against the existing tonic mock + an HTTPS-fronted mock variant for header propagation. should have the commit batch up in the next hour or so.

… spans

Addresses both gherkin scenarios from @ytallo's iii-hq#1618 review:

Scenario 1 — OTEL_EXPORTER_OTLP_HEADERS regression. The pre-fix
`opentelemetry_otlp::SpanExporter::builder().with_tonic()` parsed
`OTEL_EXPORTER_OTLP_HEADERS` (and the signal-specific
`OTEL_EXPORTER_OTLP_TRACES_HEADERS` override) automatically per the
OTel SDK spec, injecting each `key=value` pair as a tonic
`MetadataMap` entry on every export. The raw-`tonic::Channel`
replacement in earlier commits lost that behaviour silently — every
API-keyed collector (Grafana Cloud, Honeycomb, SigNoz Cloud) was
about to reject forwarded spans at auth.

Changes for scenario 1:
- New `SdkForwarderState` struct wrapping the `tonic::Channel` plus
  the parsed `Vec<(MetadataKey<Ascii>, MetadataValue<Ascii>)>` pairs.
  Replaces the bare-`Channel` `OnceLock<Channel>` static.
- New `parse_otlp_headers_env()` reads
  `OTEL_EXPORTER_OTLP_TRACES_HEADERS` first (signal-specific takes
  precedence per OTel spec), falls back to `OTEL_EXPORTER_OTLP_HEADERS`.
  Comma-split on entries, `=`-split on key/value, invalid pairs
  warn-and-skip rather than fail init (matches `opentelemetry-otlp`'s
  tolerance).
- Forwarder dispatch builds a `tonic::Request` and attaches every
  parsed pair to `metadata_mut()` before calling
  `TraceServiceClient::export(...)`.

Scenario 2 — one bad span no longer poisons the batch. The previous
`otlp_span_to_proto` was infallible: it called `hex_decode_or_empty`
on malformed `trace_id` / `span_id` strings and emitted spans with
empty-byte IDs on the wire. Most collectors reject any batch
containing such spans wholesale, so a single bad SDK emitter
(instrumentation bug, truncated context propagation, custom client)
could blackhole otherwise-valid telemetry from the same payload.

Changes for scenario 2:
- `otlp_span_to_proto` now returns `Option<ProtoSpan>` and returns
  `None` whenever `trace_id` or `span_id` fail to hex-decode OR
  decode to wrong byte lengths (trace_id MUST be 16 bytes, span_id
  MUST be 8 — surfaced by adversarial review DISS-001 in this round
  as a gap in the original hex-only filter: `hex::decode("01")`
  succeeds and produces a 1-byte trace_id that collectors still
  reject).
- New `OTLP_TRACE_ID_BYTES`/`OTLP_SPAN_ID_BYTES` consts so the spec
  numbers have a single point of truth.
- `otlp_scope_spans_to_proto` now `filter_map`s over the spans,
  dropping `None`s so valid spans in the same batch survive.
- `parent_span_id` validates the same way but falls back to empty
  (rather than dropping the whole span) when malformed — the
  orphaned-parent case is recoverable downstream.
- `hex_decode_or_empty` helper deleted (only consumer is link
  metadata, which inlines `hex::decode(...).unwrap_or_default()` —
  links to malformed-ID remote spans are recoverable, the link's
  parent span itself was already validated).
- Forwarder-dispatch comment updated to reflect the new
  drop-malformed-spans contract (was previously "relays everything,
  collector validates"; now "drops malformed, valid spans survive").

Tests:
- `engine/tests/otel_forwarder_headers_test.rs` (new) — sets
  `OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer-test-token-12345,x-tenant=acme`
  before init, asserts both metadata entries land on the
  `ExportTraceServiceRequest`'s tonic `MetadataMap`. Single test per
  binary to avoid env-var contamination across parallel tests.
- `engine/tests/otel_forwarder_partial_batch_test.rs` (new) — ingests
  a three-span payload (non-hex IDs, hex-but-wrong-length IDs, valid
  IDs), asserts the collector receives exactly one span on the wire
  (the valid one) with spec-correct ID lengths (16/8 bytes).
- Existing `engine/tests/otel_forwarder_resource_test.rs` unchanged
  and still green; the spec-length filter doesn't disturb its
  happy-path payload.

Sweep: 1542 lib tests passed, 3 forwarder integration tests passed
(resource preservation + headers env var + partial-batch filtering),
cargo fmt + clippy clean on touched files. Adversarial cross-model
re-review (gpt-5.3-codex) returned `findings: []` / status `clean`
after the spec-length tightening.

Credits: @ytallo for the gherkin scenarios surfacing both regressions
on iii-hq#1618 before they reached production. @alsruf36 for the end-to-end
validation earlier in the review cycle that anchored the contract
this PR is pinning.

Ridden with [Loa](https://github.com/0xHoneyJar/loa)
Copy link
Copy Markdown
Contributor Author

@deep-name deep-name left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Analytical review of #1618. Enrichment pass was unavailable; findings are unenriched.

Findings

{
  "schema_version": 1,
  "findings": [
    {
      "id": "F1",
      "title": "Integration tests rely on global SDK_SPAN_FORWARDER OnceLock and cannot coexist",
      "severity": "MEDIUM",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "Both new integration test files (otel_forwarder_resource_test.rs and otel_forwarder_partial_batch_test.rs, plus the pre-existing otel_forwarder_headers_test.rs) call init_sdk_span_forwarder(&endpoint) which writes to a process-global OnceLock. The comment in each file asserts that 'Owns its own process via being an integration test file, so the process-global SDK_SPAN_FORWARDER OnceLock is fresh on every run.' This is true only because cargo compiles each integration test file in tests/ as a separate binary. However, if any test file in the future contains more than one #[tokio::test] that calls init_sdk_span_forwarder with a different endpoint, all subsequent tests will silently send to the first endpoint (OnceLock semantics) and the assertions about 'exactly one ExportTraceServiceRequest' will become non-deterministic across runs. There is no test-level guardrail (no serial_test marker, no assertion that the OnceLock was uninitialized) preventing future contributors from breaking this invariant.",
      "suggestion": "Either (a) add a doc-comment near init_sdk_span_forwarder warning that the OnceLock requires one-test-per-binary, with a brief explanation, or (b) expose a #[cfg(test)] reset helper so multiple tests per binary become safe. Option (b) is more robust.",
      "confidence": 0.75
    },
    {
      "id": "F2",
      "title": "Mock collector spawn task swallows panics on Server::builder errors after readiness",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The tokio::spawn task that runs the tonic Server uses .expect() on serve_with_incoming. If the server errors mid-test (e.g., port reclaimed, OS resource exhaustion), the spawned task panics in the background but the main test continues, producing confusing 'received 0 requests' failures rather than the underlying transport error. Same pattern in otel_forwarder_partial_batch_test.rs.",
      "suggestion": "Capture the JoinHandle and abort()/check it at end-of-test, or use a oneshot channel to surface server startup errors to the main test thread.",
      "confidence": 0.6
    },
    {
      "id": "F3",
      "title": "Readiness probe leaves a stray connect socket against tonic accept loop",
      "severity": "LOW",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The TCP connect-then-drop readiness probe at line ~62 establishes a real TCP connection to the tonic server that is then dropped without an HTTP/2 handshake. tonic/h2 may log a spurious warning ('connection error: not a valid HTTP/2 connection preface' or similar) on stderr for each test. Not a correctness bug but adds noise to test output.",
      "suggestion": "Acceptable as-is; if the noise becomes problematic, switch to a lightweight in-process readiness signal (oneshot sent from inside the spawn before serve_with_incoming).",
      "confidence": 0.4
    },
    {
      "id": "F4",
      "title": "Excellent regression-test design with explicit contract assertions",
      "severity": "PRAISE",
      "category": "testing",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "The resource-preservation test pins not only the originally-broken service.name field but also tests three AnyValue discriminator branches (string, int, bool), span-name pass-through, attribute count (catches injection), and SAMPLED-flag default behavior. The partial-batch test goes further: it pins both hex-decode-failure AND hex-decodes-but-wrong-length cases, citing the spec lengths (16 bytes / 8 bytes) explicitly. Failure messages reference issue numbers and explain the intended invariant, making future debugging significantly easier.",
      "suggestion": "No change required."
    },
    {
      "id": "F5",
      "title": "Production tonic dependency added with tls-native-roots — confirm intent",
      "severity": "LOW",
      "category": "dependencies",
      "file": "engine/Cargo.toml",
      "description": "tonic is added to the [dependencies] section (not [dev-dependencies]) with features = [\"transport\", \"tls-native-roots\"]. The new test files use tonic, but they only need tonic at dev-time. If the forwarder implementation in the truncated otel.rs needs tonic directly (not just transitively through opentelemetry-otlp which already pulls tonic), that is fine. Otherwise, this is dependency bloat and should move to [dev-dependencies]. Cannot fully verify without the truncated otel.rs diff.",
      "suggestion": "Verify whether engine/src/workers/observability/otel.rs imports tonic directly. If only the integration tests need it, move tonic + opentelemetry-proto to [dev-dependencies].",
      "confidence": 0.55
    },
    {
      "id": "F6",
      "title": "Typo in test comment",
      "severity": "LOW",
      "category": "nit",
      "file": "engine/tests/otel_forwarder_resource_test.rs",
      "description": "'preceeding' is misspelled (should be 'preceding') in the comment block at the attribute-count assertion.",
      "suggestion": "Fix spelling.",
      "confidence": 0.95
    },
    {
      "id": "F7",
      "title": "Duplicated mock-collector helper across three test files",
      "severity": "LOW",
      "category": "maintainability",
      "file": "engine/tests/otel_forwarder_partial_batch_test.rs",
      "description": "CapturingTraceService, TraceService impl, and spawn_mock_collector are duplicated verbatim across otel_forwarder_resource_test.rs, otel_forwarder_partial_batch_test.rs, and (per the truncated diff) otel_forwarder_headers_test.rs. Cargo integration tests can share code via a tests/common/mod.rs module; this would also enforce that the readiness/spawn semantics stay in lockstep across the suite.",
      "suggestion": "Extract the mock collector into engine/tests/common/mod.rs and have each integration test `mod common;` it.",
      "confidence": 0.85
    },
    {
      "id": "F8",
      "title": "Truncated otel.rs change is the actual fix — cannot verify behavior without it",
      "severity": "REFRAME",
      "category": "review-scope",
      "file": "engine/src/workers/observability/otel.rs",
      "description": "The substantive fix (449 added, 696 removed lines in engine/src/workers/observability/otel.rs) is truncated from the diff. The tests pin behavior but the production change itself — including the `otlp_span_to_proto` returning Option, the resource conversion path, and the SAMPLED-default logic — is unreviewed. Findings here cover only the test/dependency surface; the core fix needs a separate pass.",
      "suggestion": "Request the full otel.rs diff for the next review pass to verify: (1) hex-decode length checks on trace_id (16) and span_id (8), (2) resource attribute conversion preserves all AnyValue variants, (3) SAMPLED-default flag handling, (4) no unwrap()/expect() on adversarial input.",
      "confidence": 0.9
    }
  ]
}

Callouts

Enrichment unavailable for this review.

Addresses two Bridgebuilder F-findings from the local self-review pass
on PR iii-hq#1618 head a3c37fb:

F7 (LOW, maintainability) — `CapturingTraceService`, the `TraceService`
impl, the spawn helper, and the readiness probe were duplicated across
all three `otel_forwarder_*` integration tests (~60 lines each, three
copies). When the `SdkForwarderState` shape changed in earlier rounds
the three helpers drifted out of lockstep until the partial-batch test
caught it. Extracted to `engine/tests/common/forwarder_mock.rs` with
two variants:

- `spawn_body_capturing_collector` — used by the resource-preservation
  and partial-batch tests (both need the `ExportTraceServiceRequest`
  body).
- `spawn_metadata_capturing_collector` — used by the headers test
  (needs `tonic::MetadataMap` instead of body).

Both share the same readiness-probe contract (bounded
`TcpStream::connect` retry; see file-level doc-comment for the
rationale that replaced the 50ms-sleep heuristic earlier in the
review cycle). Registered in `engine/tests/common/mod.rs`
alongside the existing `http_helpers` / `queue_helpers` modules.

Net effect: tests/ shrinks by ~210 lines (210 deletions vs 32
insertions, after the new helper module is counted).

F6 (NIT, spelling) — "preceeding" → "preceding" in the comment block
on the attribute-count assertion in `otel_forwarder_resource_test.rs`.

Deferred F-findings (with rationale): F1 (OnceLock fragility) was
already addressed in round-3 via the doc-comment on
`SDK_SPAN_FORWARDER`; BB couldn't see otel.rs in the truncated diff.
F2/F3 (server task panics / readiness probe noise) — `.expect()` was
added in round-3 and the noise is acceptable per BB's own confidence
(0.4/0.6). F5 (dep placement) — `tonic` IS used in production code
(`Channel`, `Endpoint`, `MetadataKey/Value`, `ClientTlsConfig`); the
`[dependencies]` placement is correct. F8 (truncated otel.rs review
scope) — adversarial cross-model dissent (`gpt-5.3-codex`) reviewed
the full diff and returned `findings: []` after the round-4
spec-length tightening.

Sweep: 1542 lib tests passed, 3 forwarder integration tests passed
(all three now via the shared helper module), cargo fmt + clippy
clean on touched files.

Ridden with [Loa](https://github.com/0xHoneyJar/loa)
@deep-name
Copy link
Copy Markdown
Contributor Author

both gherkin scenarios closed end-to-end @ytallo. round summary across 2 commits:

a3c37fb — scenarios proper

scenario 1 (OTEL_EXPORTER_OTLP_HEADERS): new SdkForwarderState struct wraps the tonic channel + the parsed Vec<(MetadataKey<Ascii>, MetadataValue<Ascii>)> pairs. parse_otlp_headers_env() reads OTEL_EXPORTER_OTLP_TRACES_HEADERS first (signal-specific precedence per the OTel SDK spec), falls back to OTEL_EXPORTER_OTLP_HEADERS; comma-split + =-split + warn-and-skip on invalid pairs (matches opentelemetry-otlp's tolerance). forwarder dispatch attaches every parsed pair to tonic::Request::metadata_mut() before TraceServiceClient::export. test: engine/tests/otel_forwarder_headers_test.rs sets authorization=Bearer-test-token-12345,x-tenant=acme before init, asserts both metadata entries land on the captured request.

scenario 2 (one bad span doesn't poison the batch): otlp_span_to_proto is now fallible — returns None for spans whose trace_id/span_id fail to hex-decode OR decode to non-spec byte lengths (16 / 8). otlp_scope_spans_to_proto filter_maps them out. caught one gap mid-review: my initial filter was hex-only, so traceId: "01" would have decoded to a 1-byte trace_id and slipped through. spec-length check (OTLP_TRACE_ID_BYTES = 16 / OTLP_SPAN_ID_BYTES = 8) added — test now covers all three failure modes (non-hex, hex-but-wrong-length, valid) in one mixed batch.

380d684 — test maintainability nits from local bridgebuilder pass

extracted the duplicated mock-collector helpers into engine/tests/common/forwarder_mock.rs (spawn_body_capturing_collector + spawn_metadata_capturing_collector, plus the shared readiness probe). net tests/ shrinks by ~50 lines and keeps the three forwarder regression tests in lockstep when the forwarder shape changes in the future.

cross-model adversarial dissent (gpt-5.3-codex) was clean against the final state.

sweep: 1542 lib tests + 3 forwarder integration tests pass, cargo fmt + clippy clean on touched files. ready when you are.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@engine/src/workers/observability/otel.rs`:
- Around line 713-756: parse_otlp_headers_env currently uses header values
verbatim; update it to percent-decode the value string before constructing the
MetadataValue so OTLP percent-encoded headers (e.g. Authorization=Basic%20...)
are decoded per the spec: after trimming v in parse_otlp_headers_env, run a
percent-decode (e.g. via percent_encoding::percent_decode_str or similar) and if
decoding fails log a warning (like the existing MetadataValue error branch,
include header_name = k) and skip the entry; then pass the decoded string/bytes
into MetadataValue::<Ascii>::try_from instead of the original v; preserve the
existing header-name parsing/validation and warnings for invalid name/value.
- Around line 1926-1934: The otlp_link_to_proto function currently uses
hex::decode(...).unwrap_or_default() which can produce short byte vectors (e.g.,
1 byte) and cause OTLP collectors to reject the whole batch; change the trace_id
and span_id handling so you decode the hex string, check the decoded length
(trace_id must be exactly 16 bytes, span_id must be exactly 8 bytes), and if the
length is invalid replace with an empty Vec<u8> (preserving the previous
"recoverable metadata loss" behavior) instead of using unwrap_or_default; update
the trace_id/span_id expressions in otlp_link_to_proto to perform this
decode+length-check logic (and keep trace_state as
link.trace_state.clone().unwrap_or_default()).

In `@engine/tests/otel_forwarder_headers_test.rs`:
- Around line 99-170: The test
forwarder_injects_otel_exporter_otlp_headers_into_export_metadata must
explicitly guard against existing OTEL_EXPORTER_OTLP_TRACES_HEADERS precedence:
capture the current values of OTEL_EXPORTER_OTLP_HEADERS and
OTEL_EXPORTER_OTLP_TRACES_HEADERS, clear or set both appropriately before
calling iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint)
(ensuring the desired OTEL_EXPORTER_OTLP_HEADERS is set and
OTEL_EXPORTER_OTLP_TRACES_HEADERS is cleared), and then restore the original
values at the end of the test; this guarantees deterministic header precedence
for the assertions without changing init_sdk_span_forwarder itself.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5a3a7220-e44a-4749-8520-077514a8310d

📥 Commits

Reviewing files that changed from the base of the PR and between 1bc8af3 and a3c37fb.

📒 Files selected for processing (3)
  • engine/src/workers/observability/otel.rs
  • engine/tests/otel_forwarder_headers_test.rs
  • engine/tests/otel_forwarder_partial_batch_test.rs

Comment on lines +713 to +756
fn parse_otlp_headers_env() -> Vec<(MetadataKey<Ascii>, MetadataValue<Ascii>)> {
let raw = std::env::var("OTEL_EXPORTER_OTLP_TRACES_HEADERS")
.or_else(|_| std::env::var("OTEL_EXPORTER_OTLP_HEADERS"))
.ok();
let Some(raw) = raw else {
return Vec::new();
};
if raw.trim().is_empty() {
return Vec::new();
}
raw.split(',')
.filter_map(|pair| {
let pair = pair.trim();
if pair.is_empty() {
return None;
}
}
let (k, v) = pair.split_once('=')?;
let k = k.trim();
let v = v.trim();
let key = match MetadataKey::<Ascii>::from_bytes(k.as_bytes()) {
Ok(k) => k,
Err(e) => {
tracing::warn!(
error = %e,
header_name = k,
"OTEL_EXPORTER_OTLP_HEADERS entry has invalid header name; skipping"
);
return None;
}
};
let value = match MetadataValue::<Ascii>::try_from(v) {
Ok(v) => v,
Err(e) => {
tracing::warn!(
error = %e,
header_name = k,
"OTEL_EXPORTER_OTLP_HEADERS entry has invalid header value; skipping"
);
return None;
}
};
Some((key, value))
})
.collect()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS, are values required to be URL-decoded before sending request headers?

💡 Result:

No. The OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS values are not something you should “URL-decode” as part of sending the HTTP/gRPC request headers. What the spec requires is that the environment variable value itself is formatted like W3C Baggage (key=value,key2=value2), and that the value portion is URL-encoded; SDKs should percent-decode that value into the actual header value they send on the wire. The header key should not be percent-decoded; only the value is treated as URL-encoded baggage. [1][2][3] In practice: if you set OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic%20..." then the SDK should decode %20 to a space before sending the Authorization header. [4][5][6][7] So: you (the user) should provide URL-encoded header values in the env var; the SDK performs the decoding when constructing the actual request headers. You generally do not URL-decode yourself before exporting the env vars. [1][2][3]

Citations:


Add percent-decoding for OTEL header environment variable values per OTLP specification.

The OTEL_EXPORTER_OTLP_HEADERS and OTEL_EXPORTER_OTLP_TRACES_HEADERS environment variables require header values to be percent-encoded by the user; the SDK must percent-decode them before constructing request headers. The current code (lines 723–745) parses values verbatim without decoding, which breaks headers with encoded characters (e.g., Authorization=Basic%20... will send literal %20 instead of a space). This violates the OTLP specification and causes interoperability issues with collectors.

Proposed fix
             let v = v.trim();
+            let v = match urlencoding::decode(v) {
+                Ok(decoded) => decoded.into_owned(),
+                Err(e) => {
+                    tracing::warn!(
+                        error = %e,
+                        header_name = k,
+                        "OTEL_EXPORTER_OTLP_HEADERS entry has invalid percent-encoding; skipping"
+                    );
+                    return None;
+                }
+            };
 
             let value = match MetadataValue::<Ascii>::try_from(v.as_str()) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@engine/src/workers/observability/otel.rs` around lines 713 - 756,
parse_otlp_headers_env currently uses header values verbatim; update it to
percent-decode the value string before constructing the MetadataValue so OTLP
percent-encoded headers (e.g. Authorization=Basic%20...) are decoded per the
spec: after trimming v in parse_otlp_headers_env, run a percent-decode (e.g. via
percent_encoding::percent_decode_str or similar) and if decoding fails log a
warning (like the existing MetadataValue error branch, include header_name = k)
and skip the entry; then pass the decoded string/bytes into
MetadataValue::<Ascii>::try_from instead of the original v; preserve the
existing header-name parsing/validation and warnings for invalid name/value.

Comment on lines +1926 to +1934
fn otlp_link_to_proto(link: &OtlpSpanLink) -> opentelemetry_proto::tonic::trace::v1::span::Link {
// Links to malformed-ID remote spans fall back to empty bytes here
// (rather than dropping the link) — the parent span itself was already
// validated by `otlp_span_to_proto`, so a bad link is a recoverable
// metadata loss, not a wholesale span-loss event.
opentelemetry_proto::tonic::trace::v1::span::Link {
trace_id: hex::decode(&link.trace_id).unwrap_or_default(),
span_id: hex::decode(&link.span_id).unwrap_or_default(),
trace_state: link.trace_state.clone().unwrap_or_default(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In OTLP Trace Span.Link, must trace_id be exactly 16 bytes and span_id exactly 8 bytes, and can collectors reject entire exports when these lengths are invalid?

💡 Result:

Yes, in OTLP Trace Span.Link, trace_id must be exactly 16 bytes and span_id exactly 8 bytes, as defined in the official OpenTelemetry protobuf schema for the Span.Link message: "A unique identifier of a trace that this linked span is part of. The ID is a 16-byte array." for trace_id and "A unique identifier for the linked span. The ID is an 8-byte array." for span_id [1][2]. This matches the specification for Span trace_id and span_id [3][4][5]. These fields are bytes with fixed semantic lengths; incorrect lengths violate the spec and are treated as invalid [6][7].

Citations:


🏁 Script executed:

cat -n engine/src/workers/observability/otel.rs | sed -n '1920,1940p'

Repository: iii-hq/iii

Length of output: 1121


🏁 Script executed:

rg "OTLP_TRACE_ID_BYTES|OTLP_SPAN_ID_BYTES" engine/src/workers/observability/otel.rs

Repository: iii-hq/iii

Length of output: 281


🏁 Script executed:

rg "fn otlp_span_to_proto" engine/src/workers/observability/otel.rs -A 20

Repository: iii-hq/iii

Length of output: 1053


🏁 Script executed:

rg "\.len\(\)" engine/src/workers/observability/otel.rs | head -20

Repository: iii-hq/iii

Length of output: 924


Add link ID length validation to prevent malformed IDs from poisoning batch exports.

The otlp_link_to_proto function accepts hex-decoded IDs of any length via unwrap_or_default(). Short IDs like hex::decode("01") produce 1-byte values that violate the OTLP spec (trace_id must be 16 bytes, span_id must be 8 bytes) and trigger collector-side rejection of the entire export. The parent function otlp_span_to_proto already validates these lengths correctly; otlp_link_to_proto must do the same.

Proposed fix
 fn otlp_link_to_proto(link: &OtlpSpanLink) -> opentelemetry_proto::tonic::trace::v1::span::Link {
     // Links to malformed-ID remote spans fall back to empty bytes here
     // (rather than dropping the link) — the parent span itself was already
     // validated by `otlp_span_to_proto`, so a bad link is a recoverable
     // metadata loss, not a wholesale span-loss event.
+    let trace_id = hex::decode(&link.trace_id)
+        .ok()
+        .filter(|b| b.len() == OTLP_TRACE_ID_BYTES)
+        .unwrap_or_default();
+    let span_id = hex::decode(&link.span_id)
+        .ok()
+        .filter(|b| b.len() == OTLP_SPAN_ID_BYTES)
+        .unwrap_or_default();
+
     opentelemetry_proto::tonic::trace::v1::span::Link {
-        trace_id: hex::decode(&link.trace_id).unwrap_or_default(),
-        span_id: hex::decode(&link.span_id).unwrap_or_default(),
+        trace_id,
+        span_id,
         trace_state: link.trace_state.clone().unwrap_or_default(),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@engine/src/workers/observability/otel.rs` around lines 1926 - 1934, The
otlp_link_to_proto function currently uses hex::decode(...).unwrap_or_default()
which can produce short byte vectors (e.g., 1 byte) and cause OTLP collectors to
reject the whole batch; change the trace_id and span_id handling so you decode
the hex string, check the decoded length (trace_id must be exactly 16 bytes,
span_id must be exactly 8 bytes), and if the length is invalid replace with an
empty Vec<u8> (preserving the previous "recoverable metadata loss" behavior)
instead of using unwrap_or_default; update the trace_id/span_id expressions in
otlp_link_to_proto to perform this decode+length-check logic (and keep
trace_state as link.trace_state.clone().unwrap_or_default()).

Comment on lines +99 to +170
async fn forwarder_injects_otel_exporter_otlp_headers_into_export_metadata() {
// Order is load-bearing: the env var MUST be set BEFORE
// init_sdk_span_forwarder is called, because the parsing happens at
// forwarder-init time (matching the upstream `opentelemetry-otlp`
// builder semantics). Setting it post-init would not retroactively
// populate `SdkForwarderState.headers`.
//
// # Safety: this integration test binary is single-threaded with
// respect to env-var mutation — only one test exists in the file,
// and `SDK_SPAN_FORWARDER`'s `OnceLock` ensures we only init once.
unsafe {
std::env::set_var(
"OTEL_EXPORTER_OTLP_HEADERS",
"authorization=Bearer-test-token-12345,x-tenant=acme",
);
}

let (endpoint, received_metadata) = spawn_mock_collector().await;

iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint);

iii::workers::observability::otel::ingest_otlp_json(MINIMAL_PAYLOAD)
.await
.expect("ingest_otlp_json should accept a well-formed payload");

let deadline = std::time::Instant::now() + Duration::from_secs(5);
while std::time::Instant::now() < deadline {
if !received_metadata.lock().await.is_empty() {
break;
}
tokio::time::sleep(Duration::from_millis(20)).await;
}

let captured = received_metadata.lock().await;
assert_eq!(
captured.len(),
1,
"Mock collector should have received exactly one ExportTraceServiceRequest"
);

let metadata = &captured[0];

// The authorization header from OTEL_EXPORTER_OTLP_HEADERS must
// land on the wire as a tonic Ascii metadata entry. Without this
// the upstream-spec env var is silently ignored and every API-keyed
// collector rejects forwarded spans at auth.
let auth = metadata
.get("authorization")
.expect("authorization metadata MUST be set from OTEL_EXPORTER_OTLP_HEADERS");
assert_eq!(
auth.to_str().unwrap(),
"Bearer-test-token-12345",
"authorization metadata value did not match OTEL_EXPORTER_OTLP_HEADERS"
);

// Second pair confirms the comma-split parser handles multiple
// entries (single-pair could pass with a naive `=`-only split).
let tenant = metadata
.get("x-tenant")
.expect("x-tenant metadata MUST be set; comma-split parser regression?");
assert_eq!(
tenant.to_str().unwrap(),
"acme",
"x-tenant metadata value did not match OTEL_EXPORTER_OTLP_HEADERS"
);

// Cleanup so a future test binary inheriting this process's env (if
// any) doesn't leak the test credential. # Safety: same single-test
// file rationale as the set_var above.
unsafe {
std::env::remove_var("OTEL_EXPORTER_OTLP_HEADERS");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test isolation is incomplete for header precedence.

Because OTEL_EXPORTER_OTLP_TRACES_HEADERS overrides OTEL_EXPORTER_OTLP_HEADERS, this test can flake if the process inherits *_TRACES_HEADERS. Clear (and restore) both vars around init to make the assertion deterministic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@engine/tests/otel_forwarder_headers_test.rs` around lines 99 - 170, The test
forwarder_injects_otel_exporter_otlp_headers_into_export_metadata must
explicitly guard against existing OTEL_EXPORTER_OTLP_TRACES_HEADERS precedence:
capture the current values of OTEL_EXPORTER_OTLP_HEADERS and
OTEL_EXPORTER_OTLP_TRACES_HEADERS, clear or set both appropriately before
calling iii::workers::observability::otel::init_sdk_span_forwarder(&endpoint)
(ensuring the desired OTEL_EXPORTER_OTLP_HEADERS is set and
OTEL_EXPORTER_OTLP_TRACES_HEADERS is cleared), and then restore the original
values at the end of the test; this guarantees deterministic header precedence
for the assertions without changing init_sdk_span_forwarder itself.

@ytallo
Copy link
Copy Markdown
Contributor

ytallo commented May 13, 2026

This PR touches engine code. For us to accept it, we need your explicit confirmation in this thread that you license your changes to the engine portion of the codebase under Apache 2.0.

Please reply with:
“I agree that my contributions in this PR to the engine code are licensed under Apache 2.0.”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTLP span forwarding drops resource attributes, causing service.name to appear as <nil>

2 participants