Skip to content

Centaur Codex execution summaries do not project useful command/tool/text observations #359

@GoonMachine

Description

@GoonMachine

Summary

Codex executions can contain substantial command and assistant activity in raw events, while Centaur execution summaries report zero commands, zero tool results, zero assistant text, and zero token usage.

This one is lower confidence than the other two issues. It may be incomplete Codex telemetry support or expected behavior, but given the other Codex parsing/delivery issues, it looks like Centaur is under-projecting some Codex event shapes.

Observed

Example execution: exe_69d2b0c925414cd1

Raw event shape counts:

raw_event_count: 2748

item.agentMessage.delta: 1937
item.commandExecution.outputDelta: 468
item.started commandExecution: 82
item.completed commandExecution: 82
item.completed agentMessage: 44
item.started agentMessage: 44
item.completed reasoning: 42
item.started reasoning: 42

But the execution summary reported:

observation_event_count: 0
assistant_text_events: 0
assistant_text_chars: 0
command_events: 0
tool_result_events: 0
tool_error_events: 0
file_change_events: 0
total_tokens: 0

Other Codex executions showed the same pattern:

exe_f1c44cb9a8f64852: raw_event_count 1411, observation_event_count 0
exe_cc37103e753f4122: raw_event_count 800, observation_event_count 0

Expected

If Centaur stores Codex raw events, the execution summary should project useful observation counters from those events where possible.

For example:

  • item.completed with item.type = commandExecution should count as command execution activity.
  • item.commandExecution.outputDelta should be usable for command output diagnostics.
  • item.agentMessage.delta or completed agentMessage should count as assistant text.
  • file changes and usage/token events should be projected if present.

Alternatively, if this is intentionally not supported for Codex, the behavior should be documented clearly so downstream workflows know to query raw Codex events directly.

Why It Matters

Trace workflows that look for failed or inefficient tool use cannot rely on Centaur summaries. They have to scrape raw Codex events or sandbox logs, even though the database already contains enough raw signal to identify command and tool behavior.

Possible Root Cause

The Codex normalizer appears to pass through many item.* events, while the observability projection expects canonical event types such as command_execution, assistant, tool, file_change, and usage.

Relevant code paths to inspect:

  • services/api/api/sandbox/normalize.py
  • services/api/api/observability.py
  • services/api/api/runtime_control.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions