Skip to content

LLM cost tracking with built-in pricing#49

Open
mert-cemri wants to merge 2 commits into
skydiscover-ai:mainfrom
mert-cemri:cost_tracking
Open

LLM cost tracking with built-in pricing#49
mert-cemri wants to merge 2 commits into
skydiscover-ai:mainfrom
mert-cemri:cost_tracking

Conversation

@mert-cemri

Copy link
Copy Markdown
Collaborator

Summary

  • Track input/output tokens and USD cost per LLM call, aggregated across the full discovery run
  • Built-in pricing table for 50+ models (OpenAI GPT-5.x/4.x/o-series, Gemini, Claude) with prefix matching; config overrides take precedence
  • Thread-safe CostTracker shared across generation, evaluator, and guide pools (including Evox nested controllers and agentic mode)
  • CLI dashboard at run end showing best score + cost breakdown by category; llm_cost_summary.json saved to output dir; DiscoveryResult.llm_cost_summary for API users

Bug fixes

  • AdaEvolveController.__init__ was re-creating self.llms without the cost_tracker, silently dropping all tracking in adaevolve mode
  • Agentic mode was discarding aggregated usage when wrapping results into LLMResponse
  • Fixed indentation in _run_iteration kwargs unpacking

Test plan

  • 168 tests pass (including 9 new cost tracking tests + all upstream tests after rebase)
  • Verified with live gpt-5 runs on circle_packing and signal_processing (adaevolve, 5 iterations each)
  • Confirmed dollar costs auto-resolve from built-in table when no config pricing is set

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements comprehensive LLM cost and token usage tracking by introducing a CostTracker utility and expanding the LLMResponse data structure. It adds pricing parameters to model configurations and updates the OpenAI provider to calculate costs using either explicit settings or a built-in pricing table. The CLI and API now report these metrics in their summaries. A review comment identifies an opportunity to reduce code duplication in skydiscover/llm/openai.py by extracting a helper method for constructing LLMResponse objects, which would improve maintainability across the various API interaction methods.

Comment thread skydiscover/llm/openai.py Outdated
Comment on lines +199 to +216
input_tokens, output_tokens, total_tokens = extract_usage_counts(getattr(response, "usage", None))
input_cost_usd, output_cost_usd, total_cost_usd = compute_token_costs(
input_tokens,
output_tokens,
self.input_price_per_million_tokens,
self.output_price_per_million_tokens,
)
return LLMResponse(
text=response.choices[0].message.content or "",
model_name=getattr(response, "model", None) or self.model,
usage_source="api" if getattr(response, "usage", None) is not None else None,
input_tokens=input_tokens,
output_tokens=output_tokens,
total_tokens=total_tokens,
input_cost_usd=input_cost_usd,
output_cost_usd=output_cost_usd,
total_cost_usd=total_cost_usd,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for creating an LLMResponse with cost and usage information is repeated in _call_api, _call_api_via_responses, and _generate_with_image. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method.

For example, you could create a _build_llm_response method:

def _build_llm_response(
    self,
    response_obj: Any,
    text: str,
    image_path: Optional[str] = None,
) -> LLMResponse:
    input_tokens, output_tokens, total_tokens = extract_usage_counts(
        getattr(response_obj, "usage", None)
    )
    input_cost_usd, output_cost_usd, total_cost_usd = compute_token_costs(
        input_tokens,
        output_tokens,
        self.input_price_per_million_tokens,
        self.output_price_per_million_tokens,
    )
    return LLMResponse(
        text=text,
        image_path=image_path,
        model_name=getattr(response_obj, "model", None) or self.model,
        usage_source="api" if getattr(response_obj, "usage", None) is not None else None,
        input_tokens=input_tokens,
        output_tokens=output_tokens,
        total_tokens=total_tokens,
        input_cost_usd=input_cost_usd,
        output_cost_usd=output_cost_usd,
        total_cost_usd=total_cost_usd,
    )

Then you could simplify this method and the others to just call this helper, for instance:

return self._build_llm_response(response, response.choices[0].message.content or "")
References
  1. When accessing configurations that might return None, it is acceptable to use defensive patterns like 'or []' even if getattr has a default value, as this guards against explicit null values in the configuration.

@mert-cemri mert-cemri changed the title Add end-to-end LLM cost tracking with built-in pricing LLM cost tracking with built-in pricing Mar 28, 2026
shubham3-ucb
shubham3-ucb previously approved these changes Mar 28, 2026

@shubham3-ucb shubham3-ucb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Superseded by final review below.)

@shubham3-ucb shubham3-ucb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Superseded by final review below.)

@shubham3-ucb shubham3-ucb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Superseded by final review below.)

@shubham3-ucb shubham3-ucb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we tested

Unit tests: 168/168 pass. 60 additional microtests (thread safety, prefix matching, float precision, empty tracker, agentic aggregation, partial price overrides). All pass.

Live runs on circle_packing (2 iters each):

Model Strategy Calls Tokens (in/out/total) Reported cost Correct?
gpt-4o-mini topk 2 11,371 / 1,424 / 12,795 $0.0026 Yes
gpt-4.1-nano adaevolve 2 14,510 / 3,226 / 17,736 $0.0027 Yes
gemini-2.5-flash topk 2 13,083 / 3,749 / 33,937 $0.0000 No (see #1 and #2)
gemini-3-pro-preview topk 4 43,572 / 4,749 / 120,112 $0.1441 No (see #1)
claude-sonnet-4-6 topk 3 20,792 / 13,936 / 34,728 $0.2714 Yes

Also verified: empty tracker (all LLM calls fail) renders cleanly, JSON output and checkpoint cost snapshots correct.

Code review

Architecture is clean. Traced every path:

  • No double-counting in agentic mode
  • AdaEvolve pool re-creation bug fix is correct
  • Evox nested controller passes cost_tracker through correctly
  • Thread safety is solid
  • Prefix matching handles all edge cases

Category tracking across strategies

Checked what LLM call types each strategy uses and whether they show up separately in by_category:

Default/topk: generation only (unless llm_as_judge is on, then also evaluator).

AdaEvolve: generation (main iteration calls) + guide (paradigm breakthrough calls when stagnating) + evaluator (if llm_as_judge). All tracked separately.

Evox: The main solution controller and the nested search evolution controller share the same cost_tracker, which is correct for total cost. But both controllers label their pools with the same category names (generation, guide, evaluator), so the breakdown doesn't distinguish solution-side vs search-side calls. Same applies to label/variation operator generation (line 300) which goes through search_controller.guide_llms and lands in guide.

Not a bug (totals are right), but worth considering: the search controller's pools could use distinct categories like search_generation, search_guide so the Evox cost breakdown shows where money actually went.

Two things to fix

1. Gemini thinking tokens not billed

Gemini does not include thinking tokens in completion_tokens. They only show up in total_tokens:

prompt_tokens=11, completion_tokens=1, total_tokens=104

On the gemini-3-pro run, reported cost was $0.14 but actual cost is ~$1.00 (71,791 thinking tokens unbilled). 7x undercount. OpenAI and Anthropic are fine (they include everything in completion_tokens).

Simplest fix: use total_tokens - prompt_tokens as the output token count for cost calculation. This gives the correct answer for all providers:

Provider completion_tokens total - prompt Thinking gap
OpenAI (incl. o-series) 83 83 0
Gemini thinking 1 93 92
Anthropic 10 10 0

2. No warning when pricing is unknown

gemini-2.5-flash silently reports $0 with no indication pricing is missing.

Fixes:

  • Add gemini-2.5-pro and gemini-2.5-flash to DEFAULT_MODEL_PRICING
  • Log a warning in OpenAILLM.__init__ when pricing can't be resolved:
if self.input_price_per_million_tokens is None:
    logger.warning("No pricing for model '%s', costs will show $0. Set input/output_price_per_million_tokens in config.", model_cfg.name)

Both are small. Happy to re-review once they're in.

@shubham3-ucb shubham3-ucb dismissed stale reviews from themself March 28, 2026 21:27

Superseded by final review.

…mport

- simulator.py: __transfer_time was not computing actual transfer times.
  The inner loop iterated over path edges but never accumulated a result,
  leaving partition_time as -inf. This cascaded into __total_cost via
  runtime_s, producing -inf total costs. Fixed by computing the bottleneck
  (minimum flow) along each path and deriving transfer time from it.

- evaluator.py: Removed stale top-level `from initial_program import
  search_algorithm` which fails inside Docker (initial_program is not on
  the Python path in the container). The import is unnecessary since
  programs are loaded dynamically via importlib in evaluate().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants