LLM cost tracking with built-in pricing by mert-cemri · Pull Request #49 · skydiscover-ai/skydiscover

mert-cemri · 2026-03-28T19:50:14Z

Summary

Track input/output tokens and USD cost per LLM call, aggregated across the full discovery run
Built-in pricing table for 50+ models (OpenAI GPT-5.x/4.x/o-series, Gemini, Claude) with prefix matching; config overrides take precedence
Thread-safe CostTracker shared across generation, evaluator, and guide pools (including Evox nested controllers and agentic mode)
CLI dashboard at run end showing best score + cost breakdown by category; llm_cost_summary.json saved to output dir; DiscoveryResult.llm_cost_summary for API users

Bug fixes

AdaEvolveController.__init__ was re-creating self.llms without the cost_tracker, silently dropping all tracking in adaevolve mode
Agentic mode was discarding aggregated usage when wrapping results into LLMResponse
Fixed indentation in _run_iteration kwargs unpacking

Test plan

168 tests pass (including 9 new cost tracking tests + all upstream tests after rebase)
Verified with live gpt-5 runs on circle_packing and signal_processing (adaevolve, 5 iterations each)
Confirmed dollar costs auto-resolve from built-in table when no config pricing is set

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request implements comprehensive LLM cost and token usage tracking by introducing a CostTracker utility and expanding the LLMResponse data structure. It adds pricing parameters to model configurations and updates the OpenAI provider to calculate costs using either explicit settings or a built-in pricing table. The CLI and API now report these metrics in their summaries. A review comment identifies an opportunity to reduce code duplication in skydiscover/llm/openai.py by extracting a helper method for constructing LLMResponse objects, which would improve maintainability across the various API interaction methods.

gemini-code-assist · 2026-03-28T19:52:09Z

+        input_tokens, output_tokens, total_tokens = extract_usage_counts(getattr(response, "usage", None))
+        input_cost_usd, output_cost_usd, total_cost_usd = compute_token_costs(
+            input_tokens,
+            output_tokens,
+            self.input_price_per_million_tokens,
+            self.output_price_per_million_tokens,
+        )
+        return LLMResponse(
+            text=response.choices[0].message.content or "",
+            model_name=getattr(response, "model", None) or self.model,
+            usage_source="api" if getattr(response, "usage", None) is not None else None,
+            input_tokens=input_tokens,
+            output_tokens=output_tokens,
+            total_tokens=total_tokens,
+            input_cost_usd=input_cost_usd,
+            output_cost_usd=output_cost_usd,
+            total_cost_usd=total_cost_usd,
+        )


This block of code for creating an LLMResponse with cost and usage information is repeated in _call_api, _call_api_via_responses, and _generate_with_image. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method.

For example, you could create a _build_llm_response method:

def _build_llm_response( self, response_obj: Any, text: str, image_path: Optional[str] = None, ) -> LLMResponse: input_tokens, output_tokens, total_tokens = extract_usage_counts( getattr(response_obj, "usage", None) ) input_cost_usd, output_cost_usd, total_cost_usd = compute_token_costs( input_tokens, output_tokens, self.input_price_per_million_tokens, self.output_price_per_million_tokens, ) return LLMResponse( text=text, image_path=image_path, model_name=getattr(response_obj, "model", None) or self.model, usage_source="api" if getattr(response_obj, "usage", None) is not None else None, input_tokens=input_tokens, output_tokens=output_tokens, total_tokens=total_tokens, input_cost_usd=input_cost_usd, output_cost_usd=output_cost_usd, total_cost_usd=total_cost_usd, )

Then you could simplify this method and the others to just call this helper, for instance:

return self._build_llm_response(response, response.choices[0].message.content or "")

References

When accessing configurations that might return None, it is acceptable to use defensive patterns like 'or []' even if getattr has a default value, as this guards against explicit null values in the configuration.

shubham3-ucb

(Superseded by final review below.)

shubham3-ucb

(Superseded by final review below.)

shubham3-ucb

(Superseded by final review below.)

shubham3-ucb

What we tested

Unit tests: 168/168 pass. 60 additional microtests (thread safety, prefix matching, float precision, empty tracker, agentic aggregation, partial price overrides). All pass.

Live runs on circle_packing (2 iters each):

Model	Strategy	Calls	Tokens (in/out/total)	Reported cost	Correct?
gpt-4o-mini	topk	2	11,371 / 1,424 / 12,795	$0.0026	Yes
gpt-4.1-nano	adaevolve	2	14,510 / 3,226 / 17,736	$0.0027	Yes
gemini-2.5-flash	topk	2	13,083 / 3,749 / 33,937	$0.0000	No (see #1 and #2)
gemini-3-pro-preview	topk	4	43,572 / 4,749 / 120,112	$0.1441	No (see #1)
claude-sonnet-4-6	topk	3	20,792 / 13,936 / 34,728	$0.2714	Yes

Also verified: empty tracker (all LLM calls fail) renders cleanly, JSON output and checkpoint cost snapshots correct.

Code review

Architecture is clean. Traced every path:

No double-counting in agentic mode
AdaEvolve pool re-creation bug fix is correct
Evox nested controller passes cost_tracker through correctly
Thread safety is solid
Prefix matching handles all edge cases

Category tracking across strategies

Checked what LLM call types each strategy uses and whether they show up separately in by_category:

Default/topk: generation only (unless llm_as_judge is on, then also evaluator).

AdaEvolve: generation (main iteration calls) + guide (paradigm breakthrough calls when stagnating) + evaluator (if llm_as_judge). All tracked separately.

Evox: The main solution controller and the nested search evolution controller share the same cost_tracker, which is correct for total cost. But both controllers label their pools with the same category names (generation, guide, evaluator), so the breakdown doesn't distinguish solution-side vs search-side calls. Same applies to label/variation operator generation (line 300) which goes through search_controller.guide_llms and lands in guide.

Not a bug (totals are right), but worth considering: the search controller's pools could use distinct categories like search_generation, search_guide so the Evox cost breakdown shows where money actually went.

Two things to fix

1. Gemini thinking tokens not billed

Gemini does not include thinking tokens in completion_tokens. They only show up in total_tokens:

prompt_tokens=11, completion_tokens=1, total_tokens=104

On the gemini-3-pro run, reported cost was $0.14 but actual cost is ~$1.00 (71,791 thinking tokens unbilled). 7x undercount. OpenAI and Anthropic are fine (they include everything in completion_tokens).

Simplest fix: use total_tokens - prompt_tokens as the output token count for cost calculation. This gives the correct answer for all providers:

Provider	completion_tokens	total - prompt	Thinking gap
OpenAI (incl. o-series)	83	83	0
Gemini thinking	1	93	92
Anthropic	10	10	0

2. No warning when pricing is unknown

gemini-2.5-flash silently reports $0 with no indication pricing is missing.

Fixes:

Add gemini-2.5-pro and gemini-2.5-flash to DEFAULT_MODEL_PRICING
Log a warning in OpenAILLM.__init__ when pricing can't be resolved:

if self.input_price_per_million_tokens is None:
    logger.warning("No pricing for model '%s', costs will show $0. Set input/output_price_per_million_tokens in config.", model_cfg.name)

Both are small. Happy to re-review once they're in.

Superseded by final review.

…mport - simulator.py: __transfer_time was not computing actual transfer times. The inner loop iterated over path edges but never accumulated a result, leaving partition_time as -inf. This cascaded into __total_cost via runtime_s, producing -inf total costs. Fixed by computing the bottleneck (minimum flow) along each path and deriving transfer time from it. - evaluator.py: Removed stale top-level `from initial_program import search_algorithm` which fails inside Docker (initial_program is not on the Python path in the container). The import is unnecessary since programs are loaded dynamically via importlib in evaluate(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Mar 28, 2026

View reviewed changes

mert-cemri changed the title ~~Add end-to-end LLM cost tracking with built-in pricing~~ LLM cost tracking with built-in pricing Mar 28, 2026

mert-cemri requested review from lynnliu030 and shubham3-ucb March 28, 2026 20:03

mert-cemri force-pushed the cost_tracking branch from f6298a2 to 43393ab Compare March 28, 2026 20:04

shubham3-ucb previously approved these changes Mar 28, 2026

View reviewed changes

shubham3-ucb reviewed Mar 28, 2026

View reviewed changes

shubham3-ucb previously requested changes Mar 28, 2026

View reviewed changes

shubham3-ucb requested changes Mar 28, 2026

View reviewed changes

add cost tracking

160c442

mert-cemri force-pushed the cost_tracking branch from 43393ab to 160c442 Compare March 29, 2026 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM cost tracking with built-in pricing#49

LLM cost tracking with built-in pricing#49
mert-cemri wants to merge 2 commits into
skydiscover-ai:mainfrom
mert-cemri:cost_tracking

mert-cemri commented Mar 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 28, 2026

Uh oh!

shubham3-ucb left a comment •

edited

Loading

Uh oh!

shubham3-ucb left a comment •

edited

Loading

Uh oh!

shubham3-ucb left a comment •

edited

Loading

Uh oh!

shubham3-ucb left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mert-cemri commented Mar 28, 2026

Summary

Bug fixes

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

shubham3-ucb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shubham3-ucb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shubham3-ucb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shubham3-ucb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

What we tested

Code review

Category tracking across strategies

Two things to fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shubham3-ucb left a comment •

edited

Loading

shubham3-ucb left a comment •

edited

Loading

shubham3-ucb left a comment •

edited

Loading

shubham3-ucb left a comment •

edited

Loading