LLM cost tracking with built-in pricing#49
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements comprehensive LLM cost and token usage tracking by introducing a CostTracker utility and expanding the LLMResponse data structure. It adds pricing parameters to model configurations and updates the OpenAI provider to calculate costs using either explicit settings or a built-in pricing table. The CLI and API now report these metrics in their summaries. A review comment identifies an opportunity to reduce code duplication in skydiscover/llm/openai.py by extracting a helper method for constructing LLMResponse objects, which would improve maintainability across the various API interaction methods.
| input_tokens, output_tokens, total_tokens = extract_usage_counts(getattr(response, "usage", None)) | ||
| input_cost_usd, output_cost_usd, total_cost_usd = compute_token_costs( | ||
| input_tokens, | ||
| output_tokens, | ||
| self.input_price_per_million_tokens, | ||
| self.output_price_per_million_tokens, | ||
| ) | ||
| return LLMResponse( | ||
| text=response.choices[0].message.content or "", | ||
| model_name=getattr(response, "model", None) or self.model, | ||
| usage_source="api" if getattr(response, "usage", None) is not None else None, | ||
| input_tokens=input_tokens, | ||
| output_tokens=output_tokens, | ||
| total_tokens=total_tokens, | ||
| input_cost_usd=input_cost_usd, | ||
| output_cost_usd=output_cost_usd, | ||
| total_cost_usd=total_cost_usd, | ||
| ) |
There was a problem hiding this comment.
This block of code for creating an LLMResponse with cost and usage information is repeated in _call_api, _call_api_via_responses, and _generate_with_image. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method.
For example, you could create a _build_llm_response method:
def _build_llm_response(
self,
response_obj: Any,
text: str,
image_path: Optional[str] = None,
) -> LLMResponse:
input_tokens, output_tokens, total_tokens = extract_usage_counts(
getattr(response_obj, "usage", None)
)
input_cost_usd, output_cost_usd, total_cost_usd = compute_token_costs(
input_tokens,
output_tokens,
self.input_price_per_million_tokens,
self.output_price_per_million_tokens,
)
return LLMResponse(
text=text,
image_path=image_path,
model_name=getattr(response_obj, "model", None) or self.model,
usage_source="api" if getattr(response_obj, "usage", None) is not None else None,
input_tokens=input_tokens,
output_tokens=output_tokens,
total_tokens=total_tokens,
input_cost_usd=input_cost_usd,
output_cost_usd=output_cost_usd,
total_cost_usd=total_cost_usd,
)Then you could simplify this method and the others to just call this helper, for instance:
return self._build_llm_response(response, response.choices[0].message.content or "")References
- When accessing configurations that might return None, it is acceptable to use defensive patterns like 'or []' even if getattr has a default value, as this guards against explicit null values in the configuration.
f6298a2 to
43393ab
Compare
There was a problem hiding this comment.
What we tested
Unit tests: 168/168 pass. 60 additional microtests (thread safety, prefix matching, float precision, empty tracker, agentic aggregation, partial price overrides). All pass.
Live runs on circle_packing (2 iters each):
| Model | Strategy | Calls | Tokens (in/out/total) | Reported cost | Correct? |
|---|---|---|---|---|---|
| gpt-4o-mini | topk | 2 | 11,371 / 1,424 / 12,795 | $0.0026 | Yes |
| gpt-4.1-nano | adaevolve | 2 | 14,510 / 3,226 / 17,736 | $0.0027 | Yes |
| gemini-2.5-flash | topk | 2 | 13,083 / 3,749 / 33,937 | $0.0000 | No (see #1 and #2) |
| gemini-3-pro-preview | topk | 4 | 43,572 / 4,749 / 120,112 | $0.1441 | No (see #1) |
| claude-sonnet-4-6 | topk | 3 | 20,792 / 13,936 / 34,728 | $0.2714 | Yes |
Also verified: empty tracker (all LLM calls fail) renders cleanly, JSON output and checkpoint cost snapshots correct.
Code review
Architecture is clean. Traced every path:
- No double-counting in agentic mode
- AdaEvolve pool re-creation bug fix is correct
- Evox nested controller passes cost_tracker through correctly
- Thread safety is solid
- Prefix matching handles all edge cases
Category tracking across strategies
Checked what LLM call types each strategy uses and whether they show up separately in by_category:
Default/topk: generation only (unless llm_as_judge is on, then also evaluator).
AdaEvolve: generation (main iteration calls) + guide (paradigm breakthrough calls when stagnating) + evaluator (if llm_as_judge). All tracked separately.
Evox: The main solution controller and the nested search evolution controller share the same cost_tracker, which is correct for total cost. But both controllers label their pools with the same category names (generation, guide, evaluator), so the breakdown doesn't distinguish solution-side vs search-side calls. Same applies to label/variation operator generation (line 300) which goes through search_controller.guide_llms and lands in guide.
Not a bug (totals are right), but worth considering: the search controller's pools could use distinct categories like search_generation, search_guide so the Evox cost breakdown shows where money actually went.
Two things to fix
1. Gemini thinking tokens not billed
Gemini does not include thinking tokens in completion_tokens. They only show up in total_tokens:
prompt_tokens=11, completion_tokens=1, total_tokens=104
On the gemini-3-pro run, reported cost was $0.14 but actual cost is ~$1.00 (71,791 thinking tokens unbilled). 7x undercount. OpenAI and Anthropic are fine (they include everything in completion_tokens).
Simplest fix: use total_tokens - prompt_tokens as the output token count for cost calculation. This gives the correct answer for all providers:
| Provider | completion_tokens | total - prompt | Thinking gap |
|---|---|---|---|
| OpenAI (incl. o-series) | 83 | 83 | 0 |
| Gemini thinking | 1 | 93 | 92 |
| Anthropic | 10 | 10 | 0 |
2. No warning when pricing is unknown
gemini-2.5-flash silently reports $0 with no indication pricing is missing.
Fixes:
- Add gemini-2.5-pro and gemini-2.5-flash to
DEFAULT_MODEL_PRICING - Log a warning in
OpenAILLM.__init__when pricing can't be resolved:
if self.input_price_per_million_tokens is None:
logger.warning("No pricing for model '%s', costs will show $0. Set input/output_price_per_million_tokens in config.", model_cfg.name)Both are small. Happy to re-review once they're in.
43393ab to
160c442
Compare
…mport - simulator.py: __transfer_time was not computing actual transfer times. The inner loop iterated over path edges but never accumulated a result, leaving partition_time as -inf. This cascaded into __total_cost via runtime_s, producing -inf total costs. Fixed by computing the bottleneck (minimum flow) along each path and deriving transfer time from it. - evaluator.py: Removed stale top-level `from initial_program import search_algorithm` which fails inside Docker (initial_program is not on the Python path in the container). The import is unnecessary since programs are loaded dynamically via importlib in evaluate(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
CostTrackershared across generation, evaluator, and guide pools (including Evox nested controllers and agentic mode)llm_cost_summary.jsonsaved to output dir;DiscoveryResult.llm_cost_summaryfor API usersBug fixes
AdaEvolveController.__init__was re-creatingself.llmswithout thecost_tracker, silently dropping all tracking in adaevolve modeLLMResponse_run_iterationkwargs unpackingTest plan
🤖 Generated with Claude Code