feat(inference): add tokenization utilities for prompt caching #4168
+902
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
OpenAI-Compatible Prompt Caching Feature - Phase 1 (Tokenization Utilities for Prompt Caching)
Summary
This PR implements Tokenization Utilities for the OpenAI-compatible prompt caching feature. It provides token counting functionality to determine whether prompts are cacheable (≥1024 tokens), with support for OpenAI models, Llama models, and multimodal content.
This is the second of various progressive PRs towards implementing prompt caching. This has no dependency on PR #4166
Strategy
Implementation strategy for extending the Llama Stack OpenAI-compatible API to support Prompt Caching (as per OpenAI's implementation) while integrating with MLflow's prompt registry for prompt management and versioning.
cached_tokensin usage statisticsRelated Issue
Part of prompt caching implementation - Phase 1 of #4166
Changes
Core Implementation
src/llama_stack/providers/utils/inference/tokenization.pyToken Counting API:
count_tokens(messages, model, exact=True)- Main API for counting tokens in messagesget_tokenization_method(model)- Returns tokenization method used for a modelclear_tokenizer_cache()- Clears tokenizer cache for testing/memory managementModel Support:
OpenAI models (exact via
tiktoken):gpt-4-turbo-2024-04-09)Llama models (exact via
transformers):meta-llama/Llama-3.x-*meta-llama/Llama-4.x-*meta-llama/Meta-Llama-3-*Unknown models (character-based estimation):
Multimodal Content Support:
Performance Optimization:
Error Handling:
Tests
tests/unit/providers/utils/inference/test_tokenization.py34 comprehensive test cases covering:
Test Coverage:
Dependencies
pyproject.toml(modified)tiktokenversion constraint to>=0.8.0(was already in dependencies)transformersalready available intype_checkinggroup (optional for Llama models)Testing
Unit Tests
Results:
Architecture Notes
Design Decisions
Model-specific tokenization: Uses native tokenizers (tiktoken, transformers) for accuracy where available, with graceful fallback to estimation
LRU caching: Tokenizer instances are expensive to create (~100ms first load), so we cache up to 10 tokenizers using Python's
functools.lru_cacheMultimodal support: Image token estimation based on GPT-4V benchmarks (85 tokens low-res, 170 high-res) with detail level detection
Conservative estimation: Unknown models use 4 chars/token (conservative) to avoid undercounting
Async-compatible: While functions are not async (tokenization is CPU-bound), they're designed to be called from async contexts without blocking
Token Counting Accuracy
Performance Characteristics
@lru_cache(maxsize=10))Security Considerations
Checklist
llama_stack.logExamples
Basic Text Counting
Multimodal Content
Checking Cacheability
Unknown Models (Estimation)
Check Tokenization Method