Skip to content

⚡️ Speed up function _estimate_string_tokens by 221% #2156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

misrasaurabh1
Copy link
Contributor

📄 221% (2.21x) speedup for _estimate_string_tokens in pydantic_ai_slim/pydantic_ai/models/function.py

⏱️ Runtime : 5.29 milliseconds 1.65 milliseconds (best of 89 runs)

📝 Explanation and details

Hotspots

  • Most time is spent calling re.split() for every string or string-like object, which is expensive.
  • Checking isinstance for common types on each iteration.
  • tokens += 0 operations are no-ops and can be removed.
  • The regex can be precompiled.
  • str.split() with None as delimiter is often much faster and covers most whitespace splitting, which is likely enough here.
  • content.strip() is being called redundantly for every string, which we can optimize.

Optimizations made.

  • Precompiled the regex, so it’s not recompiled per call.
  • Removed all tokens += 0 and unnecessary else branches (no effect).
  • Minimized calls to .strip() and .split() to once per string instance.
  • Dropped extraneous isinstance checks.
  • Moved logic into clear branches, optimizing the common paths.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 64 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 85.7%
🌀 Generated Regression Tests and Runtime
import re
from collections.abc import Sequence

# imports
import pytest  # used for our unit tests
from pydantic_ai.models.function import _estimate_string_tokens


# Dummy classes to mimic pydantic_ai.messages for testing
class AudioUrl:
    def __init__(self, url):
        self.url = url

class ImageUrl:
    def __init__(self, url):
        self.url = url

class BinaryContent:
    def __init__(self, data):
        self.data = data
from pydantic_ai.models.function import _estimate_string_tokens

# unit tests

# ------------------------ BASIC TEST CASES ------------------------

def test_empty_string_returns_zero():
    # Test that empty string returns 0 tokens
    codeflash_output = _estimate_string_tokens("") # 349ns -> 378ns (7.67% slower)

def test_simple_sentence():
    # Test a simple sentence
    codeflash_output = _estimate_string_tokens("Hello world") # 3.63μs -> 2.42μs (49.8% faster)

def test_sentence_with_punctuation():
    # Test sentence with punctuation
    codeflash_output = _estimate_string_tokens("Hello, world.") # 3.28μs -> 2.45μs (33.9% faster)

def test_sentence_with_multiple_spaces():
    # Test sentence with multiple spaces
    codeflash_output = _estimate_string_tokens("Hello    world") # 3.57μs -> 2.39μs (49.4% faster)

def test_sentence_with_mixed_delimiters():
    # Test sentence with various delimiters
    codeflash_output = _estimate_string_tokens('Hello, world: "Python".') # 3.83μs -> 2.73μs (40.3% faster)

def test_string_with_leading_and_trailing_spaces():
    # Leading/trailing whitespace should not affect token count
    codeflash_output = _estimate_string_tokens("   Hello world   ") # 3.52μs -> 2.39μs (47.2% faster)

def test_string_with_only_delimiters():
    # Only delimiters should result in zero tokens
    codeflash_output = _estimate_string_tokens(" ,.:  ") # 3.34μs -> 2.05μs (62.7% faster)

def test_string_with_newlines_and_tabs():
    # Newlines and tabs are whitespace and should be treated as delimiters
    codeflash_output = _estimate_string_tokens("Hello\nworld\tPython") # 3.81μs -> 2.75μs (38.5% faster)

# ------------------------ EDGE TEST CASES ------------------------

def test_none_input_returns_zero():
    # None input should return 0 tokens
    codeflash_output = _estimate_string_tokens(None) # 338ns -> 366ns (7.65% slower)

def test_empty_list_returns_zero():
    # Empty list should return 0 tokens
    codeflash_output = _estimate_string_tokens([]) # 358ns -> 358ns (0.000% faster)

def test_list_of_empty_strings():
    # List of empty strings should return 0 tokens
    codeflash_output = _estimate_string_tokens(["", "", ""]) # 5.70μs -> 1.97μs (190% faster)

def test_list_of_strings():
    # List of strings should sum token counts
    codeflash_output = _estimate_string_tokens(["Hello world", "Python is great"]) # 6.52μs -> 3.66μs (78.0% faster)

def test_list_with_string_and_empty_string():
    # List with a string and an empty string
    codeflash_output = _estimate_string_tokens(["Hello world", ""]) # 6.68μs -> 3.06μs (119% faster)

def test_list_with_audio_and_image_url():
    # AudioUrl and ImageUrl should contribute 0 tokens
    audio = AudioUrl("http://audio.url")
    image = ImageUrl("http://image.url")
    codeflash_output = _estimate_string_tokens([audio, image]) # 2.66μs -> 1.19μs (123% faster)

def test_list_with_string_and_audio_url():
    # String and AudioUrl, only string counts
    audio = AudioUrl("http://audio.url")
    codeflash_output = _estimate_string_tokens(["Hello world", audio]) # 5.87μs -> 3.18μs (84.8% faster)

def test_list_with_binary_content():
    # BinaryContent's token count is len(data)
    binary = BinaryContent(b"abcde")
    codeflash_output = _estimate_string_tokens([binary]) # 2.18μs -> 1.00μs (117% faster)

def test_list_with_string_and_binary_content():
    # Both string and BinaryContent contribute
    binary = BinaryContent(b"abcde")
    codeflash_output = _estimate_string_tokens(["Hello", binary]) # 5.65μs -> 2.60μs (117% faster)

def test_list_with_all_types():
    # All types together
    audio = AudioUrl("http://audio.url")
    image = ImageUrl("http://image.url")
    binary = BinaryContent(b"xyz")
    codeflash_output = _estimate_string_tokens(["Hi there", audio, image, binary, "Python."]) # 7.98μs -> 3.91μs (104% faster)

def test_list_with_unexpected_type():
    # Unexpected type should contribute 0 tokens
    class Dummy: pass
    codeflash_output = _estimate_string_tokens(["Hello", Dummy()])

def test_string_with_only_spaces():
    # String of only spaces returns 0
    codeflash_output = _estimate_string_tokens("     ") # 2.44μs -> 1.40μs (74.8% faster)

def test_string_with_unicode_characters():
    # Unicode characters should be counted as tokens
    codeflash_output = _estimate_string_tokens("你好 世界") # 5.61μs -> 4.50μs (24.6% faster)

def test_string_with_mixed_unicode_and_ascii():
    # Mixed unicode and ascii
    codeflash_output = _estimate_string_tokens("hello 世界") # 4.64μs -> 3.56μs (30.4% faster)

def test_list_with_unicode_strings():
    # List with unicode strings
    codeflash_output = _estimate_string_tokens(["你好 世界", "hello"]) # 8.25μs -> 4.64μs (78.0% faster)

def test_binary_content_empty():
    # BinaryContent with empty data
    binary = BinaryContent(b"")
    codeflash_output = _estimate_string_tokens([binary]) # 1.96μs -> 1.02μs (91.7% faster)

def test_list_with_nested_empty_lists():
    # Nested empty lists are not supported, but should not crash
    codeflash_output = _estimate_string_tokens([[]]) # 2.03μs -> 905ns (124% faster)

# ------------------------ LARGE SCALE TEST CASES ------------------------

def test_long_string_1000_words():
    # Test a string with 1000 words
    long_str = "word " * 1000
    codeflash_output = _estimate_string_tokens(long_str.strip()) # 121μs -> 111μs (8.44% faster)

def test_list_of_1000_strings():
    # List of 1000 single-word strings
    string_list = ["word"] * 1000
    codeflash_output = _estimate_string_tokens(string_list) # 701μs -> 191μs (266% faster)

def test_list_of_1000_empty_strings():
    # List of 1000 empty strings
    string_list = [""] * 1000
    codeflash_output = _estimate_string_tokens(string_list) # 596μs -> 89.6μs (566% faster)

def test_list_of_500_strings_and_500_binary():
    # 500 strings, 500 BinaryContent (each with 2 bytes)
    string_list = ["hi"] * 500
    binary_list = [BinaryContent(b"ab")] * 500
    codeflash_output = _estimate_string_tokens(string_list + binary_list) # 504μs -> 112μs (349% faster)

def test_list_of_1000_audio_and_image_urls():
    # 500 AudioUrl, 500 ImageUrl
    audio_list = [AudioUrl("a")] * 500
    image_list = [ImageUrl("b")] * 500
    codeflash_output = _estimate_string_tokens(audio_list + image_list) # 278μs -> 45.4μs (513% faster)

def test_list_of_mixed_types_large():
    # 333 strings, 333 binary (3 bytes), 334 images
    string_list = ["hello world"] * 333  # 2 tokens each
    binary_list = [BinaryContent(b"xyz")] * 333  # 3 tokens each
    image_list = [ImageUrl("img")] * 334  # 0 tokens each
    expected = 333*2 + 333*3 + 334*0
    codeflash_output = _estimate_string_tokens(string_list + binary_list + image_list) # 512μs -> 150μs (242% faster)

def test_performance_large_string():
    # Test that large string doesn't crash or hang (not a strict performance test)
    long_str = "a " * 999 + "a"
    codeflash_output = _estimate_string_tokens(long_str) # 66.2μs -> 66.3μs (0.202% slower)

def test_performance_large_list():
    # Test that large list doesn't crash or hang
    string_list = ["a b c"] * 333  # 3 tokens each
    codeflash_output = _estimate_string_tokens(string_list) # 295μs -> 108μs (172% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import re
from collections.abc import Sequence

# imports
import pytest  # used for our unit tests
from pydantic_ai.models.function import _estimate_string_tokens


# Dummy classes to mimic pydantic_ai.messages
class AudioUrl:
    def __init__(self, url):
        self.url = url

class ImageUrl:
    def __init__(self, url):
        self.url = url

class BinaryContent:
    def __init__(self, data: bytes):
        self.data = data

UserContent = (str, AudioUrl, ImageUrl, BinaryContent)
from pydantic_ai.models.function import _estimate_string_tokens

# unit tests

# 1. Basic Test Cases

def test_empty_string_returns_zero():
    # Should return 0 for empty string
    codeflash_output = _estimate_string_tokens("") # 290ns -> 389ns (25.4% slower)

def test_simple_string_single_word():
    # Single word string
    codeflash_output = _estimate_string_tokens("hello") # 2.94μs -> 1.91μs (53.9% faster)

def test_simple_string_multiple_words():
    # Multiple words separated by spaces
    codeflash_output = _estimate_string_tokens("hello world") # 3.37μs -> 2.24μs (50.8% faster)

def test_string_with_punctuation():
    # String with punctuation that should be split
    codeflash_output = _estimate_string_tokens('hello, world.') # 3.89μs -> 2.76μs (40.9% faster)

def test_string_with_multiple_separators():
    # String with multiple spaces and punctuation
    codeflash_output = _estimate_string_tokens('hello,  world.  foo:bar') # 3.73μs -> 2.76μs (34.9% faster)

def test_string_with_leading_trailing_spaces():
    # Leading and trailing whitespace should not affect token count
    codeflash_output = _estimate_string_tokens('   hello world   ') # 3.48μs -> 2.44μs (42.2% faster)

def test_string_with_only_separators():
    # String with only separators should result in 0 tokens
    codeflash_output = _estimate_string_tokens(' , . : ') # 3.10μs -> 2.18μs (42.1% faster)

def test_sequence_of_strings():
    # Sequence of strings should sum their token counts
    content = ["hello world", "foo,bar"]
    codeflash_output = _estimate_string_tokens(content) # 6.31μs -> 3.40μs (85.7% faster)

def test_sequence_of_strings_and_empty_string():
    # Sequence with empty string should not affect total
    content = ["hello world", "", "foo"]
    codeflash_output = _estimate_string_tokens(content) # 7.43μs -> 3.56μs (109% faster)

def test_string_with_multiple_spaces_between_words():
    # Multiple spaces between words should not create empty tokens
    codeflash_output = _estimate_string_tokens("hello    world") # 3.54μs -> 2.45μs (44.5% faster)

# 2. Edge Test Cases

def test_none_input_returns_zero():
    # None input should return 0
    codeflash_output = _estimate_string_tokens(None) # 348ns -> 335ns (3.88% faster)

def test_sequence_empty_list_returns_zero():
    # Empty sequence should return 0
    codeflash_output = _estimate_string_tokens([]) # 326ns -> 351ns (7.12% slower)

def test_sequence_of_only_non_str_content():
    # Sequence of only AudioUrl/ImageUrl should return 0
    content = [AudioUrl("http://a.com"), ImageUrl("http://b.com")]
    codeflash_output = _estimate_string_tokens(content) # 3.06μs -> 1.28μs (139% faster)

def test_sequence_with_binary_content():
    # BinaryContent should use length of bytes as tokens
    content = [BinaryContent(b"abcde")]
    codeflash_output = _estimate_string_tokens(content) # 2.28μs -> 1.14μs (101% faster)

def test_sequence_mixed_types():
    # Mixed sequence: string, AudioUrl, BinaryContent, ImageUrl
    content = [
        "hello there",
        AudioUrl("http://audio.com"),
        BinaryContent(b"xyz"),
        ImageUrl("http://img.com"),
        "foo:bar"
    ]
    # "hello there" -> 2, BinaryContent -> 3, "foo:bar" -> 2
    codeflash_output = _estimate_string_tokens(content) # 8.83μs -> 4.19μs (110% faster)

def test_sequence_with_empty_binary_content():
    # BinaryContent with empty bytes
    content = [BinaryContent(b"")]
    codeflash_output = _estimate_string_tokens(content) # 2.15μs -> 1.09μs (97.2% faster)

def test_string_with_consecutive_separators():
    # Multiple consecutive separators should not create empty tokens
    codeflash_output = _estimate_string_tokens("hello,,,  world...foo") # 3.65μs -> 2.67μs (36.6% faster)

def test_sequence_with_unexpected_type():
    # Sequence with an unexpected type should be ignored (added as 0)
    class Dummy: pass
    content = ["hello", Dummy()]
    codeflash_output = _estimate_string_tokens(content)

def test_string_with_unicode_and_non_ascii():
    # Unicode characters should be treated as part of tokens
    codeflash_output = _estimate_string_tokens("héllo wørld") # 4.34μs -> 3.10μs (39.8% faster)

def test_string_with_newline_and_tab():
    # Newlines and tabs are whitespace and should be split
    codeflash_output = _estimate_string_tokens("hello\nworld\tfoo") # 3.88μs -> 2.71μs (43.0% faster)

def test_string_with_only_whitespace():
    # String with only whitespace should return 0
    codeflash_output = _estimate_string_tokens("    \t\n   ") # 2.33μs -> 1.45μs (60.8% faster)

def test_sequence_of_strings_with_whitespace_only():
    # Sequence with whitespace-only strings
    content = ["   ", "\t", "\n"]
    codeflash_output = _estimate_string_tokens(content) # 6.33μs -> 2.28μs (177% faster)

def test_string_with_colon_and_period():
    # Colons and periods are separators
    codeflash_output = _estimate_string_tokens("foo:bar.baz") # 3.60μs -> 2.77μs (30.2% faster)

def test_string_with_quotes():
    # Quotes are separators
    codeflash_output = _estimate_string_tokens('foo "bar" baz') # 3.52μs -> 2.58μs (36.5% faster)

def test_string_with_mixed_separators():
    # All separators together
    codeflash_output = _estimate_string_tokens('foo, bar: "baz".') # 4.15μs -> 2.80μs (48.2% faster)

# 3. Large Scale Test Cases

def test_long_string():
    # Very long string with 1000 words
    long_str = "word " * 1000
    codeflash_output = _estimate_string_tokens(long_str.strip()) # 122μs -> 121μs (0.830% faster)

def test_large_sequence_of_strings():
    # Sequence of 1000 single-word strings
    content = ["hello"] * 1000
    codeflash_output = _estimate_string_tokens(content) # 754μs -> 188μs (299% faster)

def test_large_sequence_of_mixed_content():
    # Sequence of 500 strings and 500 BinaryContent (each 2 bytes)
    content = ["foo bar"] * 500 + [BinaryContent(b"xy")] * 500
    # 500*2 tokens from strings + 500*2 from BinaryContent
    codeflash_output = _estimate_string_tokens(content) # 599μs -> 174μs (243% faster)

def test_large_sequence_with_audio_and_image():
    # Sequence with 333 strings, 333 AudioUrl, 334 ImageUrl
    content = (
        ["foo bar"] * 333 +
        [AudioUrl("http://a.com")] * 333 +
        [ImageUrl("http://b.com")] * 334
    )
    # Only strings counted: 333*2 = 666
    codeflash_output = _estimate_string_tokens(content) # 493μs -> 134μs (267% faster)

def test_large_binary_content():
    # Single BinaryContent with 999 bytes
    content = [BinaryContent(b"x" * 999)]
    codeflash_output = _estimate_string_tokens(content) # 2.11μs -> 1.05μs (101% faster)

def test_large_mixed_string_with_all_separators():
    # Large string with all separators and 500 tokens
    s = ("foo, bar. baz: " * 125).strip()  # 4 tokens per repeat, 125*4=500
    codeflash_output = _estimate_string_tokens(s) # 39.2μs -> 38.5μs (1.69% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-_estimate_string_tokens-mcs8yg4q and push.

Codeflash

codeflash-ai bot and others added 4 commits July 6, 2025 22:31
Certainly! Let's break down the profile first.

### Hotspots
- Most time is spent calling `re.split()` for every string or string-like object, which is expensive.
- Checking `isinstance` for common types on each iteration.
- `tokens += 0` operations are no-ops and can be removed.
- The regex can be precompiled.
- `str.split()` with `None` as delimiter is often *much* faster and covers most whitespace splitting, which is likely enough here.
- `content.strip()` is being called redundantly for every string, which we can optimize.

Here’s a rewritten, **optimized** version.



### Optimizations made.
- Precompiled the regex, so it’s not recompiled per call.
- Removed all `tokens += 0` and unnecessary else branches (no effect).
- Minimized calls to `.strip()` and `.split()` to once per string instance.
- Dropped extraneous `isinstance` checks.
- Moved logic into clear branches, optimizing the common paths.

#### Further possible optimization, if you don't need exact punctuation splitting.
If you're willing to change the token estimation (using whitespace instead of the full punctuation split), you can swap out `_TOKEN_SPLIT_RE.split(foo.strip())` to simply `foo.strip().split()` and drop all regex, which is **much** faster. But this does **relax** the original tokenization logic.

Let me know if you want it even **faster** with that change, or if you need to preserve the splitting on punctuation!
tokens += len(_TOKEN_SPLIT_RE.split(part.strip()))
elif isinstance(part, BinaryContent):
tokens += len(part.data)
# We don't need explicit handling for AudioUrl or ImageUrl, since they add 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can skip the tokens += 0, but we should keep on the original todo comment as image/audio URL parts actually do add tokens, we just don't count them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return tokens


_TOKEN_SPLIT_RE = re.compile(r'[\s",.:]+')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will have some overhead at import time, it's small but it'll add up if we do this with all regular expressions. Should we stick with re.split(r'[\s",.:]+', part.strip()) as it'll cache the regex the first time it's run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To validate the performance characteristics, I tried an experiment where I replaced the current suggestion with inline re.split and ran it on the generated test set and timed the runtime. So the only change is the global re.compile vs inline re.split.
global re.compile time -> 1.68ms
inline re.split -> 2.57ms
Yes, regex does cache the complied regex for future use, but it has overhead that especially when used in a loop can be high. In my experience with optimizations discovered with codeflash, I've seen re.compile be faster.
In this case, since regex is used multiple times and in a loop i would recommend regex compilation. Although its your decision.

@misrasaurabh1 misrasaurabh1 requested a review from DouweM July 16, 2025 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants