Speed up `blib2to3` tokenization using generic python function #4540

moogician · 2024-12-28T16:23:32Z

Describe the bug

is_fstring_start using builtin.any for prefix matching is too slow and slows down fstring tokenization.

def is_fstring_start(token: str) -> bool:
    return builtins.any(token.startswith(prefix) for prefix in fstring_prefix) # using `any` with `<genexpr>` is too slow

To Reproduce

Run this minimal reproducing script:

import cProfile, pstats, io
from blib2to3.pgen2 import tokenize

profiler = cProfile.Profile()
example = io.StringIO(','.join(['f"X"']*10000)).readline
profiler.enable()
tokenize.tokenize(example, lambda *_: None)
profiler.disable()

pstats.Stats(profiler).sort_stats(pstats.SortKey.TIME).print_stats("black", "src", 10)

The profiling output looks like this:

         720011 function calls in 0.133 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    40001    0.040    0.000    0.124    0.000 black/src/blib2to3/pgen2/tokenize.py:559(generate_tokens)
   190000    0.021    0.000    0.036    0.000 black/src/blib2to3/pgen2/tokenize.py:466(<genexpr>)
        1    0.008    0.008    0.133    0.133 black/src/blib2to3/pgen2/tokenize.py:280(tokenize_loop)
    10000    0.004    0.000    0.048    0.000 black/src/blib2to3/pgen2/tokenize.py:463(is_fstring_start)
    59997    0.003    0.000    0.003    0.000 black/src/blib2to3/pgen2/tokenize.py:528(current)
    10000    0.001    0.000    0.002    0.000 black/src/blib2to3/pgen2/tokenize.py:534(leave_fstring)
    10000    0.001    0.000    0.002    0.000 black/src/blib2to3/pgen2/tokenize.py:531(enter_fstring)
        1    0.000    0.000    0.133    0.133 black/src/blib2to3/pgen2/tokenize.py:260(tokenize)
        2    0.000    0.000    0.000    0.000 black/src/blib2to3/pgen2/tokenize.py:525(is_in_fstring_expression)
        1    0.000    0.000    0.000    0.000 black/src/blib2to3/pgen2/tokenize.py:522(__init__)

The <genexpr> in is_fstring_start typically occupies around 15-20% of the time, which is too much and easily optimizable.

Environment

Black's version: [main]
OS and Python version: [Mac/Python 3.12.6]

Proposed Solution
change the fstring_prefix to a tuple and use token.startswith(fstring_prefix) directly

cc. @JelleZijlstra @tusharsadhwani

The text was updated successfully, but these errors were encountered:

tusharsadhwani · 2024-12-28T16:27:32Z

Sounds good, would you like to send a PR?

moogician · 2024-12-28T17:13:34Z

Sure!

moogician added the T: bug Something isn't working label Dec 28, 2024

moogician mentioned this issue Dec 28, 2024

Speed up blib2to3 tokenization using generic python function #4541

Merged

3 tasks

JelleZijlstra closed this as completed in #4541 Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `blib2to3` tokenization using generic python function #4540

Speed up `blib2to3` tokenization using generic python function #4540

moogician commented Dec 28, 2024

tusharsadhwani commented Dec 28, 2024

moogician commented Dec 28, 2024

Speed up blib2to3 tokenization using generic python function #4540

Speed up blib2to3 tokenization using generic python function #4540

Comments

moogician commented Dec 28, 2024

tusharsadhwani commented Dec 28, 2024

moogician commented Dec 28, 2024

Speed up `blib2to3` tokenization using generic python function #4540

Speed up `blib2to3` tokenization using generic python function #4540