Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up blib2to3 tokenization using generic python function #4540

Closed
moogician opened this issue Dec 28, 2024 · 2 comments · Fixed by #4541
Closed

Speed up blib2to3 tokenization using generic python function #4540

moogician opened this issue Dec 28, 2024 · 2 comments · Fixed by #4541
Labels
T: bug Something isn't working

Comments

@moogician
Copy link
Contributor

Describe the bug

is_fstring_start using builtin.any for prefix matching is too slow and slows down fstring tokenization.

def is_fstring_start(token: str) -> bool:
    return builtins.any(token.startswith(prefix) for prefix in fstring_prefix) # using `any` with `<genexpr>` is too slow

To Reproduce

Run this minimal reproducing script:

import cProfile, pstats, io
from blib2to3.pgen2 import tokenize

profiler = cProfile.Profile()
example = io.StringIO(','.join(['f"X"']*10000)).readline
profiler.enable()
tokenize.tokenize(example, lambda *_: None)
profiler.disable()

pstats.Stats(profiler).sort_stats(pstats.SortKey.TIME).print_stats("black", "src", 10)

The profiling output looks like this:

         720011 function calls in 0.133 seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    40001    0.040    0.000    0.124    0.000 black/src/blib2to3/pgen2/tokenize.py:559(generate_tokens)
   190000    0.021    0.000    0.036    0.000 black/src/blib2to3/pgen2/tokenize.py:466(<genexpr>)
        1    0.008    0.008    0.133    0.133 black/src/blib2to3/pgen2/tokenize.py:280(tokenize_loop)
    10000    0.004    0.000    0.048    0.000 black/src/blib2to3/pgen2/tokenize.py:463(is_fstring_start)
    59997    0.003    0.000    0.003    0.000 black/src/blib2to3/pgen2/tokenize.py:528(current)
    10000    0.001    0.000    0.002    0.000 black/src/blib2to3/pgen2/tokenize.py:534(leave_fstring)
    10000    0.001    0.000    0.002    0.000 black/src/blib2to3/pgen2/tokenize.py:531(enter_fstring)
        1    0.000    0.000    0.133    0.133 black/src/blib2to3/pgen2/tokenize.py:260(tokenize)
        2    0.000    0.000    0.000    0.000 black/src/blib2to3/pgen2/tokenize.py:525(is_in_fstring_expression)
        1    0.000    0.000    0.000    0.000 black/src/blib2to3/pgen2/tokenize.py:522(__init__)

The <genexpr> in is_fstring_start typically occupies around 15-20% of the time, which is too much and easily optimizable.

Environment

  • Black's version: [main]
  • OS and Python version: [Mac/Python 3.12.6]

Proposed Solution
change the fstring_prefix to a tuple and use token.startswith(fstring_prefix) directly

cc. @JelleZijlstra @tusharsadhwani

@moogician moogician added the T: bug Something isn't working label Dec 28, 2024
@tusharsadhwani
Copy link
Contributor

Sounds good, would you like to send a PR?

@moogician
Copy link
Contributor Author

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants