Skip to content

Make syntax highlight generation scale better. #5642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

paul-ollis
Copy link
Contributor

The TextArea widget code queries the entires syntax tree for each edit, using the tree-sitter Query.captures method. This has potential scaling issues, but such issues are exacerbated by the fact the Query.captures method scales very poorly with the number of line it is asked to generate captures for. It appears to be quadratic or something similar I think - I strongly suspect a bug in tree-sitter or its python bindings.

On my laptop, this makes editing a 25,000 line Python file painfully unresponsive. A 25,000 lines Python file is large, but not entirely unreasonable. I actually became aware of this behaviour developing a LaTeX editor, which showed symptoms after about 1,000 lines.

This change replaces the plain TextArea._highlights dictionary with a dictonary-like class that lazily performs item access to build hilghlight information for small blocks of (50) lines at a time. As well as keeping the TextArea very much more responsive, it will reduce the average memory requirements for larger documents.

During regression testing, I discovered that the per-line highlights are not necessarily in the correct (best) order for syntax highlighting. For example, highlighting within string expressions can lost. So I added suitable sorting. This required that the snapshots for some tests to be updated.

Please review the following checklist.

  • Docstrings on all new or modified functions / classes
  • [n/a] Updated documentation
  • Updated CHANGELOG.md (where appropriate)

The TextArea widget code queries the entires syntax tree for each edit,
using the tree-sitter Query.captures method. This has potential scaling
issues, but such issues are exacerbated by the fact the Query.captures
method scales very poorly with the number of line it is asked to
generate captures for. It appears to be quadratic or something similar I
think - I strongly suspect a bug in tree-sitter or its python bindings.

On my laptop, this makes editing a 25,000 line Python file painfully
unresponsive. A 25,000 lines Python file is large, but not entirely
unreasonable. I actually became aware of this behaviour developing a
LaTeX editor, which showed symptoms after about 1,000 lines.

This change replaces the plain TextArea._highlights dictionary with a
dictonary-like class that lazily performs item access to build
hilghlight information for small blocks of (50) lines at a time. As well
as keeping the TextArea very much more responsive, it will reduce the
average memory requirements for larger documents.

During regression testing, I discovered that the per-line highlights are
not necessarily in the correct (best) order for syntax highlighting. For
example, highlighting within string expressions can lost. So I added
suitable sorting. This required that the snapshots for some tests to be
updated.
@paul-ollis
Copy link
Contributor Author

In case it is of interest.

For the LaTeX editor I was developing when I found this issue, I also implemented a spell checker and search function. It was very easy to extend the new HighlightMap class so that I could overlay misspelled word and matching text on top of the syntax highlighting.

@darrenburns
Copy link
Member

Love it! Exposing the highlight mapping is also super useful.

@paul-ollis
Copy link
Contributor Author

I forgot to upload this earlier. It is the large Python file and a simple application that I have been using for ad-hoc responsiveness testing - before and after. To use, just run the mini_editor and type.

demo.zip

I have not tried to add any tests due to fact that reliable automated performance testing is hard and not necessarily very helpful in this case. I can look into this if you feel that it is worth trying to automatically catch performance regression.

@darrenburns
Copy link
Member

Just played with the demo - very nice! Editing a 25k line Python file with syntax highlighting and almost no perceptible delay cc @willmcgugan :)

@paul-ollis
Copy link
Contributor Author

I am glad you like it, but if you were wondering why I my branch has a '1' on the end...

Try typing in a docstring at the top of the file - i.e. on a blank line open with triple quote (""") and then typing the contents leaving the end triple quote missing. I have a textarea-speedup-2 branch for this mostly ready to go. I need to finish some tweaks to the tests before I can prepare a PR.

@darrenburns
Copy link
Member

Awesome - I just made a PR on your fork with some suggestions, here: paul-ollis#1

Feel free to leave comments. A few type hinting changes but also I suggested a change to how the "uncovered_lines" part works - although I could be misunderstanding your implementation, I feel like we don't need to be constructing all those range objects on each edit and can instead calculate the "range" a line index belongs to on the fly.

For what it's worth, we can likely do even better here if we can get this to work:

for changed_range in tree.changed_ranges(new_tree):
    print("Changed range:")
    print(f"  Start point {changed_range.start_point}")
    print(f"  Start byte {changed_range.start_byte}")
    print(f"  End point {changed_range.end_point}")
    print(f"  End byte {changed_range.end_byte}")

In early versions of the Python tree-sitter bindings, this method wasn't working correctly.

For each edit, I wanted to use this method to return the document ranges affected by it, so that we could only re-highlight those parts. I'm not sure if that has since been fixed though - we only upgraded to the new bindings relatively recently.

@paul-ollis
Copy link
Contributor Author

Thanks for the PR @darrenburns. FYI idea behind the dict of ranges was to make getitem as efficient as possible at the expense of some memory - not too much because the only one range object is created for each 50 lines. However, based on you PR, I see no merit in keeping the scheme - your code is shorter and cleaner. I have incorporated your changes with some modifications, which, I think, make the code cleaner still.

I have also renamed _build_highlight_map to _reset_highlights, but you might be able to come up with a better name.

@paul-ollis
Copy link
Contributor Author

@darrenburns. Re tree.changed_ranges, my other branch (PR imminent) is a good place to try some experiments. I will have a play and let you know what I find (on the next PR).

@paul-ollis
Copy link
Contributor Author

It seems that Tree.changed_ranges does not always do what is required to be useful here.

I am looking at alternative approaches, which may take me a while.

@paul-ollis paul-ollis marked this pull request as draft April 11, 2025 12:02
@paul-ollis
Copy link
Contributor Author

paul-ollis commented Apr 11, 2025

Converted this to draft because it is basically subsumed into #5645. When the status of #5645 is settled this can be closed or readied for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants