A much faster and more memory-efficient XML parser #962

Rouslan · 2023-11-10T02:46:26Z

Rouslan
Nov 10, 2023

I know someone was working on replacing the XML parser with one based on lxml, but how about one written in C? I started working on an XML parser generator that takes a simplified schema file written in JSON and outputs an extension module written in C.

The module uses Expat and Python's stable ABI. All complex types are their own classes and perfect hash lookup functions are generated for the element names, attribute names and enumerations. The generated code is meant to be easily readable and a fully-typed stub file will be generated.

I already have "compound.xsd" converted to the JSON schema and the project is able to generate most of the needed code. Before I finish and start trying to incorporate it into Breathe, I want to know what people's thoughts are, and if there are any objections to such a solution.

I'm also aware that the original author of Breathe is working on a Rust-based rewrite, but I think having most of the code in Python is preferable.

michaeljones · 2023-11-11T09:17:03Z

michaeljones
Nov 11, 2023
Maintainer

I think that sounds like a really interesting technical challenge and something you should work on if it excites you. I don't think it would be sensible for the project to adopt such a system though as I suspect the extra code and maintenance burden would outweigh the potential performance benefits over using lxml. After all lxml is Python bindings to a C library. I can see that having something generated from the xsd files must offer potential for out performing a generic solution but I suspect it would be too much to take on for the project.

I'm not much involved with the project though so others might come to another conclusion. Sorry to be a bit of a downer!

0 replies

Rouslan · 2023-11-11T16:35:48Z

Rouslan
Nov 11, 2023
Author

I started this, fully aware that the maintainers might not want to burden themselves with any C code, so I won't be too disappointed if this is rejected.

Anyway, I should mention, in case people thought otherwise, that the code generator itself is written in Python. As an alternative, the generator could be made to output the equivalent Python code. It would still be a significant improvement over the current parser because it would use the Expat parser directly, building the output as it parses, with all the generated classes using __slots__ and having complete type annotations.

3 replies

vermeeren Dec 4, 2023
Maintainer

@Rouslan How feasible would it be to have two XML parsing back-ends so that the user can choose? Indeed my time for Breathe is very very limited since quite a while but I still see options for something like this to be merged.

What seems smoothest would be merging it, but off by default. Get people to test for a while, then turn on by default. We may then either retain both implementations long-term depending on effort, or eventually remove the old one completely. Probably depends on user feedback.

Runtime config, completely different branches with 2 unique releases per Breathe version or whatever else is the least amount of effort. Could also formally release a test version with the new parser for example.

In any case many thanks for your work on this!

Edit: To clarify, Breathe has limited tests and limited time for testing, so it's really on a best-effort basis. Pushing out updates that may break edge cases because we cannot test everything is the current reality and imo better than stagnation.

Rouslan Dec 4, 2023
Author

It would have to be a separate branch or a pre-release version. The new back-end has a different interface, I'm adding type annotations all over the place and the filters have been rewritten. So much code has been touched that to have it configurable inside one branch would require having two copies of the entire package in the branch. That being said, I would prefer to do what makes the most sense, even if it's a little more effort; so what would be the best way?

Anyway, speaking of tests, I have just started adding a new suite of tests that should help a lot. I'm taking files from breathe/examples/specific, adding a minimal reStructuredText file for each example and building it with the original Breathe with the output format set to XML (the XML format is basically a dump of the docutils nodes). The test suite works by building the same files in temporary directories and comparing the XML output. The XML files are compared as streams of tags and text nodes, with certain tags and attributes filtered out and all text nodes filtered with str.strip.

By the way, are the files in breathe/examples/specific used anywhere? If they have a use, it's fine but if not, now that I'm using them for testing, it would be slightly more convenient to put each example in its own folder under breathe/tests/data.

Rouslan Dec 11, 2023
Author

Never mind about the breathe/examples/specific files. I see now that they are used in the documentation.

Rouslan · 2023-11-17T19:57:43Z

Rouslan
Nov 17, 2023
Author

My fork is at https://github.com/Rouslan/breathe/tree/c_parser

I still have a long way to go before the fork is usable, but the part that generates the C module is working. You can run Python3 setup.py build to generate (and compile) the module. If you want to compile it, you need to make sure the Expat library and headers can be found by the compiler. I'll add command line options later, to setup.py to specify the locations of Expat's library and headers.

As I'm replacing all references to the old parser classes, I'm also updating and adding more type annotations.

Any comments or criticisms are appreciated.

0 replies

Rouslan · 2023-11-27T00:02:38Z

Rouslan
Nov 27, 2023
Author

The link is now https://github.com/Rouslan/breathe (I merged the c_parser branch into the main branch of my fork).

It's mostly working now. Right now I need to fix some filters (all class members are being emitted regardless of given options). On that topic: would anyone object to having the high-level filter objects (subclasses of Selector, Accessor and Filter) removed and the filters replaced with simple callback functions. I can probably come up with a way to make the high-level objects type-safe but they really don't seem necessary in the first place. Even with the provided domain specific language, they don't seem to be more readable or significantly more concise than regular functions and functions would be faster anyway. Even the apparent functional quality is betrayed because one of the filters is impure (has side-effects), which is something that caught me by surprise.

1 reply

michaeljones Nov 27, 2023
Maintainer

Thanks super impressive. Much respect.

I think it is very reasonable to change the Selector/Accessor/Filter stuff to normal functions. It was an experiment I did a long time ago and has its own particular charm but isn't necessarily the best way to go about doing it and certainly isn't common.

Rouslan · 2023-12-11T02:51:50Z

Rouslan
Dec 11, 2023
Author

The fork is now able to generate identical HTML output for the Pigweed project (* with one trivial exception, see below), a collection of libraries with a lot of files, making extensive use of Sphinx and Breathe.

Currently, all the tests pass, including the ones I added. Later, I'm going to run the tests with Coverage.py to see what the tests' blind spots are and add more tests if needed. I'll also look up how to compile it according to the manylinux project so that binary wheels can be provided for Linux (in addition to Windows).

* The exception is one method documented with the "param" command. The method has an unnamed argument and the "param" command is incorrectly given the type of the parameter in place of the name. This causes the original Breathe to omit the "[in]" qualifier in the output, for some reason.

0 replies

igrr · 2023-12-18T07:53:33Z

igrr
Dec 18, 2023

We have tried using the fork to build documentation of ESP-IDF project, and the build time went down from 50 to 30 minutes. This is a very nice improvement, thanks @Rouslan for your work!

I've noticed a few parser warnings (unexpected element "sect2", also "sp", "highlight") but haven't so far seen any impact on the resulting documentation.

1 reply

Rouslan Dec 19, 2023
Author

Thanks for trying it out!

Regarding the unexpected element warnings: a few elements appear outside of what should be their containing elements according to the schema. I believe the original parser silently ignores them. I'll have to investigate before I'll know the correct way to handle them.

Rouslan · 2023-12-24T23:24:35Z

Rouslan
Dec 24, 2023
Author

I have just removed an enormous bottleneck in the original code. Projects with a lot of code should be significantly faster now.

0 replies

Rouslan · 2024-01-15T00:02:05Z

Rouslan
Jan 15, 2024
Author

I have replaced the parser written in C with an equivalent one written in Python. It looks like the parser was never the problem and the reason my fork was faster was because I rewrote the "filters" and "finders".

I still tried to make the new parser run as fast as I could. Its code is generated using the same system that generated the C parser. Interestingly, it's only about 5 times slower than the C one. On my machine, the new parser can still parse 2000 XML files, totaling 33MB, in 1.2 seconds. The memory usage should be about the same; the C parser's output stored all values in Python types and the values from the new parser all either use __slots__ or are named tuples.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A much faster and more memory-efficient XML parser #962

{{title}}

Replies: 8 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

A much faster and more memory-efficient XML parser #962

Rouslan Nov 10, 2023

Replies: 8 comments · 5 replies

michaeljones Nov 11, 2023 Maintainer

Rouslan Nov 11, 2023 Author

vermeeren Dec 4, 2023 Maintainer

Rouslan Dec 4, 2023 Author

Rouslan Dec 11, 2023 Author

Rouslan Nov 17, 2023 Author

Rouslan Nov 27, 2023 Author

michaeljones Nov 27, 2023 Maintainer

Rouslan Dec 11, 2023 Author

igrr Dec 18, 2023

Rouslan Dec 19, 2023 Author

Rouslan Dec 24, 2023 Author

Rouslan Jan 15, 2024 Author

Rouslan
Nov 10, 2023

Replies: 8 comments 5 replies

michaeljones
Nov 11, 2023
Maintainer

Rouslan
Nov 11, 2023
Author

vermeeren Dec 4, 2023
Maintainer

Rouslan Dec 4, 2023
Author

Rouslan Dec 11, 2023
Author

Rouslan
Nov 17, 2023
Author

Rouslan
Nov 27, 2023
Author

michaeljones Nov 27, 2023
Maintainer

Rouslan
Dec 11, 2023
Author

igrr
Dec 18, 2023

Rouslan Dec 19, 2023
Author

Rouslan
Dec 24, 2023
Author

Rouslan
Jan 15, 2024
Author