[WIP] Lxml Parsing #891

AnthonyDiGirolamo · 2023-02-22T04:52:28Z

Hi all! My team is looking to use doxygen with breathe and noticed very long build times, about ~60 seconds per rst file.

I started hacking on the parser stuff and re-ran generateDS 2.41.1 against doxygen 1.9.6 xsd files.

So far ./tests/runtests.sh passes but I can't yet run sphinx.

tests/test_renderer.py::test_find_nodes PASSED
tests/test_renderer.py::test_find_node PASSED
tests/test_renderer.py::test_render_func PASSED
tests/test_renderer.py::test_render_typedef PASSED
tests/test_renderer.py::test_render_c_typedef PASSED
tests/test_renderer.py::test_render_c_function_typedef PASSED
tests/test_renderer.py::test_render_using_alias PASSED
tests/test_renderer.py::test_render_const_func PASSED
tests/test_renderer.py::test_render_lvalue_func PASSED
tests/test_renderer.py::test_render_rvalue_func PASSED
tests/test_renderer.py::test_render_const_lvalue_func PASSED
tests/test_renderer.py::test_render_const_rvalue_func PASSED
tests/test_renderer.py::test_render_variable_initializer PASSED
tests/test_renderer.py::test_render_define_initializer PASSED
tests/test_renderer.py::test_render_define_no_initializer PASSED
tests/test_renderer.py::test_render_innergroup PASSED
tests/test_renderer.py::test_resolve_overrides PASSED
tests/test_renderer.py::test_ellipsis PASSED
tests/test_utils.py::TestUtils::test_definition_without_template_args PASSED
tests/test_utils.py::TestUtils::test_param_decl PASSED
tests/test_utils.py::TestPathHandler::test_path_handler PASSED

The error I'm getting in our sphinx run is

  lots more docutils stack traces...
  ...
  File "$PROJECT/venv/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2096, in directive
    return self.run_directive(
           ^^^^^^^^^^^^^^^^^^^
  File "$PROJECT/venv/lib/python3.11/site-packages/docutils/parsers/rst/states.py", line 2146, in run_directive
    result = directive_instance.run()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tonymd/code/breathe/breathe/directives/class_like.py", line 52, in run
    finder.filter_(finder_filter, matches)
  File "/home/tonymd/code/breathe/breathe/finder/factory.py", line 45, in filter_
    item_finder.filter_([_FakeParentNode()], filter_, matches)
  File "/home/tonymd/code/breathe/breathe/finder/index.py", line 18, in filter_
    compound_finder.filter_(node_stack, filter_, matches)
  File "/home/tonymd/code/breathe/breathe/finder/index.py", line 65, in filter_
    finder.filter_(node_stack, filter_, matches)
  File "/home/tonymd/code/breathe/breathe/finder/compound.py", line 10, in filter_
    compound_finder = self.item_finder_factory.create_finder(self.data_object.compounddef)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tonymd/code/breathe/breathe/finder/factory.py", line 29, in create_finder
    return self.finders[data_object.node_type](self.project_info, data_object, self)
                        ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'node_type'

Any ideas?

…generateDS 2.41.1

michaeljones · 2023-02-24T11:45:53Z

Thanks for taking a look at this. Unfortunately it is a non-trivial task. We stopped regenerating the xml parsing code when we starting manually adjusting it. Regenerating it now will likely require a lot of changes which is hard to achieve in a Python project which doesn't have a lot of type safety or good test coverage. Though clearly these files are at the heart of Breathe's performance and memory problems.

In this case, it is hard to know if this is the first error of a small number of potential problems or a very large number.

If this project was well funded and supported then we might have the time and energy to work through such a large change but we have not been successful at securing such funding. Partly due to a lack of knowledge of how to go about it.

Personally my preference for languages has also drifted away from Python. I find large Python code bases hard to maintain and whilst the new optional typing can help with that it is still not as robust as other languages.

I have started on a re-write of Breathe using Rust as the backend and a slim Python layer connected via PyO3 bindings. The hope is that it'll be significantly faster and less memory intensive whilst also being more maintainable due to Rust's type system. I have a simple proof of concept but it is a long way from matching the current functionality but Breathe though I would like to get it there.

In order to attempt to make it a more sustainable project, I intend to license it under the Parity license which makes it free to use for open source projects but will require a commercial license for closed source work. I realise this is not optimal for all of Breathe's current users but will hopefully give the project a clearer future and potential to expand in scope to deliver more solutions in this space.

igrr · 2023-03-10T10:47:10Z

@AnthonyDiGirolamo our team is also interested in contributing to improved performance, since our documentation builds are currently taking several hours.

I have tried your branch to see if I can make any progress, but ran into some issues which look like something more basic than the issue you have described in the PR:

(backtrace)

  File "/Users/ivan/.espressif/python_env/idf5.1_py3.9_env/lib/python3.9/site-packages/docutils/parsers/rst/states.py", line 2146, in run_directive
    result = directive_instance.run()
  File "/Users/ivan/e/3rdparty/breathe/breathe/directives/item.py", line 53, in run
    finder.filter_(finder_filter, matches)
  File "/Users/ivan/e/3rdparty/breathe/breathe/finder/factory.py", line 45, in filter_
    item_finder.filter_([_FakeParentNode()], filter_, matches)
  File "/Users/ivan/e/3rdparty/breathe/breathe/finder/index.py", line 18, in filter_
    compound_finder.filter_(node_stack, filter_, matches)
  File "/Users/ivan/e/3rdparty/breathe/breathe/finder/index.py", line 63, in filter_
    file_data = self.compound_parser.parse(self.data_object.refid)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/__init__.py", line 77, in parse
    result = compound.parse(filename)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compound.py", line 954, in parse
    rootObj.build(rootNode)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 1382, in build
    self._buildChildren(child, node, nodeName_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 1397, in _buildChildren
    obj_.build(child_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 1996, in build
    self._buildChildren(child, node, nodeName_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 2129, in _buildChildren
    obj_.build(child_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 3728, in build
    self._buildChildren(child, node, nodeName_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 3750, in _buildChildren
    obj_.build(child_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 4587, in build
    self._buildChildren(child, node, nodeName_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compound.py", line 168, in _buildChildren
    super(memberdefTypeSub, self)._buildChildren(child_, node, nodeName_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 4900, in _buildChildren
    obj_.build(child_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 5542, in build
    self._buildChildren(child, node, nodeName_, gds_collector_=gds_collector_)
  File "/Users/ivan/e/3rdparty/breathe/breathe/parser/compoundsuper.py", line 5557, in _buildChildren
    obj_.build(child_, gds_collector_=gds_collector_)
TypeError: build() got an unexpected keyword argument 'gds_collector_'

It looks like the call to build doesnt pass the node name argument, but the declaration of build expects it? Is it possible that you haven't staged/committed some of the changes you did?

AnthonyDiGirolamo · 2023-03-10T16:26:21Z

@igrr Thanks for trying it!

Those are the types of exceptions I was working through. I got through the ones caused by the unit tests but not running an actual sphinx generation.

I started this effort by re-running the latest generate-ds version which provided new indexsuper.py and compoundsuper.py files.
http://www.davekuhlman.org/generateDS.html
Problem with that is the internal API those files provide have changed since the last time this was done (in 2009 I think) 2362fc9

The latest generate-ds version uses lxml for parsing which is what speeds everything up.

For what it's worth I setup our breathe config to only run Doxygen on files with doxygen comment blocks. This reduces the number of xml output files that breathe has to parse which cut down our generation time drastically. Here's the change that did that: docs: Support Doxygen style comment blocks

igrr · 2023-03-11T11:51:27Z

Right, but aren't both compound.py and compoundsuper.py in your PR auto-generated? I would expect that this would make them internally consistent. The API changes would then only be important outside of the parser, e.g. in breathe/finder/*.py. Whereas in the backtrace I got it looks like parser/compoundsuper.py is calling another function in parser/compoundsuper.py, but the function is expecting more arguments than the caller is passing.

I guess I'll also try to run generateDS locally and see how far I can get.

AnthonyDiGirolamo · 2023-03-11T14:27:41Z

As far as I can tell there was some code in compound.py that was written by the breathe team. The auto generated output provided empty functions that can be overridden by users. I copied that all over but it's not all updated for the newer generate DS api.

oxysoft · 2023-10-25T01:14:32Z

Alright folks we need to move forward with this, there is no way that we can accept in 2023 an hour to build motherfucking text files. (no one's fault, of course) I am happy to contribute, but already from the get-go I'm confused what the ruckus is all about. If my assessment is correct, our goal is to parse XML and build a structure of objects that is native to Sphinx, then we store these into 'doctree' files or pass them to Sphinx so it can run it's HTML rendering magic. So what is so hard about switching over to LXML? Both are still XML underneath, so it should be a matter of updating the API calls everywhere, isn't it?

AnthonyDiGirolamo · 2023-11-13T19:01:27Z

Hey @oxysoft yep that's right. This is still a problem for our sphinx build too. We try to limit doxygen to only run on the files we care about to limit what breathe parses. It worked well when there were less files but now it's getting too big and slow. (List of inputs https://cs.opensource.google/pigweed/pigweed/+/main:docs/BUILD.gn;l=113-197 )

Hopefully we can revisit this soon.

vermeeren · 2023-12-04T12:19:06Z

Side note that there is also the approach of #962. I am not sure what is preferable at this point in time or how far both of the implementations are.

AnthonyDiGirolamo added 4 commits February 21, 2023 19:24

Regenerate indexsuper.py and compoundsuper.py for doxygen 1.9.6 with …

2a60a6f

…generateDS 2.41.1

Add missing node_types

1d24705

node.get_argsstring() can return None which doesn't work with ''.join()

8fddfba

Update index.py

d8eb866

This was referenced Feb 22, 2023

slowness #439

Open

Breathe parsing XML slow #544

Closed

Memory Consumption of doxyindex & doxygenclass #315

Open

reneme mentioned this pull request Apr 12, 2023

Use Breathe for API Reference documentation randombit/botan#3377

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Lxml Parsing #891

[WIP] Lxml Parsing #891

AnthonyDiGirolamo commented Feb 22, 2023 •

edited

Loading

michaeljones commented Feb 24, 2023 •

edited

Loading

igrr commented Mar 10, 2023

AnthonyDiGirolamo commented Mar 10, 2023

igrr commented Mar 11, 2023

AnthonyDiGirolamo commented Mar 11, 2023

oxysoft commented Oct 25, 2023 •

edited

Loading

AnthonyDiGirolamo commented Nov 13, 2023

vermeeren commented Dec 4, 2023

[WIP] Lxml Parsing #891

Are you sure you want to change the base?

[WIP] Lxml Parsing #891

Conversation

AnthonyDiGirolamo commented Feb 22, 2023 • edited Loading

michaeljones commented Feb 24, 2023 • edited Loading

igrr commented Mar 10, 2023

AnthonyDiGirolamo commented Mar 10, 2023

igrr commented Mar 11, 2023

AnthonyDiGirolamo commented Mar 11, 2023

oxysoft commented Oct 25, 2023 • edited Loading

AnthonyDiGirolamo commented Nov 13, 2023

vermeeren commented Dec 4, 2023

AnthonyDiGirolamo commented Feb 22, 2023 •

edited

Loading

michaeljones commented Feb 24, 2023 •

edited

Loading

oxysoft commented Oct 25, 2023 •

edited

Loading