Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/Large docx xml encoding terminates lmxl #3887

Open
DhenPadilla opened this issue Jan 23, 2025 · 0 comments
Open

bug/Large docx xml encoding terminates lmxl #3887

DhenPadilla opened this issue Jan 23, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@DhenPadilla
Copy link

DhenPadilla commented Jan 23, 2025

Describe the bug
Unusually large xml encoding of .docx file fails partition-step when parsing the document.

To Reproduce
Unfortunately, I won't be able to give you a .docx document or .xml that reproduces this since the document that caused this bug for us is under our NDA.

Expected behavior
Expected to run partition on this docx file and get elements without an issue.

Screenshots
N/A - added stack trace below

Environment Info

OS version:  macOS-15.1-arm64-arm-64bit
Python version:  3.11.4
unstructured version:  0.16.13
unstructured-inference version:  0.8.1
pytesseract version:  0.3.13
Torch version:  2.5.1

Additional context

I'm not too sure what the best solution for this could be but flagging since there is a possibility that other users may experience this.

Here's a stack trace from where we call partition on the docx file:


File "/Users/dhen/vectorizer/utils/file_loader.py", line 119, in get_file_elements
    elements = partition(filename=file_path)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/auto.py", line 279, in partition
    elements = partition(filename=filename, file=file, **partitioning_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/common/metadata.py", line 162, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 146, in partition_docx
    elements = _DocxPartitioner.iter_document_elements(opts)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 354, in iter_document_elements
    if self._document_contains_sections
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/utils.py", line 154, in __get__
    value = self._fget(obj)
            ^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 516, in _document_contains_sections
    return bool(self._document.sections)
                ^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/utils.py", line 154, in __get__
    value = self._fget(obj)
            ^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 506, in _document
    return self._opts.document
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/utils.py", line 154, in __get__
    value = self._fget(obj)
            ^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 195, in document
    return docx.Document(self._docx_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/api.py", line 27, in Document
    document_part = cast("DocumentPart", Package.open(docx).main_document_part)
                                         ^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/package.py", line 129, in open
    Unmarshaller.unmarshal(pkg_reader, package, PartFactory)
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/package.py", line 193, in unmarshal
    parts = Unmarshaller._unmarshal_parts(pkg_reader, package, part_factory)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/package.py", line 209, in _unmarshal_parts
    parts[partname] = part_factory(partname, content_type, reltype, blob, package)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/part.py", line 196, in __new__
    return PartClass.load(partname, content_type, blob, package)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/part.py", line 231, in load
    element = parse_xml(blob)
              ^^^^^^^^^^^^^^^
  File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/oxml/parser.py", line 29, in parse_xml
    return cast("BaseOxmlElement", etree.fromstring(xml, oxml_parser))
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/etree.pyx", line 3306, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1995, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1882, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1164, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 633, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
  File "<string>", line 2
lxml.etree.XMLSyntaxError: Memory allocation failed : Huge input lookup, line 2, column 10115679


@DhenPadilla DhenPadilla added the bug Something isn't working label Jan 23, 2025
@DhenPadilla DhenPadilla changed the title bug/large-docx-xml-terminates-lmxl bug/Large docx xml encoding terminates lmxl Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant