You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Unusually large xml encoding of .docx file fails partition-step when parsing the document.
To Reproduce
Unfortunately, I won't be able to give you a .docx document or .xml that reproduces this since the document that caused this bug for us is under our NDA.
Expected behavior
Expected to run partition on this docx file and get elements without an issue.
I'm not too sure what the best solution for this could be but flagging since there is a possibility that other users may experience this.
Here's a stack trace from where we call partition on the docx file:
File "/Users/dhen/vectorizer/utils/file_loader.py", line 119, in get_file_elements
elements = partition(filename=file_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/auto.py", line 279, in partition
elements = partition(filename=filename, file=file, **partitioning_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/common/metadata.py", line 162, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 146, in partition_docx
elements = _DocxPartitioner.iter_document_elements(opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 354, in iter_document_elements
if self._document_contains_sections
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/utils.py", line 154, in __get__
value = self._fget(obj)
^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 516, in _document_contains_sections
return bool(self._document.sections)
^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/utils.py", line 154, in __get__
value = self._fget(obj)
^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 506, in _document
return self._opts.document
^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/utils.py", line 154, in __get__
value = self._fget(obj)
^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/unstructured/partition/docx.py", line 195, in document
return docx.Document(self._docx_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/api.py", line 27, in Document
document_part = cast("DocumentPart", Package.open(docx).main_document_part)
^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/package.py", line 129, in open
Unmarshaller.unmarshal(pkg_reader, package, PartFactory)
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/package.py", line 193, in unmarshal
parts = Unmarshaller._unmarshal_parts(pkg_reader, package, part_factory)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/package.py", line 209, in _unmarshal_parts
parts[partname] = part_factory(partname, content_type, reltype, blob, package)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/part.py", line 196, in __new__
return PartClass.load(partname, content_type, blob, package)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/opc/part.py", line 231, in load
element = parse_xml(blob)
^^^^^^^^^^^^^^^
File "/Users/dhen/vectorizer/env/lib/python3.11/site-packages/docx/oxml/parser.py", line 29, in parse_xml
return cast("BaseOxmlElement", etree.fromstring(xml, oxml_parser))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src/lxml/etree.pyx", line 3306, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1995, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1882, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1164, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 633, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: Memory allocation failed : Huge input lookup, line 2, column 10115679
The text was updated successfully, but these errors were encountered:
Describe the bug
Unusually large xml encoding of
.docx
file failspartition
-step when parsing the document.To Reproduce
Unfortunately, I won't be able to give you a
.docx
document or.xml
that reproduces this since the document that caused this bug for us is under our NDA.Expected behavior
Expected to run partition on this docx file and get elements without an issue.
Screenshots
N/A - added stack trace below
Environment Info
Additional context
I'm not too sure what the best solution for this could be but flagging since there is a possibility that other users may experience this.
Here's a stack trace from where we call partition on the docx file:
The text was updated successfully, but these errors were encountered: