You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When I try to use the partition on .docx file, it fails with error unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.ZIP file type.
from unstructured.partition.auto import partition
elements = partition("Document.docx")
Expected behavior
Partioning should be done without any errors
Screenshots
NA
Environment Info
OS version: macOS-14.6.1-arm64-arm-64bit
Python version: 3.10.15
unstructured version: None
unstructured-inference version: 0.8.7
pytesseract is not installed
Torch version: 2.6.0
Detectron2 is not installed
[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'
Additional context
On further debugging found that for .docx processing, unstructured relies to unzip the file and look for a specific file called word/document.xml , but in certain versions of office365, .docx files are sent with word/document2.xml, adding this to the list of checks in https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py#L749 fixes the issue.
Tried the same in local and found no issues.
Tried to understand the same from official microsoft docs, but there is no explicit mention about them
The text was updated successfully, but these errors were encountered:
Describe the bug
When I try to use the partition on .docx file, it fails with error unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.ZIP file type.
To Reproduce
Download the attached
Document.docx
file and run the below code
Expected behavior
Partioning should be done without any errors
Screenshots
NA
Environment Info
OS version: macOS-14.6.1-arm64-arm-64bit
Python version: 3.10.15
unstructured version: None
unstructured-inference version: 0.8.7
pytesseract is not installed
Torch version: 2.6.0
Detectron2 is not installed
[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'
Additional context
On further debugging found that for .docx processing, unstructured relies to unzip the file and look for a specific file called
word/document.xml
, but in certain versions of office365, .docx files are sent withword/document2.xml
, adding this to the list of checks in https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py#L749 fixes the issue.Tried the same in local and found no issues.
Tried to understand the same from official microsoft docs, but there is no explicit mention about them
The text was updated successfully, but these errors were encountered: