Description
Describe the bug
When I try to use the partition on .docx file, it fails with error unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.ZIP file type.
To Reproduce
Download the attached
file and run the below code
from unstructured.partition.auto import partition
elements = partition("Document.docx")
Expected behavior
Partioning should be done without any errors
Screenshots
NA
Environment Info
OS version: macOS-14.6.1-arm64-arm-64bit
Python version: 3.10.15
unstructured version: None
unstructured-inference version: 0.8.7
pytesseract is not installed
Torch version: 2.6.0
Detectron2 is not installed
[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'
Additional context
On further debugging found that for .docx processing, unstructured relies to unzip the file and look for a specific file called word/document.xml
, but in certain versions of office365, .docx files are sent with word/document2.xml
, adding this to the list of checks in https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py#L749 fixes the issue.
Tried the same in local and found no issues.
Tried to understand the same from official microsoft docs, but there is no explicit mention about them