Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/unable to process .docx file from office 365 #3937

Open
srisudarsan opened this issue Mar 3, 2025 · 1 comment
Open

bug/unable to process .docx file from office 365 #3937

srisudarsan opened this issue Mar 3, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@srisudarsan
Copy link

srisudarsan commented Mar 3, 2025

Describe the bug
When I try to use the partition on .docx file, it fails with error unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.ZIP file type.

To Reproduce
Download the attached

Document.docx

file and run the below code

from unstructured.partition.auto import partition
elements = partition("Document.docx")

Expected behavior
Partioning should be done without any errors

Screenshots
NA

Environment Info
OS version: macOS-14.6.1-arm64-arm-64bit
Python version: 3.10.15
unstructured version: None
unstructured-inference version: 0.8.7
pytesseract is not installed
Torch version: 2.6.0
Detectron2 is not installed

[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip

[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

Additional context
On further debugging found that for .docx processing, unstructured relies to unzip the file and look for a specific file called word/document.xml , but in certain versions of office365, .docx files are sent with word/document2.xml, adding this to the list of checks in https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py#L749 fixes the issue.
Tried the same in local and found no issues.

Tried to understand the same from official microsoft docs, but there is no explicit mention about them

@srisudarsan srisudarsan added the bug Something isn't working label Mar 3, 2025
@srisudarsan
Copy link
Author

Stumbled on similar issue with other libraries when searching for this, few of them for reference
open-xml-templating/docxtemplater#366
neilharvey/FileSignatures#46

And it is possible that the same could be affecting other office file formats like PPTX, XLSX, etc.

Let me know what you think of this and I can potentially suggest a fix for this too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant