Skip to content

bug/unable to process .docx file from office 365 #3937

Closed
@srisudarsan

Description

@srisudarsan

Describe the bug
When I try to use the partition on .docx file, it fails with error unstructured.partition.common.UnsupportedFileFormatError: Partitioning is not supported for the FileType.ZIP file type.

To Reproduce
Download the attached

Document.docx

file and run the below code

from unstructured.partition.auto import partition
elements = partition("Document.docx")

Expected behavior
Partioning should be done without any errors

Screenshots
NA

Environment Info
OS version: macOS-14.6.1-arm64-arm-64bit
Python version: 3.10.15
unstructured version: None
unstructured-inference version: 0.8.7
pytesseract is not installed
Torch version: 2.6.0
Detectron2 is not installed

[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip

[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python3 -m pip install --upgrade pip
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

Additional context
On further debugging found that for .docx processing, unstructured relies to unzip the file and look for a specific file called word/document.xml , but in certain versions of office365, .docx files are sent with word/document2.xml, adding this to the list of checks in https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py#L749 fixes the issue.
Tried the same in local and found no issues.

Tried to understand the same from official microsoft docs, but there is no explicit mention about them

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions