Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/<UnstructuredFileLoader Fails on Markdown Files (.md) – AttributeError in partition_md> #3935

Open
kaustubh-darekar opened this issue Feb 24, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@kaustubh-darekar
Copy link

I’m encountering an issue when using UnstructuredFileLoader to process a Markdown (.md) file.

The loader throws an AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing' when calling partition_md internally.

Steps to Reproduce:

  1. Install unstructured and langchain_community:
    pip install unstructured langchain-community

2. Run the following code:
attached the file used in this code
sparql-language-ref.md

from langchain_community.document_loaders import UnstructuredFileLoader
file_path = "sparql-language-ref.md"
loader = UnstructuredFileLoader(file_path, mode="elements", autodetect_encoding=True)
pages = loader.load()

3. Observed Error:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'

Expected Behavior:
• The Markdown file should be successfully loaded and parsed into elements.
• If the file has processing instructions, they should be ignored or handled gracefully without causing a crash.

Actual Behavior:
• The process crashes with an AttributeError in partition_md, specifically at:
while q and q[0].is_phrasing:

Environment Details:
• unstructured Version: (0.16.23)
• langchain_community Version: (0.3.18)
• Python Version: 3.10
• OS: Ubuntu 22.04 (WSL/Cloud-based environment)

Would appreciate any insights or a workaround for this issue! Thanks! 🙌

@kaustubh-darekar kaustubh-darekar added the bug Something isn't working label Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant