Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/Incorrectly classifying Chinese text as "title" #3930

Open
anth0nyhak1m opened this issue Feb 19, 2025 · 0 comments
Open

bug/Incorrectly classifying Chinese text as "title" #3930

anth0nyhak1m opened this issue Feb 19, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@anth0nyhak1m
Copy link

Hello unstructured team,

I am currently using unstructured with paddle ocr to try and pre-process chinese documents.

The goal is to strip away any paratext in the document and only keep main body text.

The problem is all the text is either falsely classified as "title" or as "uncategorizedtext."

I am using:

paddeocr>-2.9.1 (not sure if this is being utilized at all by unstructured)
paddlepaddle==3.0.0b1
pymupdf>=1.25.3
unstructured-ingest>=0.5.5
unstructured-paddleocr==2.8.1.0
unstructured[pdf]>=0.16.21

I saw in the release logs that this problem was addressed in unstructured v 0.16.12.

The code I am using is the example code here: https://docs.unstructured.io/open-source/how-to/set-ocr-agent

But instead of importing partition_image from unstructured.partition.image, I am importing partition_pdf from unstructured.partition.pdf.

Please let me know if I am using your library correctly, let me know if there are any hyperparameters that I can set to improve the classification, let me know if this is just currently a limitation - does the API do a better job?

Thanks in advance.

@anth0nyhak1m anth0nyhak1m added the bug Something isn't working label Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant