You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently using unstructured with paddle ocr to try and pre-process chinese documents.
The goal is to strip away any paratext in the document and only keep main body text.
The problem is all the text is either falsely classified as "title" or as "uncategorizedtext."
I am using:
paddeocr>-2.9.1 (not sure if this is being utilized at all by unstructured)
paddlepaddle==3.0.0b1
pymupdf>=1.25.3
unstructured-ingest>=0.5.5
unstructured-paddleocr==2.8.1.0
unstructured[pdf]>=0.16.21
I saw in the release logs that this problem was addressed in unstructured v 0.16.12.
But instead of importing partition_image from unstructured.partition.image, I am importing partition_pdf from unstructured.partition.pdf.
Please let me know if I am using your library correctly, let me know if there are any hyperparameters that I can set to improve the classification, let me know if this is just currently a limitation - does the API do a better job?
Thanks in advance.
The text was updated successfully, but these errors were encountered:
Hello unstructured team,
I am currently using unstructured with paddle ocr to try and pre-process chinese documents.
The goal is to strip away any paratext in the document and only keep main body text.
The problem is all the text is either falsely classified as "title" or as "uncategorizedtext."
I am using:
paddeocr>-2.9.1 (not sure if this is being utilized at all by unstructured)
paddlepaddle==3.0.0b1
pymupdf>=1.25.3
unstructured-ingest>=0.5.5
unstructured-paddleocr==2.8.1.0
unstructured[pdf]>=0.16.21
I saw in the release logs that this problem was addressed in unstructured v 0.16.12.
The code I am using is the example code here: https://docs.unstructured.io/open-source/how-to/set-ocr-agent
But instead of importing partition_image from unstructured.partition.image, I am importing partition_pdf from unstructured.partition.pdf.
Please let me know if I am using your library correctly, let me know if there are any hyperparameters that I can set to improve the classification, let me know if this is just currently a limitation - does the API do a better job?
Thanks in advance.
The text was updated successfully, but these errors were encountered: