-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError: attempt to get argmin of an empty sequence when converting PDF with no headings #88
Comments
The error occurs because the heading_font_sizes array, which is derived from the clustering of font sizes larger than the mode, is empty. This could happen if the document doesn't have any font sizes larger than the mode font size, or if the clustering process fails due to insufficient data. |
I've tried to fix this and it worked by doing the following :
|
I've forked the project and applied the modifications, I can create a PR if you want. https://github.com/rafikoham/raglite Yours Faithfully, Rafik Adonis Hammoutène |
When attempting to convert a PDF document that doesn't contain any detectable headings using document_to_markdown, a ValueError is raised:
ValueError: attempt to get argmin of an empty sequence
This error occurs within the add_heading_level_metadata function in raglite document_to_markdown.py. It's triggered when the heading_font_sizes array is empty, indicating that no heading font sizes could be determined from the PDF.
I guess it's trying to find the minimum value in an empty NumPy array (heading_font_sizes). This suggests that the PDF file that i have may not have any detectable headings or their font sizes are not being recognized correctly.
Here's the pdf file :
groovy.pdf
------------- Full error --------------------------------------------------
ValueError Traceback (most recent call last)
in <cell line: 0>()
1 doc_path=Path(file_path)
----> 2 doc=document_to_markdown(doc_path)
3 doc
4 frames
/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in document_to_markdown(doc_path)
202 # Parse the PDF with pdftext and convert it to Markdown.
203 pages = dictionary_output(doc_path, sort=True, keep_chars=False)
--> 204 doc = "\n\n".join(parsed_pdf_to_markdown(pages))
205 else:
206 try:
/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in parsed_pdf_to_markdown(pages)
184
185 # Add heading level metadata.
--> 186 pages = add_heading_level_metadata(pages)
187 # Add emphasis metadata.
188 pages = add_emphasis_metadata(pages)
/usr/local/lib/python3.11/dist-packages/raglite/_markdown.py in add_heading_level_metadata(pages)
75 idx = 6
76 else:
---> 77 idx = np.argmin(np.abs(heading_font_sizes - span_font_size)) # type: ignore[assignment]
78 span["md"]["heading_level"] = idx + 1
79 heading_level[idx] += len(span["text"])
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in argmin(a, axis, out, keepdims)
1323 """
1324 kwds = {'keepdims': keepdims} if keepdims is not np._NoValue else {}
-> 1325 return _wrapfunc(a, 'argmin', axis=axis, out=out, **kwds)
1326
1327
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
57
58 try:
---> 59 return bound(*args, **kwds)
60 except TypeError:
61 # A TypeError occurs if the object does have such a method in its
ValueError: attempt to get argmin of an empty sequence
The text was updated successfully, but these errors were encountered: