library.add_files() only adding .docx file to the library and ignoring .pdf files #49
-
In the snippet below, I'm attempting to load a collection of documents into a new library. The folder has 1 .docx document and 15 .pdf documents. However, it only seems to be loading the .docx file while ignoring the .pdf files:
This gives me:
It looks like I can just use add_pdf() to get around this, but I'm hoping to understand what I'm doing wrong with add_files(). That would be a much more convenient method to use! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @jcalebsmith. Yes add_files should be getting the PDF files. It would be a good test to also try specifically doing an add_pdf() if it's not too much work. If that also fails to pick up the PDF files then here's what is likely happening: The PDF parser expects to find parseable text in the PDFs and if it doesn't (and instead finds mostly embedded images or not enough text to generate decent sized "blocks" from), it will skip the given PDF. Is it possible that your PDFs may be mostly images or have very little text? If so, you can try OCR: library.add_pdf_by_ocr(input_folder). This is an explicit action because OCR can be a memory intensive operation. If your PDFs do actually contain good amounts of text, and you believe should be getting parsed with the normal PDF parser, if you could share one or more with us that would be great. If they contain sensitive data, you can send directly to me at [email protected] and we'll handle with care and delete when done with any troubleshooting. |
Beta Was this translation helpful? Give feedback.
-
Ahh, i think I see the problem. In the code above, add_files(...) is only actually called in the case where the library doesn't exist. Once the library has been created then running that code again won't call add_files(...) again. So can you try moving that line out of the else, like this: ...
# Load the library or create and populate it if doesn't exist
if Library().check_if_library_exists(library_name):
# Load the library
library = Library().load_library(library_name)
else:
print (f' > Creating library {library_name}...')
# Create the library
library = Library().create_new_library(library_name)
# Add files to library
library.add_files(os.path.join(folder_path,'test-data'))
... And then try running the code again. |
Beta Was this translation helpful? Give feedback.
Ahh, i think I see the problem. In the code above, add_files(...) is only actually called in the case where the library doesn't exist. Once the library has been created then running that code again won't call add_files(...) again. So can you try moving that line out of the else, like this: