-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/pdf support #23
base: main
Are you sure you want to change the base?
Feat/pdf support #23
Conversation
WalkthroughThe pull request introduces enhancements to a document OCR application, focusing on PDF file support and improved amount extraction. The changes include adding PDF processing capabilities, implementing a new amount extraction function, and updating the application to handle PDF files alongside images. New dependencies are added to support these features, and environment configuration files are updated to protect sensitive information. Changes
Sequence DiagramsequenceDiagram
participant User
participant Streamlit
participant PDFConverter
participant OCRModel
participant AmountExtractor
User->>Streamlit: Upload PDF/Image
Streamlit->>PDFConverter: Convert PDF to Image
PDFConverter-->>Streamlit: Image Bytes
Streamlit->>OCRModel: Send Image for Text Extraction
OCRModel-->>Streamlit: Extracted Text
Streamlit->>AmountExtractor: Clean and Validate Amount
AmountExtractor-->>Streamlit: Processed Amount
Streamlit->>User: Display Extraction Results
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
llama-ocr/app.py (4)
30-50
: Consider broader locale handling or specialized currency parsing.
Theextract_amount
function removes all non-numeric characters, handles multiple decimal points, and returns a float. This approach is straightforward but may fail for certain locale formats (e.g., negative amounts or parentheses, currency symbols like '€', or multiple decimal separators). If you plan to support more complex currency formats, consider a specialized library (e.g., Babel) or handle negative values and locale-specific separators.
51-60
: Note that only the first page of the PDF is converted.
Currently,pdf_to_image
processes only the first page. If multi-page PDF support is needed in the future, consider looping over pages and combining or selecting the desired page(s).Do you want a snippet showing how to handle multiple pages automatically?
71-79
: Catching and displaying PDF conversion errors.
Using a try-except block for PDF conversion is a good approach to handle malformed or encrypted PDFs. Consider logging errors for debugging to track repeated conversion failures.
92-114
: Add error checks for unexpected model output.
The call toollama.chat
uses a prompt expecting a single numeric amount. If the model returns text that doesn't contain a clear amount, the extraction logic may fail. Handling unexpected format or empty results would improve reliability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
llama-ocr/.gitignore
(1 hunks)llama-ocr/app.py
(3 hunks)llama-ocr/requirements.txt
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- llama-ocr/.gitignore
- llama-ocr/requirements.txt
🔇 Additional comments (7)
llama-ocr/app.py (7)
20-20
: UI layout approach looks fine.
Defining two columns with different widths is a neat approach to place the clear button. Make surecol1
is utilized for any additional content if needed.
61-69
: Good user interface for file upload.
Users can upload either images or PDFs, which is flexible. You might want to limit maximum file size or handle extremely large PDFs to prevent performance issues.
80-83
: Direct image loading looks solid.
Loading images with PIL is straightforward. You might eventually want to check or warn if the image is very large to help control memory usage.
84-88
: Streamlined conversion to PNG bytes.
Converting the image to PNG ensures a consistent format for OCR. In some scenarios, preserving the original format may be necessary (e.g., to maintain higher fidelity). For now, this is fine.
120-120
: Clear presentation of extracted result.
Displaying the extracted amount from session state is clear and straightforward. Good job!
126-126
: Footer branding looks neat.
Providing a direct link to report issues is helpful for users to submit bug reports and feedback.
5-6
: Validate library versions and security patches.
Adding PyMuPDF (fitz
) andre
is appropriate for PDF processing and regex-based text cleaning. To ensure security and stability, consider verifying that you are using the latest releases of PyMuPDF and that no high-severity vulnerabilities exist in it.Below is a script you can run to search for advisories on PyMuPDF:
✅ Verification successful
PyMuPDF and re libraries are secure and up-to-date
PyMuPDF is at the latest version (1.25.2) with no known security vulnerabilities reported in GitHub's advisory database. There
module is a built-in Python library that's maintained as part of Python's standard library.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for known PyMuPDF vulnerabilities curl -s https://pypi.org/pypi/PyMuPDF/json | jq '.info.version' gh api graphql -f query=' { securityVulnerabilities(first: 5, ecosystem: PIP, package: "pymupdf") { nodes { advisory { summary severity publishedAt } vulnerableVersionRange firstPatchedVersion { identifier } } } }'Length of output: 413
Summary by CodeRabbit
New Features
Dependencies
Chores
.gitignore
to exclude.env
files