Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Open Source to Extract Data #27

Open
5 tasks
jillcfoley1 opened this issue Feb 27, 2025 · 0 comments
Open
5 tasks

Use Open Source to Extract Data #27

jillcfoley1 opened this issue Feb 27, 2025 · 0 comments
Assignees
Labels

Comments

@jillcfoley1
Copy link
Collaborator

jillcfoley1 commented Feb 27, 2025

As a system, I leverage an open-source OCR tool to automate data extraction so that the product team can evaluate the tool's effectiveness compared to AWS Textract.

AC:

  • Explore and evaluate open-source OCR tool options, focusing on essential capabilities such as character recognition accuracy and ease of integration. Identify the highest-merit open-source tool for initial experimentation, prioritizing efficiency, measurable outcomes, and automation potential.
  • There is a second application that uses the open-source tool, in addition to the AWS Textract application.
  • The new open-source tool is integrated into the new application.
  • The team has designed and executed side-by-side comparison testing across both applications and has an understanding of the accuracy of both. Evaluate measurable outcomes such as accuracy rates, processing speed, and error handling capabilities.
  • The team has a documented understanding of the comparative effectiveness, measurable savings potential, and any efficiencies gained through the use of an open-source tool.

Technical Details
Technical Stack Considerations:

  • Identify compatible OCR libraries (e.g., Tesseract, PaddleOCR, or Textract alternatives).
  • Ensure selected tool supports required file types (PDF, TIFF, PNG, JPG).
  • Implement logging and structured reporting for extracted data, errors, and confidence scores.
  • Where possible, containerize the OCR service for consistent deployment across environments.

Data Pipeline Integration:

  • Ingest source files via standardized input folder structure, API call, or message queue.
  • Normalize outputs (structured JSON, XML, or CSV) to align with product team’s existing comparison scripts.
  • Implement error handling for failed pages, unsupported formats, and language mismatches.

Testing and Validation:

  • Create a standardized test corpus representative of real-world data.
  • Compare outputs using measurable criteria:
  • Character-level accuracy
  • Field-level completeness
  • Confidence score variance
  • Processing time per page
  • Use automated comparison scripts to generate side-by-side accuracy reporting.

Reporting Artifact:

  • Deliver structured evaluation report covering:
  • Tool selection rationale
  • Accuracy comparison (open-source vs AWS Textract)
  • Processing performance (speed and failure rates)

Hypothesis: OCR validation: We believe that integrating a merit-based open-source OCR tool can streamline operations and reduce waste by replacing AWS Textract—delivering equivalent or superior accuracy without compromising essential performance standards.

@em-herrick em-herrick moved this from Backlog to Sprint Targets in Document Extractor POC Mar 3, 2025
@jillcfoley1 jillcfoley1 changed the title Use Tesseract Open Source to Extract Data Use Open Source to Extract Data Mar 3, 2025
@jillcfoley1 jillcfoley1 added the dev label Mar 3, 2025
@halprin halprin moved this from Sprint Targets to In Progress in Document Extractor POC Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

No branches or pull requests

3 participants