Use Open Source to Extract Data #27

jillcfoley1 · 2025-02-27T17:27:10Z

As a system, I leverage an open-source OCR tool to automate data extraction so that the product team can evaluate the tool's effectiveness compared to AWS Textract.

AC:

Explore and evaluate open-source OCR tool options, focusing on essential capabilities such as character recognition accuracy and ease of integration. Identify the highest-merit open-source tool for initial experimentation, prioritizing efficiency, measurable outcomes, and automation potential.
There is a second application that uses the open-source tool, in addition to the AWS Textract application.
The new open-source tool is integrated into the new application.
The team has designed and executed side-by-side comparison testing across both applications and has an understanding of the accuracy of both. Evaluate measurable outcomes such as accuracy rates, processing speed, and error handling capabilities.
The team has a documented understanding of the comparative effectiveness, measurable savings potential, and any efficiencies gained through the use of an open-source tool.

Technical Details
Technical Stack Considerations:

Identify compatible OCR libraries (e.g., Tesseract, PaddleOCR, or Textract alternatives).
Ensure selected tool supports required file types (PDF, TIFF, PNG, JPG).
Implement logging and structured reporting for extracted data, errors, and confidence scores.
Where possible, containerize the OCR service for consistent deployment across environments.

Data Pipeline Integration:

Ingest source files via standardized input folder structure, API call, or message queue.
Normalize outputs (structured JSON, XML, or CSV) to align with product team’s existing comparison scripts.
Implement error handling for failed pages, unsupported formats, and language mismatches.

Testing and Validation:

Create a standardized test corpus representative of real-world data.
Compare outputs using measurable criteria:
Character-level accuracy
Field-level completeness
Confidence score variance
Processing time per page
Use automated comparison scripts to generate side-by-side accuracy reporting.

Reporting Artifact:

Deliver structured evaluation report covering:
Tool selection rationale
Accuracy comparison (open-source vs AWS Textract)
Processing performance (speed and failure rates)

Hypothesis: OCR validation: We believe that integrating a merit-based open-source OCR tool can streamline operations and reduce waste by replacing AWS Textract—delivering equivalent or superior accuracy without compromising essential performance standards.

jillcfoley1 moved this to Backlog in Document Extractor POC Feb 27, 2025

jillcfoley1 added this to Document Extractor POC Feb 27, 2025

em-herrick moved this from Backlog to Sprint Targets in Document Extractor POC Mar 3, 2025

jillcfoley1 changed the title ~~Use Tesseract Open Source to Extract Data~~ Use Open Source to Extract Data Mar 3, 2025

jillcfoley1 added the dev label Mar 3, 2025

jillcfoley1 assigned halprin and lrichardson-flexion Mar 3, 2025

halprin moved this from Sprint Targets to In Progress in Document Extractor POC Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Open Source to Extract Data #27

Use Open Source to Extract Data #27

jillcfoley1 commented Feb 27, 2025 •

edited

Loading

Use Open Source to Extract Data #27

Use Open Source to Extract Data #27

Comments

jillcfoley1 commented Feb 27, 2025 • edited Loading

jillcfoley1 commented Feb 27, 2025 •

edited

Loading