Identify Document Type Using AI (Back-End) #21

samanthannoor · 2025-02-26T19:53:22Z

User Story: As a system, I need to automate document classification by identifying the essential type of document uploaded (W-2, 1099, pay stub) so that I can efficiently extract relevant fields and produce measurable outcomes.

Acceptance Criteria:

System ingests the uploaded document and automatically evaluates its document structure, layout artifacts, and content patterns using Amazon Bedrock LLM (or another AI model).
System applies classification algorithms to predict document type (W-2, 1099, pay stub), leveraging pre-trained AI models and any additional programmatic rules necessary to enhance classification effectiveness.
The predicted document type is displayed to the user via the user interface and is also logged for downstream reporting.
A confidence score for classification is logged for evaluation.

This story establishes the essential foundation for a broader document intelligence capability, enabling future programmatic evaluation of extracted field accuracy, fraud detection, confidence level benchmarking, and automation effectiveness across the document lifecycle.

Technical Details
Input Handling:

Documents uploaded via API or front-end UI.
Files normalized to standard input format (PDF, TIFF, or PNG).
System stores document metadata artifacts, including filename, size, file type, and upload timestamp.

Classification Pipeline:

Document text and layout are extracted using OCR preprocessing pipeline (could leverage AWS Textract or equivalent).
Extracted data passed to Amazon Bedrock LLM for semantic and structural classification evaluation.
Classification logic uses prompt engineering with document type exemplars to maximize classification accuracy.

Confidence Scoring:

AI-generated classification results include native confidence scores from the model.
System logs both raw confidence scores and any post-processed confidence evaluation (e.g., adjusted scores based on prior classification patterns).
Confidence scores and classification outputs are captured in a reporting artifact for evaluation by the product team.

Output and Reporting:

Predicted document type and confidence score displayed to the user in the UI.
Full processing log (including predicted type, confidence score, and raw OCR text if needed) saved to system logs for future evaluation.
Classification results included in measurable effectiveness reports, aligning with broader program management objectives for evaluating AI-enabled automation.

Quality Assurance:

Periodic manual sampling of classified documents performed to validate accuracy and ensure that automation delivers high standards of performance.
If discrepancies exceed predefined thresholds, models and/or rules are evaluated and retrained to maximize outcomes and reduce waste.

samanthannoor moved this to Backlog in Document Extractor POC Feb 26, 2025

samanthannoor added this to Document Extractor POC Feb 26, 2025

jillcfoley1 changed the title ~~Identify Document Type Using AI (Optional – Back-End)~~ Identify Document Type Using AI (Back-End) Mar 3, 2025

jillcfoley1 added the dev label Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify Document Type Using AI (Back-End) #21

Identify Document Type Using AI (Back-End) #21

samanthannoor commented Feb 26, 2025 •

edited by jillcfoley1

Loading

Identify Document Type Using AI (Back-End) #21

Identify Document Type Using AI (Back-End) #21

Comments

samanthannoor commented Feb 26, 2025 • edited by jillcfoley1 Loading

samanthannoor commented Feb 26, 2025 •

edited by jillcfoley1

Loading