A Streamlit application that combines Oracle Document Understanding (OCR) with Large Language Models to automatically extract all fields from documents without requiring predefined field specifications.
- Dynamic Field Extraction: Automatically identifies and extracts all key-value pairs from documents
- Multi-language Support: Handles documents in English, Korean, Japanese, Chinese, Arabic, German, and more
- Smart Line Item Grouping: Properly structures invoice line items in tables
- Document Classification: Automatically classifies documents (Invoice, Receipt, Contract, etc.)
- Confidence Scoring: Provides extraction confidence levels
- Python 3.8+
- Oracle Cloud Infrastructure (OCI) account with Document Understanding service access
- OCI Generative AI service access
-
Clone the repository
git clone <repository-url> cd du-to-text-llm
-
Create virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure OCI credentials
- Set up your OCI config file at
~/.oci/config - Update
config.pywith your:- COMPARTMENT_ID
- OCI model IDs for LLMs
- Set up your OCI config file at
Run the Streamlit application:
streamlit run pages/run_DU_plus_LLM.pyThen:
- Upload a PDF or image document
- Select OCR language (Auto-detect recommended)
- Choose LLM provider (Meta Llama 3.3 or Cohere Command R)
- View extracted fields organized by category
- Optionally save results to JSON
Extracted data is saved to outputs/ directory as JSON files containing:
- Document classification
- All extracted key-value pairs
- Line items (for invoices/receipts)
- Confidence scores
- OCR raw data (optional)
├── pages/
│ └── 01_DU_plus_LLM_dynamic.py # Main Streamlit app
├── config.py # Configuration settings
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
└── README.md # This file
Proprietary