Skip to content

brnil6/du-to-text-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Document Understanding + LLM Dynamic Extraction

A Streamlit application that combines Oracle Document Understanding (OCR) with Large Language Models to automatically extract all fields from documents without requiring predefined field specifications.

Features

  • Dynamic Field Extraction: Automatically identifies and extracts all key-value pairs from documents
  • Multi-language Support: Handles documents in English, Korean, Japanese, Chinese, Arabic, German, and more
  • Smart Line Item Grouping: Properly structures invoice line items in tables
  • Document Classification: Automatically classifies documents (Invoice, Receipt, Contract, etc.)
  • Confidence Scoring: Provides extraction confidence levels

Prerequisites

  • Python 3.8+
  • Oracle Cloud Infrastructure (OCI) account with Document Understanding service access
  • OCI Generative AI service access

Setup

  1. Clone the repository

    git clone <repository-url>
    cd du-to-text-llm
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Configure OCI credentials

    • Set up your OCI config file at ~/.oci/config
    • Update config.py with your:
      • COMPARTMENT_ID
      • OCI model IDs for LLMs

Usage

Run the Streamlit application:

streamlit run pages/run_DU_plus_LLM.py

Then:

  1. Upload a PDF or image document
  2. Select OCR language (Auto-detect recommended)
  3. Choose LLM provider (Meta Llama 3.3 or Cohere Command R)
  4. View extracted fields organized by category
  5. Optionally save results to JSON

Output

Extracted data is saved to outputs/ directory as JSON files containing:

  • Document classification
  • All extracted key-value pairs
  • Line items (for invoices/receipts)
  • Confidence scores
  • OCR raw data (optional)

Project Structure

├── pages/
│   └── 01_DU_plus_LLM_dynamic.py  # Main Streamlit app
├── config.py                       # Configuration settings
├── requirements.txt                # Python dependencies
├── .gitignore                      # Git ignore rules
└── README.md                       # This file

License

Proprietary

About

Streamlit app that runs documents through OCI Document Understanding to text based LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages