Skip to content

Fast, accurate document translation that preserves formatting, structure, and technical terms using OpenAI GPT models. Ideal for educational content and technical documents.

Notifications You must be signed in to change notification settings

Prasaderp/AI-Document-Translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


AI Document Translator

Intelligent document translation powered by OpenAI that preserves formatting, structure, and context

Key FeaturesDemoQuick StartUsageHow It WorksAPI

Python FastAPI OpenAI spaCy


Overview

AI Document Translator is a production-ready translation service designed for educational content and technical documents. Built with FastAPI and powered by OpenAI's GPT models, it goes beyond basic translation by preserving document structure, mathematical formulas, and technical terminology while delivering natural, contextually-aware translations.


Key Features

📄 Structure Preservation

Maintains original document formatting including headings, paragraphs, tables, lists, and embedded images without any loss of structure.

🧠 Smart Entity Recognition

Uses advanced NLP (spaCy) to automatically identify and preserve proper nouns, technical terms, and brand names.

🔢 Mathematical Integrity

Keeps all mathematical formulas, expressions, and numerical data completely unchanged during translation.

✅ Quality Assurance

Built-in quality scoring system (0-40 scale) that validates translation accuracy and naturalness with automatic retries.

⚡ Real-time Progress

WebSocket-based live progress tracking with visual feedback showing translation status and quality metrics.

🚀 High Performance

Efficient async architecture with concurrent processing, rate limiting, and automatic file cleanup.


Demo

Before & After Translation

Original English Document

Original Document (English)

Mathematical questions with formulas and structure

Translated Hindi Document

Translated Document (Hindi)

Perfect structure and formula preservation

Notice how all formulas, numbers, and structural elements remain perfectly intact while the natural language is translated seamlessly.




Technology Stack

Backend Architecture

FastAPI (Async Web Framework)
├── OpenAI GPT-4o (Quality Assessment)
├── OpenAI GPT-4o-mini (Translation)
├── spaCy (Entity Recognition)
├── python-docx (Document Processing)
└── WebSocket (Real-time Communication)

Key Technologies:

  • FastAPI: High-performance async REST API
  • OpenAI API: State-of-the-art language models
  • spaCy: Industrial-strength NLP
  • python-docx: DOCX manipulation

Frontend Architecture

Modern Web Stack
├── Vanilla JavaScript (No Dependencies)
├── WebSocket Client (Live Updates)
├── CSS3 (Responsive Design)
└── Drag-and-Drop API

Features:

  • Zero framework overhead
  • Real-time progress updates
  • Responsive design
  • Intuitive drag-and-drop interface

Multi-Layer Processing Pipeline

graph LR
    A[Document Upload] --> B[Structure Analysis]
    B --> C[Entity Detection]
    C --> D[Text Masking]
    D --> E[Batch Translation]
    E --> F[Quality Validation]
    F --> G{Quality OK?}
    G -->|Yes| H[Reconstruct Document]
    G -->|No| E
    H --> I[Download]
Loading
📊 View Detailed Architecture Layers
Layer Responsibility Technology
Document Processing Extract text while preserving metadata python-docx, lxml
NLP Layer Identify and mask protected entities spaCy en_core_web_sm
Translation Layer Async batch processing with rate limiting OpenAI API, asyncio
Quality Assessment Validate translation accuracy GPT-4o scoring
Reconstruction Apply translations with original formatting python-docx styles



Quick Start

Prerequisites

✓ Python 3.8 or higher
✓ OpenAI API key (Get yours at platform.openai.com)
✓ Modern web browser (Chrome, Firefox, Safari, Edge)

Installation

# 1. Clone the repository
git clone https://github.com/yourusername/ai-document-translator.git
cd ai-document-translator

# 2. Create virtual environment
python -m venv venv

# 3. Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# 4. Install dependencies
pip install -r requirements.txt

# 5. Download spaCy model
python -m spacy download en_core_web_sm

# 6. Start the server
uvicorn fastapi_app:app --reload

First Translation

  1. Open http://localhost:8000 in your browser
  2. Enter your OpenAI API key and click Set
  3. Drag and drop a .docx file or click to upload
  4. Select target language (Hindi, Tamil, or Telugu)
  5. Click Translate and watch the real-time progress
  6. Download your translated document



Usage

Basic Translation Workflow

1️⃣

Upload Document
Drop .docx file or browse

2️⃣

Select Language
Choose target language

3️⃣

Configure Options
Add terms to preserve

4️⃣

Translate & Download
Get your translated file

Advanced: Preserving Custom Terms

Specify technical terms, brand names, or acronyms that should remain untranslated:

AcmeCorp, ProjectX, API, OAuth, MongoDB, React, Kubernetes

Supports:

  • Comma-separated lists
  • Multi-line entries
  • Mixed case sensitivity
  • Automatic boundary detection

Supported Languages

Language Code Status
Hindi hi ✅ Active
Tamil ta ✅ Active
Telugu te ✅ Active

Add More Languages: Simply modify the dropdown in web/index.html




How It Works

Translation Pipeline

┌─────────────────────────────────────────────────────────────────┐
│  Step 1: Document Analysis                                      │
│  ├─ Extract paragraphs (body, headers, footers, tables)         │
│  └─ Identify translatable segments                              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  Step 2: Entity Masking                                         │
│  ├─ Detect proper nouns with spaCy                              │
│  ├─ Mask user-specified terms                                   │
│  └─ Create token map (e.g., <<UT0>>, <<NE1>>)                   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  Step 3: Concurrent Translation                                 │
│  ├─ Batch process with semaphore (10 concurrent)                │
│  ├─ Rate limiting (950 RPM)                                     │
│  └─ Real-time progress updates via WebSocket                    │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  Step 4: Quality Validation                                     │
│  ├─ Score each translation (0-40 scale)                         │
│  ├─ Automatic retry if below threshold (30)                     │
│  └─ Maximum 3 attempts per segment                              │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  Step 5: Document Reconstruction                                │
│  ├─ Unmask protected terms                                      │
│  ├─ Apply translations preserving formatting                    │
│  └─ Generate output DOCX file                                   │
└─────────────────────────────────────────────────────────────────┘

Quality Scoring Breakdown

Each translation is evaluated across four dimensions:

Dimension Weight Evaluation Criteria
Accuracy 0-10 Preserves original meaning, numbers, formulas
Clarity 0-10 Easy to understand for target audience
Naturalness 0-10 Sounds natural in target language
Educational Fit 0-10 Appropriate for students/learners

Total Score: 0-40 points | Threshold: 30 points | Auto-retry if below threshold




API

REST Endpoints

Translation Job

POST /api/translate
Content-Type: multipart/form-data

Parameters:
  file            : File      (required) - DOCX file to translate
  target_language : string    (required) - Target language name
  retain_terms    : string    (optional) - Comma-separated terms
  api_key         : string    (required) - OpenAI API key

Response:
  {
    "job_id": "550e8400-e29b-41d4-a716-446655440000"
  }

Job Status

GET /api/status/{job_id}

Response:
  {
    "job_id": "uuid",
    "status": "running",
    "progress": 65.5,
    "avg_quality": 35.2,
    "elapsed_seconds": 45.3
  }

Download Result

GET /api/download/{job_id}

Response:
  Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
  Content-Disposition: attachment; filename="translated_document.docx"

WebSocket Endpoints

Real-time Progress

ws://localhost:8000/ws/progress/{job_id}

// Message Types
{
  type: "progress",
  progress: 45.5,
  avg_quality: 34.2,
  elapsed_seconds: 12.3
}

{
  type: "completed",
  progress: 100,
  avg_quality: 36.8,
  elapsed_seconds: 67.4,
  download_url: "/api/download/{job_id}"
}

{
  type: "error",
  message: "Translation failed: API quota exceeded"
}



Configuration

Environment Variables

Create a .env file (optional):

OPENAI_API_KEY=sk-proj-...

Users can also provide API keys directly through the web interface.

Performance Tuning

⚙️ Customize Performance Parameters

In fastapi_app.py:

# File cleanup settings
ttl_seconds = 12 * 3600      # File retention: 12 hours
interval_seconds = 1800       # Cleanup interval: 30 minutes

In translator.py:

quality_threshold = 30        # Minimum quality score (0-40)
max_retries = 3              # Translation retry attempts
concurrency_limit = 10       # Parallel API requests
rpm_limit = 950              # Requests per minute



Performance Metrics

📊 1000+

Paragraphs
Efficient handling

⚡ 50-100

Segments/Min
Translation speed

🔄 10

Concurrent
Parallel processing

🕐 12h

Auto-cleanup
File retention

Security & Privacy

🔒 Security Features

  • API keys never stored on server
  • Client-side API key validation
  • Isolated job processing directories
  • Automatic file cleanup after 12 hours

🛡️ Privacy Guarantees

  • No data persistence beyond session
  • Files deleted on cancellation
  • No API key logging
  • Temporary processing only



Troubleshooting

❌ Translation fails with API error

Solutions:

  • Verify your OpenAI API key is valid
  • Check API quota and billing status at platform.openai.com
  • Ensure stable internet connection
  • Try again with a smaller document
❌ spaCy model not found

Solutions:

python -m spacy download en_core_web_sm

Then restart the server:

uvicorn fastapi_app:app --reload
❌ Document formatting issues

Solutions:

  • Ensure input is valid .docx format (not .doc)
  • Complex formatting may require manual adjustment
  • Images and diagrams are preserved but not translated
  • Check if the document uses unsupported features



Contributing

Contributions are welcome! Whether you're fixing bugs, adding features, or improving documentation, your help makes this project better.

How to Contribute:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

Built with these amazing technologies:

  • OpenAI - Powerful language models (GPT-4, GPT-4o-mini)
  • FastAPI - Modern async web framework
  • spaCy - Industrial-strength NLP
  • python-docx - Document manipulation


Made with care for educational content creators and translators worldwide

Report BugRequest FeatureDocumentation


If you find this project helpful, please consider giving it a ⭐

About

Fast, accurate document translation that preserves formatting, structure, and technical terms using OpenAI GPT models. Ideal for educational content and technical documents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published