Transform any document into professional audiobooks with AI-powered enhancement and high-quality text-to-speech.
- 📄 Multi-format Support: PDF, DOCX, TXT file processing
- 🤖 AI Enhancement: Gemini API transforms text into engaging audiobook narration
- 🎙️ Premium TTS: Edge TTS with multiple voice styles (storytelling, authoritative, conversational)
- 🔍 Smart Q&A: RAG-powered document search and question answering
- 🌐 Modern UI: React frontend with real-time progress tracking
- ⚡ Fast Processing: Optimized pipeline with caching and batch processing
- Python 3.8+
- Node.js 16+
- Gemini API key
-
Clone the repository
git clone https://github.com/AabidMK/Audiobook_generator_-Infosys_Internship_Aug2025.git cd Audiobook_generator_-Infosys_Internship_Aug2025 -
Install Python dependencies
pip install -r requirements.txt
-
Setup environment
# Create .env file echo "GEMINI_API_KEY=your_api_key_here" > .env
-
Install frontend dependencies
cd frontend npm install cd ..
Option 1: Full Stack (Recommended)
run_app.bat # Windows
# or
python run_localhost.py # Cross-platformOption 2: Backend Only
python start_api.pyOption 3: Manual Setup
# Terminal 1 - Backend
python start_api.py
# Terminal 2 - Frontend
cd frontend
npm run dev- Upload Document: Drag & drop or select PDF/DOCX/TXT files
- Generate Audiobook: Choose voice style and click generate
- Download: Get enhanced text and high-quality audio files
- Ask Questions: Use RAG system to query document content
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ React UI │────│ FastAPI │────│ Gemini API │
│ (Frontend) │ │ (Backend) │ │ (Enhancement) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌─────────┼─────────┐
│ │ │
┌───────▼───┐ ┌───▼────┐ ┌──▼──────┐
│ Edge TTS │ │ ChromaDB│ │Text │
│ (Audio) │ │ (RAG) │ │Extraction│
└───────────┘ └────────┘ └─────────┘
- Backend: FastAPI, Python 3.8+
- Frontend: React, Vite, Axios
- AI/ML: Google Gemini API, ChromaDB, Sentence Transformers
- TTS: Microsoft Edge TTS
- Text Processing: PyMuPDF, python-docx, BeautifulSoup
├── main.py # FastAPI backend server
├── audiobook_generator.py # Core audiobook logic
├── enhanced_extraction.py # Text extraction
├── rag.py # RAG system
├── frontend/ # React application
├── requirements.txt # Python dependencies
└── README.md # This file
GEMINI_API_KEY=your_gemini_api_key
LM_STUDIO_BASE_URL=http://localhost:1234 # Optional- storytelling: Warm, expressive (default)
- authoritative: Deep, confident
- conversational: Natural, friendly
- narrative: Smooth, professional
- dramatic: Dynamic, emotional
# Check ports
netstat -ano | findstr :8000
# Kill processes
taskkill /F /PID <PID>- Indexing failed: Check file permissions and format
- Audio generation failed: Verify Edge TTS installation
- API errors: Validate Gemini API key
- Text Processing: ~2-5 seconds per page
- AI Enhancement: ~10-30 seconds per chunk
- Audio Generation: ~1-2x real-time speed
- Supported File Size: Up to 50MB documents
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Infosys Internship Program - Project opportunity
- Google Gemini API - AI text enhancement
- Microsoft Edge TTS - High-quality speech synthesis
- ChromaDB - Vector database for RAG
Made with ❤️ during Infosys Internship 2025