Extracts tables from PDF documents and converts them to CSV. Upload a PDF, and the app runs DETR table detection followed by PaddleOCR to extract table contents.
Browser → React (Vite) → Flask API → Celery worker → Redis
│
PyMuPDF → DETR → PaddleOCR → CSV
- Frontend: React 19 + TypeScript + Tailwind CSS, served via Vite (dev) or Nginx (prod)
- Backend: Flask 3 handles uploads and status polling; Celery processes PDFs asynchronously
- Models:
TahaDouaji/detr-doc-table-detectionfor table detection, PaddleOCR for text extraction - Broker: Redis (job queue + result backend)
- Python ≥ 3.12 with Poetry
- Node.js ≥ 20 with npm
- Redis (
redis-serverorbrew install redis)
make installredis-servermake run-allThis starts Flask (port 5000), the Celery worker, and the Vite dev server (port 3000) in the background. Open http://localhost:3000.
Or start each service individually in separate terminals:
make run-backend # Flask on :5000
make run-worker # Celery worker
make run-frontend # Vite dev server on :3000Note: The Celery worker downloads PaddleOCR and DETR model weights on first run (~1–2 GB). This takes a few minutes but only happens once.
- Docker with the Compose plugin
docker compose up --buildOpen http://localhost. The backend healthcheck waits for models to load before the frontend comes up — first start takes ~60–90s.
To rebuild only the worker (e.g. after changing tasks.py):
docker compose up --build workerTo stop and remove volumes:
docker compose down -vThe deploy/ directory contains two scripts for deploying to a Linux server over SSH.
Installs Docker CE and configures the firewall (ports 22, 80, 443) on a fresh Ubuntu 22.04/24.04 server:
./deploy/provision.sh <host> [user] [ssh-key]
# Examples:
./deploy/provision.sh 192.168.1.100
./deploy/provision.sh 192.168.1.100 ubuntu ~/.ssh/my_key.pemSyncs code via rsync and restarts containers:
./deploy/deploy.sh <host> [user] [ssh-key]
# Examples:
./deploy/deploy.sh 192.168.1.100
./deploy/deploy.sh 192.168.1.100 ubuntu ~/.ssh/my_key.pemSafe to run repeatedly — only changed files are transferred. The app will be live at http://<host> when the script completes.
| Method | Endpoint | Description |
|---|---|---|
POST |
/upload |
Upload a PDF. Returns {job_id}. |
GET |
/status/<job_id> |
Poll job status. Returns state + step label, or result on completion. |
GET |
/download/<filename> |
Download the output CSV. |
GET |
/health |
Health check. |
| State | Meaning |
|---|---|
pending |
Job queued, not yet started |
progress |
Processing — step field describes current stage |
success |
Done — filename and tables_found fields present |
failure |
Error — error field contains the message |
- PDF → images — PyMuPDF at 300 DPI
- Preprocessing — denoising, CLAHE contrast enhancement, unsharp masking, adaptive binarization
- Table detection — DETR (
TahaDouaji/detr-doc-table-detection) finds and crops table regions - OCR — PaddleOCR runs on each cropped table
- CSV assembly — OCR results grouped into rows by y-coordinate and written to CSV
Both ML models load once per worker process at startup.
cd backend
poetry run python src/main.py # Flask on :5000
PYTHONPATH=src poetry run celery -A celery_app worker --loglevel=info # Celery workercd frontend
npm run dev # Vite dev server on :3000
npm run build # Production build
npm run lint # ESLintVite proxies /api/* to http://localhost:5000, stripping the /api prefix before forwarding.
Locust is included as a dev dependency:
cd backend
poetry run locust -f locustfile.py --host http://localhost:5000eco399/
├── backend/
│ ├── src/
│ │ ├── main.py # Flask app + routes
│ │ ├── celery_app.py # Celery instance
│ │ ├── tasks.py # process_pdf Celery task
│ │ └── paddlepaddle.py # ML models + image processing pipeline
│ ├── uploads/ # Incoming PDFs (auto-created)
│ ├── outputs/ # Output CSVs (auto-created)
│ └── pyproject.toml
├── frontend/
│ ├── src/
│ │ ├── main.tsx
│ │ └── App.tsx
│ └── package.json
├── deploy/
│ ├── provision.sh # One-time server setup
│ └── deploy.sh # Code sync + container restart
├── docker-compose.yml
├── Dockerfile # Backend image
└── Makefile