eco399 — PDF to CSV Converter

Extracts tables from PDF documents and converts them to CSV. Upload a PDF, and the app runs DETR table detection followed by PaddleOCR to extract table contents.

Architecture

Browser → React (Vite) → Flask API → Celery worker → Redis
                                          │
                              PyMuPDF → DETR → PaddleOCR → CSV

Frontend: React 19 + TypeScript + Tailwind CSS, served via Vite (dev) or Nginx (prod)
Backend: Flask 3 handles uploads and status polling; Celery processes PDFs asynchronously
Models: TahaDouaji/detr-doc-table-detection for table detection, PaddleOCR for text extraction
Broker: Redis (job queue + result backend)

Running Locally

Prerequisites

Python ≥ 3.12 with Poetry
Node.js ≥ 20 with npm
Redis (redis-server or brew install redis)

Install dependencies

make install

Start Redis

redis-server

Start all services

make run-all

This starts Flask (port 5000), the Celery worker, and the Vite dev server (port 3000) in the background. Open http://localhost:3000.

Or start each service individually in separate terminals:

make run-backend    # Flask on :5000
make run-worker     # Celery worker
make run-frontend   # Vite dev server on :3000

Note: The Celery worker downloads PaddleOCR and DETR model weights on first run (~1–2 GB). This takes a few minutes but only happens once.

Running with Docker

Prerequisites

Docker with the Compose plugin

Build and start

docker compose up --build

Open http://localhost. The backend healthcheck waits for models to load before the frontend comes up — first start takes ~60–90s.

To rebuild only the worker (e.g. after changing tasks.py):

docker compose up --build worker

To stop and remove volumes:

docker compose down -v

Deploying to a Remote Server

The deploy/ directory contains two scripts for deploying to a Linux server over SSH.

1. Provision (one-time)

Installs Docker CE and configures the firewall (ports 22, 80, 443) on a fresh Ubuntu 22.04/24.04 server:

./deploy/provision.sh <host> [user] [ssh-key]

# Examples:
./deploy/provision.sh 192.168.1.100
./deploy/provision.sh 192.168.1.100 ubuntu ~/.ssh/my_key.pem

2. Deploy

Syncs code via rsync and restarts containers:

./deploy/deploy.sh <host> [user] [ssh-key]

# Examples:
./deploy/deploy.sh 192.168.1.100
./deploy/deploy.sh 192.168.1.100 ubuntu ~/.ssh/my_key.pem

Safe to run repeatedly — only changed files are transferred. The app will be live at http://<host> when the script completes.

API

Method	Endpoint	Description
`POST`	`/upload`	Upload a PDF. Returns `{job_id}`.
`GET`	`/status/<job_id>`	Poll job status. Returns state + step label, or result on completion.
`GET`	`/download/<filename>`	Download the output CSV.
`GET`	`/health`	Health check.

Status response states

State	Meaning
`pending`	Job queued, not yet started
`progress`	Processing — `step` field describes current stage
`success`	Done — `filename` and `tables_found` fields present
`failure`	Error — `error` field contains the message

Processing Pipeline

PDF → images — PyMuPDF at 300 DPI
Preprocessing — denoising, CLAHE contrast enhancement, unsharp masking, adaptive binarization
Table detection — DETR (TahaDouaji/detr-doc-table-detection) finds and crops table regions
OCR — PaddleOCR runs on each cropped table
CSV assembly — OCR results grouped into rows by y-coordinate and written to CSV

Both ML models load once per worker process at startup.

Development

Backend only

cd backend
poetry run python src/main.py                                          # Flask on :5000
PYTHONPATH=src poetry run celery -A celery_app worker --loglevel=info  # Celery worker

Frontend only

cd frontend
npm run dev    # Vite dev server on :3000
npm run build  # Production build
npm run lint   # ESLint

Vite proxies /api/* to http://localhost:5000, stripping the /api prefix before forwarding.

Load testing

Locust is included as a dev dependency:

cd backend
poetry run locust -f locustfile.py --host http://localhost:5000

Project Structure

eco399/
├── backend/
│   ├── src/
│   │   ├── main.py          # Flask app + routes
│   │   ├── celery_app.py    # Celery instance
│   │   ├── tasks.py         # process_pdf Celery task
│   │   └── paddlepaddle.py  # ML models + image processing pipeline
│   ├── uploads/             # Incoming PDFs (auto-created)
│   ├── outputs/             # Output CSVs (auto-created)
│   └── pyproject.toml
├── frontend/
│   ├── src/
│   │   ├── main.tsx
│   │   └── App.tsx
│   └── package.json
├── deploy/
│   ├── provision.sh         # One-time server setup
│   └── deploy.sh            # Code sync + container restart
├── docker-compose.yml
├── Dockerfile               # Backend image
└── Makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eco399 — PDF to CSV Converter

Architecture

Running Locally

Prerequisites

Install dependencies

Start Redis

Start all services

Running with Docker

Prerequisites

Build and start

Deploying to a Remote Server

1. Provision (one-time)

2. Deploy

API

Status response states

Processing Pipeline

Development

Backend only

Frontend only

Load testing

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
deploy		deploy
frontend		frontend
test_results		test_results
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

eco399 — PDF to CSV Converter

Architecture

Running Locally

Prerequisites

Install dependencies

Start Redis

Start all services

Running with Docker

Prerequisites

Build and start

Deploying to a Remote Server

1. Provision (one-time)

2. Deploy

API

Status response states

Processing Pipeline

Development

Backend only

Frontend only

Load testing

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages