AI-Powered Synthetic Medical Data Generation & Privacy Evaluation Platform
Generate privacy-preserving synthetic medical datasets using Large Language Models with built-in utility and privacy risk assessment.
- Overview
- Features
- Architecture
- Tech Stack
- Getting Started
- Usage
- API Reference
- Evaluation Pipeline
- Privacy Assessment
- Project Structure
- Contributing
- License
MedGen addresses a critical challenge in healthcare AI: the scarcity of accessible medical data due to privacy regulations (HIPAA, GDPR). It was an idea my groupmates and I came up with for our project for CS3264 at NUS, I continued to work on the project and extending its functionality as LLMs grew in their analytical power. By leveraging state-of-the-art Large Language Models with Retrieval-Augmented Generation (RAG), MedGen generates high-quality synthetic medical datasets that:
- β Preserve statistical properties of original data
- β Maintain utility for machine learning tasks
- β Minimize privacy risks (singling out, linkability, inference attacks)
- β Enable safe data sharing for research and development
- Unified Dataset Hub: Manage all datasets from a central location
- Sample Datasets: Pre-loaded medical datasets (Pima Diabetes, Diabetes Prediction, Andrew's Diabetes)
- Save & Organize: Save generated datasets with custom names and descriptions
- One-Click Activation: Instantly switch between datasets for analysis
- Preview & Delete: Preview any dataset or remove saved ones
- Dual Generation Modes:
- β‘ Fast Mode: Single API call batch generation (~5-10 seconds for 10-50 rows)
- π§ Deep Mode: Feature-by-feature RAG-enhanced generation (slower but more context-aware)
- LLM-Powered Generation: Uses GPT-4o-mini with customizable parameters
- Auto-Batching: Automatic batching for large requests (>25 rows)
- Real-time Progress: Live progress updates during generation
- CSV Auto-Detection: Automatic delimiter detection (comma, semicolon, tab, pipe)
- Interactive Data Explorer: Upload, view, and analyze CSV datasets
- Statistical Analysis: Automatic computation of distributions, correlations, and summary statistics
- Rich Visualizations: Charts and graphs powered by Recharts
- Download Synthetic Data: Export only the generated rows
- Download Combined Data: Export original + synthetic merged datasets
- Save for Later: Persist generated datasets for future use
- Multi-Model Comparison: Evaluate with KNN, MLP, Naive Bayes, Random Forest, SGD, and SVM
- Automated Pipeline: Split β Train β Generate β Compare workflow
- Performance Metrics: Accuracy, precision, recall, F1-score, confusion matrices
- Anonymeter Integration: Industry-standard privacy risk metrics
- Singling Out Risk: Probability of uniquely identifying individuals
- Linkability Risk: Risk of linking records across datasets
- Inference Risk: Risk of inferring sensitive attributes
- Material-UI v7 Design: Clean, responsive interface with cyberpunk dark theme
- Sidebar Navigation: Quick access to all features
- Real-time Updates: Live generation progress and status
- Natural Language Queries: Ask questions about your data in plain English
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React 19) β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Home β β Datasets β β Explorer β β Analysis β β Generate β ... β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Backend (Flask API) β
β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββββββ β
β β Dataset β β Generate β β Evaluation Pipeline β β
β β Management β β Service β β (ML Models + Privacy) β β
β ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββββββ β
β β β β β
β βββββββββΌββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββββ β
β β Data Storage Layer β β
β β ./data/saved_datasets/ β ./data/generated/ β ./data/chroma_db/ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββΌβββββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββββββ
β ChromaDB β β OpenAI β β Anonymeter β
β (Vector) β β API β β (Privacy) β
ββββββββββββ ββββββββββββ ββββββββββββββββ
| Technology | Purpose |
|---|---|
| Python 3.11+ | Core language |
| Flask 3.1 | REST API server |
| LlamaIndex | RAG framework |
| ChromaDB | Vector database for embeddings |
| OpenAI GPT-4o-mini | Synthetic data generation |
| scikit-learn | ML model evaluation |
| Anonymeter | Privacy risk assessment |
| Pandas/NumPy | Data processing |
| Technology | Purpose |
|---|---|
| React 19 | UI framework |
| Material-UI v7 | Component library |
| Recharts | Data visualization |
| Framer Motion | Animations |
| Axios | HTTP client |
| React Router v7 | Navigation |
- Python 3.11 or 3.12
- Node.js 18+ and npm
- OpenAI API key
-
Clone the repository
git clone https://github.com/SomneelSaha2042/MedGen cd MedGen -
Set up Python environment
# Using uv (recommended) pip install uv uv sync # Or using pip pip install -r requirements.txt
-
Install frontend dependencies
cd frontend npm install cd ..
-
Configure environment variables
cp .env.example .env # Edit .env and add your OpenAI API key -
Run the application
# Start backend (terminal 1) uv run python backend.py # Start frontend (terminal 2) cd frontend && npm start
-
Access the application
- Frontend: http://localhost:3000
- Backend API: http://localhost:5000
- Health Check: http://localhost:5000/health
make install # Install all dependencies
make dev # Run both backend and frontend
make backend # Run backend only
make frontend # Run frontend only
make clean # Clean generated filesdocker-compose up --buildNavigate to Datasets page to:
- View all available sample datasets
- Activate a dataset with one click
- Save generated data for later use
- Preview any dataset before activating
Go to Data Explorer and upload your own CSV file. The platform automatically detects delimiters (comma, semicolon, tab).
Go to Data Generation and configure:
- Generation Mode: Fast (batch) or Deep (feature-by-feature)
- Number of samples: How many synthetic rows to generate
- Temperature (0.1-2.0): Controls randomness
- Top-P (0.1-1.0): Nucleus sampling threshold
- Frequency Penalty: Reduces repetitive patterns
- Max Tokens: Maximum tokens per API call
After generation:
- Download as CSV (synthetic only or combined)
- Use for Analysis to switch to the generated data
- Save for Later to store in your dataset library
Use Analysis page to:
- View statistical distributions
- Generate charts and visualizations
- Compare original vs synthetic data
Use the Query Interface to ask questions about your data in plain English, powered by RAG.
| Method | Endpoint | Description |
|---|---|---|
GET |
/datasets |
List all datasets (sample + saved) |
POST |
/datasets/<id>/activate |
Activate a dataset for analysis |
POST |
/datasets/save |
Save generated data as new dataset |
DELETE |
/datasets/<id> |
Delete a saved dataset |
GET |
/datasets/<id>/preview |
Preview dataset (first 100 rows) |
| Method | Endpoint | Description |
|---|---|---|
POST |
/generate_data |
Start synthetic data generation |
GET |
/generation_status |
Check generation progress |
GET |
/get_generated_data |
Retrieve generated data |
GET |
/download_data?type=<type> |
Download as CSV (synthetic/combined/original) |
POST |
/use_generated_data |
Switch to generated data for analysis |
| Method | Endpoint | Description |
|---|---|---|
POST |
/upload |
Upload CSV dataset |
GET |
/check_csv_status |
Check if CSV is loaded |
POST |
/delete_current_csv |
Remove current CSV |
GET |
/sample_datasets |
List sample datasets |
POST |
/use_sample_dataset |
Use a sample dataset |
| Method | Endpoint | Description |
|---|---|---|
GET |
/stats_query |
Get statistical analysis |
POST |
/stream_analysis |
Stream analysis results |
POST |
/query_csv |
Execute pandas query |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check endpoint |
GET |
/data_availability |
Check available data |
curl -X POST http://localhost:5000/generate_data \
-H "Content-Type: application/json" \
-d '{
"numSamples": 50,
"temperature": 0.7,
"topP": 0.9,
"repetitionPenalty": 1.1,
"maxTokens": 4096,
"generationMode": "fast"
}'curl -X POST http://localhost:5000/datasets/save \
-H "Content-Type: application/json" \
-d '{
"name": "My Study Data",
"description": "100 synthetic diabetes records",
"type": "combined"
}'The evaluation pipeline (basic_eval_pipeline.py) performs:
- Data Splitting: 80% training / 20% test
- Original Training: Train 6 ML models on original training data
- Synthetic Generation: Generate synthetic data matching training set size
- Synthetic Training: Train same models on synthetic data
- Evaluation: Compare both on the held-out test set
- Visualization: Generate comparison plots and metrics
- K-Nearest Neighbors (KNN)
- Multi-Layer Perceptron (MLP)
- Naive Bayes
- Random Forest
- Stochastic Gradient Descent (SGD)
- Support Vector Machine (SVM)
uv run python basic_eval_pipeline.pyuv run python multi_dataset_pipeline.pyMedGen uses Anonymeter for privacy risk evaluation:
Measures the probability that a synthetic record can uniquely identify an individual from the original dataset.
Assesses whether records in the synthetic dataset can be linked to records in external datasets.
Evaluates the risk of inferring sensitive attributes about individuals using the synthetic data.
uv run python anonymeter_privacy_eval.pyMedGen/
βββ backend.py # Flask API server (main entry point)
βββ generate_data.py # LLM synthetic data generation (fast + deep modes)
βββ rag.py # RAG system with ChromaDB
βββ basic_eval_pipeline.py # ML evaluation pipeline
βββ multi_dataset_pipeline.py # Multi-dataset evaluation
βββ anonymeter_privacy_eval.py # Privacy risk assessment
βββ preprocess.py # Data preprocessing utilities
βββ dquery.py # Feature analysis with LLM
β
βββ frontend/ # React frontend application
β βββ src/
β β βββ components/ # React components
β β β βββ Home.js # Landing page
β β β βββ DatasetManager.js # Dataset management UI
β β β βββ DataExplorer.js # Data upload and preview
β β β βββ DataGeneration.js # Generation interface
β β β βββ Analysis.js # Data analysis & charts
β β β βββ Database.js # Database info
β β β βββ Sidebar.js # Navigation sidebar
β β β βββ ...
β β βββ services/
β β β βββ api.js # API client with all endpoints
β β βββ App.js # Main app with routing
β βββ package.json
β
βββ data/ # Runtime data storage
β βββ saved_datasets/ # User-saved datasets
β βββ generated/ # Generated synthetic data
β βββ chroma_db/ # ChromaDB vector store
β βββ features/ # Feature documents for RAG
β
βββ evals/ # Evaluation module
β βββ models/ # ML model implementations
β β βββ knn.py
β β βββ mlp.py
β β βββ naivebayes.py
β β βββ randomforest.py
β β βββ sgd.py
β β βββ svm.py
β βββ dataset/ # Evaluation datasets
β βββ pristine_datasets/ # Original unmodified datasets
β
βββ datasets/ # Sample datasets
βββ results/ # Generated results and plots
βββ multi_dataset_results/ # Multi-dataset evaluation results
β
βββ .vscode/ # VS Code configuration
β βββ launch.json # Debug configurations
β βββ settings.json # Editor settings
β
βββ pyproject.toml # Python project configuration (uv)
βββ requirements.txt # Python dependencies
βββ Makefile # Build automation
βββ docker-compose.yml # Docker configuration
βββ Dockerfile # Backend container
βββ .env.example # Environment template
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Workflow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Select Dataset β β Upload Custom β β Use Sample β
β (Datasets) β β (Explorer) β β Dataset β
ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββ
β Active Dataset β
β (RAG Index Built) β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Analyze β β Generate β β Query β
β (Analysis) β β (Generation) β β (Database) β
βββββββββββββββββββ ββββββββββ¬βββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Generated Data β
β (Synthetic Rows) β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Download β β Use for β β Save β
β as CSV β β Analysis β β for Later β
βββββββββββββββββββ βββββββββββββββββββ ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββ
β Saved Datasets β
β (Datasets Library) β
βββββββββββββββββββββββ
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built as part of CS3264 coursework at the National University of Singapore
- Uses Anonymeter for privacy evaluation
- Powered by OpenAI GPT-4o-mini
- UI components from Material-UI
- RAG framework by LlamaIndex
Made with β€οΈ for privacy-preserving healthcare AI