Production-grade AI health query service with adaptive routing, NVIDIA LLM ensemble reasoning, and Gemini final aggregation.
- Classifier (fast NVIDIA model) labels query type, complexity, risk, and routing needs.
- Router decides between fast response, ensemble reasoning, or emergency bypass.
- Ensemble runs multiple NVIDIA models in parallel and validates strict JSON outputs.
- Gemini Judge aggregates only overlapping, safe conclusions and adds a medical disclaimer.
- Safety enforces strict medical constraints and emergency escalation.
health_ai/
+-- app.py
+-- router/
| +-- classifier.py
| +-- decision.py
+-- models/
| +-- nvidia_client.py
| +-- nemotron.py
| +-- mistral.py
| +-- qwen.py
+-- ensemble/
| +-- runner.py
| +-- validator.py
| +-- normalizer.py
+-- judge/
| +-- gemini.py
+-- safety/
| +-- rules.py
+-- schemas/
| +-- ensemble_response.json
+-- utils/
| +-- env.py
| +-- logger.py
+-- .env.example
+-- requirements.txt
+-- README.md
- Create and activate a virtual environment.
- Install dependencies:
pip install -r health_ai/requirements.txt
- Create
health_ai/.envfromhealth_ai/.env.exampleand set your keys. You can set a singleNVIDIA_API_KEYfor all models or override per model withNVIDIA_API_KEY_*.
uvicorn health_ai.app:app --host 0.0.0.0 --port 8000 --reload
GET /healthPOST /query
Example:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query":"I have a headache and mild nausea for two days"}'
- No diagnoses, prescriptions, or medication names.
- Emergency queries bypass models and return immediate escalation.
- All responses include a medical disclaimer.
NVIDIA_API_KEY=...
NVIDIA_API_KEY_CLASSIFIER=...
NVIDIA_API_KEY_FAST=...
NVIDIA_API_KEY_NEMOTRON=...
NVIDIA_API_KEY_MISTRAL=...
NVIDIA_API_KEY_QWEN=...
NVIDIA_BASE_URL=https://integrate.api.nvidia.com/v1
NVIDIA_TIMEOUT_S=20
GEMINI_API_KEY=...
GEMINI_MODEL=gemini-1.5-pro
LOG_LEVEL=INFO
- Ensemble outputs are strictly validated against
schemas/ensemble_response.json. - If any model flags high severity, the system escalates immediately.
- Gemini aggregation only uses overlapping, safe content.
- No API keys are stored in code.
- Use environment variables or a secrets manager in production.