A simple semantic caching layer for LLM responses using vector embeddings and Qdrant. Cache semantically similar queries to reduce API costs and latency.
User Query → Embedding Model → Vector Search → Cache Hit/Miss
↓
LLM API (Ollama)
↓
Store in Cache
- Python 3.9+
- Docker & Docker Compose
- Ollama running on
localhost:11434withqwen2.5:0.5bmodel
-
Clone the repository
git clone <repo-url> cd semantic-cacher
-
Install dependencies
pip install -r requirements.txt
-
Start Qdrant
docker-compose up -d
-
Start Ollama (if not already running)
ollama serve ollama pull qwen2.5:0.5b
-
Run the API server
uvicorn app.main:app --reload
The API will be available at http://localhost:8000
POST /chat
- Request body:
{"prompt": "Your question here"} - Response:
{"response": "...", "source": "cached|llm", "latency": 0.123}
GET /health
- Check if the service is running and models are loaded
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "What is the capital of France?"}'Run the benchmark script to test cache performance:
python scripts/benchmark.pyThis will send a series of queries (some repeated) and show cache hit rates and latency improvements.
- Query embedding: Incoming queries are converted to 384-dimensional vectors using
all-MiniLM-L6-v2 - Similarity search: Qdrant searches for similar vectors in the cache
- Cache decision: If similarity > 0.85, return cached response; otherwise call LLM
- Cache storage: New LLM responses are stored with their query embeddings
- FastAPI - Web framework
- Qdrant - Vector database
- Sentence Transformers - Embedding generation
- Ollama - Local LLM inference
MIT