Semantic Cacher

A simple semantic caching layer for LLM responses using vector embeddings and Qdrant. Cache semantically similar queries to reduce API costs and latency.

Architecture

User Query → Embedding Model → Vector Search → Cache Hit/Miss
                                            ↓
                                    LLM API (Ollama)
                                            ↓
                                    Store in Cache

Quick Start

Prerequisites

Python 3.9+
Docker & Docker Compose
Ollama running on localhost:11434 with qwen2.5:0.5b model

Installation

Clone the repository
```
git clone <repo-url>
cd semantic-cacher
```
Install dependencies
```
pip install -r requirements.txt
```
Start Qdrant
```
docker-compose up -d
```
Start Ollama (if not already running)
```
ollama serve
ollama pull qwen2.5:0.5b
```
Run the API server
```
uvicorn app.main:app --reload
```

The API will be available at http://localhost:8000

Usage

API Endpoints

POST /chat

Request body: {"prompt": "Your question here"}
Response: {"response": "...", "source": "cached|llm", "latency": 0.123}

GET /health

Check if the service is running and models are loaded

Example

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of France?"}'

Benchmark

Run the benchmark script to test cache performance:

python scripts/benchmark.py

This will send a series of queries (some repeated) and show cache hit rates and latency improvements.

How It Works

Query embedding: Incoming queries are converted to 384-dimensional vectors using all-MiniLM-L6-v2
Similarity search: Qdrant searches for similar vectors in the cache
Cache decision: If similarity > 0.85, return cached response; otherwise call LLM
Cache storage: New LLM responses are stored with their query embeddings

Tech Stack

FastAPI - Web framework
Qdrant - Vector database
Sentence Transformers - Embedding generation
Ollama - Local LLM inference

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
qdrant_data		qdrant_data
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
dashboard.py		dashboard.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
traffic_logs.db		traffic_logs.db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Cacher

Architecture

Quick Start

Prerequisites

Installation

Usage

API Endpoints

Example

Benchmark

How It Works

Tech Stack

License

About

Uh oh!

Releases

Packages

Languages

Kashyab19/cache-flow

Folders and files

Latest commit

History

Repository files navigation

Semantic Cacher

Architecture

Quick Start

Prerequisites

Installation

Usage

API Endpoints

Example

Benchmark

How It Works

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages