A smart semantic cache for high-scale GenAI workloads.
Warning
v0.1.0 is currently in Alpha. It is not yet production-ready. Significant improvements in stability, performance, and configuration are coming in v0.2.0.
In production, a large percentage of LLM requests are repetitive:
- RAG applications: Variations of the same employee questions
- AI Agents: Repeated reasoning steps or tool calls
- Support Bots: Thousands of similar customer queries
Every redundant request means extra token cost and extra latency.
Why pay your LLM provider multiple times for the same answer?
PromptCache is a lightweight middleware that sits between your application and your LLM provider. It uses semantic understanding to detect when a new prompt has the same intent as a previous one — and returns the cached result instantly.
| Metric | Without Cache | With PromptCache | Benefit |
|---|---|---|---|
| Cost per 1,000 Requests | ≈ $30 | ≈ $6 | Lower cost |
| Avg Latency | ~1.5s | ~300ms | Faster UX |
| Throughput | API-limited | Unlimited | Better scale |
Numbers vary per model, but the pattern holds across real workloads: semantic caching dramatically reduces cost and latency.
* Results may vary depending on model, usage patterns, and configuration.
Naive semantic caches can be risky — they may return incorrect answers when prompts look similar but differ in intent.
PromptCache uses a two-stage verification strategy to ensure accuracy:
- High similarity → direct cache hit
- Low similarity → skip cache directly
- Gray zone → intent check using a small, cheap verification model
This ensures cached responses are semantically correct, not just “close enough”.
PromptCache works as a drop-in replacement for the OpenAI API.
# Clone the repo
git clone https://github.com/messkan/prompt-cache.git
cd prompt-cache
# Run with Docker Compose
export OPENAI_API_KEY=your_key_here
docker-compose up -dSimply change the base_url in your SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1", # Point to PromptCache
api_key="sk-..."
)
# First request → goes to the LLM provider
client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum physics"}]
)
# Semantically similar request → served from PromptCache
client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "How does quantum physics work?"}]
)No code changes. Just point your client to PromptCache.
Built for speed, safety, and reliability:
- Pure Go implementation (high concurrency, minimal overhead)
- BadgerDB for fast embedded persistent storage
- In-memory caching for ultra-fast responses
- OpenAI-compatible API for seamless integration
- Docker Setup
- In-memory & BadgerDB storage
- Smart semantic verification (dual-threshold + intent check)
- OpenAI API compatibility
- Core Improvements: Bug fixes and performance optimizations.
- Public API: Improve cache create/delete operations.
- Enhanced Configuration:
- Configurable "gray zone" fallback model (enable/disable env var).
- User-definable similarity thresholds with sensible defaults.
- Built-in support for Claude & Mistral APIs
- Clustered mode (Raft or gossip-based replication)
- Custom embedding backends (Ollama, local models)
- Rate-limiting & request shaping
- Web dashboard (hit rate, latency, cost metrics)
We are working hard to reach v1.0.0! If you find this project useful, please give it a ⭐️ on GitHub and consider contributing. Your support helps us ship v0.2.0 and v1.0.0 faster!
MIT License.
