The plug-and-play memory layer for smart, contextual agents
Memlayer adds persistent, intelligent memory to any LLM in just 3 lines of code, enabling agents that recall context across conversations, extract structured knowledge, and surface relevant information when it matters.
<100ms Fast Search • Noise-Aware Memory Gate • Multi-Tier Retrieval Modes • 100% Local • Zero Config
- Features
- Quick Start
- Key Concepts
- Memory Modes
- Search Tiers
- Providers
- Advanced Features
- Examples
- Performance
- Documentation
- Contributing
- Universal LLM Support: Works with OpenAI, Claude, Gemini, Ollama models
- Plug-and-play: Install with
pip install memlayerand get started in minutes — minimal setup required. - Intelligent Memory Filtering: Three operation modes (LOCAL/ONLINE/LIGHTWEIGHT) automatically filter important information
- Hybrid Search: Combines vector similarity + knowledge graph traversal for accurate retrieval
- Three Search Tiers: Fast (<100ms), Balanced (<500ms), Deep (<2s) optimized for different use cases
- Knowledge Graph: Automatically extracts entities, relationships, and facts from conversations
- Proactive Reminders: Schedule tasks and get automatic reminders when they're due
- Built-in Observability: Trace every search operation with detailed performance metrics
- Flexible Storage: ChromaDB (vector) + NetworkX (graph) or graph-only mode
- Production Ready: Serverless-friendly with fast cold starts using online mode
pip install memlayerfrom memlayer import OpenAI
# Initialize with memory capabilities
client = OpenAI(
model="gpt-4.1-mini",
storage_path="./memories",
user_id="user_123"
)
# Store information automatically
client.chat([
{"role": "user", "content": "My name is Alice and I work at TechCorp"}
])
# Retrieve information automatically (no manual prompting needed!)
response = client.chat([
{"role": "user", "content": "Where do I work?"}
])
# Response: "You work at TechCorp."That's it! Memlayer automatically:
- ✅ Filters salient information using ML-based classification
- ✅ Extracts structured facts, entities, and relationships
- ✅ Stores memories in hybrid vector + graph storage
- ✅ Retrieves relevant context for each query
- ✅ Injects memories seamlessly into LLM context
Not all conversation content is worth storing. Memlayer uses salience gates to intelligently filter:
- ✅ Save: Facts, preferences, user info, decisions, relationships
- ❌ Skip: Greetings, acknowledgments, filler words, meta-conversation
Memories are stored in two complementary systems:
- Vector Store (ChromaDB): Semantic similarity search for facts
- Knowledge Graph (NetworkX): Entity relationships and structured knowledge
After each conversation, background threads:
- Extract facts, entities, and relationships using LLM
- Store facts in vector database with embeddings
- Build knowledge graph with entities and relationships
- Index everything for fast retrieval
Memlayer offers three modes that control both memory filtering (salience) and storage:
client = Ollama(salience_mode="local")- Filtering: Sentence-transformers ML model (high accuracy)
- Storage: ChromaDB (vector) + NetworkX (graph)
- Startup: ~10s (model loading)
- Best for: High-volume production, offline apps
- Cost: Free (no API calls)
client = OpenAI(salience_mode="online")- Filtering: OpenAI embeddings API (high accuracy)
- Storage: ChromaDB (vector) + NetworkX (graph)
- Startup: ~2s (no model loading!)
- Best for: Serverless, cloud functions, fast cold starts
- Cost: ~$0.0001 per operation
client = OpenAI(salience_mode="lightweight")- Filtering: Keyword-based (medium accuracy)
- Storage: NetworkX only (no vector storage!)
- Startup: <1s (instant)
- Best for: Prototyping, testing, low-resource environments
- Cost: Free (no embeddings at all)
Performance Comparison:
Mode Startup Time Accuracy API Cost Storage
──────────────────────────────────────────────────────────────
LOCAL ~10s High Free Vector+Graph
ONLINE ~2s High $0.0001/op Vector+Graph
LIGHTWEIGHT <1s Medium Free Graph-only
Memlayer provides three search tiers optimized for different latency requirements:
# Automatic - LLM chooses based on query complexity
response = client.chat([{"role": "user", "content": "What's my name?"}])- 2 vector search results
- No graph traversal
- Perfect for: Real-time chat, simple factual recall
# Automatic - handles most queries well
response = client.chat([{"role": "user", "content": "Tell me about my projects"}])- 5 vector search results
- No graph traversal
- Perfect for: General conversation, most use cases
# Explicit request or auto-detected for complex queries
response = client.chat([{
"role": "user",
"content": "Use deep search: Tell me everything about Alice and her relationships"
}])- 10 vector search results
- Graph traversal enabled (entity extraction + 1-hop relationships)
- Perfect for: Research, "tell me everything", multi-hop reasoning
Memlayer works with all major LLM providers:
from memlayer import OpenAI
client = OpenAI(
model="gpt-4.1-mini", # or gpt-4.1, gpt-5, etc.
storage_path="./memories",
user_id="user_123"
)from memlayer import Claude
client = Claude(
model="claude-4-sonnet",
storage_path="./memories",
user_id="user_123"
)from memlayer import Gemini
client = Gemini(
model="gemini-2.5-flash",
storage_path="./memories",
user_id="user_123"
)from memlayer import Ollama
client = Ollama(
host="http://localhost:11434",
model="qwen3:14b", # or llama3.2, mistral, etc.
storage_path="./memories",
user_id="user_123",
salience_mode="local" # Run 100% offline!
)from memlayer import LMStudio
client = LMStudio(
host="http://localhost:11434/v1",
model="qwen/qwen3-14b",
storage_path="./memories",
user_id="user_123",
)All providers share the same API - switch between them seamlessly!
# User schedules a task
client.chat([{
"role": "user",
"content": "Remind me to submit the report next Friday at 9am"
}])
# Later, when the task is due, Memlayer automatically injects it
response = client.chat([{"role": "user", "content": "What should I do today?"}])
# Response includes: "Don't forget to submit the report - it's due today at 9am!"response = client.chat(messages)
# Inspect search performance
if client.last_trace:
print(f"Search tier: {client.last_trace.events[0].metadata.get('tier')}")
print(f"Total time: {client.last_trace.total_duration_ms}ms")
for event in client.last_trace.events:
print(f" {event.event_type}: {event.duration_ms}ms")# Control memory filtering strictness
client = OpenAI(
salience_threshold=-0.1 # Permissive (saves more)
# salience_threshold=0.0 # Balanced (default)
# salience_threshold=0.1 # Strict (saves less)
)# Manually extract structured knowledge
kg = client.analyze_and_extract_knowledge(
"Alice leads Project Phoenix in the London office. The project uses Python and React."
)
print(kg["facts"]) # ["Alice leads Project Phoenix", ...]
print(kg["entities"]) # [{"name": "Alice", "type": "Person"}, ...]
print(kg["relationships"]) # [{"subject": "Alice", "predicate": "leads", "object": "Project Phoenix"}]Explore the examples/ directory for comprehensive examples:
# Getting started
python examples/01_basics/getting_started.py# Try all three search tiers
python examples/02_search_tiers/fast_tier_example.py
python examples/02_search_tiers/balanced_tier_example.py
python examples/02_search_tiers/deep_tier_example.py
# Compare them side-by-side
python examples/02_search_tiers/tier_comparison.py# Proactive task reminders
python examples/03_features/task_reminders.py
# Knowledge graph visualization
python examples/03_features/test_knowledge_graph.py# Compare salience modes
python examples/04_benchmarks/compare_salience_modes.py# Try different LLM providers
python examples/05_providers/openai_example.py
python examples/05_providers/claude_example.py
python examples/05_providers/gemini_example.py
python examples/05_providers/ollama_example.pySee examples/README.md for full documentation.
Real-world startup times from benchmarks:
Mode First Use Memory Savings Trade-off
─────────────────────────────────────────────────────────
LIGHTWEIGHT ~5s No embeddings No semantic search
ONLINE ~5s 5s faster Small API cost
LOCAL ~10s No API cost 11s model loading
Typical query latencies:
Tier Latency Vector Results Graph Use Case
────────────────────────────────────────────────────────────
Fast 50-150ms 2 No Real-time chat
Balanced 200-600ms 5 No General use
Deep 800-2500ms 10 Yes Research queries
Background processing (non-blocking):
Step Time Async
──────────────────────────────────────────────
Salience filtering ~10ms Yes
Knowledge extraction ~1-2s Yes (background thread)
Vector storage ~50ms Yes
Graph storage ~20ms Yes
Total (non-blocking) ~0ms User doesn't wait!
- Basics Overview - Architecture, components, and how Memlayer works
- Quickstart Guide - Get up and running in 5 minutes
- Streaming Mode - Complete guide to streaming responses
- API Reference - Complete API documentation with all methods and parameters
- Providers Overview - Compare all providers, choose the right one
- Ollama Setup - Run completely offline with local models
- OpenAI - OpenAI configuration
- Claude - Anthropic Claude setup
- Gemini - Google Gemini configuration
- Examples Index - Comprehensive examples by category
- Provider Examples - Provider comparison and usage
The project exposes several runtime/configuration knobs you can tune to match latency, cost, and accuracy trade-offs. Detailed docs for each area live in the docs/ folder:
- docs/tuning/operation_mode.md — Architecture deep dive: How to choose between
online,local, andlightweightmodes, performance implications, storage composition, and deployment strategies. - docs/tuning/intervals.md — Scheduler and curation interval configuration (
scheduler_interval_seconds,curation_interval_seconds) and practical guidance. - docs/tuning/salience_threshold.md — How to adjust
salience_thresholdand expected behavior. - docs/services/consolidation.md — Consolidation pipeline internals and how to call it programmatically (including
update_from_text). - docs/services/curation.md — How memory curation works, archiving rules, and how to run/stop the curation service.
- docs/storage/chroma.md — ChromaDB notes: metadata types, connection handling, and Windows file-lock guidance.
- docs/storage/networkx.md — Knowledge graph persistence, expected node schemas, and backup/restore tips.
Use the docs when tuning for production. The following docs/ files were added to this repository and provide detailed, practical guidance.
# Clone repository
git clone https://github.com/divagr18/memlayer.git
cd memlayer
# Install dependencies
pip install -e .
# Run tests
python -m pytest tests/
# Run examples
python examples/01_basics/getting_started.pymemlayer/
├── memlayer/ # Core library
│ ├── wrappers/ # LLM provider wrappers
│ ├── storage/ # Storage backends (ChromaDB, NetworkX)
│ ├── services.py # Search & consolidation services
│ ├── ml_gate.py # Salience filtering
│ └── embedding_models.py # Embedding model implementations
├── examples/ # Organized examples by category
│ ├── 01_basics/
│ ├── 02_search_tiers/
│ ├── 03_features/
│ ├── 04_benchmarks/
│ └── 05_providers/
├── tests/ # Tests and benchmarks
├── docs/ # Documentation
└── README.md # This file
Contributions are welcome! Here's how you can help:
- Report bugs - Open an issue with reproduction steps
- Suggest features - Share your use case and requirements
- Submit PRs - Fix bugs, add features, improve docs
- Share examples - Show us what you've built!
Please keep PRs focused and include tests for new features.
- Author/Maintainer: Divyansh Agrawal
- Email: [email protected]
- GitHub: divagr18
- Issues: Report bugs or request features via GitHub Issues
For security vulnerabilities, please email directly with SECURITY in the subject line instead of opening a public issue.
MIT License - see LICENSE for details.
- Built with ChromaDB for vector storage
- Uses NetworkX for knowledge graph operations
- Powered by sentence-transformers for local embeddings
- Supports OpenAI, Anthropic, Google Gemini, and Ollama
Made with ❤️ for the AI community
Give your LLMs memory. Try Memlayer today!
