Skip to content

Nerikko/semantic-memory-kit

Semantic Memory Kit

License: MIT Python Model

Lightweight semantic search over your AI agent's memory files. No vector database. No API calls. Runs on CPU.

Packaged Pro version available: download on Gumroad (€29).

Main site: https://datis-agent.com

Free vs Pro

This repository is the free core version:

  • semantic_memory.py library
  • Base README + one basic example
  • Local semantic retrieval with no vector DB

The Gumroad Pro pack includes everything above, plus:

  • 3 ready-to-use integrations (Anthropic, OpenAI, Ollama)
  • Reindex-on-change script
  • Memory dedupe script
  • Agent starter template
  • Prompt recipes + troubleshooting guide

If you just want to learn or build a minimal setup, this free repo is enough. If you want to implement faster in real agents with less trial-and-error, use Pro.

The Problem

Most AI agents treat memory as a file you append to and eventually load into context. This fails in two ways:

  1. Too much context — loading everything hits token limits and costs money
  2. Keyword search misses intent — searching for "payment setup" won't find "configured Stripe for billing"

The Solution

Embed your memory files locally using all-MiniLM-L6-v2 (22MB, runs on CPU). At query time, encode the query and retrieve the most relevant chunks by cosine similarity. Understands meaning — "remote access" matches "VPN tunneling" without shared keywords.

from semantic_memory import SemanticMemory

mem = SemanticMemory("~/.agent/memory")
mem.index()

results = mem.query("Stripe payment integration", top_k=3)
# → [0.847] stripe_notes.md: Set up Stripe webhook handler. Use idempotency keys...
# → [0.731] payments.md: Stripe requires HTTPS in live mode...
# → [0.612] decisions.md: Chose Stripe over PayPal due to better API documentation...

Install

pip install sentence-transformers numpy

Then copy semantic_memory.py into your project.

Usage

Index your memory files

from semantic_memory import SemanticMemory

mem = SemanticMemory("~/.agent/memory")
mem.index()
# → Indexing 205 chunks from 12 files...
# → Index built. 205 chunks ready.

Supports .md, .txt, .json files. Index is cached to .semantic_index.json — no re-indexing unless files change.

Query by meaning

results = mem.query("what did we decide about the database?", top_k=5)
for r in results:
    print(f"[{r.score:.3f}] {r.source}: {r.text[:100]}")

Get formatted context for prompt injection

context = mem.query_and_format("current project priorities", top_k=4)

response = client.messages.create(
    model="claude-haiku-4-5",
    system=f"You are an assistant.\n\nMemory:\n{context}",
    messages=[{"role": "user", "content": user_message}]
)

Full example with Ollama

import ollama
from semantic_memory import SemanticMemory

mem = SemanticMemory("~/.agent/memory")
mem.index()

def chat(message):
    context = mem.query_and_format(message, top_k=4)
    return ollama.chat(
        model="mistral",
        messages=[
            {"role": "system", "content": f"Memory:\n{context}"},
            {"role": "user", "content": message}
        ]
    )["message"]["content"]

CLI

python semantic_memory.py ~/.agent/memory --index
python semantic_memory.py ~/.agent/memory "what is the status of the API project?"

How It Works

Memory files (.md / .txt / .json)
        │
        ▼
  ┌───────────┐
  │  Chunker  │  Split into ~400-char overlapping segments
  └─────┬─────┘
        │
        ▼
  ┌───────────┐
  │  Encoder  │  all-MiniLM-L6-v2 (22MB, CPU-only)
  └─────┬─────┘
        │
        ▼
  ┌──────────────────────┐
  │  .semantic_index.json│  Chunks + embeddings stored locally
  └─────┬────────────────┘
        │
  Query │  encode → cosine similarity → top-K
        ▼
  Ranked relevant chunks (with score + source file)
        │
        ▼
  Inject into model context

Benchmark

Measured on MacBook Air M4, CPU-only, no GPU:

Operation Result
Index build (2,000 chunks) 13.2s
Avg query time 42ms
Min query time 8ms
Index cache load ~0.4s
Model size 22MB
RAM while loaded ~180MB

When to Use This vs a Vector Database

Use this when:

  • Fewer than ~10,000 memory chunks
  • Zero infrastructure — no server, no Docker, no account
  • Privacy matters (all data stays local)
  • Single-process agent

Use Chroma / Pinecone / pgvector when:

  • More than 10,000 chunks
  • Multiple processes need concurrent access
  • Sub-10ms latency required at scale

Requirements

sentence-transformers>=2.2.0
numpy>=1.21.0

Python 3.8+. No other dependencies.

License

MIT


Extended examples including multi-agent setups, custom indexing patterns, and OpenAI function calling integration: Gumroad

About

Lightweight semantic search for AI agent memory files. No vector DB. CPU-only. 198 lines of Python.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages