Semantic Memory Kit

Lightweight semantic search over your AI agent's memory files. No vector database. No API calls. Runs on CPU.

Packaged Pro version available: download on Gumroad (€29).

Main site: https://datis-agent.com

Free vs Pro

This repository is the free core version:

semantic_memory.py library
Base README + one basic example
Local semantic retrieval with no vector DB

The Gumroad Pro pack includes everything above, plus:

3 ready-to-use integrations (Anthropic, OpenAI, Ollama)
Reindex-on-change script
Memory dedupe script
Agent starter template
Prompt recipes + troubleshooting guide

If you just want to learn or build a minimal setup, this free repo is enough. If you want to implement faster in real agents with less trial-and-error, use Pro.

The Problem

Most AI agents treat memory as a file you append to and eventually load into context. This fails in two ways:

Too much context — loading everything hits token limits and costs money
Keyword search misses intent — searching for "payment setup" won't find "configured Stripe for billing"

The Solution

Embed your memory files locally using all-MiniLM-L6-v2 (22MB, runs on CPU). At query time, encode the query and retrieve the most relevant chunks by cosine similarity. Understands meaning — "remote access" matches "VPN tunneling" without shared keywords.

from semantic_memory import SemanticMemory

mem = SemanticMemory("~/.agent/memory")
mem.index()

results = mem.query("Stripe payment integration", top_k=3)
# → [0.847] stripe_notes.md: Set up Stripe webhook handler. Use idempotency keys...
# → [0.731] payments.md: Stripe requires HTTPS in live mode...
# → [0.612] decisions.md: Chose Stripe over PayPal due to better API documentation...

Install

pip install sentence-transformers numpy

Then copy semantic_memory.py into your project.

Usage

Index your memory files

from semantic_memory import SemanticMemory

mem = SemanticMemory("~/.agent/memory")
mem.index()
# → Indexing 205 chunks from 12 files...
# → Index built. 205 chunks ready.

Supports .md, .txt, .json files. Index is cached to .semantic_index.json — no re-indexing unless files change.

Query by meaning

results = mem.query("what did we decide about the database?", top_k=5)
for r in results:
    print(f"[{r.score:.3f}] {r.source}: {r.text[:100]}")

Get formatted context for prompt injection

context = mem.query_and_format("current project priorities", top_k=4)

response = client.messages.create(
    model="claude-haiku-4-5",
    system=f"You are an assistant.\n\nMemory:\n{context}",
    messages=[{"role": "user", "content": user_message}]
)

Full example with Ollama

import ollama
from semantic_memory import SemanticMemory

mem = SemanticMemory("~/.agent/memory")
mem.index()

def chat(message):
    context = mem.query_and_format(message, top_k=4)
    return ollama.chat(
        model="mistral",
        messages=[
            {"role": "system", "content": f"Memory:\n{context}"},
            {"role": "user", "content": message}
        ]
    )["message"]["content"]

CLI

python semantic_memory.py ~/.agent/memory --index
python semantic_memory.py ~/.agent/memory "what is the status of the API project?"

How It Works

Memory files (.md / .txt / .json)
        │
        ▼
  ┌───────────┐
  │  Chunker  │  Split into ~400-char overlapping segments
  └─────┬─────┘
        │
        ▼
  ┌───────────┐
  │  Encoder  │  all-MiniLM-L6-v2 (22MB, CPU-only)
  └─────┬─────┘
        │
        ▼
  ┌──────────────────────┐
  │  .semantic_index.json│  Chunks + embeddings stored locally
  └─────┬────────────────┘
        │
  Query │  encode → cosine similarity → top-K
        ▼
  Ranked relevant chunks (with score + source file)
        │
        ▼
  Inject into model context

Benchmark

Measured on MacBook Air M4, CPU-only, no GPU:

Operation	Result
Index build (2,000 chunks)	13.2s
Avg query time	42ms
Min query time	8ms
Index cache load	~0.4s
Model size	22MB
RAM while loaded	~180MB

When to Use This vs a Vector Database

Use this when:

Fewer than ~10,000 memory chunks
Zero infrastructure — no server, no Docker, no account
Privacy matters (all data stays local)
Single-process agent

Use Chroma / Pinecone / pgvector when:

More than 10,000 chunks
Multiple processes need concurrent access
Sub-10ms latency required at scale

Requirements

sentence-transformers>=2.2.0
numpy>=1.21.0

Python 3.8+. No other dependencies.

License

MIT

Extended examples including multi-agent setups, custom indexing patterns, and OpenAI function calling integration: Gumroad

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
examples		examples
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
semantic_memory.py		semantic_memory.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Memory Kit

Free vs Pro

The Problem

The Solution

Install

Usage

Index your memory files

Query by meaning

Get formatted context for prompt injection

Full example with Ollama

CLI

How It Works

Benchmark

When to Use This vs a Vector Database

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Memory Kit

Free vs Pro

The Problem

The Solution

Install

Usage

Index your memory files

Query by meaning

Get formatted context for prompt injection

Full example with Ollama

CLI

How It Works

Benchmark

When to Use This vs a Vector Database

Requirements

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages