memory-architectures

This repository investigates how external memory modules influence transformer performance on tasks that require longer context retention.

Why Memory Matters

Standard transformers process sequences through attention mechanisms that scale quadratically with sequence length. As sequences grow, the model's ability to maintain information from earlier tokens degrades. This becomes problematic for tasks like bracket matching, copying patterns, or tracking repeated elements where early tokens must influence later predictions.

External memory provides a separate storage mechanism that can retain information across longer distances without the same computational constraints. The question is whether this actually improves performance on tasks that require long-range dependencies.

Model Differences

The baseline transformer is a standard decoder-only architecture with embedding, positional encoding, transformer layers, and output projection.

The memory-augmented model wraps the baseline transformer and adds an external key-value memory store. During forward passes, the model writes token representations and hidden states to memory, then retrieves relevant past states when processing later tokens. Retrieval uses nearest-key matching based on attention queries.

Usage

Generate synthetic datasets with long-range dependencies:

python data/generate_sequences.py

Train the baseline model:

python train/train_baseline.py

Train the memory-augmented model:

python train/train_memory.py

Compare results:

python benchmarks/compare.py

Generate performance plots:

python plots/plot_memory.py

Expected Observations

At small scale, the memory model typically shows modest improvements in effective context length. The baseline model's performance degrades more quickly as sequence length increases, while the memory model maintains lower loss for longer sequences. The improvement is most noticeable on tasks with explicit long-range dependencies like bracket matching or pattern copying.

The memory module adds computational overhead, so the benefit must outweigh this cost. On simple random token sequences, the difference may be minimal. The synthetic datasets in this repository are designed to highlight cases where memory provides measurable advantages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

memory-architectures

Why Memory Matters

Model Differences

Usage

Expected Observations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
data		data
memory		memory
models		models
plots		plots
train		train
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

memory-architectures

Why Memory Matters

Model Differences

Usage

Expected Observations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages