Skip to content

Forgingalex/memory-architectures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

memory-architectures

This repository investigates how external memory modules influence transformer performance on tasks that require longer context retention.

Why Memory Matters

Standard transformers process sequences through attention mechanisms that scale quadratically with sequence length. As sequences grow, the model's ability to maintain information from earlier tokens degrades. This becomes problematic for tasks like bracket matching, copying patterns, or tracking repeated elements where early tokens must influence later predictions.

External memory provides a separate storage mechanism that can retain information across longer distances without the same computational constraints. The question is whether this actually improves performance on tasks that require long-range dependencies.

Model Differences

The baseline transformer is a standard decoder-only architecture with embedding, positional encoding, transformer layers, and output projection.

The memory-augmented model wraps the baseline transformer and adds an external key-value memory store. During forward passes, the model writes token representations and hidden states to memory, then retrieves relevant past states when processing later tokens. Retrieval uses nearest-key matching based on attention queries.

Usage

Generate synthetic datasets with long-range dependencies:

python data/generate_sequences.py

Train the baseline model:

python train/train_baseline.py

Train the memory-augmented model:

python train/train_memory.py

Compare results:

python benchmarks/compare.py

Generate performance plots:

python plots/plot_memory.py

Expected Observations

At small scale, the memory model typically shows modest improvements in effective context length. The baseline model's performance degrades more quickly as sequence length increases, while the memory model maintains lower loss for longer sequences. The improvement is most noticeable on tasks with explicit long-range dependencies like bracket matching or pattern copying.

The memory module adds computational overhead, so the benefit must outweigh this cost. On simple random token sequences, the difference may be minimal. The synthetic datasets in this repository are designed to highlight cases where memory provides measurable advantages.

About

Explores how adding a simple external memory store changes the behavior of small transformer models on tasks that require longer context retention.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages