This repository investigates how external memory modules influence transformer performance on tasks that require longer context retention.
Standard transformers process sequences through attention mechanisms that scale quadratically with sequence length. As sequences grow, the model's ability to maintain information from earlier tokens degrades. This becomes problematic for tasks like bracket matching, copying patterns, or tracking repeated elements where early tokens must influence later predictions.
External memory provides a separate storage mechanism that can retain information across longer distances without the same computational constraints. The question is whether this actually improves performance on tasks that require long-range dependencies.
The baseline transformer is a standard decoder-only architecture with embedding, positional encoding, transformer layers, and output projection.
The memory-augmented model wraps the baseline transformer and adds an external key-value memory store. During forward passes, the model writes token representations and hidden states to memory, then retrieves relevant past states when processing later tokens. Retrieval uses nearest-key matching based on attention queries.
Generate synthetic datasets with long-range dependencies:
python data/generate_sequences.pyTrain the baseline model:
python train/train_baseline.pyTrain the memory-augmented model:
python train/train_memory.pyCompare results:
python benchmarks/compare.pyGenerate performance plots:
python plots/plot_memory.pyAt small scale, the memory model typically shows modest improvements in effective context length. The baseline model's performance degrades more quickly as sequence length increases, while the memory model maintains lower loss for longer sequences. The improvement is most noticeable on tasks with explicit long-range dependencies like bracket matching or pattern copying.
The memory module adds computational overhead, so the benefit must outweigh this cost. On simple random token sequences, the difference may be minimal. The synthetic datasets in this repository are designed to highlight cases where memory provides measurable advantages.