Independent PyTorch implementation and analysis of the Perceiver architecture
(Perceiver: General Perception with Iterative Attention, Jaegle et al., DeepMind, ICML 2021)
This repository contains a clean, minimal implementation of the Perceiver architecture, focusing on its core design principles: cross-attention bottlenecks, latent Transformer processing, and Fourier positional encodings. The goal is architectural understanding and analysis rather than full-scale reproduction of the original DeepMind results.
Standard Transformers scale quadratically with input size, making them impractical for raw, high-dimensional inputs such as images, audio waveforms, point clouds, or multimodal data.
The Perceiver addresses this limitation by:
- Introducing a learned latent bottleneck
- Using asymmetric cross-attention from inputs to latents
- Decoupling model depth from input dimensionality
This project studies these ideas through a lightweight, interpretable implementation.
- Independent Perceiver-style implementation in PyTorch
- Fourier positional encodings for 2D images
- Cross-attention + latent self-attention stack
- End-to-end training on CIFAR-10
- Training curves, qualitative predictions, and analysis
- A concise written report summarizing insights and limitations
This work is not a strict reproduction of the original DeepMind implementation. Instead, it is a deliberate, minimal implementation designed to validate and study the Perceiver’s architectural mechanisms.
Perceiver-Architecture-Study/
│
├── data/ # CIFAR-10 dataset (auto-downloaded)
├── Code Implementation.ipynb # Main PyTorch implementation notebook
├── Code Implementation (PDF).pdf # Exported notebook (read-only)
├── Original Paper (PDF).pdf # Jaegle et al., ICML 2021
├── Reimplementation-Report.pdf # One-page technical analysis & insights
├── requirements.txt # Python dependencies
├── README.md # This file
└── LICENSE
Design choices in this study:
- Latent array: 128 latents × 256 dimensions
- Single cross-attention block (inputs → latents)
- 4 latent self-attention blocks
- Mean-pooled latents → classification head
- Optimizer: AdamW
- Dataset: CIFAR-10
Despite its simplicity and lack of convolutional priors, the model:
- Trains stably
- Shows smooth loss curves
- Reaches ~52% validation accuracy in 10 epochs
This behavior aligns with claims in the original paper regarding scalability and generality.
Let the input be a high-dimensional array:
Standard self-attention has quadratic complexity:
which becomes infeasible for large inputs.
The Perceiver introduces a learned latent array:
This latent space acts as an information bottleneck.
Asymmetric cross-attention projects input information into the latent space:
with:
$Q = ZW_Q$ $K = XW_K$ $V = XW_V$
Resulting update:
Complexity:
The latents are processed by a Transformer stack:
Per-layer complexity:
This formulation decouples model depth
Since attention is permutation-invariant, spatial structure is injected using Fourier features:
These encodings allow the model to learn spatial relationships without convolutional inductive biases.
The final representation is obtained via latent aggregation:
where
- Latent bottlenecks dramatically reduce attention complexity
- Depth can be increased independently of input size
- Fourier positional encodings provide spatial structure without convolutions
- Early validation accuracy exceeding training accuracy reflects strong regularization
- Accuracy is limited primarily by latent size, depth, and training time
Understanding scalable perception architectures is critical for:
- Processing raw, high-bandwidth sensory data
- Feeding learned representations into downstream control or RL systems
- Bridging perception and decision-making in autonomous systems
This study complements broader work on controllable perception and perception-to-control pipelines.
Install dependencies with:
pip install -r requirements.txtRecommended Python version: Python ≥ 3.9
- Jaegle et al., Perceiver: General Perception with Iterative Attention, ICML 2021
https://arxiv.org/abs/2103.03206
Ayushman Mishra
Robotics & Control Systems
https://github.com/aymisxx
linkedin.com/in/aymisxx