Hypothesis: robot policies that can draw on persistent experience — not just the current observation window — will generalize better to tasks requiring spatial recall, object permanence, and long-horizon reasoning.
This repository is a ground-up implementation track toward that hypothesis: rebuild the core ingredients of modern VLA policies one component at a time, in a controlled simulation where each architectural change is attributable and measurable. Once the generative action model is in place, memory is the next experimental variable.
The name reflects the end goal. Current "memory" in most policies is just the context window — a fixed lookback of frames. pi-e is built to eventually test what happens when you replace or augment that with richer, persistent experience.
Each stage is implemented from scratch and evaluated before the next is added:
| # | Component | Status |
|---|---|---|
| 1 | Expert policy (rule-based) | done |
| 2 | Behavior cloning (single-frame) | done |
| 3 | DAgger (covariate shift correction) | done |
| 4 | Action chunking (open-loop + receding horizon) | done |
| 5 | ACT-style transformer decoder with action queries | done |
| 6 | ViT encoder + transformer decoder | done |
| 7 | Evaluation harness with rollout metrics | done |
| 8 | Flow-matching action head | done |
| 9 | Language conditioning (VLA) | next |
| 10 | Memory experiments | upcoming |
2D interception task:
- Observation: RGB image
- Action:
(dx, dy)end-effector velocity - Goal: move blue end-effector to capture red bouncing ball
The environment is intentionally simple so that model and control changes are attributable and measurable. Complexity is added to the policy, not the environment.
Evaluated over 300 rollouts per policy with fixed seed and settings.
| Policy | Steps to Capture | Path Inefficiency | Completed Rate | Params |
|---|---|---|---|---|
| Expert | 16.83 ± 10.25 | 1.06 ± 0.07 | 1.000 | N/A |
| BC + DAgger | 31.55 ± 26.58 | 1.38 ± 0.95 | 0.910 | 4.2M |
| ACT (RH4) | 23.98 ± 15.22 | 1.20 ± 0.26 | 1.000 | 69K |
| ViT (RH4) | 25.39 ± 15.90 | 1.24 ± 0.52 | 1.000 | 287K |
| Flow Matching (RH4) | 34.91 ± 21.31 | 1.44 ± 1.02 | 0.990 | ~290K |
| Random | 89.99 ± 26.47 | 14.54 ± 13.45 | 0.157 | N/A |
Takeaways:
- Receding-horizon ACT and ViT reach expert-level completion at 69K and 287K parameters respectively.
- ACT is 60× more parameter-efficient than the flatten+MLP BC baseline while outperforming it on every metric.
- Transformer-based policies close most of the gap to the expert while using far fewer parameters than MLP heads.
- Flow matching reaches 0.99 completion rate but is slower per step than direct regression — expected on a unimodal task where regression has an inherent advantage.
Full breakdown including action chunking, open-loop variants, smoothness, and trajectory length: notes/10_baseline_metrics.md.
| BC + DAgger | ACT (RH4) | ViT (RH4) |
|---|---|---|
BC → DAgger: Expert demonstrations undersample failure-recovery states. DAgger corrects covariate shift by labeling states the learned policy actually visits, not just states the expert visits.
Action chunking: Open-loop chunk execution is surprisingly fragile on dynamic targets. Receding-horizon execution (re-plan every 4 steps) recovers most of the gap at no architectural cost.
Transformer decoder (ACT): Learned action queries replace the flattened CNN output + MLP head. The result is better performance at 60× fewer parameters — the decoder adds structure that the MLP cannot.
ViT encoder: Replacing the CNN with a patch-based ViT encoder gives comparable performance at a higher parameter cost, but establishes the backbone for language conditioning later.
Flow matching: A generative action head that transforms Gaussian noise into action chunks by integrating a learned velocity field. Reaches 0.99 completion but trails direct regression on speed and path efficiency — on a unimodal task this is expected. The architecture is now in place for generative VLA behavior.
Multi-frame stacking (abandoned): Frame stacking (concatenating along channels, à la DQN) was implemented and discarded. A CNN treats all channels equally; it provides no temporal inductive bias. Attention over frame sequences or action chunking is the right path.
pi/
├── env/ # simulation environment
├── expert/ # rule-based expert policy
├── policy/ # learned policies (BC, chunking, ACT, ViT)
│ ├── bc_policy.py
│ ├── action_chunking_policy.py
│ ├── act_policy.py
│ ├── vit_policy.py
│ └── flow_matching_policy.py
├── data/ # datasets + dataloaders
├── scripts/ # training / collection scripts
├── eval/ # metric computation + evaluation runners
├── experiments/ # ablation matrix, run log, findings journal
├── notes/ # implementation notes, metrics, videos
└── visualize.py # interactive visualization
# Visualize expert
python visualize.py
# Train
python scripts/train.py
# Evaluate
python eval/run_eval.py- Task simplicity: the interception environment does not yet test long-horizon planning, memory, or complex contact dynamics — by design, but a real constraint on what the current results claim.
- Script ergonomics: train/eval entrypoints are script-based and not yet unified into a single configurable CLI.
- Dataset scope: demonstrations are from one toy domain; no cross-task or multi-domain pretraining.
- No language conditioning yet: VLA behavior is planned but not implemented.
- Language-conditioned control — minimal language conditioning on top of the ViT/flow matching backbone.
- Harder tasks — partial observability, multi-step goals, variable dynamics.
- Memory experiments — test whether persistent experience (episodic memory, spatial recall) improves policies on tasks that exceed a single observation window. The frame-lookback context that current policies use is a lower bound on what memory could be.
- Evaluation:
eval/run_eval.py - Environment:
MovingObjectEnv(env/moving_object.py) - Episodes per policy: 300
- Max steps: 100
- Seed: 42
- RH4 variants share the same checkpoint; only
actions_per_inference=4changes. - Full metric details:
notes/10_baseline_metrics.md - Experiment log:
experiments/runs.csv
-
Implementation writeups:
notes/ -
Baseline metrics + charts:
notes/10_baseline_metrics.md -
Flow matching notes:
notes/11_flow_matching.md -
Rollout videos:
notes/videos/ -
Experiment workflow:
experiments/README.md