Objective: To empirically test whether Sparse Autoencoders (SAEs) can recover true hierarchical structure (specifically, bracket nesting depth) from LLM activations, and to verify this causality via feature steering.
This repository demonstrates a complete Mechanistic Interpretability pipeline using TransformerLens and SAELens to audit a model's internal structural logic, clamping specific neural features to induce targeted structural amnesia.
- Feature Identification: Isolated a distinct structural "depth feature" (Feature ID 22574, Layer 4 of
gpt2-small-res-jb). - Linear Scaling: The feature's activation magnitude scales almost perfectly linearly with the depth of unclosed brackets, proving the SAE recovered a hierarchical tracking mechanism rather than a flat token representation.
- Causal Steering: Artificially clamping this feature to
0.0during the forward pass successfully degraded the model's structural logic, preventing it from closing nested sequences at shallow depths. Deeper depths survived, suggesting structural feature redundancy.
- Model:
gpt2-small - Mechanistic Interpretability:
TransformerLens,SAELens - Framework: PyTorch
The pipeline evaluates the model against a few-shot prompted bracket sequence. After verifying 100% baseline structural accuracy, an intervention hook suppresses Feature 22574.
============================================================
STEERING EXPERIMENT: Clamping Depth Feature 22574 to 0.0
============================================================
Depth 1:
Expected: ')'
Baseline: ' )' β
Steered: ' (' β
Depth 2:
Expected: ')'
Baseline: ' )' β
Steered: ' (' β
Depth 3:
Expected: ')'
Baseline: ' )' β
Steered: ' )' β
Depth 4:
Expected: ')'
Baseline: ' )' β
Steered: ' )' β
============================================================
SUMMARY
============================================================
Baseline accuracy: 4/4 (100%)
Steered accuracy: 2/4 (50%)
Degradation: 2 failures induced
β
SUCCESS: Steering suppressed the depth feature, breaking bracket prediction.