SAE Structural Steering: Diagnosing Hierarchy in GPT-2

Objective: To empirically test whether Sparse Autoencoders (SAEs) can recover true hierarchical structure (specifically, bracket nesting depth) from LLM activations, and to verify this causality via feature steering.

This repository demonstrates a complete Mechanistic Interpretability pipeline using TransformerLens and SAELens to audit a model's internal structural logic, clamping specific neural features to induce targeted structural amnesia.

🔬 Key Findings

Feature Identification: Isolated a distinct structural "depth feature" (Feature ID 22574, Layer 4 of gpt2-small-res-jb).
Linear Scaling: The feature's activation magnitude scales almost perfectly linearly with the depth of unclosed brackets, proving the SAE recovered a hierarchical tracking mechanism rather than a flat token representation.
Causal Steering: Artificially clamping this feature to 0.0 during the forward pass successfully degraded the model's structural logic, preventing it from closing nested sequences at shallow depths. Deeper depths survived, suggesting structural feature redundancy.

🛠️ Tech Stack

Model: gpt2-small
Mechanistic Interpretability: TransformerLens, SAELens
Framework: PyTorch

📊 Empirical Results

The pipeline evaluates the model against a few-shot prompted bracket sequence. After verifying 100% baseline structural accuracy, an intervention hook suppresses Feature 22574.

============================================================
STEERING EXPERIMENT: Clamping Depth Feature 22574 to 0.0
============================================================

Depth 1: 
    Expected:   ')'
    Baseline:   ' )' ✅
    Steered:    ' (' ❌

Depth 2: 
    Expected:   ')'
    Baseline:   ' )' ✅
    Steered:    ' (' ❌

Depth 3: 
    Expected:   ')'
    Baseline:   ' )' ✅
    Steered:    ' )' ✅

Depth 4: 
    Expected:   ')'
    Baseline:   ' )' ✅
    Steered:    ' )' ✅

============================================================
SUMMARY
============================================================
Baseline accuracy: 4/4 (100%)
Steered accuracy:  2/4 (50%)
Degradation:       2 failures induced

✅ SUCCESS: Steering suppressed the depth feature, breaking bracket prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
sae_structural_steering(2).ipynb		sae_structural_steering(2).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAE Structural Steering: Diagnosing Hierarchy in GPT-2

🔬 Key Findings

🛠️ Tech Stack

📊 Empirical Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAE Structural Steering: Diagnosing Hierarchy in GPT-2

🔬 Key Findings

🛠️ Tech Stack

📊 Empirical Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages