sparse-autoencoder-training▌
davila7/claude-code-templates · updated Apr 8, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity.
SAELens: Sparse Autoencoders for Mechanistic Interpretability
SAELens is the primary library for training and analyzing Sparse Autoencoders (SAEs) - a technique for decomposing polysemantic neural network activations into sparse, interpretable features. Based on Anthropic's groundbreaking research on monosemanticity.
GitHub: jbloomAus/SAELens (1,100+ stars)
The Problem: Polysemanticity & Superposition
Individual neurons in neural networks are polysemantic - they activate in multiple, semantically distinct contexts. This happens because models use superposition to represent more features than they have neurons, making interpretability difficult.
SAEs solve this by decomposing dense activations into sparse, monosemantic features - typically only a small number of features activate for any given input, and each feature corresponds to an interpretable concept.
When to Use SAELens
Use SAELens when you need to:
- Discover interpretable features in model activations
- Understand what concepts a model has learned
- Study superposition and feature geometry
- Perform feature-based steering or ablation
- Analyze safety-relevant features (deception, bias, harmful content)
Consider alternatives when:
- You need basic activation analysis → Use TransformerLens directly
- You want causal intervention experiments → Use pyvene or TransformerLens
- You need production steering → Consider direct activation engineering
Installation
pip install sae-lens
Requirements: Python 3.10+, transformer-lens>=2.0.0
Core Concepts
What SAEs Learn
SAEs are trained to reconstruct model activations through a sparse bottleneck:
Input Activation → Encoder → Sparse Features → Decoder → Reconstructed Activation
(d_model) ↓ (d_sae >> d_model) ↓ (d_model)
sparsity reconstruction
penalty loss
Loss Function: MSE(original, reconstructed) + L1_coefficient × L1(features)
Key Validation (Anthropic Research)
In "Towards Monosemanticity", human evaluators found 70% of SAE features genuinely interpretable. Features discovered include:
- DNA sequences, legal language, HTTP requests
- Hebrew text, nutrition statements, code syntax
- Sentiment, named entities, grammatical structures
Workflow 1: Loading and Analyzing Pre-trained SAEs
Step-by-Step
from transformer_lens import HookedTransformer
from sae_lens import SAE
# 1. Load model and pre-trained SAE
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, cfg_dict, sparsity = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
device="cuda"
)
# 2. Get model activations
tokens = model.to_tokens("The capital of France is Paris")
_, cache = model.run_with_cache(tokens)
activations = cache["resid_pre", 8] # [batch, pos, d_model]
# 3. Encode to SAE features
sae_features = sae.encode(activations) # [batch, pos, d_sae]
print(f"Active features: {(sae_features > 0).sum()}")
# 4. Find top features for each position
for pos in range(tokens.shape[1]):
top_features = sae_features[0, pos].topk(5)
token = model.to_str_tokens(tokens[0, pos:pos+1])[0]
print(f"Token '{token}': features {top_features.indices.tolist()}")
# 5. Reconstruct activations
reconstructed = sae.decode(sae_features)
reconstruction_error = (activations - reconstructed).norm()
Available Pre-trained SAEs
| Release | Model | Layers |
|---|---|---|
gpt2-small-res-jb |
GPT-2 Small | Multiple residual streams |
gemma-2b-res |
Gemma 2B | Residual streams |
| Various on HuggingFace | Search tag saelens |
Various |
Checklist
- Load model with TransformerLens
- Load matching SAE for target layer
- Encode activations to sparse features
- Identify top-activating features per token
- Validate reconstruction quality
Workflow 2: Training a Custom SAE
Step-by-Step
from sae_lens import SAE, LanguageModelSAERunnerConfig, SAETrainingRunner
# 1. Configure training
cfg = LanguageModelSAERunnerConfig(
# Model
model_name="gpt2-small",
hook_name="blocks.8.hook_resid_pre",
hook_layer=8,
d_in=768, # Model dimension
# SAE architecture
architecture="standard", # or "gated", "topk"
d_sae=768 * 8, # Expansion factor of 8
activation_fn="relu",
# Training
lr=4e-4,
l1_coefficient=8e-5, # Sparsity penalty
l1_warm_up_steps=1000,
train_batch_size_tokens=4096,
training_tokens=100_000_000,
# Data
dataset_path="monology/pile-uncopyrighted",
context_size=128,
# Logging
log_to_wandb=True,
wandb_project="sae-training",
# Checkpointing
checkpoint_path="checkpoints",
n_checkpoints=5,
)
# 2. Train
trainer = SAETrainingRunner(cfg)
sae = trainer.run()
# 3. Evaluate
print(f"L0 (avg active features): {trainer.metrics['l0']}")
print(f"CE Loss Recovered: {trainer.metrics['ce_loss_score']}")
Key Hyperparameters
| Parameter | Typical Value | Effect |
|---|---|---|
d_sae |
4-16× d_model | More features, higher capacity |
l1_coefficient |
5e-5 to 1e-4 | Higher = sparser, less accurate |
lr |
1e-4 to 1e-3 | Standard optimizer LR |
l1_warm_up_steps |
500-2000 | Prevents early feature death |
Evaluation Metrics
| Metric | Target | Meaning |
|---|---|---|
| L0 | 50-200 | Average active features per token |
| CE Loss Score | 80-95% | Cross-entropy recovered vs original |
| Dead Features | <5% | Features that never activate |
| Explained Variance | >90% | Reconstruction quality |
Checklist
- Choose target layer and hook point
- Set expansion factor (d_sae = 4-16× d_model)
- Tune L1 coefficient for desired sparsity
- Enable L1 warm-up to prevent dead features
- Monitor metrics during training (W&B)
- Validate L0 and CE loss recovery
- Check dead feature ratio
Workflow 3: Feature Analysis and Steering
Analyzing Individual Features
from transformer_lens import HookedTransformer
from sae_lens import SAE
import torch
model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
sae, _, _ = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
device="cuda"
)
# Find what activates a specific feature
feature_idx = 1234
test_texts = [
"The scientist conducted an experiment",
"I love chocolate cake",
"The code compiles successfully",
"Paris is beautiful in spring",
]
for text in test_texts:
tokens = model.to_tokens(text)
_, cache = model.run_with_cache(tokens)
features = sae.encode(cache["resid_pre", 8])
activation = features[0, :, feature_idx].max().itemhow to use sparse-autoencoder-trainingHow to use sparse-autoencoder-training on Cursor
AI-first code editor with Composer
1Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your development machine
- ›Node.js version 16.0+ with npm package manager (verify with
node --version) - ›Active project directory or workspace where you want to add sparse-autoencoder-training
2Execute installation command
Execute the skills CLI command in your project's root directory to begin installation:
$npx skills add https://github.com/davila7/claude-code-templates --skill sparse-autoencoder-trainingThe skills CLI fetches sparse-autoencoder-training from GitHub repository davila7/claude-code-templates and configures it for Cursor.
3Select Cursor when prompted
The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:
◆ Which agents do you want to install to?││ ── Universal (.agents/skills) ── always included ────│ • Amp│ • Antigravity│ • Cline│ • Codex│ ●Cursor(selected)│ • Cursor│ • Windsurf4Verify installation
Confirm successful installation by checking the skill directory location:
.cursor/skills/sparse-autoencoder-trainingReload or restart Cursor to activate sparse-autoencoder-training. Access the skill through slash commands (e.g., /sparse-autoencoder-training) or your agent's skill management interface.
⚠Security & Verification Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.
Additional Resources
List & Monetize Your Skill
Submit your Claude Code skill and start earning
GET_STARTED →Use Cases▌
Task Automation & Efficiency
Automate repetitive workflows and reduce manual effort
Example
Generate reports, summarize documents, draft communications
✓Save 3-5 hours per week on routine tasks
Knowledge Enhancement
Learn new skills, understand complex topics, get expert guidance
Example
Explain concepts, provide examples, suggest learning resources
✓Accelerate learning and skill development by 2x
Quality Improvement
Enhance output quality through reviews, suggestions, and refinements
Example
Review drafts, suggest improvements, catch errors
✓Improve work quality by 30-40% with less effort
Implementation Guide▌
Prerequisites
- ›Claude Desktop or compatible AI client with skill support
- ›Clear understanding of task or problem to solve
- ›Willingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Installation Steps
- 1.Install skill using provided installation command
- 2.Test with simple use case relevant to your work
- 3.Evaluate output quality and relevance
- 4.Iterate on prompts to improve results
- 5.Integrate into regular workflow if valuable
Common Pitfalls
- ⚠Expecting perfect results without iteration
- ⚠Not providing enough context in prompts
- ⚠Using skill for tasks outside its intended scope
- ⚠Accepting outputs without review and validation
Best Practices▌
✓ Do
- +Start with clear, specific prompts
- +Provide relevant context and constraints
- +Review and refine all outputs before using
- +Iterate to improve output quality
- +Document successful prompt patterns
✗ Don't
- −Don't use without understanding skill limitations
- −Don't skip validation of outputs
- −Don't share sensitive information in prompts
- −Don't expect skill to replace human judgment
💡 Pro Tips
- ★Be specific about desired format and style
- ★Ask for multiple options to choose from
- ★Request explanations to understand reasoning
- ★Combine AI efficiency with human expertise
When to Use This▌
✓ Use When
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
✗ Avoid When
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path▌
- 1Familiarize yourself with skill capabilities and limitations
- 2Start with low-risk, non-critical tasks
- 3Progress to more complex and valuable use cases
- 4Build expertise through regular use and experimentation
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
general reviewsRatings
4.5★★★★★32 reviews- ★★★★★Shikha Mishra· Dec 28, 2024
Useful defaults in sparse-autoencoder-training — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Henry Thompson· Dec 20, 2024
Keeps context tight: sparse-autoencoder-training is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Yash Thakker· Nov 19, 2024
sparse-autoencoder-training has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Ishan Choi· Nov 11, 2024
Registry listing for sparse-autoencoder-training matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Dhruvi Jain· Oct 10, 2024
Solid pick for teams standardizing on skills: sparse-autoencoder-training is focused, and the summary matches what you get after install.
- ★★★★★Charlotte Chen· Oct 2, 2024
sparse-autoencoder-training reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Kaira Garcia· Sep 25, 2024
sparse-autoencoder-training fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Jin Abbas· Sep 17, 2024
Useful defaults in sparse-autoencoder-training — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Hana White· Sep 13, 2024
We added sparse-autoencoder-training from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Rahul Santra· Sep 1, 2024
sparse-autoencoder-training is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
showing 1-10 of 32
1 / 4