autoresearch▌
supercent-io/skills-template · updated Apr 8, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
"The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy
autoresearch
"The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy
Autoresearch is an autonomous ML experimentation framework. An AI agent iteratively modifies train.py, runs fixed 5-minute GPU experiments, evaluates with a single metric (val_bpb), and commits only improvements via git ratcheting. The result: wake up to 100+ experiments logged and a monotonically better model.
When to use this skill
- Setting up autoresearch on a GPU machine for the first time
- Writing or refining
program.mdresearch directives for the agent - Launching an overnight autonomous experiment loop
- Interpreting
results.tsvto understand what the agent found - Configuring the system for constrained hardware (limited VRAM)
- Understanding the ratcheting mechanism and git workflow
- Porting to Apple Silicon (MLX) or Windows RTX
Core Architecture
Human authors program.md
│
▼
Agent reads program.md + train.py
│
▼
Agent modifies train.py → git commit
│
▼
uv run train.py (exactly 300 seconds)
│
▼
Extract val_bpb + peak_vram_mb
│
┌────┴────┐
improved? no improvement
│ │
keep commit git reset HEAD~1
│ │
└──────┬───────┘
│
log to results.tsv
│
▼
repeat ∞
Mutable vs. Immutable Files
| File | Agent access | Purpose |
|---|---|---|
train.py |
Read + Write | Model, optimizer, training loop (~630 lines) |
program.md |
Read-only | Human research directives |
prepare.py |
Read-only | Data pipeline + evaluate_bpb() harness |
constants.py |
Read-only | TIME_BUDGET=300, MAX_SEQ_LEN, EVAL_TOKENS |
pyproject.toml |
Read-only | Locked dependencies (no new packages) |
results.tsv |
Append | All experiments: kept and discarded |
Instructions
Step 1: Install Prerequisites
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/karpathy/autoresearch
cd autoresearch
# Install locked dependencies
uv sync
Step 2: Prepare Data (One-Time, ~2 Minutes)
# Downloads FineWeb-Edu parquet shards, trains BPE tokenizer
# Last shard is reserved for validation — never seen during training
uv run prepare.py
For constrained hardware, edit prepare.py before running:
# Lower MAX_SEQ_LEN for GPUs with limited VRAM
MAX_SEQ_LEN = 256 # default: 2048
Step 3: Run a Baseline Experiment
# Single 5-minute experiment to verify setup
uv run train.py > run.log 2>&1
# Extract key metrics
grep "^val_bpb:\|^peak_vram_mb:" run.log
Expected output:
val_bpb: 0.9979
peak_vram_mb: 38420
Step 4: Author program.md
program.md is the human-written research charter the agent reads at the start of every loop iteration. Write it as precise Markdown instructions:
# Research Program
## Goal
Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.
## Current Baseline
val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)
## Directions to Explore
1. Attention variants: MLA, GQA, sliding window, local-global hybrid
2. Layer types: MoE FFN layers, SwiGLU activations
3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
4. Architectural depth/width tradeoffs within VRAM budget
## Constraints
- Must complete within 300 seconds
- Peak VRAM must stay under 39GB
- No new packages (use only what is in pyproject.toml)
- Do not modify prepare.py or constants.py
## Notes from Previous Runs
- Depth-12 improvements transfer to depth-24 (scale-invariant gains)
- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)
Effective program.md principles:
- Be specific about what to explore — vague directives waste experiments
- Record what has already been tried (prevents redundant experiments)
- Note hardware constraints explicitly
- Use the current best
val_bpbas a reference point
Step 5: Run the Autonomous Agent Loop
Point your AI agent (Claude Code, Codex, etc.) at the repository with program.md as its research context. The agent will:
- Read
program.md+ currenttrain.py - Hypothesize an improvement
- Modify
train.py+ commit - Execute
uv run train.py(300 seconds) - Extract
val_bpb; keep or revert via git - Append to
results.tsv - Repeat
With Claude Code (OMC):
# From inside autoresearch/
# Give Claude the context: "Run the autoresearch loop following program.md"
With Claude Code CLI directly:
claude "Follow program.md. Run autonomous research loop on train.py.
Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.
Log everything to results.tsv. Do not stop until I say so."
Step 6: Monitor Results
# Live monitoring during a run
watch -n 30 "tail -20 results.tsv"
# Count kept vs. discarded
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c
# Find the best experiment
sort -t$'\t' -k2 -n results.tsv | head -5
# Check current best val_bpb
git log --oneline -5
Step 7: Interpret results.tsv
commit val_bpb memory_gb status description
a3f2c91 0.9697 37.2 keep SwiGLU activation + depth-12
b8e1d04 0.9821 38.1 discard MoE 4-expert: marginal gain
c1a5f30 crash — crash OOM: sequence length 4096
| Status | Meaning |
|---|---|
keep |
val_bpb improved; commit retained on branch |
discard |
No improvement; git reset HEAD~1 applied |
crash |
OOM, syntax error, or timeout; always reverted |
Examples
Example 1: Overnight Run Summary
Session summary: 126 experiments, 18 improvements
Best val_bpb: 0.9697 (started: 0.9979)
Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpb
Example 2: Low-VRAM Configuration (6GB GPU)
# In prepare.py — edit before uv run prepare.py
MAX_SEQ_LEN = 256 # was 2048
EVAL_TOKENS = 2_097_152 # was 20_971_520 (scale down proportionally)
Example 3: Extract Experiments by Category
# Find all attention-related experiments
grep -i "attention\|GQA\|MLA\|MHA" results.tsv
# List only improvements sorted by gain
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n
Available scripts
Run from inside the autoresearch repository directory:
| Script | Purpose | Usage |
|---|---|---|
setup.sh |
One-time environment setup | bash scripts/setup.sh [--seq-len 512] |
run-experiment.sh |
Single 5-min experiment + metric extraction | bash scripts/run-experiment.sh |
run-loop.sh |
Autonomous loop: run → keep/revert → repeat | bash scripts/run-loop.sh [--max 20] |
show-results.sh |
Human-readable results.tsv report | bash scripts/show-results.sh [--top 10] |
check-hardware.sh |
GPU/CUDA/uv availability check (JSON output) | bash scripts/check-hardware.sh |
# Typical overnight session
bash scripts/check-hardware.sh
bash scripts/setup.sh --seq-len 512 # adjust for your VRAM
# Edit program.md with your research directives
bash scripts/run-loop.sh --max 100 --desc "session-1"
bash scripts/show-results.sh --kept-only
References
Detailed documentation in references/:
| File | Contents |
|---|---|
references/architecture.md |
System design, immutability contract, git ratcheting, key design decisions |
references/program-md-guide.md |
How to write effective program.md directives; full template + principles |
references/hardware-config.md |
VRAM settings by GPU, memory optimization techniques, troubleshooting |
Best practices
- Write program.md before running — the agent is only as good as its directives; vague programs waste compute
- Start with the baseline first — always
uv run train.pymanually before launching the loop to confirm the setup works - Keep
MAX_SEQ_LENinprepare.pyconsistent — changing it mid-run invalidates val_bpb comparisons - Never modify
prepare.pyorconstants.py— the evaluation harness must stay fixed for results to be meaningful - Scale improvements before committing — test that a depth-12 improvement also holds at depth-24 before treating it as a fundamental gain
- Commit
program.mdupdates — version-control your research directives alongsideresults.tsvfor reproducibility - Monitor VRAM — add
peak_vram_mbconstraints inprogram.mdfor your GPU's headroom - No new dependencies — the agent cannot
pip install; it can only use what is inpyproject.toml
Hardware Requirements
| Hardware | Status | Notes |
|---|---|---|
| H100 80GB | Recommended | Default config, full MAX_SEQ_LEN=2048 |
| A100 40GB | Supported | Lower MAX_SEQ_LEN if needed |
| RTX 4090 24GB | Community | Reduce MAX_SEQ_LEN to 512 |
| GTX 1660 Ti 6GB | Community fork | MAX_SEQ_LEN=256, reduced EVAL_TOKENS |
| Apple Silicon (M-series) | MLX port | Community fork; different optimizer API |
| Windows RTX | Community | WSL2 + CUDA recommended |
Key Metrics Reference
| Metric | Direction | Description |
|---|---|---|
val_bpb |
Lower = better | Validation bits-per-byte; vocabulary-size-independent |
peak_vram_mb |
Lower = more headroom | Peak GPU memory during the training run |
| Experiments/hour | Higher = faster search | ~12 at TIME_BUDGET=300 |
References
How to use autoresearch on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your development machine
- ›Node.js version 16.0+ with npm package manager (verify with
node --version) - ›Active project directory or workspace where you want to add autoresearch
Execute installation command
Execute the skills CLI command in your project's root directory to begin installation:
The skills CLI fetches autoresearch from GitHub repository supercent-io/skills-template and configures it for Cursor.
Select Cursor when prompted
The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Reload or restart Cursor to activate autoresearch. Access the skill through slash commands (e.g., /autoresearch) or your agent's skill management interface.
Security & Verification Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases▌
User Story & Requirements Generation
Create detailed user stories, acceptance criteria, and feature specs
Example
Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios
Reduce spec writing time by 50%, ensure comprehensive coverage
Competitive Analysis
Research competitors, compare features, identify gaps
Example
Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities
Complete competitive research in 2 hours instead of 2 days
Roadmap Prioritization
Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs
Example
Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale
Make data-driven prioritization decisions faster
Stakeholder Communication
Draft PRDs, status updates, and stakeholder presentations
Example
Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement
Save 3-5 hours/week on communication overhead
Implementation Guide▌
Prerequisites
- ›Claude Desktop or compatible AI client
- ›Access to product documentation and roadmap tools (Jira, Notion, etc.)
- ›Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
- ›Stakeholder contact information and communication channels
Time Estimate
30-60 minutes to see productivity improvements
Installation Steps
- 1.Install product management skill
- 2.Start with user story generation for known feature
- 3.Progress to competitive analysis: research 2-3 competitors
- 4.Use for roadmap prioritization: apply RICE/ICE scoring
- 5.Draft stakeholder communications and refine based on feedback
- 6.Build template library for recurring PM tasks
- 7.Share effective prompts with product team
Common Pitfalls
- ⚠Not validating competitive research—verify facts before sharing
- ⚠Accepting user stories without involving engineering team
- ⚠Over-relying on frameworks without qualitative judgment
- ⚠Not customizing outputs to company culture and communication style
- ⚠Skipping stakeholder validation of generated requirements
Best Practices▌
✓ Do
- +Validate research and competitive analysis with real data
- +Collaborate with engineering when generating technical requirements
- +Customize frameworks and templates to your company context
- +Use skill for first drafts, refine with stakeholder input
- +Document successful prompt patterns for PM tasks
- +Combine AI efficiency with human judgment and intuition
✗ Don't
- −Don't publish competitive analysis without fact-checking
- −Don't finalize user stories without engineering review
- −Don't make prioritization decisions solely on AI scoring
- −Don't skip customer validation of generated requirements
- −Don't ignore company-specific context and culture
💡 Pro Tips
- ★Provide context: company goals, constraints, customer feedback
- ★Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
- ★Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
- ★Use skill for 70% generation + 30% customization to company needs
When to Use This▌
✓ Use When
Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.
✗ Avoid When
Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.
Learning Path▌
- 1Basic: user stories, feature specs, status updates
- 2Intermediate: competitive analysis, prioritization frameworks, PRDs
- 3Advanced: product strategy, go-to-market planning, OKR setting
- 4Expert: product vision, market positioning, business model innovation
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.5★★★★★60 reviews- ★★★★★Maya Park· Dec 28, 2024
Useful defaults in autoresearch — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Mei Bansal· Dec 12, 2024
autoresearch reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Sofia Ghosh· Dec 12, 2024
Registry listing for autoresearch matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Hana Perez· Dec 4, 2024
Registry listing for autoresearch matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Hana Sanchez· Nov 27, 2024
autoresearch is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Carlos Diallo· Nov 27, 2024
Keeps context tight: autoresearch is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Omar Khanna· Nov 23, 2024
Useful defaults in autoresearch — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Hana Menon· Nov 19, 2024
Registry listing for autoresearch matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Noor Bhatia· Nov 3, 2024
I recommend autoresearch for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Noor Nasser· Nov 3, 2024
Useful defaults in autoresearch — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
showing 1-10 of 60