Agent Skills went from Anthropic experiment to industry default in roughly twelve months. The clearest articulation of why—and how to ship them safely—is the 62-page whitepaper on Kaggle, authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo), with the open spec at agentskills.io.
This is not a vendor pitch. It is a builder's manual for two personas:
- Builders — use skills in Claude Code, Codex, Cursor, ChatGPT
- Developers — write, version, evaluate, and govern skill libraries
If you read one document after our Agent Skills complete guide, make it this one. When you are ready to install, ExplainX hosts a discoverable skills registry at /skills—ranked listings with copy-ready install commands across Claude Code, Cursor, Codex, and compatible agents.
TL;DR
| Topic | Takeaway |
|---|---|
| Format | Folder + SKILL.md + optional scripts/, references/, assets/ |
| Standard | agentskills.io — cross-platform |
| Core win | Procedural memory — how to do tasks, on demand |
| Token math | 50 skills ≈ ~4K metadata vs 15K monolithic prompt |
| Bad skills | 19% of tasks worse with skill (SkillsBench) |
| vs MCP | MCP = hands; Skills = runbook — they compose |
| vs AGENTS.md | AGENTS.md = always on; Skills = load when matched |
| Marketplace | 40,000+ public listings by early 2026 |
| Discover | ExplainX skills registry — ranked, install-ready |
| Install | npx skills install github.com/google/skills |
Why Skills Exploded: Four Friction Points
The whitepaper names four problems that made multi-agent sprawl the default—and Skills the escape hatch:
1. Context rot
Dump every instruction into one system prompt → LLM performance degrades. Research cited includes Lost in the Middle (Liu et al., TACL 2024) and Context Rot (Chroma, 2025): accuracy falls as input grows—even below the context window limit.
Skills fix: load instructions only when triggered.
2. Procedural memory
LLMs approximate:
| Memory type | Analog |
|---|---|
| Episodic | What happened in this chat |
| Semantic | Facts in weights / RAG |
| Procedural | How to execute workflows step-by-step |
Skills are the first credible procedural memory primitive for agents—the "runbook the experienced colleague hands you on day one."
3. Multi-agent overload
Early 2025 default: router + HR sub-agent + invoice sub-agent + slide-deck sub-agent → CI/CD hell.
Skills alternative: one general agent + library of specialists-on-demand. Maintain skills, not deployments.
When multi-agent still wins: genuine parallelism, different security postures, adversarial check-and-balance, heterogeneous models.
4. Portability
A markdown folder works anywhere with filesystem access—Claude Code, Codex, Cursor, Antigravity, ADK. No vendor lock-in at the artifact layer.
Skill Anatomy and Progressive Disclosure
Minimum viable skill:
my-skill/
├── SKILL.md # Required
├── scripts/ # Optional — deterministic code
├── references/ # Optional — load on demand
└── assets/ # Optional — templates, schemas
Three disclosure levels:
| Level | What loads | When |
|---|---|---|
| L1 Metadata | name + description | Every turn (~50 tokens/skill) |
| L2 Body | Full SKILL.md | Skill triggers |
| L3 Bundled | references/, assets/, scripts/ | Body references or executes |
Token economics example from the paper: 50 workflows as one prompt = 15,000 tokens/turn. As 50 skills = ~4,000 description tokens + ~2,000 active body ≈ 6,000 total—with 49 bodies on disk.
Anthropic examples cite reductions from ~150K → ~2K active tokens when converting workflows to skills—aligned with Matt Pocock v1.0's 63% savings via the same pattern.
Case Study: mattpocock/skills at 135K Stars
The whitepaper's theory meets production in Matt Pocock's skills repo—the most-starred open-source skill library ( 135,000+ stars, 11,700+ forks, v1.0.1 June 2026). Full inventory: Matt Pocock agent skills guide.
Why it validates the whitepaper thesis
| Whitepaper concept | Pocock implementation |
|---|---|
| Procedural memory | /tdd, diagnosing-bugs, codebase-design encode how to engineer |
| Progressive disclosure | v1.0 loads compact index first; bodies on invoke (~63% token cut) |
| One agent + library | User-invoked orchestrators (/grill-me, /to-prd) call model-invoked discipline |
| Eval before ship | writing-great-skills + GLOSSARY.md; changesets for semver |
| Portability | npx skills@latest add mattpocock/skills across Claude Code, Cursor, Codex |
User-invoked vs model-invoked (composition pattern)
Pocock's v1.0 taxonomy docs/invocation.md is a live example of the whitepaper's DAG orchestration:
- User-invoked = orchestrate (
/grill-with-docs,/ask-matt,/triage) - Model-invoked = discipline (
grilling,domain-modeling,tdd) - Rule: user-invoked may call model-invoked; never another user-invoked
This is the production workaround for the skill-invocable tier gap Pocock flagged—helper skills stay model-invoked until the spec adds a third tier.
Four failure modes → skill library design
The README maps four engineering failures to skills—the same "one skill, one job" rule the whitepaper's cheatsheet recommends:
- Misalignment →
/grill-me,/grill-with-docs - Verbosity →
CONTEXT.md+domain-modeling(ubiquitous language) - Broken code →
/tdd,diagnosing-bugs(feedback loops) - Ball of mud →
/improve-codebase-architecture,codebase-design
Contrast with GSD / BMAD / Spec-Kit: Pocock explicitly rejects process frameworks that own the workflow. Skills stay small, composable, adaptable—matching the whitepaper's portability argument.
Enterprise lessons from a solo-author monorepo
Even a one-person library demonstrates governance patterns enterprises need:
/setup-matt-pocock-skills— one-time config (tracker, labels, docs) = onboarding skill- Shared primitives —
codebase-design+domain-modelingdeduplicate vocabulary across skills (DRY for procedural memory) - Breaking changes via semver — v1.0 removed
zoom-out, renameddiagnose→diagnosing-bugs; dependents updated in same release - Read/Draft/Act analog —
/triagefor read-only routing;/to-prdfor draft artifacts; git guardrails before action-allowed git ops
If your team ships one reference implementation, study this repo before writing internal skills from scratch.
Two Paths to Build Your First Skill
Path A: Translate what you already know
Compliance runbooks, HR onboarding guides, 30-page ops docs—no coding required. Focus on YAML frontmatter:
---
name: cafe-preparation
description: |
Calculates daily ingredient needs and generates prep sheets.
Use when estimating quantities or shopping lists.
Do NOT use for shift scheduling or accounting.
---
Description = routing algorithm. Front-load trigger keywords; state when NOT to use.
Naming: snake_case directories, kebab-case names, gerund form (processing-pdfs). Avoid utils, tools, vendor prefixes.
Path B: Crystallize what the agent just did
Agent completes a reusable workflow → propose SKILL.md draft → human reviews. Tools: Anthropic skill-creator, Nous Hermes Agent, self-improving-agent-skills patterns.
Warning: Unreviewed agent-drafted skills are often worse than no skill.
Skills vs MCP vs AGENTS.md
One-line mental model from Appendix A:
System prompt = instinct. AGENTS.md = project README. Tools/MCP = hands. RAG = library. Skills = the runbook.
| Primitive | Role |
|---|---|
| MCP | Connect to external systems |
| Skill | Teach how to use those tools for a workflow |
| AGENTS.md | Always-on conventions + optional skill catalog router |
| Karpathy LLM Wiki | Compiled organizational knowledge (different layer) |
They compose, not compete.
Installing Skills Today (Three Paradigms)
-
File drop —
.agents/skills/at project root (emerging convention); tool-specific paths vary. Community managers (skillport, openskills) symlink across CLIs. -
UI install — Web/enterprise registries; upload folders without terminal.
-
Programmatic — Google ADK
SkillToolset, auto-generatesload_skillrouting tools.
npx skills install github.com/google/skills
Google launched the official repo at github.com/google/skills at Cloud Next 2026—works across Antigravity, Codex, Claude Code, and compliant agents.
Evaluation: A Skill Without a Test Is a Hope
SkillsBench: 19% of tasks performed worse with a skill than without.
Four failure modes
| Mode | Symptom |
|---|---|
| Trigger | Wrong skill fires—or right one never fires |
| Execution | Triggers but wrong output / tool calls |
| Token budget | Huge body causes context rot when co-loaded |
| Regression | New skill breaks routing for existing library |
Vercel production data (cited)
- 56% non-invocation rate for skills expected to fire consistently
- Skill stripped of instructions: 58% pass vs 63% with no skill — active harm
- Passive AGENTS.md index: 100% pass vs 53% baseline — global context ≠ specialist skills
Target: 90% trigger accuracy on descriptions.
Evaluation toolkit (five patterns)
| Pattern | Use |
|---|---|
| Eval-as-unit-test | CI on every change |
| Golden dataset | Versioned input/output pairs in skill dir |
| LLM-as-judge | Rubric scoring at scale (swap positions for bias) |
| Adversarial red-team | Negative boundary + rephrasing tests |
| Canary/shadow | 1% live traffic before action-allowed |
Evaluation-Driven Development (EDD): write 3 JSON eval cases before drafting SKILL.md:
{
"case_id": "refund_dup_charge_001",
"input": "I was charged twice for order #4521 last Tuesday",
"expected_skill": "refund_processor",
"expected_tool_calls": [
{"tool": "lookup_order", "args": {"order_id": "4521"}}
],
"rubric": ["acknowledges duplicate", "cites order id"]
}
Critical: Never evaluate skills in isolation. Production co-loads 5–15 skills. A 5,000-word body that passes alone may fail in the library.
Read / Draft / Act ladder
| Tier | Capability | Gate |
|---|---|---|
| Read-Only | Query, describe | 90% trigger accuracy |
| Draft-Only | Produce for human review | 20+ golden cases |
| Action-Allowed | Irreversible ops | Adversarial + pass^k + human sign-off |
pass^k: require success on k consecutive runs—not one lucky pass. GPT-4o: 61% pass^1 vs under 25% pass^8 on tau-bench.
Production: Skills Are the Unit of Improvement
Reverse-engineering Claude Code v2.1.88 (Liu, Zhao, Shang, Shen 2026): 98.4% of codebase is operational infrastructure—permissions, compaction, subagents—only 1.6% is the agent loop.
Implication: As models converge on reasoning, deterministic engineering around the model differentiates reliability. The composable unit is the skill.
| Improvement style | Cycle time | Context tax |
|---|---|---|
| Model swap | Days–weeks | None (weights) |
| System prompt edit | Minutes–hours | Every turn |
| Fine-tune | Weeks–months | None |
| New skill | Hours–days | On-demand only |
Google Agents CLI ships seven lifecycle skills (scaffold, build, eval, deploy, publish, observe)—expertise in skills, runtime commoditized.
Meta-Skills and Self-Improvement
Four buckets:
- Authoring — skill-creator, ADK skill factory
- Trace harvesting — successful runs → draft skills
- Improvement — SkillOptimizer, description tuning loops, Karpathy autoresearch
- Library evolution — Voyager-style skill growth from production traffic
Rules:
- Agent-written skills enter at draft tier always
- Human reviews first diffs—even when metrics improve
- Don't start with meta-skills on an empty library
Connects to Microsoft SkillOpt and self-harness agents research.
Composition: DAGs and Capability Profiles
Real workflows exceed one skill. The paper advocates:
- DAG orchestration — structured handoffs via file bus, not LLM context as database
- Capability profiles — swappable bundles of active skills + tools + model params
- Canonical node types — Generator, Reviewer/Gate, Pipeline, Inversion, Domain Context Wrapper
Context debt: capitalized "ALWAYS DO X" in descriptions → models ignore them. Shift intelligence left into testable scripts/.
Choosing Among 40,000+ Public Skills
Heuristics:
- First-party first —
google/skills,anthropics/skills,stripe/ai,microsoft/skills - Pin versions — community skills change without notice
- Audit like code — supply-chain hygiene; skills run in your context
| Source | Trust default |
|---|---|
| Vendor first-party | Trust; pin version |
| Org-curated private | Trust within org; PR review |
| Community | Audit before adopt; pin aggressively |
Marketplaces: SkillsMP (1.2M+ indexed), VoltAgent/awesome-agent-skills, addyosmani/agent-skills. For a curated, ranked directory with verification signals and one-click install paths, browse ExplainX at /skills—we host discoverable skills from many authors alongside first-party repos like Google and Anthropic.
Appendix B Preview: Retail as Canonical Enterprise Case
The whitepaper's retail case study argues: same agent runtime + same MCP APIs ≠ same customer experience. Domain skills are the moat.
Example library:
| Skill | Owner | Tier |
|---|---|---|
project-guidance | Trades knowledge | Read-only |
materials-list | Pro merchandising | Draft-only |
return-policy | Customer service | Read-only (refunds = separate action skill) |
100 process variants: one agent + 100 skills (~5K metadata tokens) beats 100 subagents or one 1M-token prompt.
Query routing example: "Remodel kids' bathroom" → loads project-guidance → follow-up on delivery loads delivery-window → prior skills released. Unbounded capability, bounded active context.
The Five Rules (Cheatsheet)
- One skill, one job — if you need "and," split it
- Descriptions are the interface — spend more time here than the body
- Skills are dependencies — version, pin, PR review, tests
- Right team owns right skill — domain experts, not central AI bottleneck
- Runtime is interchangeable — portability is the value
Where to Start Tomorrow
From Appendix A:
- Record your best practitioner narrating three workflows (1 hour)
- Pick the most repeated; run without a skill; note failures
- Draft
SKILL.md; write 3 eval cases first (EDD) - Ship read-only tier; iterate description until 90% trigger accuracy
- Resist generating fifty skills on day one
Summary
The Kaggle Agent Skills whitepaper makes the case that the format is settled (agentskills.io) while the work around it—evaluation under co-load, meta-skills, DAG composition, enterprise governance—is just beginning.
Skills give agents procedural memory without context rot. They replace many multi-agent deployments with one agent + a versioned library. They fail predictably when descriptions are vague, bodies are bloated, or evals run in isolation.
Start small. Treat skills as code. Measure what you ship. Browse existing skills at ExplainX /skills before writing your twentieth from scratch.
Related Reading
- Browse Agent Skills on ExplainX
- Matt Pocock Agent Skills (135K+ stars)
- Matt Pocock Skills v1.0 — Progressive Disclosure
- What Are Agent Skills? Complete Guide
- What is MCP?
- Agent Markdown Files Guide
- Karpathy LLM Wiki Pattern
- Microsoft SkillOpt
- OpenAI Deployment Simulation
Whitepaper concepts cited from the Kaggle Agent Skills whitepaper (May 2026), authors Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, Smitha Kolam; curated by Shubham Saboo.