What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

How are Agent Skills different from MCP and AGENTS.md?

MCP provides reach—connecting agents to external systems (Drive, Salesforce, BigQuery). Skills provide know-how—how to think about a task. AGENTS.md is always loaded for project conventions; Skills load on demand. Best practice: tight AGENTS.md as router + skills library for specialist workflows.

What does SkillsBench say about bad skills?

SkillsBench (2025) found 19% of real-world agent tasks performed worse with a poorly designed skill than without one. Failure modes: trigger failure, execution failure, token budget failure, and regression when skills overlap. Vercel reported a skill stripped of instructions scored 5 points below no skill at all.

How should teams evaluate Agent Skills before production?

Test under co-loaded conditions (5–15 skills), not in isolation. Cover four gates: trigger accuracy (target 90%), execution quality and tool trajectory, regression in the existing library, and token budget. Graduate skills through Read-Only, Draft-Only, and Action-Allowed tiers with adversarial testing before irreversible actions.

What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

How are Agent Skills different from MCP and AGENTS.md?

MCP provides reach—connecting agents to external systems (Drive, Salesforce, BigQuery). Skills provide know-how—how to think about a task. AGENTS.md is always loaded for project conventions; Skills load on demand. Best practice: tight AGENTS.md as router + skills library for specialist workflows.

What does SkillsBench say about bad skills?

SkillsBench (2025) found 19% of real-world agent tasks performed worse with a poorly designed skill than without one. Failure modes: trigger failure, execution failure, token budget failure, and regression when skills overlap. Vercel reported a skill stripped of instructions scored 5 points below no skill at all.

How should teams evaluate Agent Skills before production?

Test under co-loaded conditions (5–15 skills), not in isolation. Cover four gates: trigger accuracy (target 90%), execution quality and tool trajectory, regression in the existing library, and token budget. Graduate skills through Read-Only, Draft-Only, and Action-Allowed tiers with adversarial testing before irreversible actions.

What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

How are Agent Skills different from MCP and AGENTS.md?

MCP provides reach—connecting agents to external systems (Drive, Salesforce, BigQuery). Skills provide know-how—how to think about a task. AGENTS.md is always loaded for project conventions; Skills load on demand. Best practice: tight AGENTS.md as router + skills library for specialist workflows.

What does SkillsBench say about bad skills?

SkillsBench (2025) found 19% of real-world agent tasks performed worse with a poorly designed skill than without one. Failure modes: trigger failure, execution failure, token budget failure, and regression when skills overlap. Vercel reported a skill stripped of instructions scored 5 points below no skill at all.

How should teams evaluate Agent Skills before production?

Test under co-loaded conditions (5–15 skills), not in isolation. Cover four gates: trigger accuracy (target 90%), execution quality and tool trajectory, regression in the existing library, and token budget. Graduate skills through Read-Only, Draft-Only, and Action-Allowed tiers with adversarial testing before irreversible actions.

What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

How are Agent Skills different from MCP and AGENTS.md?

MCP provides reach—connecting agents to external systems (Drive, Salesforce, BigQuery). Skills provide know-how—how to think about a task. AGENTS.md is always loaded for project conventions; Skills load on demand. Best practice: tight AGENTS.md as router + skills library for specialist workflows.

What does SkillsBench say about bad skills?

SkillsBench (2025) found 19% of real-world agent tasks performed worse with a poorly designed skill than without one. Failure modes: trigger failure, execution failure, token budget failure, and regression when skills overlap. Vercel reported a skill stripped of instructions scored 5 points below no skill at all.

How should teams evaluate Agent Skills before production?

Test under co-loaded conditions (5–15 skills), not in isolation. Cover four gates: trigger accuracy (target 90%), execution quality and tool trajectory, regression in the existing library, and token budget. Graduate skills through Read-Only, Draft-Only, and Action-Allowed tiers with adversarial testing before irreversible actions.

What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

How are Agent Skills different from MCP and AGENTS.md?

MCP provides reach—connecting agents to external systems (Drive, Salesforce, BigQuery). Skills provide know-how—how to think about a task. AGENTS.md is always loaded for project conventions; Skills load on demand. Best practice: tight AGENTS.md as router + skills library for specialist workflows.

What does SkillsBench say about bad skills?

SkillsBench (2025) found 19% of real-world agent tasks performed worse with a poorly designed skill than without one. Failure modes: trigger failure, execution failure, token budget failure, and regression when skills overlap. Vercel reported a skill stripped of instructions scored 5 points below no skill at all.

How should teams evaluate Agent Skills before production?

Test under co-loaded conditions (5–15 skills), not in isolation. Cover four gates: trigger accuracy (target 90%), execution quality and tool trajectory, regression in the existing library, and token budget. Graduate skills through Read-Only, Draft-Only, and Action-Allowed tiers with adversarial testing before irreversible actions.

What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

How are Agent Skills different from MCP and AGENTS.md?

MCP provides reach—connecting agents to external systems (Drive, Salesforce, BigQuery). Skills provide know-how—how to think about a task. AGENTS.md is always loaded for project conventions; Skills load on demand. Best practice: tight AGENTS.md as router + skills library for specialist workflows.

What does SkillsBench say about bad skills?

SkillsBench (2025) found 19% of real-world agent tasks performed worse with a poorly designed skill than without one. Failure modes: trigger failure, execution failure, token budget failure, and regression when skills overlap. Vercel reported a skill stripped of instructions scored 5 points below no skill at all.

How should teams evaluate Agent Skills before production?

Test under co-loaded conditions (5–15 skills), not in isolation. Cover four gates: trigger accuracy (target 90%), execution quality and tool trajectory, regression in the existing library, and token budget. Graduate skills through Read-Only, Draft-Only, and Action-Allowed tiers with adversarial testing before irreversible actions.

What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

How are Agent Skills different from MCP and AGENTS.md?

MCP provides reach—connecting agents to external systems (Drive, Salesforce, BigQuery). Skills provide know-how—how to think about a task. AGENTS.md is always loaded for project conventions; Skills load on demand. Best practice: tight AGENTS.md as router + skills library for specialist workflows.

What does SkillsBench say about bad skills?

SkillsBench (2025) found 19% of real-world agent tasks performed worse with a poorly designed skill than without one. Failure modes: trigger failure, execution failure, token budget failure, and regression when skills overlap. Vercel reported a skill stripped of instructions scored 5 points below no skill at all.

How should teams evaluate Agent Skills before production?

Test under co-loaded conditions (5–15 skills), not in isolation. Cover four gates: trigger accuracy (target 90%), execution quality and tool trajectory, regression in the existing library, and token budget. Graduate skills through Read-Only, Draft-Only, and Action-Allowed tiers with adversarial testing before irreversible actions.

What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

How are Agent Skills different from MCP and AGENTS.md?

MCP provides reach—connecting agents to external systems (Drive, Salesforce, BigQuery). Skills provide know-how—how to think about a task. AGENTS.md is always loaded for project conventions; Skills load on demand. Best practice: tight AGENTS.md as router + skills library for specialist workflows.

What does SkillsBench say about bad skills?

SkillsBench (2025) found 19% of real-world agent tasks performed worse with a poorly designed skill than without one. Failure modes: trigger failure, execution failure, token budget failure, and regression when skills overlap. Vercel reported a skill stripped of instructions scored 5 points below no skill at all.

How should teams evaluate Agent Skills before production?

Test under co-loaded conditions (5–15 skills), not in isolation. Cover four gates: trigger accuracy (target 90%), execution quality and tool trajectory, regression in the existing library, and token budget. Graduate skills through Read-Only, Draft-Only, and Action-Allowed tiers with adversarial testing before irreversible actions.

What is the Kaggle Agent Skills whitepaper?

A May 2026 whitepaper authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo) covering Agent Skills from first principles through production evaluation, meta-skills, composition, and enterprise governance. It is published on Kaggle and aligns with the open standard at agentskills.io.

What are the four problems Agent Skills solve?

Context rot from dumping all instructions into one system prompt; lack of procedural memory (how to do tasks step-by-step); multi-agent overload for workflows that do not need separate deployments; and portability—a folder with SKILL.md works across Claude Code, Codex, Cursor, and other agents with filesystem access.

How does progressive disclosure work in Agent Skills?

Level 1: metadata (name + description) always in context (~50 tokens per skill). Level 2: SKILL.md body loads only when the skill triggers. Level 3: scripts/, references/, and assets/ load or execute on demand without polluting the token window. 100 skills can cost ~5,000 tokens of metadata vs 15,000+ for one monolithic prompt.

Kaggle Agent Skills Whitepaper: Complete Guide 2026 | explainx.ai Blog

Agent Skills went from Anthropic experiment to industry default in roughly twelve months. The clearest articulation of why—and how to ship them safely—is the 62-page whitepaper on Kaggle, authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo), with the open spec at agentskills.io.

This is not a vendor pitch. It is a builder's manual for two personas:

Builders — use skills in Claude Code, Codex, Cursor, ChatGPT
Developers — write, version, evaluate, and govern skill libraries

If you read one document after our Agent Skills complete guide, make it this one. When you are ready to install, ExplainX hosts a discoverable skills registry at /skills—ranked listings with copy-ready install commands across Claude Code, Cursor, Codex, and compatible agents.

TL;DR

Topic	Takeaway
Format	Folder + `SKILL.md` + optional `scripts/`, `references/`, `assets/`
Standard	agentskills.io — cross-platform
Core win	Procedural memory — how to do tasks, on demand
Token math	50 skills ≈ ~4K metadata vs 15K monolithic prompt
Bad skills	19% of tasks worse with skill (SkillsBench)
vs MCP	MCP = hands; Skills = runbook — they compose
vs AGENTS.md	AGENTS.md = always on; Skills = load when matched
Marketplace	40,000+ public listings by early 2026
Discover	ExplainX skills registry — ranked, install-ready
Install	`npx skills install github.com/google/skills`

Why Skills Exploded: Four Friction Points

The whitepaper names four problems that made multi-agent sprawl the default—and Skills the escape hatch:

1. Context rot

Dump every instruction into one system prompt → LLM performance degrades. Research cited includes Lost in the Middle (Liu et al., TACL 2024) and Context Rot (Chroma, 2025): accuracy falls as input grows—even below the context window limit.

Skills fix: load instructions only when triggered.

2. Procedural memory

LLMs approximate:

Memory type	Analog
Episodic	What happened in this chat
Semantic	Facts in weights / RAG
Procedural	How to execute workflows step-by-step

Skills are the first credible procedural memory primitive for agents—the "runbook the experienced colleague hands you on day one."

3. Multi-agent overload

Early 2025 default: router + HR sub-agent + invoice sub-agent + slide-deck sub-agent → CI/CD hell.

Skills alternative: one general agent + library of specialists-on-demand. Maintain skills, not deployments.

When multi-agent still wins: genuine parallelism, different security postures, adversarial check-and-balance, heterogeneous models.

4. Portability

A markdown folder works anywhere with filesystem access—Claude Code, Codex, Cursor, Antigravity, ADK. No vendor lock-in at the artifact layer.

Skill Anatomy and Progressive Disclosure

Minimum viable skill:

my-skill/
├── SKILL.md          # Required
├── scripts/          # Optional — deterministic code
├── references/       # Optional — load on demand
└── assets/           # Optional — templates, schemas

Three disclosure levels:

Level	What loads	When
L1 Metadata	`name` + `description`	Every turn (~50 tokens/skill)
L2 Body	Full `SKILL.md`	Skill triggers
L3 Bundled	`references/`, `assets/`, `scripts/`	Body references or executes

Token economics example from the paper: 50 workflows as one prompt = 15,000 tokens/turn. As 50 skills = ~4,000 description tokens + ~2,000 active body ≈ 6,000 total—with 49 bodies on disk.

Anthropic examples cite reductions from ~150K → ~2K active tokens when converting workflows to skills—aligned with Matt Pocock v1.0's 63% savings via the same pattern.

Case Study: mattpocock/skills at 135K Stars

The whitepaper's theory meets production in Matt Pocock's skills repo—the most-starred open-source skill library ( 135,000+ stars, 11,700+ forks, v1.0.1 June 2026). Full inventory: Matt Pocock agent skills guide.

Why it validates the whitepaper thesis

Whitepaper concept	Pocock implementation
Procedural memory	`/tdd`, `diagnosing-bugs`, `codebase-design` encode how to engineer
Progressive disclosure	v1.0 loads compact index first; bodies on invoke (~63% token cut)
One agent + library	User-invoked orchestrators (`/grill-me`, `/to-prd`) call model-invoked discipline
Eval before ship	`writing-great-skills` + `GLOSSARY.md`; changesets for semver
Portability	`npx skills@latest add mattpocock/skills` across Claude Code, Cursor, Codex

User-invoked vs model-invoked (composition pattern)

Pocock's v1.0 taxonomy docs/invocation.md is a live example of the whitepaper's DAG orchestration:

User-invoked = orchestrate (/grill-with-docs, /ask-matt, /triage)
Model-invoked = discipline (grilling, domain-modeling, tdd)
Rule: user-invoked may call model-invoked; never another user-invoked

This is the production workaround for the skill-invocable tier gap Pocock flagged—helper skills stay model-invoked until the spec adds a third tier.

Four failure modes → skill library design

The README maps four engineering failures to skills—the same "one skill, one job" rule the whitepaper's cheatsheet recommends:

Misalignment → /grill-me, /grill-with-docs
Verbosity → CONTEXT.md + domain-modeling (ubiquitous language)
Broken code → /tdd, diagnosing-bugs (feedback loops)
Ball of mud → /improve-codebase-architecture, codebase-design

Contrast with GSD / BMAD / Spec-Kit: Pocock explicitly rejects process frameworks that own the workflow. Skills stay small, composable, adaptable—matching the whitepaper's portability argument.

Enterprise lessons from a solo-author monorepo

Even a one-person library demonstrates governance patterns enterprises need:

/setup-matt-pocock-skills — one-time config (tracker, labels, docs) = onboarding skill
Shared primitives — codebase-design + domain-modeling deduplicate vocabulary across skills (DRY for procedural memory)
Breaking changes via semver — v1.0 removed zoom-out, renamed diagnose → diagnosing-bugs; dependents updated in same release
Read/Draft/Act analog — /triage for read-only routing; /to-prd for draft artifacts; git guardrails before action-allowed git ops

If your team ships one reference implementation, study this repo before writing internal skills from scratch.

Two Paths to Build Your First Skill

Path A: Translate what you already know

Compliance runbooks, HR onboarding guides, 30-page ops docs—no coding required. Focus on YAML frontmatter:

---
name: cafe-preparation
description: |
  Calculates daily ingredient needs and generates prep sheets.
  Use when estimating quantities or shopping lists.
  Do NOT use for shift scheduling or accounting.
---

Description = routing algorithm. Front-load trigger keywords; state when NOT to use.

Naming: snake_case directories, kebab-case names, gerund form (processing-pdfs). Avoid utils, tools, vendor prefixes.

Path B: Crystallize what the agent just did

Agent completes a reusable workflow → propose SKILL.md draft → human reviews. Tools: Anthropic skill-creator, Nous Hermes Agent, self-improving-agent-skills patterns.

Warning: Unreviewed agent-drafted skills are often worse than no skill.

Skills vs MCP vs AGENTS.md

One-line mental model from Appendix A:

System prompt = instinct. AGENTS.md = project README. Tools/MCP = hands. RAG = library. Skills = the runbook.

Primitive	Role
MCP	Connect to external systems
Skill	Teach how to use those tools for a workflow
AGENTS.md	Always-on conventions + optional skill catalog router
Karpathy LLM Wiki	Compiled organizational knowledge (different layer)

They compose, not compete.

Installing Skills Today (Three Paradigms)

File drop — .agents/skills/ at project root (emerging convention); tool-specific paths vary. Community managers (skillport, openskills) symlink across CLIs.
UI install — Web/enterprise registries; upload folders without terminal.
Programmatic — Google ADK SkillToolset, auto-generates load_skill routing tools.

npx skills install github.com/google/skills

Google launched the official repo at github.com/google/skills at Cloud Next 2026—works across Antigravity, Codex, Claude Code, and compliant agents.

Evaluation: A Skill Without a Test Is a Hope

SkillsBench: 19% of tasks performed worse with a skill than without.

Four failure modes

Mode	Symptom
Trigger	Wrong skill fires—or right one never fires
Execution	Triggers but wrong output / tool calls
Token budget	Huge body causes context rot when co-loaded
Regression	New skill breaks routing for existing library

Vercel production data (cited)

56% non-invocation rate for skills expected to fire consistently
Skill stripped of instructions: 58% pass vs 63% with no skill — active harm
Passive AGENTS.md index: 100% pass vs 53% baseline — global context ≠ specialist skills

Target: 90% trigger accuracy on descriptions.

Evaluation toolkit (five patterns)

Pattern	Use
Eval-as-unit-test	CI on every change
Golden dataset	Versioned input/output pairs in skill dir
LLM-as-judge	Rubric scoring at scale (swap positions for bias)
Adversarial red-team	Negative boundary + rephrasing tests
Canary/shadow	1% live traffic before action-allowed

Evaluation-Driven Development (EDD): write 3 JSON eval cases before drafting SKILL.md:

{
  "case_id": "refund_dup_charge_001",
  "input": "I was charged twice for order #4521 last Tuesday",
  "expected_skill": "refund_processor",
  "expected_tool_calls": [
    {"tool": "lookup_order", "args": {"order_id": "4521"}}
  ],
  "rubric": ["acknowledges duplicate", "cites order id"]
}

Critical: Never evaluate skills in isolation. Production co-loads 5–15 skills. A 5,000-word body that passes alone may fail in the library.

Read / Draft / Act ladder

Tier	Capability	Gate
Read-Only	Query, describe	90% trigger accuracy
Draft-Only	Produce for human review	20+ golden cases
Action-Allowed	Irreversible ops	Adversarial + pass^k + human sign-off

pass^k: require success on k consecutive runs—not one lucky pass. GPT-4o: 61% pass^1 vs under 25% pass^8 on tau-bench.

Production: Skills Are the Unit of Improvement

Reverse-engineering Claude Code v2.1.88 (Liu, Zhao, Shang, Shen 2026): 98.4% of codebase is operational infrastructure—permissions, compaction, subagents—only 1.6% is the agent loop.

Implication: As models converge on reasoning, deterministic engineering around the model differentiates reliability. The composable unit is the skill.

Improvement style	Cycle time	Context tax
Model swap	Days–weeks	None (weights)
System prompt edit	Minutes–hours	Every turn
Fine-tune	Weeks–months	None
New skill	Hours–days	On-demand only

Google Agents CLI ships seven lifecycle skills (scaffold, build, eval, deploy, publish, observe)—expertise in skills, runtime commoditized.

Meta-Skills and Self-Improvement

Four buckets:

Authoring — skill-creator, ADK skill factory
Trace harvesting — successful runs → draft skills
Improvement — SkillOptimizer, description tuning loops, Karpathy autoresearch
Library evolution — Voyager-style skill growth from production traffic

Rules:

Agent-written skills enter at draft tier always
Human reviews first diffs—even when metrics improve
Don't start with meta-skills on an empty library

Connects to Microsoft SkillOpt and self-harness agents research.

Composition: DAGs and Capability Profiles

Real workflows exceed one skill. The paper advocates:

DAG orchestration — structured handoffs via file bus, not LLM context as database
Capability profiles — swappable bundles of active skills + tools + model params
Canonical node types — Generator, Reviewer/Gate, Pipeline, Inversion, Domain Context Wrapper

Context debt: capitalized "ALWAYS DO X" in descriptions → models ignore them. Shift intelligence left into testable scripts/.

Choosing Among 40,000+ Public Skills

Heuristics:

First-party first — google/skills, anthropics/skills, stripe/ai, microsoft/skills
Pin versions — community skills change without notice
Audit like code — supply-chain hygiene; skills run in your context

Source	Trust default
Vendor first-party	Trust; pin version
Org-curated private	Trust within org; PR review
Community	Audit before adopt; pin aggressively

Marketplaces: SkillsMP (1.2M+ indexed), VoltAgent/awesome-agent-skills, addyosmani/agent-skills. For a curated, ranked directory with verification signals and one-click install paths, browse ExplainX at /skills—we host discoverable skills from many authors alongside first-party repos like Google and Anthropic.

Appendix B Preview: Retail as Canonical Enterprise Case

The whitepaper's retail case study argues: same agent runtime + same MCP APIs ≠ same customer experience. Domain skills are the moat.

Example library:

Skill	Owner	Tier
`project-guidance`	Trades knowledge	Read-only
`materials-list`	Pro merchandising	Draft-only
`return-policy`	Customer service	Read-only (refunds = separate action skill)

100 process variants: one agent + 100 skills (~5K metadata tokens) beats 100 subagents or one 1M-token prompt.

Query routing example: "Remodel kids' bathroom" → loads project-guidance → follow-up on delivery loads delivery-window → prior skills released. Unbounded capability, bounded active context.

The Five Rules (Cheatsheet)

One skill, one job — if you need "and," split it
Descriptions are the interface — spend more time here than the body
Skills are dependencies — version, pin, PR review, tests
Right team owns right skill — domain experts, not central AI bottleneck
Runtime is interchangeable — portability is the value

Where to Start Tomorrow

From Appendix A:

Record your best practitioner narrating three workflows (1 hour)
Pick the most repeated; run without a skill; note failures
Draft SKILL.md; write 3 eval cases first (EDD)
Ship read-only tier; iterate description until 90% trigger accuracy
Resist generating fifty skills on day one

Summary

The Kaggle Agent Skills whitepaper makes the case that the format is settled (agentskills.io) while the work around it—evaluation under co-load, meta-skills, DAG composition, enterprise governance—is just beginning.

Skills give agents procedural memory without context rot. They replace many multi-agent deployments with one agent + a versioned library. They fail predictably when descriptions are vague, bodies are bloated, or evals run in isolation.

Start small. Treat skills as code. Measure what you ship. Browse existing skills at ExplainX /skills before writing your twentieth from scratch.

Agent Skills Whitepaper: Kaggle Guide to Procedural Memory for AI Agents

TL;DR

Why Skills Exploded: Four Friction Points

1. Context rot

2. Procedural memory

3. Multi-agent overload

4. Portability

Skill Anatomy and Progressive Disclosure

Case Study: mattpocock/skills at 135K Stars

Why it validates the whitepaper thesis

User-invoked vs model-invoked (composition pattern)

Four failure modes → skill library design

Enterprise lessons from a solo-author monorepo

Two Paths to Build Your First Skill

Path A: Translate what you already know

Path B: Crystallize what the agent just did

Skills vs MCP vs AGENTS.md

Installing Skills Today (Three Paradigms)

Evaluation: A Skill Without a Test Is a Hope

Four failure modes

Vercel production data (cited)

Evaluation toolkit (five patterns)

Read / Draft / Act ladder

Production: Skills Are the Unit of Improvement

Meta-Skills and Self-Improvement

Composition: DAGs and Capability Profiles

Choosing Among 40,000+ Public Skills

Appendix B Preview: Retail as Canonical Enterprise Case

The Five Rules (Cheatsheet)

Where to Start Tomorrow

Summary

Related Reading

Related posts

Microsoft SkillOpt: Self-Improving Agent Skills Guide 2026

Agent Markdown Files: The Complete Guide to SKILL.md, AGENT.md, CLAUDE.md, and More

Agent Skills: The Secure, Validated Registry for Professional AI Coding Agents