← Back to blog

explainx / blog

Agent Skills Whitepaper: Kaggle Guide to Procedural Memory for AI Agents

Google/Kaggle Agent Skills whitepaper (May 2026) by Debanshu Das, Tanvi Singhal, and team—progressive disclosure, eval tiers, Skills vs MCP, meta-skills, 40K+ marketplace skills, and the retail case study for enterprise builders.

·12 min read·Yash Thakker
Agent SkillsKaggleGoogleAI AgentsProcedural Memory
Agent Skills Whitepaper: Kaggle Guide to Procedural Memory for AI Agents

Agent Skills went from Anthropic experiment to industry default in roughly twelve months. The clearest articulation of why—and how to ship them safely—is the 62-page whitepaper on Kaggle, authored by Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, and Smitha Kolam (curated by Shubham Saboo), with the open spec at agentskills.io.

This is not a vendor pitch. It is a builder's manual for two personas:

  • Builders — use skills in Claude Code, Codex, Cursor, ChatGPT
  • Developers — write, version, evaluate, and govern skill libraries

If you read one document after our Agent Skills complete guide, make it this one. When you are ready to install, ExplainX hosts a discoverable skills registry at /skills—ranked listings with copy-ready install commands across Claude Code, Cursor, Codex, and compatible agents.


TL;DR

TopicTakeaway
FormatFolder + SKILL.md + optional scripts/, references/, assets/
Standardagentskills.io — cross-platform
Core winProcedural memory — how to do tasks, on demand
Token math50 skills ≈ ~4K metadata vs 15K monolithic prompt
Bad skills19% of tasks worse with skill (SkillsBench)
vs MCPMCP = hands; Skills = runbook — they compose
vs AGENTS.mdAGENTS.md = always on; Skills = load when matched
Marketplace40,000+ public listings by early 2026
DiscoverExplainX skills registry — ranked, install-ready
Installnpx skills install github.com/google/skills

Why Skills Exploded: Four Friction Points

The whitepaper names four problems that made multi-agent sprawl the default—and Skills the escape hatch:

1. Context rot

Dump every instruction into one system prompt → LLM performance degrades. Research cited includes Lost in the Middle (Liu et al., TACL 2024) and Context Rot (Chroma, 2025): accuracy falls as input grows—even below the context window limit.

Skills fix: load instructions only when triggered.

2. Procedural memory

LLMs approximate:

Memory typeAnalog
EpisodicWhat happened in this chat
SemanticFacts in weights / RAG
ProceduralHow to execute workflows step-by-step

Skills are the first credible procedural memory primitive for agents—the "runbook the experienced colleague hands you on day one."

3. Multi-agent overload

Early 2025 default: router + HR sub-agent + invoice sub-agent + slide-deck sub-agent → CI/CD hell.

Skills alternative: one general agent + library of specialists-on-demand. Maintain skills, not deployments.

When multi-agent still wins: genuine parallelism, different security postures, adversarial check-and-balance, heterogeneous models.

4. Portability

A markdown folder works anywhere with filesystem access—Claude Code, Codex, Cursor, Antigravity, ADK. No vendor lock-in at the artifact layer.


Skill Anatomy and Progressive Disclosure

Minimum viable skill:

my-skill/
├── SKILL.md          # Required
├── scripts/          # Optional — deterministic code
├── references/       # Optional — load on demand
└── assets/           # Optional — templates, schemas

Three disclosure levels:

LevelWhat loadsWhen
L1 Metadataname + descriptionEvery turn (~50 tokens/skill)
L2 BodyFull SKILL.mdSkill triggers
L3 Bundledreferences/, assets/, scripts/Body references or executes

Token economics example from the paper: 50 workflows as one prompt = 15,000 tokens/turn. As 50 skills = ~4,000 description tokens + ~2,000 active body ≈ 6,000 total—with 49 bodies on disk.

Anthropic examples cite reductions from ~150K → ~2K active tokens when converting workflows to skills—aligned with Matt Pocock v1.0's 63% savings via the same pattern.


Case Study: mattpocock/skills at 135K Stars

The whitepaper's theory meets production in Matt Pocock's skills repo—the most-starred open-source skill library ( 135,000+ stars, 11,700+ forks, v1.0.1 June 2026). Full inventory: Matt Pocock agent skills guide.

Why it validates the whitepaper thesis

Whitepaper conceptPocock implementation
Procedural memory/tdd, diagnosing-bugs, codebase-design encode how to engineer
Progressive disclosurev1.0 loads compact index first; bodies on invoke (~63% token cut)
One agent + libraryUser-invoked orchestrators (/grill-me, /to-prd) call model-invoked discipline
Eval before shipwriting-great-skills + GLOSSARY.md; changesets for semver
Portabilitynpx skills@latest add mattpocock/skills across Claude Code, Cursor, Codex

User-invoked vs model-invoked (composition pattern)

Pocock's v1.0 taxonomy docs/invocation.md is a live example of the whitepaper's DAG orchestration:

  • User-invoked = orchestrate (/grill-with-docs, /ask-matt, /triage)
  • Model-invoked = discipline (grilling, domain-modeling, tdd)
  • Rule: user-invoked may call model-invoked; never another user-invoked

This is the production workaround for the skill-invocable tier gap Pocock flagged—helper skills stay model-invoked until the spec adds a third tier.

Four failure modes → skill library design

The README maps four engineering failures to skills—the same "one skill, one job" rule the whitepaper's cheatsheet recommends:

  1. Misalignment/grill-me, /grill-with-docs
  2. VerbosityCONTEXT.md + domain-modeling (ubiquitous language)
  3. Broken code/tdd, diagnosing-bugs (feedback loops)
  4. Ball of mud/improve-codebase-architecture, codebase-design

Contrast with GSD / BMAD / Spec-Kit: Pocock explicitly rejects process frameworks that own the workflow. Skills stay small, composable, adaptable—matching the whitepaper's portability argument.

Enterprise lessons from a solo-author monorepo

Even a one-person library demonstrates governance patterns enterprises need:

  • /setup-matt-pocock-skills — one-time config (tracker, labels, docs) = onboarding skill
  • Shared primitivescodebase-design + domain-modeling deduplicate vocabulary across skills (DRY for procedural memory)
  • Breaking changes via semver — v1.0 removed zoom-out, renamed diagnosediagnosing-bugs; dependents updated in same release
  • Read/Draft/Act analog/triage for read-only routing; /to-prd for draft artifacts; git guardrails before action-allowed git ops

If your team ships one reference implementation, study this repo before writing internal skills from scratch.


Two Paths to Build Your First Skill

Path A: Translate what you already know

Compliance runbooks, HR onboarding guides, 30-page ops docs—no coding required. Focus on YAML frontmatter:

---
name: cafe-preparation
description: |
  Calculates daily ingredient needs and generates prep sheets.
  Use when estimating quantities or shopping lists.
  Do NOT use for shift scheduling or accounting.
---

Description = routing algorithm. Front-load trigger keywords; state when NOT to use.

Naming: snake_case directories, kebab-case names, gerund form (processing-pdfs). Avoid utils, tools, vendor prefixes.

Path B: Crystallize what the agent just did

Agent completes a reusable workflow → propose SKILL.md draft → human reviews. Tools: Anthropic skill-creator, Nous Hermes Agent, self-improving-agent-skills patterns.

Warning: Unreviewed agent-drafted skills are often worse than no skill.


Skills vs MCP vs AGENTS.md

One-line mental model from Appendix A:

System prompt = instinct. AGENTS.md = project README. Tools/MCP = hands. RAG = library. Skills = the runbook.

PrimitiveRole
MCPConnect to external systems
SkillTeach how to use those tools for a workflow
AGENTS.mdAlways-on conventions + optional skill catalog router
Karpathy LLM WikiCompiled organizational knowledge (different layer)

They compose, not compete.


Installing Skills Today (Three Paradigms)

  1. File drop.agents/skills/ at project root (emerging convention); tool-specific paths vary. Community managers (skillport, openskills) symlink across CLIs.

  2. UI install — Web/enterprise registries; upload folders without terminal.

  3. Programmatic — Google ADK SkillToolset, auto-generates load_skill routing tools.

npx skills install github.com/google/skills

Google launched the official repo at github.com/google/skills at Cloud Next 2026—works across Antigravity, Codex, Claude Code, and compliant agents.


Evaluation: A Skill Without a Test Is a Hope

SkillsBench: 19% of tasks performed worse with a skill than without.

Four failure modes

ModeSymptom
TriggerWrong skill fires—or right one never fires
ExecutionTriggers but wrong output / tool calls
Token budgetHuge body causes context rot when co-loaded
RegressionNew skill breaks routing for existing library

Vercel production data (cited)

  • 56% non-invocation rate for skills expected to fire consistently
  • Skill stripped of instructions: 58% pass vs 63% with no skill — active harm
  • Passive AGENTS.md index: 100% pass vs 53% baseline — global context ≠ specialist skills

Target: 90% trigger accuracy on descriptions.

Evaluation toolkit (five patterns)

PatternUse
Eval-as-unit-testCI on every change
Golden datasetVersioned input/output pairs in skill dir
LLM-as-judgeRubric scoring at scale (swap positions for bias)
Adversarial red-teamNegative boundary + rephrasing tests
Canary/shadow1% live traffic before action-allowed

Evaluation-Driven Development (EDD): write 3 JSON eval cases before drafting SKILL.md:

{
  "case_id": "refund_dup_charge_001",
  "input": "I was charged twice for order #4521 last Tuesday",
  "expected_skill": "refund_processor",
  "expected_tool_calls": [
    {"tool": "lookup_order", "args": {"order_id": "4521"}}
  ],
  "rubric": ["acknowledges duplicate", "cites order id"]
}

Critical: Never evaluate skills in isolation. Production co-loads 5–15 skills. A 5,000-word body that passes alone may fail in the library.

Read / Draft / Act ladder

TierCapabilityGate
Read-OnlyQuery, describe90% trigger accuracy
Draft-OnlyProduce for human review20+ golden cases
Action-AllowedIrreversible opsAdversarial + pass^k + human sign-off

pass^k: require success on k consecutive runs—not one lucky pass. GPT-4o: 61% pass^1 vs under 25% pass^8 on tau-bench.


Production: Skills Are the Unit of Improvement

Reverse-engineering Claude Code v2.1.88 (Liu, Zhao, Shang, Shen 2026): 98.4% of codebase is operational infrastructure—permissions, compaction, subagents—only 1.6% is the agent loop.

Implication: As models converge on reasoning, deterministic engineering around the model differentiates reliability. The composable unit is the skill.

Improvement styleCycle timeContext tax
Model swapDays–weeksNone (weights)
System prompt editMinutes–hoursEvery turn
Fine-tuneWeeks–monthsNone
New skillHours–daysOn-demand only

Google Agents CLI ships seven lifecycle skills (scaffold, build, eval, deploy, publish, observe)—expertise in skills, runtime commoditized.


Meta-Skills and Self-Improvement

Four buckets:

  1. Authoring — skill-creator, ADK skill factory
  2. Trace harvesting — successful runs → draft skills
  3. Improvement — SkillOptimizer, description tuning loops, Karpathy autoresearch
  4. Library evolution — Voyager-style skill growth from production traffic

Rules:

  • Agent-written skills enter at draft tier always
  • Human reviews first diffs—even when metrics improve
  • Don't start with meta-skills on an empty library

Connects to Microsoft SkillOpt and self-harness agents research.


Composition: DAGs and Capability Profiles

Real workflows exceed one skill. The paper advocates:

  • DAG orchestration — structured handoffs via file bus, not LLM context as database
  • Capability profiles — swappable bundles of active skills + tools + model params
  • Canonical node types — Generator, Reviewer/Gate, Pipeline, Inversion, Domain Context Wrapper

Context debt: capitalized "ALWAYS DO X" in descriptions → models ignore them. Shift intelligence left into testable scripts/.


Choosing Among 40,000+ Public Skills

Heuristics:

  1. First-party firstgoogle/skills, anthropics/skills, stripe/ai, microsoft/skills
  2. Pin versions — community skills change without notice
  3. Audit like code — supply-chain hygiene; skills run in your context
SourceTrust default
Vendor first-partyTrust; pin version
Org-curated privateTrust within org; PR review
CommunityAudit before adopt; pin aggressively

Marketplaces: SkillsMP (1.2M+ indexed), VoltAgent/awesome-agent-skills, addyosmani/agent-skills. For a curated, ranked directory with verification signals and one-click install paths, browse ExplainX at /skills—we host discoverable skills from many authors alongside first-party repos like Google and Anthropic.


Appendix B Preview: Retail as Canonical Enterprise Case

The whitepaper's retail case study argues: same agent runtime + same MCP APIs ≠ same customer experience. Domain skills are the moat.

Example library:

SkillOwnerTier
project-guidanceTrades knowledgeRead-only
materials-listPro merchandisingDraft-only
return-policyCustomer serviceRead-only (refunds = separate action skill)

100 process variants: one agent + 100 skills (~5K metadata tokens) beats 100 subagents or one 1M-token prompt.

Query routing example: "Remodel kids' bathroom" → loads project-guidance → follow-up on delivery loads delivery-window → prior skills released. Unbounded capability, bounded active context.


The Five Rules (Cheatsheet)

  1. One skill, one job — if you need "and," split it
  2. Descriptions are the interface — spend more time here than the body
  3. Skills are dependencies — version, pin, PR review, tests
  4. Right team owns right skill — domain experts, not central AI bottleneck
  5. Runtime is interchangeable — portability is the value

Where to Start Tomorrow

From Appendix A:

  1. Record your best practitioner narrating three workflows (1 hour)
  2. Pick the most repeated; run without a skill; note failures
  3. Draft SKILL.md; write 3 eval cases first (EDD)
  4. Ship read-only tier; iterate description until 90% trigger accuracy
  5. Resist generating fifty skills on day one

Summary

The Kaggle Agent Skills whitepaper makes the case that the format is settled (agentskills.io) while the work around it—evaluation under co-load, meta-skills, DAG composition, enterprise governance—is just beginning.

Skills give agents procedural memory without context rot. They replace many multi-agent deployments with one agent + a versioned library. They fail predictably when descriptions are vague, bodies are bloated, or evals run in isolation.

Start small. Treat skills as code. Measure what you ship. Browse existing skills at ExplainX /skills before writing your twentieth from scratch.


Related Reading

Whitepaper concepts cited from the Kaggle Agent Skills whitepaper (May 2026), authors Tanvi Singhal, Gabriela Hernandez Larios, Debanshu Das, Lavi Nigam, Smitha Kolam; curated by Shubham Saboo.

Related posts