← Blog
explainx / blog

Caveman skill: token economics, API pricing, and cutting verbose LLM output in agents

Caveman agent skill for terse Claude and GPT replies: 2026 OpenAI and Anthropic pricing, why output tokens dominate agent bills, and how the JuliusBrussee/caveman skill pairs with caching and routing.

7 min readExplainX Team
Caveman skillLLM OptimizationToken EconomicsDeveloper ToolingPromptingAI Agents
Caveman skill: token economics, API pricing, and cutting verbose LLM output in agents

What is the Caveman skill?

The Caveman skill is an open agent skill (JuliusBrussee/caveman) that constrains how much filler prose assistants emit—lite, full, and ultra modes—while keeping code blocks and technical payloads intact. Install it like any other skill from the registry; it complements prompt caching, batch APIs, and model routing, not replace them.

Bottom line (April 2026): public API rate cards from OpenAI and Anthropic still charge more per token for output than for input on flagship coding models, and agent pipelines multiply every wasted completion across later turns. The Caveman skill targets low-value prose, not semantics. Recent preprint work (MD Azizul Hakim, arXiv:2604.00025, 11 Mar 2026) ties scale-dependent verbosity to benchmark errors and shows brevity constraints can recover large-model advantages.

Caveman skill — token economics and brevity for agents

Why this post exists

Most writing about LLM cost optimization stops at:

  1. "Use a cheaper model."
  2. "Make prompts shorter."

Both help, but they skip the systems view:

  • how per-token economics evolved from early GPT-4-class APIs to 2026 frontier listings
  • why output and carried conversation state dominate many coding and agent bills
  • when shorter answers are a reliability lever, not only a budget lever

Caveman is the concrete example; the through-line is token economics and measurement.

First principles: what you are actually paying for

For commercial APIs, cost is usually a weighted sum of:

  • Input tokens (full-price vs cached input where the provider supports reuse)
  • Output tokens (often 1.25–6× input price on comparable tiers—exact ratio depends on model and vendor)
  • Tool charges (hosted search, code execution, retrieval)
  • Tier modifiers (batch/async discounts, flex vs “priority,” data residency uplifts)

The expensive mistake is treating “tokens” as one scalar. Buckets and multipliers differ; optimizations that trim output help most when output is priced highest.

Token cost history: anchors that still matter

Public archives and current listings tell a three-act story:

  1. Early frontier (2023): strongest general models shipped at tens of dollars per million input tokens—OpenAI’s archived tables include gpt-4-0613 at $30 / 1M input and $60 / 1M output.
  2. Efficiency wave (2024–2025): multimodal and “mini” classes pushed routine work toward sub-$5 / 1M input territory—e.g. gpt-4o-2024-05-13 at $5 / $15 per 1M and gpt-4o-mini announced at $0.15 / $0.60 (OpenAI, July 18, 2024).
  3. 2026 frontier snapshot: flagship SKUs remain output-heavy even as quality improves. As of April 2026, OpenAI’s published API pricing shows GPT-5.4 at $2.50 / 1M input, $0.25 / 1M cached input, $15.00 / 1M output (standard rates under 270K context); GPT-5.4 mini at $0.75 / $0.075 / $4.50; GPT-5.4 nano at $0.20 / $0.02 / $1.25. The same page notes Batch API saves 50% on eligible input and output, web search at $10.00 per 1,000 calls (search content tokens listed as free), and a +10% uplift for certain data-residency / regional endpoints on models released after March 5, 2026.

On Anthropic’s side, the April 2026 model pricing table lists, for example, Claude Sonnet 4.6 at $3 / 1M input and $15 / 1M output, Claude Opus 4.6 at $5 / $25, and Claude Haiku 4.5 at $1 / $5, with cache reads at 0.1× base input after a cache write and Batch API at 50% off both input and output for supported workloads—so the same “shrink repeated context + cut output” playbook applies across vendors.

Net: unit costs fell, but output‑token weight and tooled workflows keep waste material.

Why verbosity still hurts after price drops

Agentic coding stacks often chain:

  • planner call
  • router / tool-selection call
  • patch generation
  • explanation or review
  • retry loops

If each hop adds 20–40% conversational padding, you pay repeatedly in:

  • downstream input: prior verbose turns become context on the next call
  • latency and review drag
  • error surface: filler correlates with contradiction and “helpful” hedging

Tokenization: why “word count” misleads

OpenAI’s consumer docs still use handy English heuristics: ~4 characters ≈ 1 token and ~75 words ≈ 100 tokens (see What are tokens?). Production caveats:

  1. Language and script change token efficiency.
  2. JSON, markdown fences, stack traces, and tool envelopes inflate tokens versus what humans “see.”

So a “short” visible answer can still be a large billed payload.

Research note: brevity as an intervention, not just aesthetics

In “Brevity Constraints Reverse Performance Hierarchies in Language Models” (MD Azizul Hakim; submitted 11 Mar 2026; arXiv:2604.00025), the author evaluates 31 models (~0.5B–405B) on 1,485 problems and reports that on 7.7% of items across five datasets, larger models trail smaller ones by 28.4 percentage points, a pattern attributed in part to verbosity-induced overelaboration. Causal interventions with brevity constraints raise large-model accuracy by ~26 percentage points and invert prior hierarchies on math and science subsets, with 7.7–15.9 point swings—supporting the deployment idea that prompt shape is a first-class control, not cosmetic.

Where Caveman fits

Caveman (see the Caveman skill on ExplainX and the project site) is a response-style constraint layer for agentic CLIs: modes like lite, full, ultra, add-ons for terse commits/reviews, and caveman-compress for shrinking session memory-style inputs. Architecturally it targets communication overhead, not reasoning capabilitycompress surface language, preserve semantic payload, measure quality.

On ExplainX: explore the full skills registry; for content-facing agent playbooks see SEO + GEO agent skills; to list your own skill, register and use the submission flow.

Cost math: a sanity model

Use:

monthly_cost ≈ Σ (input_tokens × input_rate + output_tokens × output_rate + tool_fees)

If style changes cut output tokens by fraction r without harming task success:

output_savings ≈ monthly_output_tokens × output_rate × r

Example: 200M output tokens/month at $10 / 1M output with r = 0.35$700/month saved on that slice alone—before counting downstream input shrinkage.

Platform mechanics teams overlook

Three levers compound with terse defaults:

  1. Cached / repeated system or document context (OpenAI cached-input rows; Anthropic cache hits at 0.1× base input after writes).
  2. Batch / async lanes (both vendors advertise 50% token discounts for eligible batch workloads in their public pricing docs as of April 2026).
  3. Model routingfrontier models only on high-ambiguity steps; mini / nano / Haiku-class for transforms and scaffolding.

Deployment playbook

  1. Baseline three regimes: default prompting; manual “be concise”; Caveman-style (or equivalent system policy).
  2. Jointly track cost, latency, and task success—not cost alone.
  3. Slice metrics by task family (debug, refactor, architecture, review).
  4. Keep an “expand” escape hatch (explain more, verbose sub-agent).
  5. Default terse where safe; escalate detail when confidence is low or stakeholders require auditability.

Failure modes

Brevity-first defaults fail when legal / compliance language must be explicit, learners need expository depth, or traceability belongs inside the reply. Use selective verbosity, not universal.

Caveman as pattern, not meme

Narrowly: “funny terse mode.” Properly: measured surface compression paired with routing, caching, and evaluation—now part of cost-aware agent design.

FAQ (same topics as structured data above)

These answers mirror the FAQ block in this page’s metadata (for search and AI overviews). Caveman skill install: explainx.ai/skills/JuliusBrussee/caveman/caveman.

  • Cost drivers (2026): input + output tokens (output often higher per token), cached input discounts, batch/flex tiers, tools (e.g. OpenAI web search $10 / 1,000 calls), regional uplifts (+10% on some OpenAI endpoints for post–Mar 5, 2026 models).
  • Why verbosity still hurts: chained agents re-ingest prior completions as context; Hakim (arXiv:2604.00025) shows brevity constraints can raise large-model accuracy on part of the benchmark set by ~26 points.
  • When not to default terse: compliance narrative, training depth, or audit text that must live in the reply—use route- or audience-specific verbosity.

Related links

Sources

Related posts