TL;DR: Single prompts are obsolete for serious software engineering. Anthropic's Boris Cherny explains that production AI coding requires harness engineering—building systems that run iterative loops where Claude observes, plans, acts, and reflects over hours or days. This approach helped Anthropic engineers ship 8x more code daily, with Claude authoring over 80% of merged production code by May 2026.
The Paradigm Shift: From Prompts to Loops
"You're not supposed to prompt Claude. You're supposed to build a system that prompts itself."
This statement from Boris Cherny, engineer at Anthropic, has sparked a fundamental rethinking of how developers should use AI coding assistants.
The problem with single prompts:
- Limited scope: Can't handle multi-file, multi-day projects
- No iteration: AI can't learn from mistakes or refine approach
- Context loss: Each prompt starts fresh without accumulated knowledge
- Human bottleneck: Developer must manually orchestrate every step
The solution: harness engineering—systems that autonomously prompt AI agents in iterative loops.
What is Harness Engineering?
Harness engineering is the practice of building frameworks that orchestrate AI agents through repeated observe-plan-act-reflect cycles.
The Core Loop
1. OBSERVE → Analyze current codebase, test results, error logs
2. PLAN → Determine next action based on observations
3. ACT → Execute code changes, run tests, make commits
4. REFLECT → Evaluate results, identify gaps, adjust strategy
5. REPEAT → Loop until task complete or timeout
This isn't a single prompt like "Add user authentication." It's a system that:
- Breaks down complex tasks into sub-tasks
- Executes each step autonomously
- Validates results before proceeding
- Adapts strategy based on outcomes
- Runs for hours or days without human intervention
Example: Adding Authentication (Traditional vs Loop-Based)
Traditional Single Prompt:
User: "Add JWT authentication to the API"
Claude: [Generates auth code in one file]
User: [Realizes it needs database migration, middleware, tests, docs]
User: "Now add the migration"
Claude: [Generates migration]
User: "Add middleware"
... 15 more manual prompts ...
Loop-Based Harness Engineering:
# Simplified harness pseudocode
task = "Add JWT authentication to the API"
max_turns = 50
context = CodebaseContext()
for turn in range(max_turns):
# OBSERVE
status = context.analyze_codebase()
test_results = context.run_tests()
# PLAN
plan = claude.plan_next_action(task, status, test_results)
# ACT
if plan.action == "modify_file":
context.edit_file(plan.file_path, plan.changes)
elif plan.action == "run_migration":
context.execute_migration(plan.migration_file)
elif plan.action == "write_tests":
context.create_test_file(plan.test_code)
# REFLECT
if plan.task_complete:
break
# Update context for next iteration
context.commit_changes(plan.commit_message)
Result: Claude autonomously:
- Adds JWT library dependencies
- Creates auth middleware
- Writes database migration for user tokens
- Updates API routes to use auth
- Writes integration tests
- Updates documentation
- Runs tests and fixes failing cases
- Commits with proper messages
All without human intervention beyond initial task specification.
The Anthropic Results: 8x Productivity, 80% AI-Authored Code
By May 2026, Anthropic engineers using harness engineering:
- 8x daily code output compared to traditional development
- 80%+ of merged production code authored by Claude
- Hours to days of autonomous execution per task
- 76% success rate on open-ended software tasks
What Changed?
Before (Single Prompts - Q1 2025):
- Engineer writes detailed spec
- Claude generates code
- Engineer manually integrates, tests, debugs
- Repeat 10-20 times per feature
- Result: 60% AI-generated code, 40% human
After (Harness Engineering - Q2 2026):
- Engineer specifies high-level goal
- Harness loop runs autonomously
- Claude observes, plans, acts, reflects
- Human reviews and approves final PR
- Result: 80% AI-generated code, 20% human (architecture + review)
How to Build Your Own Harness (Practical Guide)
Level 1: Simple Loop (1 hour implementation)
Start with a basic observe-act loop for repetitive tasks:
// Example: Auto-fix linting errors
async function lintFixLoop(maxIterations = 5) {
for (let i = 0; i < maxIterations; i++) {
// OBSERVE
const lintResults = await runLinter();
if (lintResults.errors.length === 0) break;
// ACT
const fixes = await claude.generateFixes(lintResults);
await applyFixes(fixes);
// REFLECT
console.log(`Iteration ${i+1}: Fixed ${fixes.length} issues`);
}
}
Use cases: Linting fixes, test debugging, dependency updates
Level 2: Multi-Step Task Decomposition (1 day implementation)
Add planning and task breakdown:
async function featureLoop(featureSpec: string) {
// PLAN
const tasks = await claude.breakdownFeature(featureSpec);
for (const task of tasks) {
// OBSERVE
const context = await analyzeCodebase();
// PLAN SUB-ACTIONS
const actions = await claude.planImplementation(task, context);
// ACT
for (const action of actions) {
await executeAction(action);
await runTests();
}
// REFLECT
const taskComplete = await claude.validateCompletion(task);
if (!taskComplete) {
tasks.push(await claude.identifyGaps(task));
}
}
}
Use cases: Feature implementation, refactoring, migration tasks
Level 3: Autonomous Multi-Day Projects (1 week implementation)
Full harness with error recovery, checkpoints, and human approval gates:
interface HarnessConfig {
task: string;
maxTurns: number;
checkpointInterval: number;
humanApprovalRequired: string[]; // e.g., ["database_migration", "api_breaking_change"]
}
async function autonomousHarness(config: HarnessConfig) {
let turn = 0;
let context = new ProjectContext();
while (turn < config.maxTurns) {
// OBSERVE
const status = await context.fullAnalysis();
// PLAN
const plan = await claude.strategicPlan(config.task, status, turn);
// HUMAN CHECKPOINT
if (config.humanApprovalRequired.includes(plan.actionType)) {
const approved = await requestHumanApproval(plan);
if (!approved) continue;
}
// ACT
try {
await executeActionSafely(plan.action);
} catch (error) {
// ERROR RECOVERY LOOP
const recovery = await claude.recoverFromError(error, context);
await executeActionSafely(recovery.action);
}
// VALIDATE
const testResults = await context.runFullTestSuite();
// REFLECT
const reflection = await claude.evaluateProgress(
config.task,
status,
testResults,
turn
);
if (reflection.taskComplete) break;
if (reflection.stuck) {
await requestHumanIntervention(reflection.issue);
}
// CHECKPOINT
if (turn % config.checkpointInterval === 0) {
await context.createCheckpoint();
}
turn++;
}
return context.generatePullRequest();
}
Use cases: Full feature development, complex refactors, multi-service changes
The 14% Claude.md Tax and How to Fix It
Boris Cherny highlighted a critical insight: 14% of developer productivity is lost to poorly structured CLAUDE.md files (or equivalent project context files).
The Problem
Bad CLAUDE.md:
# My Project
This is a web app for users.
## Stack
- React
- Node
- Postgres
## Instructions
Be helpful!
Result: Claude wastes turns asking basic questions about:
- Project structure
- Code style preferences
- Testing approach
- Deployment process
- Business logic context
The Solution: Structured Context
Good CLAUDE.md for harness engineering:
# Project Context for AI Agents
## Architecture Map
- `/app/*` - Next.js App Router (React Server Components)
- `/lib/db/*` - Prisma ORM, PostgreSQL schemas
- `/lib/api/*` - tRPC API routes
- `/components/*` - React components (shadcn/ui + Tailwind)
## Code Style (CRITICAL - Follow Exactly)
- Server components by default; 'use client' only when needed
- Prefer server actions over API routes for mutations
- Database queries only in server components or server actions
- All async functions must handle errors with try-catch
- Use Zod for all input validation
## Testing Strategy
- Unit tests: Vitest for pure functions
- Integration tests: Playwright for user flows
- Run `pnpm test` before any commit
- Coverage requirement: 70%+
## Common Patterns
### Adding a new API endpoint
1. Define Zod schema in `/lib/schemas`
2. Create tRPC procedure in `/lib/api/routers`
3. Write integration test in `__tests__/api`
4. Update OpenAPI docs if public endpoint
### Database changes
1. Modify schema in `prisma/schema.prisma`
2. Run `pnpm db:migrate:dev` to create migration
3. Update seed data if needed
4. Test migration rollback works
## Deployment
- Production: Vercel (auto-deploy on main branch)
- Staging: Railway (auto-deploy on develop branch)
- Never commit secrets - use `.env.local` and Vercel env vars
## Business Context
- Users are B2B SaaS companies (SMB to mid-market)
- Average deal size: $50K-200K/year
- Security/compliance critical: SOC2, GDPR
- Performance target: p95 page load < 2s
Impact: Reduces wasted turns by 60%, allows Claude to make informed decisions without asking.
Real-World Success Stories
1. Developer Reports 76% Success Rate
Early adopters of harness engineering on Twitter report:
- 76% task completion on open-ended software projects
- 3-5x faster than manual development for complex features
- Reduced context-switching: Set task, review final PR hours later
2. Tutorials Going Viral
The community has created extensive guides:
- 24-minute workshop on harness engineering fundamentals
- Step-by-step loop design tutorials
- Open-source harness frameworks (LangGraph, AutoGPT-based)
3. Anthropic's Internal Adoption
By May 2026:
- Every Anthropic engineer uses harness-based workflows
- 80%+ production code written by Claude
- Human role shifted: Architecture, review, strategy—not implementation
The Criticism: Bugs, Waste, and Expertise Gaps
Not everyone is convinced. Critics raise valid concerns:
1. Loop Bugs Can Waste Hours
Poorly designed loops can:
- Infinite loops: Claude keeps "fixing" the same issue differently
- Premature termination: Stops before task actually complete
- Wasted compute: Runs expensive API calls on low-value iterations
Mitigation:
- Set max turns (20-50 for most tasks)
- Add explicit termination conditions
- Monitor token usage and costs
- Implement circuit breakers for repeated failures
2. Requires Prompt Engineering Expertise
Designing effective harnesses isn't beginner-friendly:
- Writing observation prompts that extract relevant context
- Structuring action spaces (which operations are allowed?)
- Calibrating reflection prompts to avoid hallucinated "success"
- Handling edge cases and error recovery
Reality: This is a new skill set. Teams need training and iteration.
3. Not All Tasks Suit Autonomous Loops
Bad fit for harness engineering:
- High-risk changes: Database migrations on production
- Creative/strategic work: Product vision, UX design philosophy
- Highly ambiguous tasks: "Make the app better"
Good fit:
- Well-defined scope: "Add email verification to signup flow"
- Testable outcomes: "All tests pass + no type errors"
- Repetitive patterns: "Update all API routes to use new auth middleware"
Tools That Support Harness Engineering
Native Support
-
Claude Code CLI (Anthropic)
- Built-in loop orchestration with
/goalcommand - Persistent context across turns
- Tool use (file ops, bash, testing)
- Reflection and planning prompts
- Built-in loop orchestration with
-
Cursor AI (Anysphere)
- Agent mode with multi-turn execution
- Composer for complex refactors
- Integrated testing and linting loops
-
Aider (Open Source)
- Git-integrated AI pair programmer
- Automatic commit loops
- Context-aware file selection
Frameworks for Custom Harnesses
-
LangGraph (LangChain)
- State machine for agent workflows
- Built-in checkpointing and recovery
- Conditional loops and branching
-
AutoGPT
- Autonomous task execution
- Memory and learning across runs
- Plugin ecosystem for tool use
-
CrewAI
- Multi-agent orchestration
- Role-based agent specialization
- Shared context management
How to Get Started (This Week)
Day 1: Learn Loop Fundamentals
- Watch the 24-minute Anthropic workshop (search "Anthropic harness engineering")
- Read Boris Cherny's thread on loops vs prompts
- Analyze successful loop examples on GitHub
Day 2: Implement Simple Loop
- Choose a repetitive task (linting, test fixing, docs generation)
- Write a 10-turn observe-act loop using Claude API
- Test on small codebase
Day 3: Add Planning Layer
- Implement task decomposition (break feature into sub-tasks)
- Add reflection step (validate each sub-task completion)
- Test on medium-complexity feature
Day 4: Production Hardening
- Add error recovery loops
- Implement human approval gates for high-risk actions
- Set up monitoring and cost tracking
Day 5: Optimize Context (Fix the 14% Tax)
- Write comprehensive CLAUDE.md with architecture, patterns, business context
- Test that loops waste fewer turns asking basic questions
- Measure turn reduction
Weekend: Scale to Real Projects
- Run harness on actual feature from backlog
- Review final output, measure time savings
- Iterate based on gaps and failures
The Future: Loops Everywhere
The shift from single prompts to iterative loops isn't limited to coding.
Emerging patterns across domains:
- Marketing: Content loop generates blog, gets SEO analysis, refines, publishes
- Data science: Model training loop evaluates performance, adjusts hyperparameters, reruns
- Customer support: Ticket resolution loop analyzes issue, drafts response, validates with knowledge base
- Design: UI generation loop creates mockups, validates accessibility, iterates on feedback
By 2027, every serious AI application will be loop-based, not prompt-based.
Single prompts will remain for:
- Quick one-off questions
- Creative brainstorming
- Simple content generation
But complex work—software, analysis, research, strategy—will all use harness engineering.
Conclusion: Stop Prompting, Start Building
The developers shipping 8x more code aren't writing better prompts. They're building better systems.
The harness engineering mindset:
- Don't ask AI to solve problems. Build systems that solve problems using AI.
- Don't write prompts. Write loops that write prompts.
- Don't generate code. Orchestrate agents that generate, test, refine, and ship code.
If you're still manually prompting Claude for every change, you're using a sports car as a bicycle.
The playbook:
- Identify a complex, multi-step task in your workflow
- Design an observe-plan-act-reflect loop to automate it
- Implement with checkpoints, validation, and human gates
- Iterate based on failures and edge cases
- Scale to more tasks as you learn loop design
By end of 2026, harness engineering will be a core skill for every software engineer—as fundamental as Git, testing, or code review.
The question isn't whether you'll adopt it. It's whether you'll be an early mover or late adapter.
Related Resources
- Agent Harness Engineering: Terminal Bench and LangChain 2026 - Deep dive into harness frameworks
- Claude Code Goal Command: Long-Running Agents 2026 - Using Claude's built-in loop orchestration
- Agent Markdown Files Complete Guide 2026 - Optimize CLAUDE.md for better context
- Karpathy Claude Code Guidelines: Andrej Karpathy Skills - Best practices from AI pioneers
Last updated: June 8, 2026 | Research sources: Boris Cherny (Anthropic), developer community reports, harness engineering tutorials, production deployment data