Zero-Shot vs Few-Shot vs Chain-of-Thought Prompting: Complete Guide 2026
There is a gap between knowing that prompt techniques exist and knowing which one to reach for in a specific situation. Zero-shot, few-shot, chain-of-thought, self-consistency, Tree of Thought — the vocabulary has expanded faster than the practical guidance around it.
This guide closes that gap. Each technique is explained from first principles, illustrated with concrete prompts you can copy, and placed in a decision framework that tells you when to use which. By the end, you will understand not just what each technique does but why it works — which is the only knowledge that transfers to new tasks.
What In-Context Learning Actually Is
Before comparing techniques, it is worth being precise about the mechanism underlying all of them.
In-context learning is the ability of a sufficiently large language model to adapt its behaviour based on examples provided within the prompt itself — without any change to the model's weights. You do not need to retrain, fine-tune, or even call a training API. You include examples, and the model adjusts.
This emerged as a surprise at scale. Early language models did not exhibit it. GPT-3 (2020) was among the first to demonstrate it convincingly: add a few labelled examples to the prompt, and performance on classification tasks jumps dramatically — sometimes matching fine-tuned models that saw thousands of examples.
Why does it happen? The dominant explanation is that during pre-training on enormous text corpora, the model encountered millions of patterns where a sequence of examples was followed by a natural continuation. It learned, at a statistical level, to infer the task from the examples and produce the right continuation. The examples in your prompt trigger this latent capability.
Critically: the model is not learning in the training sense. Weights do not update. What happens is pattern-matching at inference time. The examples narrow the model's probability distribution over outputs, steering it toward the format and label space you want. This distinction matters because it tells you what in-context learning can and cannot do: it can nudge a capable model toward a format; it cannot teach a model something it has no pre-training signal for.
Every prompting technique in this guide is a different way of exploiting this mechanism.
Zero-Shot Prompting
What It Is
Zero-shot prompting means giving the model a task description and asking it to perform the task — with no examples. The model relies entirely on knowledge and capability accumulated during pre-training.
Classify the sentiment of the following customer review as Positive, Negative, or Neutral.
Review: "The packaging was damaged but the product itself works perfectly."
Sentiment:
That is a zero-shot prompt. No examples of what Positive, Negative, or Neutral reviews look like. No sample outputs. Just the task and the input.
Why It Works (When It Does)
Zero-shot works when the task is well-represented in the model's pre-training data. Sentiment classification, translation between major languages, summarisation of standard documents, named-entity extraction, basic code generation — these are tasks the model has encountered in thousands of variations. The instruction alone is enough to activate the right behaviour.
Zero-shot also has a practical advantage: no examples means no token overhead, faster iteration, and no risk of examples biasing the model toward patterns that do not generalise.
When It Fails
Zero-shot fails in three predictable scenarios:
Unusual output format. If you want JSON with a specific schema — {"sentiment": "Negative", "confidence": 0.8, "reason": "..."} — zero-shot often produces something close but not exact. The model has a strong prior toward prose output, and without an example, it approximates rather than follows precisely.
Non-standard label space. Standard classification labels like Positive/Negative/Neutral work zero-shot because they are canonical. But if your categories are company-specific (BILLING_DISPUTE, FEATURE_REQUEST_PREMIUM, CHURN_RISK), the model has no training signal for them. Zero-shot will guess, usually poorly.
Strong conflicting priors. If the model's pre-training strongly associates a surface pattern with a different output, zero-shot cannot override that. Asking the model to classify a clearly sarcastic negative review as "Positive" because it is from a loyalty customer requires a nuanced instruction that zero-shot often cannot sustain.
Zero-Shot Summarisation Example
Summarise the following technical incident report in exactly three bullet points.
Each bullet must start with a bolded category label: **Cause**, **Impact**, **Resolution**.
Do not include any additional text before or after the bullets.
Incident report:
On March 14 at 02:17 UTC, the authentication service began returning 503 errors
due to a misconfigured load balancer rule deployed at 01:45 UTC. Approximately
12,000 users were unable to log in for 43 minutes. The on-call engineer identified
the faulty rule at 02:58 UTC and rolled it back. Service restored at 03:00 UTC.
Full post-mortem scheduled for March 17.
Summary:
This will work reliably zero-shot because summarisation and bullet formatting are deeply represented in pre-training. But notice that even here, the prompt is precise about count, format, and constraints — zero-shot does not mean vague.
One-Shot Prompting
One-shot prompting adds exactly one example before the task. It occupies the space between zero-shot and few-shot: cheaper than a full few-shot battery, but more reliable than no examples at all.
The example functions as a template, not just an instruction. You are showing the model the exact shape of the output you want.
Classify the sentiment of the customer review. Output a JSON object with keys
"sentiment" (Positive/Negative/Neutral), "confidence" (0.0–1.0), and "reason" (one sentence).
Example:
Review: "Arrived two days late but the quality is outstanding."
Output: {"sentiment": "Positive", "confidence": 0.75, "reason": "Product quality outweighs delivery delay in the customer's assessment."}
Now classify:
Review: "The customer service team refused to process my refund despite the item being defective."
Output:
The single example has done something the instruction alone could not: it established the exact JSON schema, the confidence range, and the expected length and tone of the reason field. The model now has a concrete target to match.
Use one-shot when:
- The output format is unusual or precise
- The task is straightforward but the schema needs anchoring
- You cannot afford the token cost of multiple examples
Few-Shot Prompting
What It Is
Few-shot prompting provides 2–20 worked examples before the task. Each example shows an input and the correct output, letting the model infer the task, label space, format, and edge-case handling simultaneously.
Few-shot is the workhorse of production prompt engineering. For classification tasks, extraction tasks, and any task with a rigid output schema, few-shot consistently outperforms zero-shot by a wide margin.
Concrete Few-Shot Classification Prompt
You are a support ticket router. Classify each ticket into exactly one category:
BILLING, TECHNICAL, FEATURE_REQUEST, or ACCOUNT_ACCESS.
---
Ticket: "I was charged twice for my subscription this month."
Category: BILLING
Ticket: "The export to CSV function crashes when there are more than 500 rows."
Category: TECHNICAL
Ticket: "It would be great if we could schedule reports to run automatically."
Category: FEATURE_REQUEST
Ticket: "I can't log in — it says my account has been suspended but I haven't received any email."
Category: ACCOUNT_ACCESS
Ticket: "My invoice shows a different amount than what I agreed to in the contract."
Category: BILLING
Ticket: "The API keeps returning a 429 error even though I'm well under my rate limit."
Category: TECHNICAL
---
Now classify the following ticket:
Ticket: "I need to transfer my account to a different email address."
Category:
Six examples cover all four categories, include two BILLING examples (the more common class), and cover format-edge cases like billing vs account disputes. The model now has both the label space and the boundary between categories.
What Makes a Good Few-Shot Example
This is where most practitioners make mistakes. The quality of examples matters far more than the quantity.
Label distribution must reflect reality. If 70% of your actual inputs are TECHNICAL tickets, your examples should be roughly 70% TECHNICAL. If you show equal examples for each label but the real distribution is skewed, the model will be poorly calibrated. It will spread its predictions more evenly than reality warrants.
Diversity beats repetition. Two identical BILLING examples teach the model less than one standard billing example and one edge-case billing example (e.g., a refund request vs a pricing dispute). Cover the range of your input space, not just the easy centre.
Format consistency is non-negotiable. If your first three examples use Category: as the output label and example four uses Label:, the model will notice the inconsistency and may waver on which to use. Every example must follow identical formatting — same delimiters, same key names, same capitalisation.
Length should be controlled. Long examples consume context window budget that could be spent on the actual task. If an input can be shortened without losing the feature that makes it a good example, shorten it. On long-context models this matters less, but it is still good hygiene.
Include edge cases deliberately. The easy examples are the ones the model would get right anyway. The examples that move the needle are the ones at the decision boundary: a customer complaint that is simultaneously a billing issue and an account access issue, a review that is sarcastic positive. These boundary examples teach the model where you draw the line.
Few-Shot Extraction Prompt
Extract structured data from the job posting. Output JSON only, no prose.
---
Posting: "Senior Backend Engineer at Moonshot Labs — Remote (US only) — $180k–$220k — 7+ years Python, Postgres, Kafka required — Apply by July 31"
Output: {"title": "Senior Backend Engineer", "company": "Moonshot Labs", "location": "Remote (US only)", "salary_range": "$180k–$220k", "required_skills": ["Python", "Postgres", "Kafka"], "min_experience_years": 7, "application_deadline": "July 31"}
Posting: "Product Designer, Fintech — Hybrid (London) — Competitive salary — 3–5 yrs experience, Figma expert, fintech background preferred — No specified deadline"
Output: {"title": "Product Designer", "company": null, "location": "Hybrid (London)", "salary_range": null, "required_skills": ["Figma"], "min_experience_years": 3, "application_deadline": null}
---
Posting: "Staff ML Engineer at VectorIQ — San Francisco, CA — $250k–$300k total comp — 10+ years experience, PyTorch, LLM fine-tuning, distributed systems — Rolling applications"
Output:
Notice the second example deliberately includes nulls — no company name, no salary, no deadline. Without that example, the model would invent values rather than return null. Edge cases in examples prevent hallucination of missing fields.
Chain-of-Thought Prompting
The Core Idea
Chain-of-thought (CoT) prompting, introduced by Wei et al. in 2022, works on a simple insight: if you want the model to reason correctly, make it show its reasoning.
The original finding was striking. On multi-step math and logic benchmarks, large models performed dramatically better when prompted to reason step by step before giving the final answer. On some benchmarks, accuracy more than doubled. The technique worked not just with few-shot examples but with a single zero-shot addition: "Let's think step by step."
Why does generating intermediate steps help? Several mechanisms are at work:
-
Sequential correction. Each generated token conditions all subsequent tokens. When the model writes out a correct intermediate step, the next step is more likely to be correct. Errors propagate less because each step is grounded in a written, verifiable intermediate.
-
Computation vs recall. A model producing a direct answer to a math problem must "compress" all arithmetic into a single prediction. A model writing out calculations can effectively perform the arithmetic token by token, which is far more reliable.
-
Longer effective computation. Transformer models have fixed depth — the number of layers limits how much "thinking" can happen for a single forward pass. But when the model generates a long reasoning chain, it effectively performs deeper computation across many forward passes, one per token.
Zero-Shot Chain-of-Thought
The simplest CoT technique requires no examples at all:
A store sells apples for $0.50 each and oranges for $0.75 each.
Emily buys 8 apples and 5 oranges. She pays with a $10 bill.
How much change does she receive?
Let's think step by step.
The phrase "Let's think step by step" is the entire intervention. The model will produce something like:
Step 1: Calculate the cost of apples.
8 apples × $0.50 = $4.00
Step 2: Calculate the cost of oranges.
5 oranges × $0.75 = $3.75
Step 3: Calculate the total cost.
$4.00 + $3.75 = $7.75
Step 4: Calculate the change.
$10.00 − $7.75 = $2.25
Emily receives $2.25 in change.
Without "Let's think step by step", the model is more likely to collapse the arithmetic and make an error. With it, the model traces each step, which catches arithmetic mistakes before they compound.
Alternatives that produce similar effects:
"Walk me through your reasoning before answering.""Show your work.""First, think about what information you have. Then reason to the answer.""Think carefully before giving your final answer."
Few-Shot Chain-of-Thought
Few-shot CoT combines example provision with explicit reasoning chains. You write out not just the correct answer but the correct reasoning path, and the model learns to follow it.
Answer the following logic questions. Think step by step before giving the final answer.
---
Question: All mammals are warm-blooded. Dolphins are mammals. Are dolphins warm-blooded?
Reasoning: The first premise tells us all mammals share the property of being warm-blooded. The second premise establishes that dolphins belong to the category of mammals. Since dolphins are mammals, and all mammals are warm-blooded, dolphins must be warm-blooded.
Answer: Yes, dolphins are warm-blooded.
Question: If it rains, the ground gets wet. The ground is wet. Did it rain?
Reasoning: The premise only tells us that rain causes wet ground — if rain then wet ground. But wet ground can have other causes (sprinklers, flooding, spilled water). Wet ground is consistent with rain but does not prove rain was the cause. This is a logical fallacy known as affirming the consequent.
Answer: Not necessarily. The ground being wet is consistent with rain but does not prove it rained.
---
Question: No reptiles are warm-blooded. Snakes are reptiles. Are snakes warm-blooded?
Reasoning:
The few-shot reasoning chains show the model how to handle both valid syllogisms and logical fallacies. The model learns both the answer format and the type of reasoning to apply.
Few-shot CoT is consistently stronger than zero-shot CoT, but it requires more work: you must write correct, explicit reasoning chains for each example. On complex domains — medical reasoning, legal analysis, multi-step code debugging — this investment pays off.
When to Use CoT vs Standard Prompting
| Task type | CoT helpful? | Reason |
|---|---|---|
| Multi-step arithmetic | Yes, strongly | Sequential correction of arithmetic errors |
| Logical deduction | Yes, strongly | Forces structured reasoning over recall |
| Factual recall | No | CoT adds tokens without improving accuracy |
| Simple classification | No | No reasoning steps involved |
| Creative writing | Rarely | Reasoning chains disrupt fluency |
| Code generation | Sometimes | Helps for algorithmic problems, not boilerplate |
| Summarisation | No | Task doesn't require stepwise reasoning |
The rule of thumb: use CoT whenever the task requires multi-step computation or logical dependency between steps. Avoid it when the task is primarily retrieval or generation without inter-step dependencies.
Self-Consistency: CoT With a Vote
A single chain-of-thought pass can still go wrong — the model commits to a reasoning path early and follows it even when it leads astray. Self-consistency, introduced by Wang et al. (2022), fixes this by generating multiple reasoning chains and taking the majority vote on the final answer.
The procedure:
- Run the same prompt 5–20 times at a slightly elevated temperature (0.5–0.8)
- Each run produces a different reasoning chain
- Extract the final answer from each chain
- Return the most common final answer
# Pseudocode for self-consistency
answers = []
for _ in range(10):
response = model.generate(prompt + "\nLet's think step by step.", temperature=0.7)
final_answer = extract_answer(response)
answers.append(final_answer)
result = majority_vote(answers)
Why does this work? Correct reasoning paths are more concentrated in the probability distribution than incorrect ones. The model can reach the wrong answer through many different faulty paths, but correct answers cluster around fewer paths. Majority voting across diverse chains amplifies the signal of correct reasoning.
Self-consistency improves accuracy by 5–15 percentage points on math and reasoning benchmarks over single-chain CoT. The cost is real — you are running the model 5–20x — so reserve it for high-stakes single questions where latency and cost are acceptable tradeoffs.
Least-to-Most Prompting
CoT works well for single-question reasoning but struggles when a problem has dependencies between subproblems — where you cannot solve step 4 without the answer to step 2.
Least-to-most prompting (Zhou et al., 2022) decomposes complex problems explicitly:
- First, prompt the model to decompose the problem into ordered subproblems
- Solve the easiest/earliest subproblem
- Use that answer as context to solve the next subproblem
- Continue until the full problem is solved
Problem: A company's revenue grew 20% in year 1, then fell 15% in year 2,
then grew 30% in year 3. If the starting revenue was $500,000,
what was the revenue at the end of year 3?
Step 1: What are the subproblems we need to solve in order?
Model output:
Subproblem 1: Calculate revenue after year 1 (20% growth from $500,000)
Subproblem 2: Calculate revenue after year 2 (15% decline from year 1 result)
Subproblem 3: Calculate revenue after year 3 (30% growth from year 2 result)
Then you feed each subproblem sequentially, using the previous answer as input to the next. The result is significantly more reliable than asking the model to solve the full problem in one CoT pass, because each subproblem is simpler and the intermediate answers are grounded in explicit, verified prior steps.
Least-to-most excels at:
- Multi-step word problems with numerical dependencies
- Code that requires designing a solution before implementing it
- Research questions that require answering sub-questions in order
- Planning tasks where early decisions constrain later options
Tree of Thoughts: The Model as a Search Algorithm
Tree of Thoughts (ToT) (Yao et al., 2023) extends CoT from a linear chain to a tree structure. Instead of following one reasoning path to completion, the model:
- Generates multiple candidate next steps at each decision point
- Evaluates which candidates are most promising
- Expands promising branches further
- Backtracks from dead ends
- Returns the best complete path found
This turns the model into a search algorithm over reasoning space — closer to how humans solve hard problems (exploring options, abandoning bad approaches, returning to forks in the road) than to how standard CoT works (committing to a path and following it straight through).
ToT is overkill for most tasks but genuinely valuable for:
- Creative writing with specific constraints (word games, constrained poetry)
- Mathematical proofs where multiple approaches must be evaluated
- Planning problems with many interdependent decisions
- Any task where a single wrong turn early in reasoning invalidates the whole answer
The implementation complexity is higher than other techniques — you need to orchestrate multiple model calls, implement the branching and evaluation logic, and manage a growing tree of partial solutions. In 2026, several agent frameworks expose ToT-style reasoning natively, reducing the engineering overhead.
Decision Table: Which Technique for Which Task
| Task Type | Recommended Technique | Why |
|---|---|---|
| Sentiment classification | Zero-shot | Well-represented in pre-training; no special format |
| Custom-category classification | Few-shot | Model needs examples of non-standard labels |
| Extraction with rigid JSON schema | One-shot or few-shot | Schema needs anchoring via example |
| Summarisation | Zero-shot | Strong pre-training signal; format is flexible |
| Multi-step arithmetic | Zero-shot CoT | "Let's think step by step" is sufficient |
| Complex logic / syllogisms | Few-shot CoT | Domain reasoning chains clarify the reasoning style |
| Translation (major languages) | Zero-shot | Abundant pre-training data |
| Translation (rare languages or style) | Few-shot | Examples anchor the target register and style |
| High-stakes single question | Self-consistency | Majority vote over multiple chains |
| Multi-step with dependencies | Least-to-most | Sequential subproblem solving |
| Open-ended problem with many approaches | Tree of Thoughts | Branching exploration over reasoning space |
| Format-sensitive output | Few-shot | Examples are the most reliable format anchor |
| Code generation (algorithmic) | Zero-shot or CoT | Problem decomposition helps; examples often unnecessary |
| Named-entity extraction | Zero-shot | Standard task; add few-shot only if entity types are unusual |
Technique Comparison at a Glance
| Dimension | Zero-Shot | One-Shot | Few-Shot | CoT (Zero-Shot) | Few-Shot CoT | Self-Consistency |
|---|---|---|---|---|---|---|
| Token cost | Lowest | Low | Medium | Low | Medium-High | Very High |
| Latency | Lowest | Low | Low | Low | Low | Very High |
| Format control | Poor | Good | Excellent | Poor | Good | Good |
| Reasoning accuracy | Baseline | Baseline | Baseline | +++ | ++++ | +++++ |
| Example writing effort | None | Minimal | Medium | None | High | None |
| Best for | Standard tasks | Anchoring format | Custom labels/schemas | Multi-step reasoning | Domain reasoning | High-stakes single Q |
The 2026 Context: When the Model Does It For You
Modern frontier models have changed the calculus around manual prompting techniques. Claude's extended thinking mode, GPT-5's reasoning settings, and similar features in other models apply chain-of-thought-style reasoning automatically before producing a response. For many tasks, turning on the model's thinking mode eliminates the need to manually add "Let's think step by step."
This does not make understanding CoT obsolete — it makes it more important. When you understand why CoT works, you can:
- Diagnose when a thinking mode is failing (wrong reasoning style for the domain)
- Write few-shot CoT examples for domain-specific reasoning the model otherwise handles poorly
- Choose the right effort level / thinking budget for a given task
For more on selecting between reasoning modes and model variants, see Claude Effort Parameter and Model Selection.
The broader picture is context engineering — assembling everything the model sees (examples, retrieved documents, tool definitions, constraints) into the tightest, most signal-dense package possible. Few-shot prompting is one component of that assembly. See Context Engineering: Why Clean Prompts Matter for the full stack.
If you are building agent systems that call models in loops — where each call's output becomes the next call's input — the prompting techniques here are applied at every node of the loop. The Agent Harness Complete Guide covers how to structure those loops so examples and reasoning chains are passed effectively across steps.
For Claude-specific prompting patterns including the 4-block structure and XML conventions, Master Prompt Engineering with Claude covers the model-specific conventions that amplify everything discussed here.
Common Mistakes and How to Fix Them
Using few-shot when zero-shot is sufficient. If a task is standard (summarisation, translation, simple classification), few-shot examples add token cost without meaningful accuracy gains. Try zero-shot first; only switch to few-shot if quality is consistently off.
Picking examples that are too similar to each other. Five examples of the same type of BILLING ticket teaches the model less than one each of five different billing scenarios. Diversity in examples matters more than quantity.
Inconsistent example formatting. Mixing Answer:, Output:, and Result: as output labels across examples confuses the model about what to produce. Pick one and use it consistently across every example.
Adding CoT to tasks that don't need reasoning. Chain-of-thought adds tokens and sometimes hurts performance on tasks that are primarily retrieval (factual questions where the answer is a single entity) or creative generation (where reasoning chains disrupt fluency). Apply CoT selectively.
Forgetting to extract the final answer in CoT. When using CoT, specify where the final answer should appear: "After your reasoning, provide the final answer on a line beginning with 'Final answer:'". Without this, parsing the answer from the reasoning chain programmatically is brittle.
Using self-consistency for every query. Self-consistency is 5–20x more expensive than a single call. It is worth the cost for high-stakes single questions; it is not worth it for bulk classification, extraction, or any task where a single well-prompted pass is already reliable.
Putting It Together: A Practical Workflow
When approaching a new task, work through this sequence:
-
Start zero-shot. Write the clearest possible instruction and try it on 10–20 representative inputs. If output quality is consistently good and format is correct, stop here.
-
Add one example if format is wrong. If the model produces the right content but wrong format, one-shot is usually enough to anchor the schema. If one-shot doesn't fix it, move to few-shot.
-
Move to few-shot if labels or categories are wrong. If the model is misclassifying, extracting wrong fields, or ignoring your label space, add 5–10 carefully selected examples covering the full distribution.
-
Add CoT if multi-step reasoning is failing. If the task requires arithmetic, logic, or sequential dependencies, add "Let's think step by step" (zero-shot CoT) or write explicit reasoning chains into your examples (few-shot CoT).
-
Add self-consistency for high-stakes single questions. If you need maximum accuracy on a specific question and cost/latency are acceptable, run self-consistency with 5–10 samples and take the majority vote.
-
Consider least-to-most or ToT for complex structured problems. If CoT still fails because the problem has deep dependency structure or requires exploration, move to decomposition-based approaches.
This workflow avoids over-engineering. Most production tasks resolve at step 1 or 2. The more exotic techniques (self-consistency, ToT, least-to-most) are precision tools for specific hard problems, not defaults.
Example Prompt Library
A compact reference of ready-to-use prompts for each technique.
Zero-Shot Translation
Translate the following English text to formal Brazilian Portuguese.
Preserve technical terms in English.
Text: "The API endpoint accepts a JSON payload with a required 'query' field and an optional 'filters' array."
Translation:
Zero-Shot CoT for Logic
Determine whether the following argument is logically valid or invalid. Explain your reasoning step by step before giving your final verdict.
Argument: "All engineers know Python. Sarah knows Python. Therefore, Sarah is an engineer."
Let's think step by step.
Few-Shot Classification
Classify the priority of the following bug reports as P0 (production down), P1 (major feature broken), P2 (minor issue), or P3 (cosmetic).
Report: "Users cannot complete checkout — payment form throws a 500 error."
Priority: P0
Report: "The bulk export feature crashes for files over 100MB."
Priority: P1
Report: "Tooltip text on the settings page is truncated on smaller screens."
Priority: P3
Report: "Dark mode toggle doesn't persist across sessions."
Priority: P2
Now classify:
Report: "Search results return in random order instead of relevance order."
Priority:
Few-Shot CoT for Reasoning
Solve the following rate problems. Show your work step by step before giving the final answer.
Problem: A train travels at 60 mph. How long does it take to travel 150 miles?
Reasoning: Time = Distance ÷ Speed = 150 miles ÷ 60 mph = 2.5 hours.
Answer: 2.5 hours
Problem: A worker completes a task in 4 hours. A second worker completes the same task in 6 hours. How long does it take them working together?
Reasoning: Worker 1 completes 1/4 of the task per hour. Worker 2 completes 1/6 of the task per hour. Together: 1/4 + 1/6 = 3/12 + 2/12 = 5/12 per hour. Time = 1 ÷ (5/12) = 12/5 = 2.4 hours.
Answer: 2.4 hours
Problem: A pump fills a tank in 8 hours. A drain empties the same tank in 12 hours. If both are running simultaneously with the tank starting full, how long until the tank is empty?
Reasoning:
Understanding these techniques at the mechanism level — not just as named methods but as specific manipulations of the model's probability distribution — is what separates practitioners who can debug failing prompts from those who can only follow recipes. The techniques are composable: few-shot with CoT, self-consistency over few-shot CoT chains, least-to-most with CoT at each subproblem step. The decision about which combination to apply follows from understanding what each one actually does.