Most prompt engineers operate on vibes. They write a prompt, try it on five inputs, decide it looks good, and ship it. Then they discover at 3am that it breaks on a class of inputs they never tested, or that a model update quietly changed its behavior, or that version B is actually worse than version A for a reason they cannot explain.
Prompt evaluation โ building a systematic way to measure whether your prompts are working โ is what separates professional prompt engineering from guesswork. This guide shows you how to do it from scratch.
The Eval Mindset
Here is the mental model shift: treat prompts like code and evaluations like tests.
You would not ship a function without verifying it works on edge cases. You would not declare a refactor "done" without running the test suite. Prompts deserve the same discipline.
A prompt is a specification for model behavior. An eval set tests whether the model meets that specification. A regression suite guards against specification drift when models update. This is not extra work โ it is the work that turns prompt engineering from a guess into an engineering discipline.
The alternative is superstition: you believe your prompt works because it worked the last three times you tried it. That belief will eventually be wrong, and you will not know until users tell you.
The Three Levels of Prompt Quality
Before you can measure quality, you need to know what you are measuring. There are three distinct dimensions:
Correctness: Does the prompt produce the right output? This is about accuracy โ is the extracted entity correct, is the summary factually accurate, is the classification label right? Correctness is evaluated per-example against a known ground truth.
Consistency: Does the prompt produce consistent output across runs? LLMs are stochastic. The same input can produce different outputs. Consistency measures how much variance there is โ both across multiple runs of the same input and across semantically similar inputs that should produce similar outputs.
Efficiency: Does the prompt use the minimum tokens needed? A prompt that achieves the same accuracy in 500 tokens as another does in 2,000 tokens is strictly better in production. This matters at scale โ token costs add up.
Most evaluation frameworks focus on correctness. Do not neglect the other two, especially for production systems where cost and consistency directly affect user experience.
Building an Eval Set
What Makes a Good Test Case
A test case has two parts: an input (what you send to the model) and a reference (how you judge the output). The reference can be:
- Golden output: An exact expected output, human-annotated ("The correct answer is X")
- Criteria: A set of conditions the output must meet ("Must contain a list, must be under 100 words, must not recommend a specific product")
- Rubric: A scoring guide ("Score 1-5 on helpfulness, accuracy, and conciseness")
Which reference type to use depends on your task:
| Task Type | Reference Type |
|---|---|
| Classification | Golden label (exact match) |
| Extraction (structured) | Golden JSON output |
| Summarization | Criteria or rubric |
| Q&A with factual answers | Golden answer |
| Open-ended generation | Rubric |
| Code generation | Test suite (run the code) |
How Many Examples You Need
50 is the minimum for a meaningful eval. Here is why: if your prompt has 80% accuracy, you need enough examples to reliably distinguish 80% from 85% or 75%. With 50 examples, a 10-point difference is detectable. With 10 examples, it is noise.
Practical sizing guidelines:
- Prototype: 20โ30 examples covering happy path and obvious edge cases
- Pre-launch: 50โ100 examples covering all input subgroups
- Production monitoring: 200+ examples, built up from real production inputs over time
Building the Initial Eval Set
Start by collecting real examples. If you have any existing outputs (from a previous system, from manual work, from a beta), those are gold. If you are building from scratch:
- Generate 20 diverse synthetic inputs that cover the breadth of the task
- Run them through your prompt, review each output manually, and annotate whether it is correct
- Identify the failure cases and add 10 more inputs specifically targeting those cases
- Add 5โ10 edge cases: shortest possible input, longest possible input, ambiguous inputs, inputs in other languages if relevant
Do not use only inputs where your prompt already works well. The eval set should challenge the prompt, not validate it.
The Annotation Process
For golden outputs, have a human annotate the correct answer before running the model. This prevents you from anchoring to the model's output and calling it correct when it is not.
For rubrics, define the scoring criteria explicitly before you see any model output:
Helpfulness rubric:
5 โ Directly answers the question with specific, actionable information
4 โ Answers the question but misses one useful detail
3 โ Partially answers the question or is too general
2 โ Tangentially related but does not answer the question
1 โ Completely wrong or refuses to answer
If two human raters cannot agree on a score within 1 point using your rubric, the rubric is not specific enough. Refine it.
Evaluation Methods
Exact Match
Best for: classification, extraction, tasks with a single correct answer.
def exact_match_eval(prediction: str, golden: str) -> bool:
return prediction.strip().lower() == golden.strip().lower()
def eval_suite_exact(predictions: list[str], goldens: list[str]) -> float:
correct = sum(
exact_match_eval(p, g)
for p, g in zip(predictions, goldens)
)
return correct / len(goldens)
Exact match is strict. For tasks with minor acceptable variations (extra whitespace, different capitalization), add normalization. For tasks where multiple answers are correct (entity extraction might accept different orderings), use set comparison.
Fuzzy Match
Best for: generation tasks where the exact wording can vary.
from difflib import SequenceMatcher
def fuzzy_match_eval(prediction: str, golden: str, threshold: float = 0.8) -> bool:
ratio = SequenceMatcher(None, prediction.lower(), golden.lower()).ratio()
return ratio >= threshold
# Or use sentence embeddings for semantic similarity
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(prediction: str, golden: str) -> float:
embeddings = model.encode([prediction, golden])
return float(util.cos_sim(embeddings[0], embeddings[1]))
Fuzzy match is lenient. Use it as a first pass and manually review cases near the threshold โ those are often the interesting failures.
LLM-as-Judge
Best for: open-ended tasks where there is no single correct answer. You use a separate LLM call to score each output.
def llm_judge(
input_text: str,
output_text: str,
criteria: str,
judge_model: str = "claude-opus-4-5"
) -> dict:
prompt = f"""You are evaluating the quality of an AI assistant's response.
INPUT:
{input_text}
RESPONSE TO EVALUATE:
{output_text}
EVALUATION CRITERIA:
{criteria}
Score the response on a scale of 1-5 for each criterion.
Return JSON with keys matching each criterion and integer values 1-5.
Also include "reasoning": a one-sentence explanation.
Return ONLY valid JSON, no prose.
"""
response = llm.complete(judge_model, prompt)
return json.loads(response.content)
# Example usage
scores = llm_judge(
input_text="Summarize the key risks of this investment",
output_text="The main risk is market volatility...",
criteria="""
- accuracy: Is the content factually accurate?
- completeness: Are all major points covered?
- conciseness: Is the response appropriately brief?
- format: Is the response well-structured?
"""
)
LLM-as-Judge: How to Do It Right
LLM-as-judge is powerful but requires care. A poorly designed judge produces scores that do not correlate with actual quality โ which is worse than no eval at all, because it gives you false confidence.
The Judge Prompt Structure
Your judge prompt needs:
- Clear evaluation criteria with definitions
- An explicit scoring scale with anchor descriptions (what does a 3 look like vs a 4?)
- A request for reasoning before the score (not just a number)
- An instruction to evaluate the response on its own merits, not compare it to other responses
Position Bias
When showing the judge a response, do not tell it which prompt generated the response. If you are comparing Prompt A vs Prompt B, the judge should only see the input and the output โ not which prompt produced it. Order effects exist too: the judge may favor whichever response it sees first. For A/B comparisons, randomize the order.
Calibrating the Judge
Before trusting your judge, run it on 20 examples where you also have human annotations. Calculate the correlation between judge scores and human scores. A well-calibrated judge should have correlation > 0.8. If it is lower, your judge prompt needs refinement.
Common causes of poor calibration:
- The judge's criteria are too vague
- The scoring scale is not anchored well enough
- The judge is being asked to evaluate too many criteria at once (use separate calls per criterion)
- The judge model is not strong enough for the task
When Human Evaluation Is Worth It
Use human evaluation for: initial calibration of your LLM judge, pre-launch validation of high-stakes features, and samples from production outputs once per quarter to check for drift. For everything else, the judge is fast and cheap enough to run automatically.
A/B Testing Prompts
You have Prompt A (the current version) and Prompt B (the proposed change). You want to know which is better.
def ab_test_prompts(
prompt_a: str,
prompt_b: str,
eval_set: list[dict],
scorer: callable,
model: str
) -> dict:
scores_a = []
scores_b = []
for example in eval_set:
output_a = run_prompt(prompt_a, example["input"], model)
output_b = run_prompt(prompt_b, example["input"], model)
score_a = scorer(example["input"], output_a, example.get("golden"))
score_b = scorer(example["input"], output_b, example.get("golden"))
scores_a.append(score_a)
scores_b.append(score_b)
avg_a = sum(scores_a) / len(scores_a)
avg_b = sum(scores_b) / len(scores_b)
# Statistical significance test
from scipy import stats
t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
return {
"prompt_a_score": avg_a,
"prompt_b_score": avg_b,
"winner": "B" if avg_b > avg_a else "A",
"p_value": p_value,
"significant": p_value < 0.05,
"relative_improvement": (avg_b - avg_a) / avg_a * 100
}
The p-value tells you whether the difference is statistically significant or noise. A p-value < 0.05 means there is less than a 5% chance the observed difference is random. With small eval sets (under 30 examples), treat any result as directional evidence rather than a definitive conclusion.
Eval Tools in 2026
PromptFoo
PromptFoo is an open-source command-line tool for automated prompt testing. You define prompts and test cases in a YAML config, and it runs them and produces a comparison report.
# promptfooconfig.yaml
prompts:
- "You are a helpful assistant. {{input}}"
- "You are a concise assistant. Answer briefly. {{input}}"
providers:
- anthropic:claude-opus-4-5
- openai:gpt-4o
tests:
- vars:
input: "Explain how transformers work"
assert:
- type: contains
value: "attention"
- type: llm-rubric
value: "Response should be accurate and appropriate for a technical audience"
- vars:
input: "What is 2+2?"
assert:
- type: equals
value: "4"
Run with: promptfoo eval
PromptFoo supports LLM-as-judge assertions, exact match, contains, regex, and custom assertion functions. It generates an HTML comparison report showing which prompt won on each test case.
LangSmith
LangSmith is Anthropic's / LangChain's tracing and evaluation platform. It captures every LLM call in your application (prompt, response, latency, cost) and lets you build eval datasets from real production traces.
Key features for prompt evaluation:
- Annotate production outputs to build golden datasets from real traffic
- Run automated evaluators (LLM-as-judge) on any dataset
- Track prompt versions and compare metrics across versions
- Set up alerts when metrics drop below a threshold
Custom Eval Loops
For full control, build your own eval loop. This is 50โ100 lines of Python and gives you exactly what you need:
import json
from pathlib import Path
from dataclasses import dataclass, asdict
from datetime import datetime
@dataclass
class EvalResult:
input: str
output: str
golden: str | None
score: float
reasoning: str | None
model: str
prompt_version: str
timestamp: str
def run_eval(
prompt_template: str,
eval_set_path: str,
scorer,
model: str,
prompt_version: str
) -> list[EvalResult]:
eval_set = json.loads(Path(eval_set_path).read_text())
results = []
for example in eval_set:
prompt = prompt_template.format(**example)
output = call_llm(model, prompt)
score_result = scorer(
input=example["input"],
output=output,
golden=example.get("golden")
)
results.append(EvalResult(
input=example["input"],
output=output,
golden=example.get("golden"),
score=score_result["score"],
reasoning=score_result.get("reasoning"),
model=model,
prompt_version=prompt_version,
timestamp=datetime.utcnow().isoformat()
))
return results
def summarize_eval(results: list[EvalResult]) -> dict:
scores = [r.score for r in results]
return {
"n": len(results),
"mean_score": sum(scores) / len(scores),
"min_score": min(scores),
"pass_rate": sum(1 for s in scores if s >= 0.8) / len(scores),
"failures": [
asdict(r) for r in results if r.score < 0.8
]
}
Regression Testing Prompts
Model updates break prompts. Anthropic and OpenAI both update their models continuously, and even "minor" updates can shift behavior in ways that affect your carefully tuned prompts. The only way to catch this before your users do is automated regression testing.
Set up a CI job (GitHub Actions, any CI system) that:
- Runs your eval suite against the current prod model on every deploy
- Compares the score against the previous baseline
- Fails the deploy if the score drops more than 5% (or whatever threshold makes sense for your use case)
- Stores results in a time-series database so you can track trends
# .github/workflows/prompt-regression.yml
name: Prompt Regression Test
on:
push:
paths:
- 'prompts/**'
- 'src/llm/**'
schedule:
- cron: '0 9 * * 1' # Weekly on Monday morning
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run prompt evals
run: python evals/run_regression.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- name: Check regression
run: python evals/check_regression.py --threshold 0.05
When a model update breaks your prompt, you have three options:
- Patch the prompt โ add examples, clarify ambiguous instructions, adjust the format specification
- Pin the model version โ use an older model version until you can patch (many APIs support this)
- Accept the regression โ if the new behavior is actually an improvement in aggregate, update your evals to reflect the new baseline
Prompt Metrics That Matter
Track these four metrics for every production prompt:
Task completion rate: What percentage of inputs produce a usable output (not a refusal, not a format error, not a timeout)? This is your top-line metric.
Format adherence rate: If you asked for JSON, what percentage of outputs are valid, parseable JSON? If you asked for a bulleted list, what percentage actually contain bullets? This is often where regression appears first.
Refusal rate: How often does the model say "I cannot help with this" or "I am sorry, but..."? An increase in refusal rate usually means a model update changed the safety thresholds or you changed the content in a way that triggers filters.
Cost per successful completion: Total token cost divided by number of successful outputs. This helps you identify when a prompt is generating unnecessarily long outputs or requiring excessive retries.
A Practical Eval Workflow From Scratch
Here is the sequence to follow when building evals for a new prompt:
Week 1: Build the eval set
- Collect 30 real or realistic inputs
- Annotate 20 with golden outputs, use criteria for the other 10
- Run the prompt on all 30, manually review every output
- Add 10 more inputs targeting the failure patterns you found
Week 2: Automate evaluation
- Implement exact match or LLM-as-judge based on task type
- Run all 40 examples through automated eval
- Check correlation between your manual scores and automated scores
- Adjust the judge prompt or scoring criteria until correlation is > 0.8
Week 3: A/B test and regression guard
- Run your existing prompt (A) vs any proposed improvements (B) on the full eval set
- Pick the winner based on automated eval scores + manual spot-check of failures
- Set up the regression test to run weekly or on every deploy
- Document the baseline score and what constitutes a regression
Ongoing: Expand with production data
- Every week, sample 10 real production inputs and add them to the eval set with annotations
- Rerun evals after any model version change
- Track metrics monthly and investigate any downward trends
Common Mistakes in Prompt Evaluation
Evaluating on your training data. If you used 10 examples to develop and tune the prompt, do not also use those 10 examples to evaluate it. You will overestimate accuracy. Hold out a separate test set.
Only evaluating the happy path. Prompts that work on easy inputs often fail on edge cases. Make sure at least 20% of your eval set is adversarial or unusual inputs.
Trusting a judge you have not calibrated. An uncalibrated LLM judge can produce scores that are inversely correlated with human judgement. Always calibrate on at least 20 human-annotated examples before trusting judge scores.
Optimizing for eval score instead of real quality. If you tune your prompt specifically to score well on your eval set, you are overfitting to the eval. Maintain a separate "held-out" test set that you only use for final validation, not for iterative tuning.