What is an eval set for prompt engineering?

An eval set is a curated collection of test inputs paired with expected outputs (called golden outputs) or evaluation criteria. Just like a unit test suite for code, an eval set lets you objectively measure whether your prompt is producing correct results — and whether a change improved or broke things.

How many examples do I need in my eval set?

Start with 50. That is enough to spot obvious problems, run meaningful comparisons, and see statistical trends. If your task has important subgroups (different tones, different document types, different languages), make sure each subgroup has at least 10 examples. For critical production features, grow to 200+ over time.

What is LLM-as-judge and when should I use it?

LLM-as-judge means using a separate language model call to score your primary model's output. Use it when exact-match evaluation is too strict (there are many correct ways to phrase a good answer) but human evaluation is too slow or expensive. It works well for open-ended generation tasks like summarization, rewriting, and Q&A. Always calibrate your judge against human annotations before relying on it.

How do I run an A/B test between two prompts?

Run both prompts against the same eval set of inputs. For each input, score both outputs using your evaluation method (exact match, fuzzy match, or LLM-as-judge). Calculate the win rate for each prompt. Use a statistical significance test (chi-squared for binary metrics, t-test for continuous scores) to confirm the difference is real and not noise.

What happens to my prompts when a model gets updated?

Model updates often change behavior in subtle ways — a phrase that worked perfectly on gpt-4o may produce worse results on the next version. This is why regression testing matters: run your eval suite automatically whenever you change the model or model version. If a model update breaks your prompt, you will know immediately instead of discovering it from user complaints.

What prompt metrics should I track in production?

The four most important: task completion rate (did the model produce a usable output?), format adherence rate (did the output match the expected format?), refusal rate (how often did the model decline to answer?), and cost per successful completion (tokens × price per token for tasks that produced a valid result). Track these over time and alert on regressions.

Is human evaluation worth the cost?

Yes, for calibration and for high-stakes launches. Use human evaluation to calibrate your LLM judge (do human scores correlate with judge scores?) and before shipping any feature where errors have real consequences. For ongoing monitoring, LLM-as-judge is usually fast and cheap enough to run continuously.

Evaluating Prompts: How to Measure Prompt Quality in 2026 | explainx.ai Blog

Most prompt engineers operate on vibes. They write a prompt, try it on five inputs, decide it looks good, and ship it. Then they discover at 3am that it breaks on a class of inputs they never tested, or that a model update quietly changed its behavior, or that version B is actually worse than version A for a reason they cannot explain.

Prompt evaluation — building a systematic way to measure whether your prompts are working — is what separates professional prompt engineering from guesswork. This guide shows you how to do it from scratch.

The Eval Mindset

Here is the mental model shift: treat prompts like code and evaluations like tests.

You would not ship a function without verifying it works on edge cases. You would not declare a refactor "done" without running the test suite. Prompts deserve the same discipline.

A prompt is a specification for model behavior. An eval set tests whether the model meets that specification. A regression suite guards against specification drift when models update. This is not extra work — it is the work that turns prompt engineering from a guess into an engineering discipline.

The alternative is superstition: you believe your prompt works because it worked the last three times you tried it. That belief will eventually be wrong, and you will not know until users tell you.

The Three Levels of Prompt Quality

Before you can measure quality, you need to know what you are measuring. There are three distinct dimensions:

Correctness: Does the prompt produce the right output? This is about accuracy — is the extracted entity correct, is the summary factually accurate, is the classification label right? Correctness is evaluated per-example against a known ground truth.

Consistency: Does the prompt produce consistent output across runs? LLMs are stochastic. The same input can produce different outputs. Consistency measures how much variance there is — both across multiple runs of the same input and across semantically similar inputs that should produce similar outputs.

Efficiency: Does the prompt use the minimum tokens needed? A prompt that achieves the same accuracy in 500 tokens as another does in 2,000 tokens is strictly better in production. This matters at scale — token costs add up.

Most evaluation frameworks focus on correctness. Do not neglect the other two, especially for production systems where cost and consistency directly affect user experience.

Building an Eval Set

What Makes a Good Test Case

A test case has two parts: an input (what you send to the model) and a reference (how you judge the output). The reference can be:

Golden output: An exact expected output, human-annotated ("The correct answer is X")
Criteria: A set of conditions the output must meet ("Must contain a list, must be under 100 words, must not recommend a specific product")
Rubric: A scoring guide ("Score 1-5 on helpfulness, accuracy, and conciseness")

Which reference type to use depends on your task:

Task Type	Reference Type
Classification	Golden label (exact match)
Extraction (structured)	Golden JSON output
Summarization	Criteria or rubric
Q&A with factual answers	Golden answer
Open-ended generation	Rubric
Code generation	Test suite (run the code)

How Many Examples You Need

50 is the minimum for a meaningful eval. Here is why: if your prompt has 80% accuracy, you need enough examples to reliably distinguish 80% from 85% or 75%. With 50 examples, a 10-point difference is detectable. With 10 examples, it is noise.

Practical sizing guidelines:

Prototype: 20–30 examples covering happy path and obvious edge cases
Pre-launch: 50–100 examples covering all input subgroups
Production monitoring: 200+ examples, built up from real production inputs over time

Building the Initial Eval Set

Start by collecting real examples. If you have any existing outputs (from a previous system, from manual work, from a beta), those are gold. If you are building from scratch:

Generate 20 diverse synthetic inputs that cover the breadth of the task
Run them through your prompt, review each output manually, and annotate whether it is correct
Identify the failure cases and add 10 more inputs specifically targeting those cases
Add 5–10 edge cases: shortest possible input, longest possible input, ambiguous inputs, inputs in other languages if relevant

Do not use only inputs where your prompt already works well. The eval set should challenge the prompt, not validate it.

The Annotation Process

For golden outputs, have a human annotate the correct answer before running the model. This prevents you from anchoring to the model's output and calling it correct when it is not.

For rubrics, define the scoring criteria explicitly before you see any model output:

Helpfulness rubric:
5 — Directly answers the question with specific, actionable information
4 — Answers the question but misses one useful detail
3 — Partially answers the question or is too general
2 — Tangentially related but does not answer the question
1 — Completely wrong or refuses to answer

If two human raters cannot agree on a score within 1 point using your rubric, the rubric is not specific enough. Refine it.

Evaluation Methods

Exact Match

Best for: classification, extraction, tasks with a single correct answer.

def exact_match_eval(prediction: str, golden: str) -> bool:
    return prediction.strip().lower() == golden.strip().lower()

def eval_suite_exact(predictions: list[str], goldens: list[str]) -> float:
    correct = sum(
        exact_match_eval(p, g) 
        for p, g in zip(predictions, goldens)
    )
    return correct / len(goldens)

Exact match is strict. For tasks with minor acceptable variations (extra whitespace, different capitalization), add normalization. For tasks where multiple answers are correct (entity extraction might accept different orderings), use set comparison.

Fuzzy Match

Best for: generation tasks where the exact wording can vary.

from difflib import SequenceMatcher

def fuzzy_match_eval(prediction: str, golden: str, threshold: float = 0.8) -> bool:
    ratio = SequenceMatcher(None, prediction.lower(), golden.lower()).ratio()
    return ratio >= threshold

# Or use sentence embeddings for semantic similarity
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(prediction: str, golden: str) -> float:
    embeddings = model.encode([prediction, golden])
    return float(util.cos_sim(embeddings[0], embeddings[1]))

Fuzzy match is lenient. Use it as a first pass and manually review cases near the threshold — those are often the interesting failures.

LLM-as-Judge

Best for: open-ended tasks where there is no single correct answer. You use a separate LLM call to score each output.

def llm_judge(
    input_text: str,
    output_text: str,
    criteria: str,
    judge_model: str = "claude-opus-4-5"
) -> dict:
    prompt = f"""You are evaluating the quality of an AI assistant's response.

INPUT:
{input_text}

RESPONSE TO EVALUATE:
{output_text}

EVALUATION CRITERIA:
{criteria}

Score the response on a scale of 1-5 for each criterion.
Return JSON with keys matching each criterion and integer values 1-5.
Also include "reasoning": a one-sentence explanation.

Return ONLY valid JSON, no prose.
"""
    
    response = llm.complete(judge_model, prompt)
    return json.loads(response.content)

# Example usage
scores = llm_judge(
    input_text="Summarize the key risks of this investment",
    output_text="The main risk is market volatility...",
    criteria="""
    - accuracy: Is the content factually accurate?
    - completeness: Are all major points covered?
    - conciseness: Is the response appropriately brief?
    - format: Is the response well-structured?
    """
)

LLM-as-Judge: How to Do It Right

LLM-as-judge is powerful but requires care. A poorly designed judge produces scores that do not correlate with actual quality — which is worse than no eval at all, because it gives you false confidence.

The Judge Prompt Structure

Your judge prompt needs:

Clear evaluation criteria with definitions
An explicit scoring scale with anchor descriptions (what does a 3 look like vs a 4?)
A request for reasoning before the score (not just a number)
An instruction to evaluate the response on its own merits, not compare it to other responses

Position Bias

When showing the judge a response, do not tell it which prompt generated the response. If you are comparing Prompt A vs Prompt B, the judge should only see the input and the output — not which prompt produced it. Order effects exist too: the judge may favor whichever response it sees first. For A/B comparisons, randomize the order.

Calibrating the Judge

Before trusting your judge, run it on 20 examples where you also have human annotations. Calculate the correlation between judge scores and human scores. A well-calibrated judge should have correlation > 0.8. If it is lower, your judge prompt needs refinement.

Common causes of poor calibration:

The judge's criteria are too vague
The scoring scale is not anchored well enough
The judge is being asked to evaluate too many criteria at once (use separate calls per criterion)
The judge model is not strong enough for the task

When Human Evaluation Is Worth It

Use human evaluation for: initial calibration of your LLM judge, pre-launch validation of high-stakes features, and samples from production outputs once per quarter to check for drift. For everything else, the judge is fast and cheap enough to run automatically.

A/B Testing Prompts

You have Prompt A (the current version) and Prompt B (the proposed change). You want to know which is better.

def ab_test_prompts(
    prompt_a: str,
    prompt_b: str,
    eval_set: list[dict],
    scorer: callable,
    model: str
) -> dict:
    scores_a = []
    scores_b = []
    
    for example in eval_set:
        output_a = run_prompt(prompt_a, example["input"], model)
        output_b = run_prompt(prompt_b, example["input"], model)
        
        score_a = scorer(example["input"], output_a, example.get("golden"))
        score_b = scorer(example["input"], output_b, example.get("golden"))
        
        scores_a.append(score_a)
        scores_b.append(score_b)
    
    avg_a = sum(scores_a) / len(scores_a)
    avg_b = sum(scores_b) / len(scores_b)
    
    # Statistical significance test
    from scipy import stats
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
    
    return {
        "prompt_a_score": avg_a,
        "prompt_b_score": avg_b,
        "winner": "B" if avg_b > avg_a else "A",
        "p_value": p_value,
        "significant": p_value < 0.05,
        "relative_improvement": (avg_b - avg_a) / avg_a * 100
    }

The p-value tells you whether the difference is statistically significant or noise. A p-value < 0.05 means there is less than a 5% chance the observed difference is random. With small eval sets (under 30 examples), treat any result as directional evidence rather than a definitive conclusion.

Eval Tools in 2026

PromptFoo

PromptFoo is an open-source command-line tool for automated prompt testing. You define prompts and test cases in a YAML config, and it runs them and produces a comparison report.

# promptfooconfig.yaml
prompts:
  - "You are a helpful assistant. {{input}}"
  - "You are a concise assistant. Answer briefly. {{input}}"

providers:
  - anthropic:claude-opus-4-5
  - openai:gpt-4o

tests:
  - vars:
      input: "Explain how transformers work"
    assert:
      - type: contains
        value: "attention"
      - type: llm-rubric
        value: "Response should be accurate and appropriate for a technical audience"
  
  - vars:
      input: "What is 2+2?"
    assert:
      - type: equals
        value: "4"

Run with: promptfoo eval

PromptFoo supports LLM-as-judge assertions, exact match, contains, regex, and custom assertion functions. It generates an HTML comparison report showing which prompt won on each test case.

LangSmith

LangSmith is Anthropic's / LangChain's tracing and evaluation platform. It captures every LLM call in your application (prompt, response, latency, cost) and lets you build eval datasets from real production traces.

Key features for prompt evaluation:

Annotate production outputs to build golden datasets from real traffic
Run automated evaluators (LLM-as-judge) on any dataset
Track prompt versions and compare metrics across versions
Set up alerts when metrics drop below a threshold

Custom Eval Loops

For full control, build your own eval loop. This is 50–100 lines of Python and gives you exactly what you need:

import json
from pathlib import Path
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class EvalResult:
    input: str
    output: str
    golden: str | None
    score: float
    reasoning: str | None
    model: str
    prompt_version: str
    timestamp: str

def run_eval(
    prompt_template: str,
    eval_set_path: str,
    scorer,
    model: str,
    prompt_version: str
) -> list[EvalResult]:
    eval_set = json.loads(Path(eval_set_path).read_text())
    results = []
    
    for example in eval_set:
        prompt = prompt_template.format(**example)
        output = call_llm(model, prompt)
        
        score_result = scorer(
            input=example["input"],
            output=output,
            golden=example.get("golden")
        )
        
        results.append(EvalResult(
            input=example["input"],
            output=output,
            golden=example.get("golden"),
            score=score_result["score"],
            reasoning=score_result.get("reasoning"),
            model=model,
            prompt_version=prompt_version,
            timestamp=datetime.utcnow().isoformat()
        ))
    
    return results

def summarize_eval(results: list[EvalResult]) -> dict:
    scores = [r.score for r in results]
    return {
        "n": len(results),
        "mean_score": sum(scores) / len(scores),
        "min_score": min(scores),
        "pass_rate": sum(1 for s in scores if s >= 0.8) / len(scores),
        "failures": [
            asdict(r) for r in results if r.score < 0.8
        ]
    }

Regression Testing Prompts

Model updates break prompts. Anthropic and OpenAI both update their models continuously, and even "minor" updates can shift behavior in ways that affect your carefully tuned prompts. The only way to catch this before your users do is automated regression testing.

Set up a CI job (GitHub Actions, any CI system) that:

Runs your eval suite against the current prod model on every deploy
Compares the score against the previous baseline
Fails the deploy if the score drops more than 5% (or whatever threshold makes sense for your use case)
Stores results in a time-series database so you can track trends

# .github/workflows/prompt-regression.yml
name: Prompt Regression Test

on:
  push:
    paths:
      - 'prompts/**'
      - 'src/llm/**'
  schedule:
    - cron: '0 9 * * 1'  # Weekly on Monday morning

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run prompt evals
        run: python evals/run_regression.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Check regression
        run: python evals/check_regression.py --threshold 0.05

When a model update breaks your prompt, you have three options:

Patch the prompt — add examples, clarify ambiguous instructions, adjust the format specification
Pin the model version — use an older model version until you can patch (many APIs support this)
Accept the regression — if the new behavior is actually an improvement in aggregate, update your evals to reflect the new baseline

Prompt Metrics That Matter

Track these four metrics for every production prompt:

Task completion rate: What percentage of inputs produce a usable output (not a refusal, not a format error, not a timeout)? This is your top-line metric.

Format adherence rate: If you asked for JSON, what percentage of outputs are valid, parseable JSON? If you asked for a bulleted list, what percentage actually contain bullets? This is often where regression appears first.

Refusal rate: How often does the model say "I cannot help with this" or "I am sorry, but..."? An increase in refusal rate usually means a model update changed the safety thresholds or you changed the content in a way that triggers filters.

Cost per successful completion: Total token cost divided by number of successful outputs. This helps you identify when a prompt is generating unnecessarily long outputs or requiring excessive retries.

A Practical Eval Workflow From Scratch

Here is the sequence to follow when building evals for a new prompt:

Week 1: Build the eval set

Collect 30 real or realistic inputs
Annotate 20 with golden outputs, use criteria for the other 10
Run the prompt on all 30, manually review every output
Add 10 more inputs targeting the failure patterns you found

Week 2: Automate evaluation

Implement exact match or LLM-as-judge based on task type
Run all 40 examples through automated eval
Check correlation between your manual scores and automated scores
Adjust the judge prompt or scoring criteria until correlation is > 0.8

Week 3: A/B test and regression guard

Run your existing prompt (A) vs any proposed improvements (B) on the full eval set
Pick the winner based on automated eval scores + manual spot-check of failures
Set up the regression test to run weekly or on every deploy
Document the baseline score and what constitutes a regression

Ongoing: Expand with production data

Every week, sample 10 real production inputs and add them to the eval set with annotations
Rerun evals after any model version change
Track metrics monthly and investigate any downward trends

Common Mistakes in Prompt Evaluation

Evaluating on your training data. If you used 10 examples to develop and tune the prompt, do not also use those 10 examples to evaluate it. You will overestimate accuracy. Hold out a separate test set.

Only evaluating the happy path. Prompts that work on easy inputs often fail on edge cases. Make sure at least 20% of your eval set is adversarial or unusual inputs.

Trusting a judge you have not calibrated. An uncalibrated LLM judge can produce scores that are inversely correlated with human judgement. Always calibrate on at least 20 human-annotated examples before trusting judge scores.

Optimizing for eval score instead of real quality. If you tune your prompt specifically to score well on your eval set, you are overfitting to the eval. Maintain a separate "held-out" test set that you only use for final validation, not for iterative tuning.

The Eval Mindset

Here is the mental model shift: treat prompts like code and evaluations like tests.

You would not ship a function without verifying it works on edge cases. You would not declare a refactor "done" without running the test suite. Prompts deserve the same discipline.

The alternative is superstition: you believe your prompt works because it worked the last three times you tried it. That belief will eventually be wrong, and you will not know until users tell you.

The Three Levels of Prompt Quality

Before you can measure quality, you need to know what you are measuring. There are three distinct dimensions:

Most evaluation frameworks focus on correctness. Do not neglect the other two, especially for production systems where cost and consistency directly affect user experience.

Building an Eval Set

What Makes a Good Test Case

A test case has two parts: an input (what you send to the model) and a reference (how you judge the output). The reference can be:

Golden output: An exact expected output, human-annotated ("The correct answer is X")
Criteria: A set of conditions the output must meet ("Must contain a list, must be under 100 words, must not recommend a specific product")
Rubric: A scoring guide ("Score 1-5 on helpfulness, accuracy, and conciseness")

Which reference type to use depends on your task:

Task Type	Reference Type
Classification	Golden label (exact match)
Extraction (structured)	Golden JSON output
Summarization	Criteria or rubric
Q&A with factual answers	Golden answer
Open-ended generation	Rubric
Code generation	Test suite (run the code)

How Many Examples You Need

Practical sizing guidelines:

Prototype: 20–30 examples covering happy path and obvious edge cases
Pre-launch: 50–100 examples covering all input subgroups
Production monitoring: 200+ examples, built up from real production inputs over time

Building the Initial Eval Set

Start by collecting real examples. If you have any existing outputs (from a previous system, from manual work, from a beta), those are gold. If you are building from scratch:

Generate 20 diverse synthetic inputs that cover the breadth of the task
Run them through your prompt, review each output manually, and annotate whether it is correct
Identify the failure cases and add 10 more inputs specifically targeting those cases
Add 5–10 edge cases: shortest possible input, longest possible input, ambiguous inputs, inputs in other languages if relevant

Do not use only inputs where your prompt already works well. The eval set should challenge the prompt, not validate it.

The Annotation Process

For golden outputs, have a human annotate the correct answer before running the model. This prevents you from anchoring to the model's output and calling it correct when it is not.

For rubrics, define the scoring criteria explicitly before you see any model output:

Helpfulness rubric:
5 — Directly answers the question with specific, actionable information
4 — Answers the question but misses one useful detail
3 — Partially answers the question or is too general
2 — Tangentially related but does not answer the question
1 — Completely wrong or refuses to answer

If two human raters cannot agree on a score within 1 point using your rubric, the rubric is not specific enough. Refine it.

Evaluation Methods

Exact Match

Best for: classification, extraction, tasks with a single correct answer.

def exact_match_eval(prediction: str, golden: str) -> bool:
    return prediction.strip().lower() == golden.strip().lower()

def eval_suite_exact(predictions: list[str], goldens: list[str]) -> float:
    correct = sum(
        exact_match_eval(p, g) 
        for p, g in zip(predictions, goldens)
    )
    return correct / len(goldens)

Fuzzy Match

Best for: generation tasks where the exact wording can vary.

from difflib import SequenceMatcher

def fuzzy_match_eval(prediction: str, golden: str, threshold: float = 0.8) -> bool:
    ratio = SequenceMatcher(None, prediction.lower(), golden.lower()).ratio()
    return ratio >= threshold

# Or use sentence embeddings for semantic similarity
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity(prediction: str, golden: str) -> float:
    embeddings = model.encode([prediction, golden])
    return float(util.cos_sim(embeddings[0], embeddings[1]))

Fuzzy match is lenient. Use it as a first pass and manually review cases near the threshold — those are often the interesting failures.

LLM-as-Judge

Best for: open-ended tasks where there is no single correct answer. You use a separate LLM call to score each output.

def llm_judge(
    input_text: str,
    output_text: str,
    criteria: str,
    judge_model: str = "claude-opus-4-5"
) -> dict:
    prompt = f"""You are evaluating the quality of an AI assistant's response.

INPUT:
{input_text}

RESPONSE TO EVALUATE:
{output_text}

EVALUATION CRITERIA:
{criteria}

Score the response on a scale of 1-5 for each criterion.
Return JSON with keys matching each criterion and integer values 1-5.
Also include "reasoning": a one-sentence explanation.

Return ONLY valid JSON, no prose.
"""
    
    response = llm.complete(judge_model, prompt)
    return json.loads(response.content)

# Example usage
scores = llm_judge(
    input_text="Summarize the key risks of this investment",
    output_text="The main risk is market volatility...",
    criteria="""
    - accuracy: Is the content factually accurate?
    - completeness: Are all major points covered?
    - conciseness: Is the response appropriately brief?
    - format: Is the response well-structured?
    """
)

LLM-as-Judge: How to Do It Right

The Judge Prompt Structure

Your judge prompt needs:

Clear evaluation criteria with definitions
An explicit scoring scale with anchor descriptions (what does a 3 look like vs a 4?)
A request for reasoning before the score (not just a number)
An instruction to evaluate the response on its own merits, not compare it to other responses

Position Bias

Calibrating the Judge

Common causes of poor calibration:

The judge's criteria are too vague
The scoring scale is not anchored well enough
The judge is being asked to evaluate too many criteria at once (use separate calls per criterion)
The judge model is not strong enough for the task

When Human Evaluation Is Worth It

A/B Testing Prompts

You have Prompt A (the current version) and Prompt B (the proposed change). You want to know which is better.

def ab_test_prompts(
    prompt_a: str,
    prompt_b: str,
    eval_set: list[dict],
    scorer: callable,
    model: str
) -> dict:
    scores_a = []
    scores_b = []
    
    for example in eval_set:
        output_a = run_prompt(prompt_a, example["input"], model)
        output_b = run_prompt(prompt_b, example["input"], model)
        
        score_a = scorer(example["input"], output_a, example.get("golden"))
        score_b = scorer(example["input"], output_b, example.get("golden"))
        
        scores_a.append(score_a)
        scores_b.append(score_b)
    
    avg_a = sum(scores_a) / len(scores_a)
    avg_b = sum(scores_b) / len(scores_b)
    
    # Statistical significance test
    from scipy import stats
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
    
    return {
        "prompt_a_score": avg_a,
        "prompt_b_score": avg_b,
        "winner": "B" if avg_b > avg_a else "A",
        "p_value": p_value,
        "significant": p_value < 0.05,
        "relative_improvement": (avg_b - avg_a) / avg_a * 100
    }

Eval Tools in 2026

PromptFoo

PromptFoo is an open-source command-line tool for automated prompt testing. You define prompts and test cases in a YAML config, and it runs them and produces a comparison report.

# promptfooconfig.yaml
prompts:
  - "You are a helpful assistant. {{input}}"
  - "You are a concise assistant. Answer briefly. {{input}}"

providers:
  - anthropic:claude-opus-4-5
  - openai:gpt-4o

tests:
  - vars:
      input: "Explain how transformers work"
    assert:
      - type: contains
        value: "attention"
      - type: llm-rubric
        value: "Response should be accurate and appropriate for a technical audience"
  
  - vars:
      input: "What is 2+2?"
    assert:
      - type: equals
        value: "4"

Run with: promptfoo eval

PromptFoo supports LLM-as-judge assertions, exact match, contains, regex, and custom assertion functions. It generates an HTML comparison report showing which prompt won on each test case.

LangSmith

Key features for prompt evaluation:

Annotate production outputs to build golden datasets from real traffic
Run automated evaluators (LLM-as-judge) on any dataset
Track prompt versions and compare metrics across versions
Set up alerts when metrics drop below a threshold

Custom Eval Loops

For full control, build your own eval loop. This is 50–100 lines of Python and gives you exactly what you need:

import json
from pathlib import Path
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class EvalResult:
    input: str
    output: str
    golden: str | None
    score: float
    reasoning: str | None
    model: str
    prompt_version: str
    timestamp: str

def run_eval(
    prompt_template: str,
    eval_set_path: str,
    scorer,
    model: str,
    prompt_version: str
) -> list[EvalResult]:
    eval_set = json.loads(Path(eval_set_path).read_text())
    results = []
    
    for example in eval_set:
        prompt = prompt_template.format(**example)
        output = call_llm(model, prompt)
        
        score_result = scorer(
            input=example["input"],
            output=output,
            golden=example.get("golden")
        )
        
        results.append(EvalResult(
            input=example["input"],
            output=output,
            golden=example.get("golden"),
            score=score_result["score"],
            reasoning=score_result.get("reasoning"),
            model=model,
            prompt_version=prompt_version,
            timestamp=datetime.utcnow().isoformat()
        ))
    
    return results

def summarize_eval(results: list[EvalResult]) -> dict:
    scores = [r.score for r in results]
    return {
        "n": len(results),
        "mean_score": sum(scores) / len(scores),
        "min_score": min(scores),
        "pass_rate": sum(1 for s in scores if s >= 0.8) / len(scores),
        "failures": [
            asdict(r) for r in results if r.score < 0.8
        ]
    }

Regression Testing Prompts

Set up a CI job (GitHub Actions, any CI system) that:

Runs your eval suite against the current prod model on every deploy
Compares the score against the previous baseline
Fails the deploy if the score drops more than 5% (or whatever threshold makes sense for your use case)
Stores results in a time-series database so you can track trends

# .github/workflows/prompt-regression.yml
name: Prompt Regression Test

on:
  push:
    paths:
      - 'prompts/**'
      - 'src/llm/**'
  schedule:
    - cron: '0 9 * * 1'  # Weekly on Monday morning

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run prompt evals
        run: python evals/run_regression.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - name: Check regression
        run: python evals/check_regression.py --threshold 0.05

When a model update breaks your prompt, you have three options:

Patch the prompt — add examples, clarify ambiguous instructions, adjust the format specification
Pin the model version — use an older model version until you can patch (many APIs support this)
Accept the regression — if the new behavior is actually an improvement in aggregate, update your evals to reflect the new baseline

Prompt Metrics That Matter

Track these four metrics for every production prompt:

Task completion rate: What percentage of inputs produce a usable output (not a refusal, not a format error, not a timeout)? This is your top-line metric.

A Practical Eval Workflow From Scratch

Here is the sequence to follow when building evals for a new prompt:

Week 1: Build the eval set

Collect 30 real or realistic inputs
Annotate 20 with golden outputs, use criteria for the other 10
Run the prompt on all 30, manually review every output
Add 10 more inputs targeting the failure patterns you found

Week 2: Automate evaluation

Implement exact match or LLM-as-judge based on task type
Run all 40 examples through automated eval
Check correlation between your manual scores and automated scores
Adjust the judge prompt or scoring criteria until correlation is > 0.8

Week 3: A/B test and regression guard

Run your existing prompt (A) vs any proposed improvements (B) on the full eval set
Pick the winner based on automated eval scores + manual spot-check of failures
Set up the regression test to run weekly or on every deploy
Document the baseline score and what constitutes a regression

Ongoing: Expand with production data

Every week, sample 10 real production inputs and add them to the eval set with annotations
Rerun evals after any model version change
Track metrics monthly and investigate any downward trends

Common Mistakes in Prompt Evaluation

Only evaluating the happy path. Prompts that work on easy inputs often fail on edge cases. Make sure at least 20% of your eval set is adversarial or unusual inputs.

The Eval Mindset

The Three Levels of Prompt Quality

Building an Eval Set

What Makes a Good Test Case

How Many Examples You Need

Building the Initial Eval Set

The Annotation Process

Evaluation Methods

Exact Match

Fuzzy Match

LLM-as-Judge

LLM-as-Judge: How to Do It Right

The Judge Prompt Structure

Position Bias

Calibrating the Judge

When Human Evaluation Is Worth It

A/B Testing Prompts

Eval Tools in 2026

PromptFoo

LangSmith

Custom Eval Loops

Regression Testing Prompts

Prompt Metrics That Matter

A Practical Eval Workflow From Scratch

Common Mistakes in Prompt Evaluation

Read next

Related posts

How to Learn AI in 2026: A Hands-On Guide from First Prompt to Shipping Agents

ReAct Prompting: The Reasoning + Acting Pattern Behind Modern AI Agents

Structured Output and JSON Mode Prompting: A Complete Guide for 2026

The Eval Mindset

The Three Levels of Prompt Quality

Building an Eval Set

What Makes a Good Test Case

How Many Examples You Need

Building the Initial Eval Set

The Annotation Process

Evaluation Methods

Exact Match

Fuzzy Match

LLM-as-Judge

LLM-as-Judge: How to Do It Right

The Judge Prompt Structure

Position Bias

Calibrating the Judge

When Human Evaluation Is Worth It

A/B Testing Prompts

Eval Tools in 2026

PromptFoo

LangSmith

Custom Eval Loops

Regression Testing Prompts

Prompt Metrics That Matter

A Practical Eval Workflow From Scratch

Common Mistakes in Prompt Evaluation

Read next

Related posts

How to Learn AI in 2026: A Hands-On Guide from First Prompt to Shipping Agents

ReAct Prompting: The Reasoning + Acting Pattern Behind Modern AI Agents

Structured Output and JSON Mode Prompting: A Complete Guide for 2026