write-judge-prompt

hamelsmu/evals-skills · updated Apr 8, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/hamelsmu/evals-skills --skill write-judge-prompt
0 commentsdiscussion
summary

Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.

skill.md

Write LLM-as-Judge Prompt

Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.

Prerequisites

  • Error analysis is complete. The failure mode is identified.
  • You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
  • A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.

The Four Components

Every judge prompt requires exactly four components:

1. Task and Evaluation Criterion

State what the judge evaluates. One failure mode per judge.

You are an evaluator assessing whether a real estate assistant's email
uses the appropriate tone for the client's persona.

Not: "Evaluate whether the email is good" or "Rate the email quality from 1-5."

2. Pass/Fail Definitions

Outcomes are strictly binary: Pass or Fail. No Likert scales, no letter grades, no partial credit. Define exactly what constitutes Pass and Fail. These definitions come from your error analysis failure mode descriptions.

## Definitions

PASS: The email matches the expected communication style for the client persona:
- Luxury Buyers: formal language, emphasis on exclusive features, premium
  market positioning, no casual slang
- First-Time Homebuyers: warm and encouraging tone, educational explanations,
  avoids jargon, patient and supportive
- Investors: data-driven language, ROI-focused, market analytics, concise
  and professional

FAIL: The email uses a tone mismatched to the client persona. Examples:
- Using casual slang ("hey, check out this pad!") for a luxury buyer
- Using heavy financial jargon for a first-time homebuyer
- Using overly emotional language for an investor

3. Few-Shot Examples

Include labeled Pass and Fail examples from your human-labeled data.

## Examples

### Example 1: PASS
Client Persona: Luxury Buyer
Email: "Dear Mr. Harrington, I am pleased to present an exclusive listing
at 1200 Pacific Heights Drive. This distinguished property features..."
Critique: The email opens with a formal salutation and uses language
consistent with luxury positioning — "exclusive listing," "distinguished
property." No casual slang or informal phrasing. The tone matches the
luxury buyer persona throughout.
Result: Pass

### Example 2: FAIL
Client Persona: Luxury Buyer
Email: "Hey! Just found this awesome place you might like. It's got a
pool and stuff, super cool neighborhood..."
Critique: The greeting "Hey!" is informal. Phrases like "awesome place,"
"got a pool and stuff," and "super cool" are casual slang inappropriate
for a luxury buyer. The email reads like a text message, not a
professional communication for a high-end client.
Result: Fail

### Example 3: PASS (borderline)
Client Persona: First-Time Homebuyer
Email: "Hi Sarah, I found a property that might be a great fit for your
first home. The neighborhood has good schools nearby, and the monthly
payment would be similar to what you're currently paying in rent..."
Critique: The greeting is warm but not overly casual. The email explains
the property in relatable terms — comparing mortgage to rent, mentioning
schools — which is educational without being condescending. It avoids
jargon like "amortization" or "LTV ratio." While not deeply technical,
this matches the supportive tone expected for a first-time buyer.
Result: Pass

Rules for selecting examples:

  • Include at least one clear Pass, one clear Fail, and one borderline case. Borderline examples are the most valuable — they teach nuance.
  • Draw examples from the training split (10-20% of labeled data set aside for this purpose).
  • Any example used in the judge prompt must be excluded from dev and test sets. Using dev/test examples is data leakage.
  • 2-4 examples is typical. Performance plateaus after 4-8.

4. Structured Output Format

Enforce structured output using your LLM provider's schema enforcement (e.g., response_format in OpenAI, tool definitions in Anthropic) or a library like Instructor or Outlines. If the provider doesn't support schema enforcement, specify the JSON schema in the prompt.

The output must include a critique before the verdict. Placing the critique first forces the judge to articulate its assessment before committing to a decision.

{
  "critique": "string — detailed assessment of the output against the criterion",
  "result": "Pass or Fail"
}

Critiques must be detailed, not terse. A good critique explains what specifically was correct or incorrect and references concrete evidence from the output. The critiques in your few-shot examples set the bar for the level of detail the judge will produce.

Choosing What to Pass to the Judge

Feed only what the judge needs for an accurate decision:

Failure Mode What the Judge Needs
Tone mismatch Client persona + generated email
Answer faithfulness Retrieved context + generated answer
SQL correctness User query + generated SQL + schema
Instruction following System prompt rules + generated response
Tool call justification Conversation history + tool call + tool result

For long documents, feed only the relevant snippet, not the entire document.

Model Selection

Start with the most capable model available. The same model used for the main task works as judge (the judge performs a different, narrower task). Optimize for cost later once alignment is confirmed.

Anti-Patterns

  • Vague criteria like "is this helpful?" Target a specific, observable failure mode from error analysis.
  • Holistic judge for the entire trace. A single judge covering multiple dimensions produces unactionable verdicts.
  • No few-shot examples. Without examples, the model won't know what counts as a failure in your application.
  • Dev/test examples used as few-shot. This is data leakage. Use only the training split.
  • Likert scales (1-5, letter grades, etc.). Binary pass/fail only. Likert scales produce scores that sound precise but can't be calibrated: annotators disagree on the difference between a 3 and a 4, and the judge inherits that noise. Binary forces you to define a clear decision boundary upfront, which makes inter-annotator agreement measurable and the judge's errors actionable. If you need to capture severity, use multiple binary judges (e.g., "factually wrong" and "dangerously wrong") rather than one ordinal scale.
  • Skipping validation. Measure alignment with human labels using validate-evaluator before trusting the judge.
  • Judges for specification failures without fixing the prompt first. If the prompt never asked for the behavior, add the instruction before building an evaluator. For critical requirements, a judge can still serve as a regression guard.
how to use write-judge-prompt

How to use write-judge-prompt on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add write-judge-prompt
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/hamelsmu/evals-skills --skill write-judge-prompt

The skills CLI fetches write-judge-prompt from GitHub repository hamelsmu/evals-skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/write-judge-prompt

Reload or restart Cursor to activate write-judge-prompt. Access the skill through slash commands (e.g., /write-judge-prompt) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

User Story & Requirements Generation

Create detailed user stories, acceptance criteria, and feature specs

Example

Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios

Reduce spec writing time by 50%, ensure comprehensive coverage

Competitive Analysis

Research competitors, compare features, identify gaps

Example

Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities

Complete competitive research in 2 hours instead of 2 days

Roadmap Prioritization

Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs

Example

Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale

Make data-driven prioritization decisions faster

Stakeholder Communication

Draft PRDs, status updates, and stakeholder presentations

Example

Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement

Save 3-5 hours/week on communication overhead

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client
  • Access to product documentation and roadmap tools (Jira, Notion, etc.)
  • Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
  • Stakeholder contact information and communication channels

Time Estimate

30-60 minutes to see productivity improvements

Installation Steps

  1. 1.Install product management skill
  2. 2.Start with user story generation for known feature
  3. 3.Progress to competitive analysis: research 2-3 competitors
  4. 4.Use for roadmap prioritization: apply RICE/ICE scoring
  5. 5.Draft stakeholder communications and refine based on feedback
  6. 6.Build template library for recurring PM tasks
  7. 7.Share effective prompts with product team

Common Pitfalls

  • Not validating competitive research—verify facts before sharing
  • Accepting user stories without involving engineering team
  • Over-relying on frameworks without qualitative judgment
  • Not customizing outputs to company culture and communication style
  • Skipping stakeholder validation of generated requirements

Best Practices

✓ Do

  • +Validate research and competitive analysis with real data
  • +Collaborate with engineering when generating technical requirements
  • +Customize frameworks and templates to your company context
  • +Use skill for first drafts, refine with stakeholder input
  • +Document successful prompt patterns for PM tasks
  • +Combine AI efficiency with human judgment and intuition

✗ Don't

  • Don't publish competitive analysis without fact-checking
  • Don't finalize user stories without engineering review
  • Don't make prioritization decisions solely on AI scoring
  • Don't skip customer validation of generated requirements
  • Don't ignore company-specific context and culture

💡 Pro Tips

  • Provide context: company goals, constraints, customer feedback
  • Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
  • Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
  • Use skill for 70% generation + 30% customization to company needs

When to Use This

✓ Use When

Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.

✗ Avoid When

Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.

Learning Path

  1. 1Basic: user stories, feature specs, status updates
  2. 2Intermediate: competitive analysis, prioritization frameworks, PRDs
  3. 3Advanced: product strategy, go-to-market planning, OKR setting
  4. 4Expert: product vision, market positioning, business model innovation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.675 reviews
  • Advait Brown· Dec 28, 2024

    We added write-judge-prompt from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Ava Sharma· Dec 24, 2024

    write-judge-prompt fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Ama Singh· Dec 24, 2024

    Solid pick for teams standardizing on skills: write-judge-prompt is focused, and the summary matches what you get after install.

  • Pratham Ware· Dec 20, 2024

    Solid pick for teams standardizing on skills: write-judge-prompt is focused, and the summary matches what you get after install.

  • Advait Johnson· Dec 12, 2024

    write-judge-prompt has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Isabella Jackson· Dec 8, 2024

    Solid pick for teams standardizing on skills: write-judge-prompt is focused, and the summary matches what you get after install.

  • Mia Ndlovu· Nov 27, 2024

    We added write-judge-prompt from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Diego Reddy· Nov 19, 2024

    Solid pick for teams standardizing on skills: write-judge-prompt is focused, and the summary matches what you get after install.

  • William Robinson· Nov 15, 2024

    write-judge-prompt has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • William Thompson· Nov 15, 2024

    We added write-judge-prompt from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

showing 1-10 of 75

1 / 8