validate-evaluator▌
hamelsmu/evals-skills · updated Apr 8, 2026
MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.
Calibrate an LLM judge against human judgment.
Validate Evaluator
Calibrate an LLM judge against human judgment.
Overview
- Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
- Run judge on dev set and measure TPR/TNR
- Iterate on the judge until TPR and TNR > 90% on dev set
- Run once on held-out test set for final TPR/TNR
- Apply bias correction formula to production data
Prerequisites
- A built LLM judge prompt (from write-judge-prompt)
- Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
- Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
- Labels must come from a domain expert, not outsourced annotators
- Candidate few-shot examples from your labeled data
Core Instructions
Step 1: Create Data Splits
Split human-labeled data into three disjoint sets:
| Split | Size | Purpose | Rules |
|---|---|---|---|
| Training | 10-20% (~10-20 examples) | Source of few-shot examples for the judge prompt | Only clear-cut Pass and Fail cases. Used directly in the prompt. |
| Dev | 40-45% (~40-45 examples) | Iterative evaluator refinement | Never include in the prompt. Evaluate against repeatedly. |
| Test | 40-45% (~40-45 examples) | Final unbiased accuracy measurement | Do NOT look at during development. Used once at the end. |
Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.
from sklearn.model_selection import train_test_split
# First split: separate test set
train_dev, test = train_test_split(
labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
# Second split: separate training examples from dev set
train, dev = train_test_split(
train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
# Result: ~15% train, ~45% dev, ~40% test
Step 2: Run Evaluator on Dev Set
Run the judge on every example in the dev set. Compare predictions to human labels.
Step 3: Measure TPR and TNR
TPR (True Positive Rate): When a human says Pass, how often does the judge also say Pass?
TPR = (judge says Pass AND human says Pass) / (human says Pass)
TNR (True Negative Rate): When a human says Fail, how often does the judge also say Fail?
TNR = (judge says Fail AND human says Fail) / (human says Fail)
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)
Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.
Step 4: Inspect Disagreements
Examine every case where the judge disagrees with human labels:
| Disagreement Type | Judge | Human | Fix |
|---|---|---|---|
| False Pass | Pass | Fail | Judge is too lenient. Strengthen Fail definitions or add edge-case examples. |
| False Fail | Fail | Pass | Judge is too strict. Clarify Pass definitions or adjust examples. |
For each disagreement, determine whether to:
- Clarify wording in the judge prompt
- Swap or add few-shot examples from the training set
- Add explicit rules for the edge case
- Split the criterion into more specific sub-checks
Step 5: Iterate
Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.
Stopping criteria:
- Target: TPR > 90% AND TNR > 90%
- Minimum acceptable: TPR > 80% AND TNR > 80%
If alignment stalls:
| Problem | Solution |
|---|---|
| TPR and TNR both low | Use a more capable LLM for the judge |
| One metric low, one acceptable | Inspect disagreements for the low metric specifically |
| Both plateau below target | Decompose the criterion into smaller, more atomic checks |
| Consistently wrong on certain input types | Add targeted few-shot examples from training set |
| Labels themselves seem inconsistent | Re-examine human labels; the rubric may need refinement |
Step 6: Final Measurement on Test Set
Run the judge exactly once on the held-out test set. Record final TPR and TNR.
Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.
Step 7 (Optional): Estimate True Success Rate (Rogan-Gladen Correction)
Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:
theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)
Where:
p_obs= fraction of unlabeled traces the judge scored as PassTPR,TNR= from test set measurementtheta_hat= corrected estimate of true success rate
Clip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).
Example:
- Judge TPR = 0.92, TNR = 0.88
- 500 production traces: 400 scored Pass -> p_obs = 0.80
- theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
- True success rate is ~85%, not the raw 80%
Step 8: Confidence Interval
Compute a bootstrap confidence interval. A point estimate alone is not enough.
import numpy as np
def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
"""Bootstrap 95% CI for corrected success rate."""
n = len(human_labels)
estimates = []
for _ in range(n_bootstrap):
idx = np.random.choice(n, size=n, replace=True)
h = np.array(human_labels)[idx]
e = np.array(eval_labels)[idx]
tp = ((h == 'Pass') & (e == 'Pass')).sum()
fn = ((h == 'Pass') & (e == 'Fail')).sum()
tn = ((h == 'Fail') & (e == 'Fail')).sum()
fp = ((h == 'Fail') & (e == 'Pass')).sum()
tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
denom = tpr_b + tnr_b - 1
if abs(denom) < 1e-6:
continue
theta = (p_obs + tnr_b - 1) / denom
estimates.append(np.clip(theta, 0, 1))
return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)
lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")
Or use judgy (pip install judgy):
from judgy import estimate_success_rate
result = estimate_success_rate(
human_labels=test_human_labels,
evaluator_labels=test_eval_labels,
unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")
Practical Guidance
- Pin exact model versions for LLM judges (e.g.,
gpt-4o-2024-05-13, notgpt-4o). Providers update models without notice, causing silent drift. - Re-validate after changing the judge prompt, switching models, or when production confidence intervals widen unexpectedly.
- Use ~100 labeled examples (50 Pass, 50 Fail). Below 60, confidence intervals become wide.
- One trusted domain expert is the most efficient labeling path. If not feasible, have two annotators label 20-50 traces independently and resolve disagreements before proceeding.
- Improving TPR narrows the confidence interval more than improving TNR. The correction formula divides by TPR, so low TPR amplifies estimation errors into wide CIs.
Anti-Patterns
- Assuming judges "just work" without validation. A judge may consistently miss failures or flag passing traces.
- Using raw accuracy or percent agreement. Use TPR and TNR. With class imbalance, raw accuracy is misleading.
- Dev/test examples as few-shot examples. This is data leakage.
- Reporting dev set performance as final accuracy. Dev numbers are optimistic. The test set gives the unbiased estimate.
- Raw judge scores without bias correction. If you report an aggregate pass rate, apply the Rogan-Gladen formula (Step 7).
- Point estimates without confidence intervals. A corrected rate of 85% could easily be 78-92% with small test sets. Report the range so stakeholders know how much to trust the number.
How to use validate-evaluator on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your development machine
- ›Node.js version 16.0+ with npm package manager (verify with
node --version) - ›Active project directory or workspace where you want to add validate-evaluator
Execute installation command
Execute the skills CLI command in your project's root directory to begin installation:
The skills CLI fetches validate-evaluator from GitHub repository hamelsmu/evals-skills and configures it for Cursor.
Select Cursor when prompted
The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Reload or restart Cursor to activate validate-evaluator. Access the skill through slash commands (e.g., /validate-evaluator) or your agent's skill management interface.
Security & Verification Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases▌
User Story & Requirements Generation
Create detailed user stories, acceptance criteria, and feature specs
Example
Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios
Reduce spec writing time by 50%, ensure comprehensive coverage
Competitive Analysis
Research competitors, compare features, identify gaps
Example
Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities
Complete competitive research in 2 hours instead of 2 days
Roadmap Prioritization
Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs
Example
Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale
Make data-driven prioritization decisions faster
Stakeholder Communication
Draft PRDs, status updates, and stakeholder presentations
Example
Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement
Save 3-5 hours/week on communication overhead
Implementation Guide▌
Prerequisites
- ›Claude Desktop or compatible AI client
- ›Access to product documentation and roadmap tools (Jira, Notion, etc.)
- ›Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
- ›Stakeholder contact information and communication channels
Time Estimate
30-60 minutes to see productivity improvements
Installation Steps
- 1.Install product management skill
- 2.Start with user story generation for known feature
- 3.Progress to competitive analysis: research 2-3 competitors
- 4.Use for roadmap prioritization: apply RICE/ICE scoring
- 5.Draft stakeholder communications and refine based on feedback
- 6.Build template library for recurring PM tasks
- 7.Share effective prompts with product team
Common Pitfalls
- ⚠Not validating competitive research—verify facts before sharing
- ⚠Accepting user stories without involving engineering team
- ⚠Over-relying on frameworks without qualitative judgment
- ⚠Not customizing outputs to company culture and communication style
- ⚠Skipping stakeholder validation of generated requirements
Best Practices▌
✓ Do
- +Validate research and competitive analysis with real data
- +Collaborate with engineering when generating technical requirements
- +Customize frameworks and templates to your company context
- +Use skill for first drafts, refine with stakeholder input
- +Document successful prompt patterns for PM tasks
- +Combine AI efficiency with human judgment and intuition
✗ Don't
- −Don't publish competitive analysis without fact-checking
- −Don't finalize user stories without engineering review
- −Don't make prioritization decisions solely on AI scoring
- −Don't skip customer validation of generated requirements
- −Don't ignore company-specific context and culture
💡 Pro Tips
- ★Provide context: company goals, constraints, customer feedback
- ★Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
- ★Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
- ★Use skill for 70% generation + 30% customization to company needs
When to Use This▌
✓ Use When
Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.
✗ Avoid When
Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.
Learning Path▌
- 1Basic: user stories, feature specs, status updates
- 2Intermediate: competitive analysis, prioritization frameworks, PRDs
- 3Advanced: product strategy, go-to-market planning, OKR setting
- 4Expert: product vision, market positioning, business model innovation
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.6★★★★★69 reviews- ★★★★★Kiara Desai· Dec 28, 2024
I recommend validate-evaluator for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Kiara Farah· Dec 16, 2024
Registry listing for validate-evaluator matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Alexander Kapoor· Dec 8, 2024
Keeps context tight: validate-evaluator is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Kaira Wang· Dec 8, 2024
validate-evaluator has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Chaitanya Patil· Dec 4, 2024
I recommend validate-evaluator for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Alexander Okafor· Dec 4, 2024
Useful defaults in validate-evaluator — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Amelia Anderson· Dec 4, 2024
validate-evaluator fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Piyush G· Nov 23, 2024
Useful defaults in validate-evaluator — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Alexander Ndlovu· Nov 23, 2024
I recommend validate-evaluator for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Amelia Smith· Nov 23, 2024
We added validate-evaluator from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
showing 1-10 of 69