eval-audit

hamelsmu/evals-skills · updated Apr 8, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit
0 commentsdiscussion
summary

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.

skill.md

Eval Audit

Inspect an LLM eval pipeline and produce a prioritized list of problems with concrete next steps.

Overview

  1. Gather eval artifacts: traces, evaluator configs, judge prompts, labeled data, metrics dashboards
  2. Run diagnostic checks across six areas
  3. Produce a findings report ordered by impact, with each finding linking to a fix

Prerequisites

Access to eval artifacts (traces, evaluator configs, judge prompts, labeled data) via an observability MCP server or local files. If none exist, skip to "No Eval Infrastructure."

Connecting to Eval Infrastructure

Check whether the user has an observability MCP server connected (Phoenix, Braintrust, LangSmith, Truesight or similar). If available, use it to pull traces, evaluator definitions, and experiment results. If not, ask for local files: CSVs, JSON trace exports, notebooks, or evaluation scripts.

Diagnostic Checks

Work through each area below. Inspect available artifacts, determine whether the problem exists, and record a finding if it does.

Prioritize findings by impact on the user's product. Present the most impactful findings first.

1. Error Analysis

Check: Has the user done systematic error analysis on real or synthetic traces?

Look for: labeled trace datasets, failure category definitions, notes from trace review. If evaluators exist but no documented failure categories, error analysis was likely skipped.

Finding if missing: Evaluators built without error analysis measure generic qualities ("helpfulness", "coherence") instead of actual failure modes. Start with error-analysis, or generate-synthetic-data first if no traces exist.

See: Your AI Product Needs Evals, LLM Evals FAQ

Check: Were failure categories brainstormed or observed?

Generic labels borrowed from research ("hallucination score", "toxicity", "coherence") suggest brainstorming. Application-grounded categories ("missing query constraints", "wrong client tone", "fabricated property features") suggest observation.

Finding if brainstormed: Generic categories miss application-specific failures and produce evaluators that score well on paper but miss real problems. Re-do with error-analysis, starting from traces.

See: Who Validates the Validators?

2. Evaluator Design

Check: Are evaluators binary pass/fail?

Flag any that use Likert scales (1-5), letter grades (A-F), or numeric scores without a clear pass/fail threshold.

Finding if not binary: Likert scales are difficult to calibrate. Annotators disagree on the difference between a 3 and a 4, and judges inherit that noise. Consider converting to binary pass/fail with explicit definitions using write-judge-prompt.

See: Creating an LLM Judge That Drives Business Results

Check: Do LLM judge prompts target specific failure modes?

Flag any that evaluate holistically ("Is this response helpful?", "Rate the quality of this output").

Finding if vague: Holistic judges produce unactionable verdicts. Each judge should check exactly one failure mode with explicit pass/fail definitions and few-shot examples. Use write-judge-prompt.

Check: Are code-based checks used where possible?

Flag LLM judges used for objectively checkable criteria: format validation, constraint satisfaction, keyword presence, schema conformance.

Finding if over-relying on judges: Replace objective checks with code (regex, parsing, schema validation, execution tests). Reserve LLM judges for criteria requiring interpretation.

Check: Are similarity metrics used as primary evaluation?

Flag ROUGE, BERTScore, cosine similarity, or embedding distance used as the main evaluator for generation quality.

Finding if present: These metrics measure surface-level overlap, not correctness. They suit retrieval ranking but not generation evaluation. Replace with binary evaluators grounded in specific failure modes.

See: LLM Evals FAQ

3. Judge Validation

Check: Are LLM judges validated against human labels?

Look for: confusion matrices, TPR/TNR measurements, alignment scores. Judges in production with no validation data is a critical finding.

Finding if unvalidated: An unvalidated judge may consistently miss failures or flag passing traces. Measure alignment using TPR and TNR on a held-out test set. Use validate-evaluator.

See: Creating an LLM Judge That Drives Business Results

Check: Is alignment measured with TPR/TNR or with raw accuracy?

Flag "accuracy", "percent agreement", or Cohen's Kappa as the primary alignment metric.

Finding if using accuracy: With class imbalance, raw accuracy is misleading: a judge that always says "Pass" gets 90% accuracy when 90% of traces pass but catches zero failures. Use TPR and TNR, which map directly to bias correction. Use validate-evaluator.

Check: Is there a proper train/dev/test split?

Check whether few-shot examples in judge prompts come from the same data used to measure judge performance.

Finding if leaking: Using evaluation data as few-shot examples inflates alignment scores and hides real judge failures. Split into train (few-shot source), dev (iteration), and test (final measurement). Use validate-evaluator.

4. Human Review Process

Check: Who is reviewing traces?

Determine whether domain experts or outsourced annotators are labeling data.

Finding if outsourced without domain expertise: General annotators catch formatting errors but miss domain-specific failures (wrong medical dosage, incorrect legal citation, mismatched property features). Involve a domain expert.

See: A Field Guide to Improving AI Products

Check: Are reviewers seeing full traces or just final outputs?

Finding if output-only: Reviewing only the final output hides where the pipeline broke. Show the full trace: input, intermediate steps, tool calls, retrieved context, and final output.

Check: How is data displayed to reviewers?

Flag raw JSON, unformatted text, or spreadsheets with trace data in cells.

Finding if raw format: Reviewers spend effort parsing data instead of judging quality. Format in natural representation: render markdown, syntax-highlight code, display tables as tables. Use build-review-interface.

See: LLM Evals FAQ

5. Labeled Data

Check: Is there enough labeled data?

For error analysis, ~100 traces is the rough target for saturation. For judge validation, ~50 Pass and ~50 Fail examples are needed for reliable TPR/TNR. If labeled data is sparse, collect more by sampling traces more effectively:

  • Random: Always include a random sample alongside other strategies to discover unknown issues.
  • Clustering: Group traces by semantic similarity and review representatives from each cluster.
  • Data analysis: Analyze statistics on latency, turns, tool calls, and tokens for outliers.
  • Classification: Use existing evals, a predictive model, or an LLM to surface problematic traces. Use with caution.
  • Feedback: Use explicit customer feedback (complaints, thumbs-down signals) to filter traces.

Finding if insufficient: Small datasets produce unreliable failure rates and wide confidence intervals. Use the sampling strategies above to collect more labeled data, or supplement with generate-synthetic-data.

6. Pipeline Hygiene

Check: Is error analysis re-run after significant changes?

Check when error analysis was last performed relative to model switches, prompt rewrites, new features, or production incidents.

Finding if stale: Failure modes shift after pipeline changes, and evaluators built for the old pipeline miss new failure types. Re-run error analysis after every significant change.

Check: Are evaluators maintained?

Look for periodic re-validation of judges or refreshed evaluation datasets.

Finding if set-and-forget: Evaluators degrade as the pipeline evolves. Re-validate judges against fresh human labels and update eval datasets to reflect current usage.

No Eval Infrastructure

If the user has no eval artifacts (no traces, no evaluators, no labeled data):

  1. Start with error-analysis on a sample of real traces.
  2. If no production data exists, use generate-synthetic-data to create test inputs, run them through the pipeline, then apply error-analysis to the resulting traces.
  3. Do not recommend building evaluators, judges, or dashboards before completing error analysis.

Report Format

Present findings ordered by impact. For each:

### [Problem Title]
**Status:** [Problem exists / OK / Cannot determine]
[1-2 sentence explanation of the specific problem found]
**Fix:** [Concrete action, referencing a skill or article]

Group under the six diagnostic areas. Omit areas where no problems were found.

Anti-Patterns

  • Running the audit as a checklist without inspecting actual artifacts.
  • Reporting generic advice disconnected from what was found in the user's pipeline.
  • Recommending evaluators before error analysis is complete.
  • Suggesting LLM judges for failures that code-based checks can handle.
  • Treating this audit as a one-time event. Re-audit after significant pipeline changes.
how to use eval-audit

How to use eval-audit on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add eval-audit
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/hamelsmu/evals-skills --skill eval-audit

The skills CLI fetches eval-audit from GitHub repository hamelsmu/evals-skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/eval-audit

Reload or restart Cursor to activate eval-audit. Access the skill through slash commands (e.g., /eval-audit) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

User Story & Requirements Generation

Create detailed user stories, acceptance criteria, and feature specs

Example

Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios

Reduce spec writing time by 50%, ensure comprehensive coverage

Competitive Analysis

Research competitors, compare features, identify gaps

Example

Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities

Complete competitive research in 2 hours instead of 2 days

Roadmap Prioritization

Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs

Example

Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale

Make data-driven prioritization decisions faster

Stakeholder Communication

Draft PRDs, status updates, and stakeholder presentations

Example

Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement

Save 3-5 hours/week on communication overhead

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client
  • Access to product documentation and roadmap tools (Jira, Notion, etc.)
  • Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
  • Stakeholder contact information and communication channels

Time Estimate

30-60 minutes to see productivity improvements

Installation Steps

  1. 1.Install product management skill
  2. 2.Start with user story generation for known feature
  3. 3.Progress to competitive analysis: research 2-3 competitors
  4. 4.Use for roadmap prioritization: apply RICE/ICE scoring
  5. 5.Draft stakeholder communications and refine based on feedback
  6. 6.Build template library for recurring PM tasks
  7. 7.Share effective prompts with product team

Common Pitfalls

  • Not validating competitive research—verify facts before sharing
  • Accepting user stories without involving engineering team
  • Over-relying on frameworks without qualitative judgment
  • Not customizing outputs to company culture and communication style
  • Skipping stakeholder validation of generated requirements

Best Practices

✓ Do

  • +Validate research and competitive analysis with real data
  • +Collaborate with engineering when generating technical requirements
  • +Customize frameworks and templates to your company context
  • +Use skill for first drafts, refine with stakeholder input
  • +Document successful prompt patterns for PM tasks
  • +Combine AI efficiency with human judgment and intuition

✗ Don't

  • Don't publish competitive analysis without fact-checking
  • Don't finalize user stories without engineering review
  • Don't make prioritization decisions solely on AI scoring
  • Don't skip customer validation of generated requirements
  • Don't ignore company-specific context and culture

💡 Pro Tips

  • Provide context: company goals, constraints, customer feedback
  • Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
  • Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
  • Use skill for 70% generation + 30% customization to company needs

When to Use This

✓ Use When

Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.

✗ Avoid When

Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.

Learning Path

  1. 1Basic: user stories, feature specs, status updates
  2. 2Intermediate: competitive analysis, prioritization frameworks, PRDs
  3. 3Advanced: product strategy, go-to-market planning, OKR setting
  4. 4Expert: product vision, market positioning, business model innovation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.727 reviews
  • Aditi Khan· Dec 24, 2024

    I recommend eval-audit for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Diego Taylor· Dec 24, 2024

    Keeps context tight: eval-audit is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Rahul Santra· Nov 19, 2024

    We added eval-audit from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Aanya Diallo· Nov 15, 2024

    Solid pick for teams standardizing on skills: eval-audit is focused, and the summary matches what you get after install.

  • Diego Abebe· Nov 15, 2024

    Registry listing for eval-audit matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Pratham Ware· Oct 10, 2024

    eval-audit fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Aanya Torres· Oct 6, 2024

    eval-audit has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Diego Farah· Oct 6, 2024

    eval-audit reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Ren Abebe· Sep 25, 2024

    Keeps context tight: eval-audit is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Yusuf Sharma· Sep 25, 2024

    I recommend eval-audit for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

showing 1-10 of 27

1 / 3