Paweł Huryn (@huryn): "5 AI Evals Traps Every AI PM Should Know About: (and what actually works) ‎ 1. Relying on Generic Metrics Trap: You treat "hallucination," "toxicity," "helpfulness" as success metrics. Why it fails: generic metrics miss domain-specific failure modes and can create fa…"

Make money doing the work you believe in

Paweł Huryn

Dec 18

The Product Compass

5 AI Evals Traps Every AI PM Should Know About:

(and what actually works)

‎

1. Relying on Generic Metrics

Trap: You treat "hallucination," "toxicity," "helpfulness" as success metrics.
Why it fails: generic metrics miss domain-specific failure modes and can create false confidence.
Do this instead: you can use generic metrics only to triage traces (sort, filter, surface weird cases). Let real metrics emerge from failure modes. See the next point.
Example: You can’t fix "10% hallucinations." You can fix "fails to parse invoice dates in this format."

‎

2. Skipping Error Analysis

Trap: You jump straight to "build evals" without looking at data.
Why it fails: you end up measuring the wrong thing.
Do this instead: log traces, read ~100 diverse ones, open code, cluster into failure modes, repeat until you hit saturation.

‎

3. Synthetic Data Without Hypotheses

Trap: "Generate 100 test queries" with no structure.
Why it fails: you get happy-path coverage and miss the real breakpoints.
Do this instead: start with hypotheses, define 3 dimensions, generate tuples, review them, then generate synthetic queries.
Example: (Persona = angry) x (task = refund) x (scenario = ambiguous).

‎

4. The Agreement % Trap

Trap:"Our judge agrees with humans 90% so it’s good."
Why it fails: It's great that you measure agreement. But agreement is easily inflated. It can look great while the judge misses the failures that matter.
Do this instead: track TPR (recall) and TNR, plus precision or F1 when useful. Track TPR per failure mode. Overall numbers lie.
Example: If only 10% of cases are true failures, a judge that almost never flags failures can still show high "agreement." TPR exposes it immediately.

High agreement is cheap. High TPR on real failures is the job.

‎

5. Fuzzy Labels and Bloated Taxonomies

Trap: Likert scales, overlapping categories, or "one giant rubric."
Why it fails: labels get noisy, inconsistent, and impossible to debug.
Do this instead: start with a set of binary failure modes, non-overlapping, each easy to apply consistently.
Examples: "Followed policy? Yes/No," "Included required field? Yes/No."

‎

—‎

‎

Find this helpful?

AI evals and error analysis are the highest-ROI skills for anyone working on AI products.

My detailed post on both is here. No email, no paywall, no comments required: productcompass.pm/p/eva…

P.S. Want to go deeper?

I recommend the AI Evals for Engineers & PMs Course by Hamel Husain and Shreya Shankar (3,000+ students, #1 Maven, starts Jan 26). You will immediately get an unlimited access to all materials, community, and future cohorts.

A special discount: bit.ly/aievals2026

Dec 18

1:33 PM

Make money doing the work you believe in

Log in or sign up