5 AI Evals Traps Every AI PM Should Know About:
(and what actually works)
1. Relying on Generic Metrics
Trap: You treat "hallucination," "toxicity," "helpfulness" as success metrics.
Why it fails: generic metrics miss domain-specific failure modes and can create false confidence.
Do this instead: you can use generic metrics only to triage traces (sort, filter, surface weird cases). Let real metrics emerge from failure modes. See the next point.
Example: You can’t fix "10% hallucinations." You can fix "fails to parse invoice dates in this format."
2. Skipping Error Analysis
Trap: You jump straight to "build evals" without looking at data.
Why it fails: you end up measuring the wrong thing.
Do this instead: log traces, read ~100 diverse ones, open code, cluster into failure modes, repeat until you hit saturation.
3. Synthetic Data Without Hypotheses
Trap: "Generate 100 test queries" with no structure.
Why it fails: you get happy-path coverage and miss the real breakpoints.
Do this instead: start with hypotheses, define 3 dimensions, generate tuples, review them, then generate synthetic queries.
Example: (Persona = angry) x (task = refund) x (scenario = ambiguous).
4. The Agreement % Trap
Trap:"Our judge agrees with humans 90% so it’s good."
Why it fails: It's great that you measure agreement. But agreement is easily inflated. It can look great while the judge misses the failures that matter.
Do this instead: track TPR (recall) and TNR, plus precision or F1 when useful. Track TPR per failure mode. Overall numbers lie.
Example: If only 10% of cases are true failures, a judge that almost never flags failures can still show high "agreement." TPR exposes it immediately.
High agreement is cheap. High TPR on real failures is the job.
5. Fuzzy Labels and Bloated Taxonomies
Trap: Likert scales, overlapping categories, or "one giant rubric."
Why it fails: labels get noisy, inconsistent, and impossible to debug.
Do this instead: start with a set of binary failure modes, non-overlapping, each easy to apply consistently.
Examples: "Followed policy? Yes/No," "Included required field? Yes/No."
—
Find this helpful?
AI evals and error analysis are the highest-ROI skills for anyone working on AI products.
My detailed post on both is here. No email, no paywall, no comments required: productcompass.pm/p/eva…
P.S. Want to go deeper?
I recommend the AI Evals for Engineers & PMs Course by Hamel Husain and Shreya Shankar (3,000+ students, #1 Maven, starts Jan 26). You will immediately get an unlimited access to all materials, community, and future cohorts.
A special discount: bit.ly/aievals2026