AI evals are the most important new skill for PMs.
Here's OpenAI & Anthropic's step-by-step playbook:
Hamel Husain and Shreya Shankar walked me through their complete eval process on real production data.
π¬ Watch Now: youtu.be/J7N9FMouSKg
Available Everywhere
Spotify: open.spotify.com/show/7β¦
Apple: podcasts.apple.com/in/pβ¦
βοΈ Here were my favorite takeaways:
1. Everyone Needs Evals - Even if you dog food well. For most applications that aren't naive uses of foundation models, you need systematic evals.
2. Your Demo Works, Production Doesn't - User asks for bathroom NOT connected, AI returns bathrooms connected. Markdown formatting in text messages. Looking at traces catches all this.
3. Error Analysis Is The Step Most Teams Skip - Review 100 traces, note problems, categorize, count. Takes 2-3 hours. Teaches you more than months of user interviews.
4. Generic Metrics Are Useless - Helpfulness score won't catch real problems. You need application-specific evals. That requires PM involvement, not just engineering.
5. Build Binary Judges - Return true or false. Not 1-5 scales. Business decisions are binary anyway. Either you fix something or you don't.
6. The Agreement Trap - 90% agreement sounds great until your judge just guesses "pass" every time. Measure TPR and TNR separately. Both must be above 80%.
7. Use Code When You Can - Format validation with regex. Tool validation with parameter checks. Save LLM judges for subjective quality.
8. PMs Must Own Error Analysis - Engineers don't have domain expertise to know if product experience is good. You have product taste. This is PM core work.
9. Start With Error Analysis First - Teams jump to dashboards without knowing what they're measuring. Error analysis is the foundation.
10. The Practice That Works - Instrument code. Review 100 traces. Categorize and count. Fix obvious issues. Build judges. Iterate monthly.
π Watch the episode for all the live demos.
This is by far the most tactical video on the web.
What topics should I cover next in the pod?