Aakash Gupta (@aakashgupta): "AI evals are the most important new skill for PMs. Here's OpenAI & Anthropic's step-by-step playbook: Hamel Husain and Shreya Shankar walked me through their complete eval process on real production data. 🎬 Watch Now: https://youtu.be/J7N9FMouSKg Available Everywhere Spoti…"

The app for independent voices

Aakash Gupta

Jan 15

Product Growth

AI evals are the most important new skill for PMs.

Here's OpenAI & Anthropic's step-by-step playbook:

Hamel Husain and Shreya Shankar walked me through their complete eval process on real production data.

🎬 Watch Now: youtu.be/J7N9FMouSKg

Available Everywhere

Spotify: open.spotify.com/show/7…

Apple: podcasts.apple.com/in/p…

✍️ Here were my favorite takeaways:

1. Everyone Needs Evals - Even if you dog food well. For most applications that aren't naive uses of foundation models, you need systematic evals.

2. Your Demo Works, Production Doesn't - User asks for bathroom NOT connected, AI returns bathrooms connected. Markdown formatting in text messages. Looking at traces catches all this.

3. Error Analysis Is The Step Most Teams Skip - Review 100 traces, note problems, categorize, count. Takes 2-3 hours. Teaches you more than months of user interviews.

4. Generic Metrics Are Useless - Helpfulness score won't catch real problems. You need application-specific evals. That requires PM involvement, not just engineering.

5. Build Binary Judges - Return true or false. Not 1-5 scales. Business decisions are binary anyway. Either you fix something or you don't.

6. The Agreement Trap - 90% agreement sounds great until your judge just guesses "pass" every time. Measure TPR and TNR separately. Both must be above 80%.

7. Use Code When You Can - Format validation with regex. Tool validation with parameter checks. Save LLM judges for subjective quality.

8. PMs Must Own Error Analysis - Engineers don't have domain expertise to know if product experience is good. You have product taste. This is PM core work.

9. Start With Error Analysis First - Teams jump to dashboards without knowing what they're measuring. Error analysis is the foundation.

10. The Practice That Works - Instrument code. Review 100 traces. Categorize and count. Fix obvious issues. Build judges. Iterate monthly.

👀 Watch the episode for all the live demos.

This is by far the most tactical video on the web.

What topics should I cover next in the pod?

Jan 15

6:51 AM

The app for independent voices

Log in or sign up