Aakash Gupta (@aakashgupta): "AI Evals are the most important new skill for PMs. So I got Ankit Shukla to put on a masterclass: It's the most intuitive, first-principles based explanation of evals I have ever heard. 🎬 Watch Now: https://youtu.be/Raa3qjEBvKE Also available on: Spotify: https://open.…"

The app for independent voices

Aakash Gupta

Feb 19

Product Growth

AI Evals are the most important new skill for PMs.

So I got Ankit Shukla to put on a masterclass:

It's the most intuitive, first-principles based explanation of evals I have ever heard.

🎬 Watch Now: youtu.be/Raa3qjEBvKE

Also available on:

Spotify: open.spotify.com/show/7…
Apple: podcasts.apple.com/us/p…

🏆 Thanks to our sponsor Reforge Build

It's the best AI prototyping (the other most important new skill for PMs) software out there. Check it out with my link: reforge.com/aakash

Here were some of my key takaways:

1. AI products have 5 components: language model, context engineering, tools, orchestration, and UX. The language model is the only non-deterministic piece, and that's exactly why evals exist.

2. Evals are the PRD for AI engineers. You define success criteria, expected behavior, and thresholds. Engineers hill-climb against those scores until they hit 80-90%. Then you ship.

3. There are 3 types of evals: code-based (word count, length, format), LLM-as-judge (tone, relevance, guardrails), and human review (domain expertise). Use the cheapest method that works. "When you can use a needle, why use a sword?"

4. The eval dataset is where most of your effort goes. Build it from 4 sources: production data, research, synthetic data from LLMs, and domain experts.

5. Offline evals run before launch. Online evals run in production. Both feed each other in a continuous loop. Set it and forget it = data drift.

6. Most teams use GPT-5.2 for tasks GPT Nano handles at 1/25th the cost. The only way to know a cheaper model works at the same quality? Evals.

7. Don't use averages for latency. Use P95/P99. If 10% of users get 10x worse performance, your average still looks fine.

8. Hard feedback (thumbs up/down) is obvious. Soft feedback is where the signal hides: users regenerating answers, not closing sessions, escalating to support.

9. Write the eval guidance, then let AI write the actual eval prompt. Studies show AI writes better prompts than humans. Put humans where humans are better, AI where AI is better.

10. Evals are not QA rebranded. QA informs. PMs transform. You're not flagging bugs, you're shaping product behavior across every edge case.

➕ Follow Aakash Gupta for daily AI PM tips.

Don't miss the full episode for the live walkthrough.

Feb 19

12:50 AM

The app for independent voices

Log in or sign up