The app for independent voices

The CEO of the $800M eval platform just built an eval from scratch on my podcast. Live. No pre-written prompts, no pre-written data, no pre-written scoring functions.

Ankur Goyal runs Braintrust. Backed by a16z, Greylock, and ICONIQ. Used by Notion, Replit, Ramp, Cloudflare, and Dropbox. His team sees 10x more evals running than this time last year.

He started with a blank playground. Wrote a one-line system prompt: "You are a helpful assistant who answers questions from Linear." Had Opus auto-generate a test data set of questions about task workloads. Picked GPT 5 Nano as the model because it's cheap and fast.

First run: the model answered "How many tasks are assigned to me?" with "Happy to help with Linear. What would you like me to do?" Scores: 0 across the board.

So he connected Linear's MCP server, gave the model access to real tools, told it to stop asking clarifying questions and just use the tools. Created a scoring function with three levels instead of a vague numerical scale. Iterated on the system prompt with few-shot examples.

Twenty minutes later: 0.75 across the board. Three parts of the workflow touched. Data set, task function, scoring function. Each one improved through the same loop: run, look at outputs, confront with intuition, improve.

The quote that stuck with me: "If you only have evals that succeed, you don't know what problems there are." Ankur keeps failing evals on purpose. When a new model drops, he reruns the failures first. Something interesting happens every time.

He also made a point I think every PM shipping AI features needs to hear. The further you are from the end user, the more you need structured evals. Anthropic's Claude Code team can vibe check because the engineers are the users. A healthcare AI company can't. Evals bridge that distance.

🎬 Watch the full demo: youtu.be/71qvIkO9d_A

Spotify: open.spotify.com/episod…

Apple: podcasts.apple.com/us/p…

If you're shipping AI features and your evals are still just you clicking around the product, watch this one. Zero to 0.75 in 20 minutes from a blank screen.

Mar 22
at
1:37 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.