Make money doing the work you believe in

Benchmarks tell us what an AI agent did; only logs reveal how and why. Hence our new paper, a collaboration of many researchers at the forefront of the science of AI agent evaluation, across academia, government, and nonprofits. arxiv.org/pdf/2605.08545

May 13
at
10:09 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.