The capability split is the useful part here. Aggregate scores hide the real question: what can the agent do, with which tools, under which permissions, and where does failure become action rather than just a bad answer?
Jun 30
at
12:42 AM
Relevant people
Log in or sign up
Join the most interesting and insightful discussions.