Numbers are essential for evaluating AI systems.
But they come with a problem we don’t talk about enough: the tool we use to generate those numbers is biased.