Here's another great paper on creating realistic / modern evals for agents with a slightly different flavor:
Gaia2 from MSL houses evals within realistic agent environments. They purposely create noise, dynamic / evolving constraints, or even temporal constraints into the eval environment. Agents are forced to resolve ambiguity, collaborate with other agents, etc. This is a great reference for how static LLM benchmarks may start to evolve moving forward.