“Benchmarkmaxxing” is becoming a major issue as LLMs are becoming a bit too Agentic, which works well for complex tasks not for simple ones