A new paper on evaluating AI in the real world made me realise that we need to square a circle.
The paper proposed CIRCLE. A six-stage lifecycle for evaluating AI from a real-world perspective. Contextualize. Identify. Represent. Compare. Learn. Extend.
Sound familiar?
It should. The bones are similar to most major AI governance frameworks. NIST AI RMF. MAS' AI Risk Management Guidelines (AIRG). Same lifecycle logic. AI and risk identification. Assessment. Testing. Monitoring. Review.
So what's actually novel?
I’ve never thought about it this way. But the subject of observation itself can shift things.
Existing frameworks, including AIRG, which I wrote, watch the organisation and its AI systems. Does the firm have governance? Are controls in place? Is the model performing within thresholds? The unit of observation is the institution.
CIRCLE watches the people.
I was reminded of this in an interview I did this week with Master's students from LKYSPP. We were discussing Agentic AI in AML. One point I kept coming back to: what a model does in its training context is not the same as what it does in a specific bank's workflow. Institutions must conduct their own contextualised testing using their own data.
That's a system-side observation. But the deeper version of the same point is human-side. What does that model do to the analyst running it? The compliance officer reviewing its outputs? The team whose judgment is slowly being offloaded to a chain of agents?
These effects don't show up in model metrics. Even a perfectly performing model can be quietly eroding the human capabilities around it.
I also said something in that interview that connects here. Agentic AI risks are largely old risks in new clothes, familiar cybersecurity, operational, and resilience risks, amplified. Same controls apply.
CIRCLE makes a similar argument. The lifecycle is familiar. What's new is turning human-behavioral concerns, over-reliance, cognitive offloading, into named constructs that can actually be measured alongside model performance metrics.
AIRG names this risk. Skill degradation due to over-reliance on Generative AI. But the focus of measurement in most governance frameworks is almost entirely system-side. CIRCLE is the methodology for the other half.
So perhap that’s the circle we need to square.
#AIRiskManagement #AIGovernance #GenAI #AIRG #RealWorldAI