This is good work, and the LoCoMo audit is genuinely useful. The 6.4% error rate, the 63% judge acceptance of wrong answers, and the statistical noise in small categories are real problems that make current benchmark results unreliable. The Mem0 prompt finding ("make sure you don't say no information is found") is particularly damning. Glad someone documented it systematically.
I agree with every one of the ten design principles. Standardized answer generation, balanced categories, human-verified ground truth, adversarially validated judges, and explicit abstention scoring are all necessary. The two-track approach (standard + open) sounds like a promising compromise between comparability and architectural freedom.
I want to raise an axis the proposal doesn't cover: write integrity.
All six proposed categories test whether a system can retrieve the correct answer from a static corpus. The corpus is ingested once. Questions are asked against it. The system never writes to its own memory in a way that could corrupt what was previously stored. This matches how every existing benchmark works, including the ones the proposal correctly criticizes.
In production, agents write state across sessions. Facts change. Corrections overwrite previous values. Summarization merges records. Two sessions write different values for the same field, and last write wins with no record the earlier value existed. The failure mode this creates (memory corruption) passes every retrieval guardrail. The retrieval is correct and the citation is grounded, but he stored data is wrong.
I wrote about this yesterday here: markmhendrickson.substa…. Hallucination and memory corruption are two distinct failure modes. Hallucination is model-level (the LLM generates content with no basis in its input) whereas corruption is infrastructure-level (the stored data changed). Most systems cannot distinguish between them because the diagnostic tooling doesn't exist. No widely used benchmark tests whether stored facts survive a week of agent writes unchanged.
Your Category 4 (Supersession and correction) gets closest. "What's my current stance on using TypeScript?" when the user changed their mind is a real question. But it tests one thing, does the system return the current value? It doesn't test whether the previous value is still accessible. It doesn't test whether the system can tell you when the change happened and what triggered it. It doesn't test whether the change was a user correction or silent drift from an agent write. These are the questions that matter when something goes wrong in production and you need to trace what happened.
I've been building the benchmark WRIT (Write Integrity Test) to cover this axis: github.com/markmhendric…. It tests drift rate, detectability, temporal replay, provenance, and update fidelity across multi-session scenarios with memory events that evolve over time. It uses an adapter interface so any memory system can plug in. The core idea is that a system can score 95%+ on retrieval benchmarks but fail catastrophically on write integrity if it overwrites values on update, loses history, or can't trace provenance.
I think the two efforts are complementary, not competing. Your proposal fixes the measurement rigor of retrieval benchmarks. WRIT tests what retrieval benchmarks structurally cannot detect. Together they'd cover both axes: can the system find the right fact (your proposal) and is the fact it found still correct (WRIT).
A few specific places I see overlap worth exploring:
- Abstention scoring. Your concrete weights (correct = 1.0, IDK = 0.10, wrong = 0.0) are more specific than WRIT's current "Abstention Quality" metric. I'd like to adopt or adapt something similar.
- Ground-truth rigor. Your <1% error target and human verification pipeline are ahead of where WRIT is on the human-authored scenarios. It'd be interested in collaborating on shared verification methodology.
- Standardized answer generation. WRIT's adapter interface standardizes the memory side but leaves the answer generation side to each adapter. Your standard/open track distinction is a better answer to this problem.
I'd be happy to discuss further, either here or directly. The benchmark landscape needs both better measurement and broader measurement. I'm glad to see we're both working at this from different angles!