The app for independent voices

New study shows LLM-as-a-judge can be fooled by “Master Key” examples: simple token sequences like “Thought process:” or “Let’s solve this problem step by step” or even single tokens such as colon or period. It causes the judge model to classify the response as correct. This affects general-purpose LLMs (e.g., GPT-4o, Claude 4, o1, Qwen2.5) and specialized judge models (e.g., OmniJudge, General-Verifier).

How to solve it? Add a small batch of “master examples” to the dataset for training the judge model (the examples can be easily generated by truncating the model’s answers).

Master-RM, a model trained with this method, catches 100% of adversarial examples while keeping 96% of answer accordance with the best general-purpose models.

Jul 21
at
3:58 PM

Log in or sign up

Join the most interesting and insightful discussions.