Recently, I’ve done a ton of reading on LLM-as-a-judge techniques (i.e., using an LLM to evaluate the output of another LLM). Here’s a reference of the best papers in this space:
(1) Early research: Research on LLM evaluators began with the proposal of GPT-4, which was (arguably) the first LLM powerful enough to reliably evaluate output quality. At this time, several works explored the usage of LLMs as evaluators:
- Sparks of AGI [1]: This paper broadly studies the behavior of GPT-4, finding that the model excels at nearly all tasks that were considered. As part of this analysis, authors use GPT-4 to evaluate the similarity of a model’s output to a reference output. This is the first work (as far as I know) that attempts to use GPT-4 as a judge.
- Open LLMs: After the proposal of LLaMA, several imitation models followed. Of these imitation models, several of them (Vicuna, LIMA, Guanaco, Tulu, Orca, and more) use LLMs to evaluate the quality of model outputs relative to ChatGPT.
- AlpacaEval [7]: In a similar timeframe, the AlpacaEval metric was proposed. AlpacaEval uses a fixed set of ~800 prompts and generates an output for each prompt with a baseline model (GPT-4-Turbo) and a model being evaluated. Then, we prompt an LLM judge (GPT-4) to compare the quality of model outputs for each prompt, allowing us to automatically compute a win rate.
(2) More formal analysis: After initial explorations of LLM-as-a-judge, researchers began to formalize these techniques and analyze them more deeply. Such work revealed that this technique is powerful but subject to interesting biases that are hard to detect.
- G-Eval [8] uses a chain of thought approach to evaluate output quality. First, the LLM is asked to output a set of steps for evaluating output on a particular task. Then, the LLM ingests this evaluation framework and executes the evaluation via a form-filling paradigm (i.e., just generating the score as an output).
- LLMs as an alternative to human evaluation [9]: Authors do a formal study of the feasibility of using LLMs to replicate the human evaluation process, finding that the results of LLM evaluation are consistent with those of expert human evaluation for story generation and adversarial example generation tasks.
- LLM-as-a-judge [10]: Written by the creators of Vicuna, this paper formalizes the LLM-as-a-judge technique, proposing several setups for evaluating model outputs with an LLM. However, authors also reveal several biases of LLM evaluators, including position bias, verbosity bias, self-enhancement bias, and limited reasoning capability.
- LLMs are not fair evaluators [11]: This paper studies bias within LLM evaluations, focusing upon position bias in particular. They find that altering the position of model outputs within the prompt used for the LLM judge can drastically change evaluation results, but we can solve this issue by randomizing the position of outputs within the prompt.
(3) Specialized evaluators: Although this post focuses upon LLM-as-a-judge techniques, a large amount of research has also been published on the topic of training specialized LLMs for evaluation. The most popular example of this is the Prometheus series of models [12, 13]. However, several other examples exist, such as JudgeLM [14] or PandaLM [15].