I just published a writeup on using statistics to make LLM evals more reliable. Here are some basic statistics you can start using to measure the uncertainty of an evaluation result…
Find the full writeup here:
Evaluation scores. In an LLM evaluation, we run our LLM over a dataset of n questions, yielding a score for each question. The score for each question can be modeled as s_i = x_i + ϵ_i, where x_i is the expected score and ϵ_i adds randomness to the score.
Sample mean. The primary evaluation metric for our model is the average score on our evaluation dataset. Given that we are using a finite dataset of size n, the mean we compute over our dataset is a sample mean x̄ that is computed by averaging the scores over our fixed dataset. This is an estimate of the “true” performance of our model for a task of interest.
Standard error. If we repeatedly compute the sample mean, we will get a different result every time, forming a sampling distribution (i.e., basically a list of estimated sample means). The standard deviation of this sampling distribution is called the standard error. The standard error measures the variability of our mean score (or evaluation result). If the questions in our evaluation set are IID, we can estimate standard error with the following expression:
SE = std / sqrt(n)
Where std is the standard deviation of evaluation scores (can be estimated with a sample standard deviation) and n is the number of evaluation questions.
Confidence intervals. Once we have estimated the standard error as outlined above, we can use this standard error to quantify the uncertainty of our evaluation result by computing a 95% confidence interval with the following form:
x̄ ± 1.96 × SE
This confidence interval indicates that if we repeated the sampling procedure many times and recomputed this confidence interval each time, 95% of the resulting confidence intervals would contain the true mean score or evaluation result.
Application to LLM evaluations. If we just report a mean evaluation score (a standard approach for LLMs), then we fail to capture the uncertainty of this score. As a result, it is hard to know whether an evaluation result is legitimate or just caused by noise.
Instead of just reporting a mean evaluation score with no notion of uncertainty, we can compute the standard error for this score as described above, as well as generate a confidence interval. Then, we can report the standard error, a confidence interval, and the number of questions n alongside our score to compare models in an uncertainty-aware manner.