I just published a blog that covers 30+ popular LLM evals / benchmarks and how they are created. Here are the common themes for success…
(1) Domain Taxonomy. Most popular LLM benchmarks categorize their data into a fixed set of domains / sub-domains. This makes it easy to granularly debug the LLM’s performance and naturally ensures that the benchmark is diverse. There are countless examples of this (e.g., MMLU, BIG-Bench, and GPQA to name a few).
(2) Human annotation. Despite the prevalence of synthetic data within LLM research, nearly all successful evaluation benchmarks rely on human experts to annotate data in some way. Some benchmarks begin with questions written by human experts (e.g., FrontierMath), while others leverage human opinions to measure question difficulty or accuracy (e.g., GPQA). Even when synthetic data is being used, human verification is helpful (e.g., IFEval and IFBench).
(3) Model-in-the-loop. Humans play a huge role in the evaluation process, but augmenting their efforts with an LLM can be beneficial. For example, LLMs are often used for difficulty filtering by identifying the questions that they get wrong. Usually, we do this with a group of several LLMs to avoid bias (e.g., BIG-Bench Extra Hard). Trends in model performance also allow us to fit IRT models (e.g., FluidBenchmarking), identify less informative subsets of data (e.g., correctness-based clustering in tinyBenchmarks), or find mistakes to provide to humans for further review (e.g., DatBench filtering pipeline).
(4) Data quality. The best evaluation benchmarks tend to pull from high-quality data sources. For example, popular math benchmarks include questions that are taken directly from recognized math competitions (e.g., AIME / AMC), and reasoning benchmarks like BIG-Bench are sourced from vetted sources; e.g., proven datasets (as in BIG-Bench Extra Hard) or questions that are extensively verified with human review (as in the original BIG-Bench).
In fact, manually written questions from human experts are another commonly-used source of evaluation data, but we must implement measures to ensure data quality. The GPQA curation pipeline is a great example of an effective system for ensuring data quality and difficulty.
(5) Realistic. Benchmarks are an imperfect proxy for measuring what we actually care about: the capabilities of an LLM. We want our benchmark to accurately capture an LLM’s true capabilities. To achieve this, we must make our evaluation data as realistic as possible. For example, CursorBench pulls evaluation data from real coding sessions in Cursor and constantly releases new benchmark versions to better capture recent trends in agent usage.
(6) Evolution. The capabilities of frontier-level LLMs are advancing rapidly, which can lead to benchmark saturation. In order to remain relevant, a good benchmark must evolve (and improve) over time. One of the best examples of this trend is BIG-Bench, which was already saturated less than a year after its initial release but released multiple new versions.
Similarly, math benchmarks have followed a similar trajectory, with early benchmarks (e.g., GSM8K and MATH) getting saturated and much harder benchmarks (e.g., FrontierMath, OmniMath, RealMath, etc.) being released over time. Some recent math benchmarks even evolve on their own by automatically pulling new problems from papers / forums.
Link to post: