New blog post coming out tomorrow morning on LLM benchmarking.
The best way to understand how LLM benchmarks are created—and how we can create a useful benchmark for our own task of interest—is to simply study details of the most popular and effective LLM benchmarks.
This post will study the following properties of a wide variety of benchmarks:
1. How the data is sourced
2. How data quality is ensured
3. How model performance is measured
4. How each benchmark has evolved as models have improved
Although many LLM benchmarks exist, there are a ton of common properties shared by the most successful benchmarks that can easily be adopted as a set of best practices:
- Creating a domain taxonomy so that the benchmark is structured and guaranteed to be diverse.
- Leveraging human expertise (for sourcing data, verification, and more).
- Using a model-in-the-loop approach to make data collection more efficient and ensure difficulty.
- Putting strict data quality checks in place.
- Making sure the benchmark is realistic and matches real-world usage of the LLM.
- Evolving the benchmark over time to enhance difficulty and capture new dimensions of performance.