LLMEval: How to Evaluate Large Language Models (AAAI 2024)
LLMEval: A Preliminary Study on How to Evaluate Large Language Models (AAAI 2024)
Published at AAAI 2024, this paper is the foundational work of the LLMEval series. While most evaluation research focuses on *what* tasks and *what* knowledge to test, this paper systematically addresses the often-overlooked third question: how to evaluate.
The Three Questions of LLM Evaluation
1. What to evaluate? — What tasks should LLMs be tested on?
2. Where to evaluate? — What domains and knowledge areas?
3. How to evaluate? — What standards, evaluators, scoring methods, and ranking systems?
This paper focuses squarely on question 3.
Methodology
We designed a comprehensive experimental framework comparing:
Evaluation Criteria
- Five assessment dimensions: Correctness, Fluency, Informativeness, Logic, and Harmlessness
- Item-based scoring vs. pairwise comparison
Annotator Types
- Onsite staff — trained professional annotators
- Crowd-sourcing workers — paid task workers
- Public annotators — volunteer participants (2,186 individuals)
- GPT-4 — automated evaluation baseline
Scoring and Ranking Methods
- Absolute scoring vs. relative comparison
- Different aggregation and ranking algorithms
The LLMEval Dataset
The paper introduces the LLMEval dataset, collected through two major evaluation rounds:
LLMEval-1 (第一期)
- 17 major categories, 453 questions
- Covers: factual QA, reading comprehension, framework generation, paragraph rewriting, summarization, mathematical problem-solving, reasoning, poetry generation, programming, and more
- 2,186 public participants, 243,337 manual annotations
LLMEval-2 (第二期)
- 12 academic disciplines, 480 questions
- Professional domain evaluation with both objective and subjective questions
- Focus on tasks where students seek LLM assistance in their studies
- 57,511 GPT-4 automated evaluations
10 Key Conclusions
The paper draws 10 empirically-grounded conclusions about LLM evaluation methodology, covering:
- Agreement between human and automated evaluators
- Impact of criteria selection on model rankings
- Reliability of different annotator types
- Trade-offs between scoring approaches
These findings laid the groundwork for LLMEval-Fair and LLMEval-Med.
- Paper: arXiv:2312.07398
- AAAI Proceedings: DOI:10.1609/aaai.v38i17.29934
- LLMEval-1 Data: GitHub
- LLMEval-2 Data: GitHub