March 24, 2024·LLMEval Team

LLMEval: How to Evaluate Large Language Models (AAAI 2024)

AAAI 2024LLMEvalmethodology

LLMEval: A Preliminary Study on How to Evaluate Large Language Models (AAAI 2024)

Published at AAAI 2024, this paper is the foundational work of the LLMEval series. While most evaluation research focuses on *what* tasks and *what* knowledge to test, this paper systematically addresses the often-overlooked third question: how to evaluate.

The Three Questions of LLM Evaluation

1. What to evaluate? — What tasks should LLMs be tested on?

2. Where to evaluate? — What domains and knowledge areas?

3. How to evaluate? — What standards, evaluators, scoring methods, and ranking systems?

This paper focuses squarely on question 3.

Methodology

We designed a comprehensive experimental framework comparing:

Evaluation Criteria

Five assessment dimensions: Correctness, Fluency, Informativeness, Logic, and Harmlessness
Item-based scoring vs. pairwise comparison

Annotator Types

Onsite staff — trained professional annotators
Crowd-sourcing workers — paid task workers
Public annotators — volunteer participants (2,186 individuals)
GPT-4 — automated evaluation baseline

Scoring and Ranking Methods

Absolute scoring vs. relative comparison
Different aggregation and ranking algorithms

The LLMEval Dataset

The paper introduces the LLMEval dataset, collected through two major evaluation rounds:

LLMEval-1 (第一期)

17 major categories, 453 questions
Covers: factual QA, reading comprehension, framework generation, paragraph rewriting, summarization, mathematical problem-solving, reasoning, poetry generation, programming, and more
2,186 public participants, 243,337 manual annotations

LLMEval-2 (第二期)

12 academic disciplines, 480 questions
Professional domain evaluation with both objective and subjective questions
Focus on tasks where students seek LLM assistance in their studies
57,511 GPT-4 automated evaluations

10 Key Conclusions

The paper draws 10 empirically-grounded conclusions about LLM evaluation methodology, covering:

Agreement between human and automated evaluators
Impact of criteria selection on model rankings
Reliability of different annotator types
Trade-offs between scoring approaches

These findings laid the groundwork for LLMEval-Fair and LLMEval-Med.

Paper: arXiv:2312.07398
AAAI Proceedings: DOI:10.1609/aaai.v38i17.29934
LLMEval-1 Data: GitHub
LLMEval-2 Data: GitHub