Leaderboard

Model performance rankings across all LLMEval benchmarks.

220K generative questions, 13 academic disciplines, ~60 models over a 30-month longitudinal study.

|
#
Model
Relative
Absolute
Eng.
Econ.
Law
Sci.
Med.
Mgmt.
1
Doubao-1.5-Thinking-ProClosed
ByteDance
100.0093.679.479.679.779.238.979.53
2
DeepSeek-R1Open
DeepSeek
97.4091.239.479.439.379.038.509.37
3
Gemini-2.5-ProClosed
Google
97.2291.079.209.479.309.078.509.63
4
Gemini-2.5-Pro-ThinkingClosed
Google
97.1591.009.139.509.479.208.309.63
5
DeepSeek-V3Open
DeepSeek
96.4790.369.309.579.238.978.839.13
6
Qwen3-235BOpen
Alibaba Cloud
96.4290.329.239.439.508.978.739.43
7
Doubao-1.5-ProClosed
ByteDance
95.6889.628.839.039.438.838.609.27
8
GLM-4.6Closed
Zhipu AI
95.2689.238.809.279.238.908.439.63
9
QwQ-32BOpen
Alibaba Cloud
94.5188.538.309.469.338.658.579.46
10
Kimi-K2Closed
Moonshot AI
94.2788.309.239.179.008.778.539.17
11
GPT-5Closed
OpenAI
93.8487.908.839.378.878.908.509.10
12
Claude-Sonnet-4.5-ThinkingClosed
Anthropic
93.4887.578.909.178.978.908.279.23
13
o1Closed
OpenAI
93.3687.458.909.308.778.908.179.27
14
Claude-Sonnet-4.5Closed
Anthropic
93.3187.408.808.978.738.978.139.10
15
Gemini-2.5-Flash-ThinkingClosed
Google
92.7486.878.679.279.008.908.038.93
16
DeepSeek-V3.2Open
DeepSeek
92.2786.438.739.138.708.878.539.33
17
Qwen3-32BOpen
Alibaba Cloud
92.2286.388.439.109.108.677.709.47
18
Claude-Sonnet-4-ThinkingClosed
Anthropic
91.0385.278.579.008.738.937.979.10
19
Claude-Sonnet-4Closed
Anthropic
91.0085.248.578.808.708.808.179.03
20
GPT-4o-SearchClosed
OpenAI
89.4083.748.278.778.678.208.278.80

Click column headers to sort. Absolute scores are on a 0–100 scale; discipline scores on a 0–10 scale. Relative scores use the SOTA model (Doubao-1.5-Thinking-Pro) as the 100% baseline. 59 models from the full appendix of LLMEval-Fair (ACL 2026).

About the Evaluations

LLMEval-FairLLMEval-Fair (ACL 2026): 220,000 graduate-level generative questions across 13 disciplines. Each model answers 1,000 randomly sampled questions. Absolute score (0-100) reflects raw performance; relative score measures the gap to the SOTA model. Discipline scores on a 10-point scale. GPT-4 Turbo judges each question on a 0-3 rubric.

LLMEval-MedLLMEval-Med (EMNLP 2025): 2,996 questions from real electronic health records and expert-designed clinical scenarios, across 5 dimensions (MK, MLU, MR, MSE, MTG). Usability rate (%) = proportion of responses scoring 4+ (automated, 0-5) or 5+ (MTG, human-evaluated, 0-7). Human-machine agreement: 92.36%.

LLMEval-1LLMEval-1 (AAAI 2024): 17 categories, 453 questions, five dimensions (correctness, fluency, informativeness, logic, harmlessness, 0-3 scale). 2,186 annotators contributed 243,337 annotations. Pairwise comparison (0-1) also provided.

LLMEval-2LLMEval-2 (AAAI 2024): 12 disciplines, 480 questions, objective (max 5 pts) + subjective (max 14 pts). Total normalized to 0-100.

Want to submit your model for evaluation? Contact us at mingzhang23@m.fudan.edu.cn.