Leaderboard
Model performance rankings across all LLMEval benchmarks.
220K generative questions, 13 academic disciplines, ~60 models over a 30-month longitudinal study.
# | Model | Relative | Absolute ↓ | Eng. | Econ. | Law | Sci. | Med. | Mgmt. |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Doubao-1.5-Thinking-ProClosed ByteDance | 100.00 | 93.67 | 9.47 | 9.67 | 9.77 | 9.23 | 8.97 | 9.53 |
| 2 | DeepSeek-R1Open DeepSeek | 97.40 | 91.23 | 9.47 | 9.43 | 9.37 | 9.03 | 8.50 | 9.37 |
| 3 | Gemini-2.5-ProClosed Google | 97.22 | 91.07 | 9.20 | 9.47 | 9.30 | 9.07 | 8.50 | 9.63 |
| 4 | Gemini-2.5-Pro-ThinkingClosed Google | 97.15 | 91.00 | 9.13 | 9.50 | 9.47 | 9.20 | 8.30 | 9.63 |
| 5 | DeepSeek-V3Open DeepSeek | 96.47 | 90.36 | 9.30 | 9.57 | 9.23 | 8.97 | 8.83 | 9.13 |
| 6 | Qwen3-235BOpen Alibaba Cloud | 96.42 | 90.32 | 9.23 | 9.43 | 9.50 | 8.97 | 8.73 | 9.43 |
| 7 | Doubao-1.5-ProClosed ByteDance | 95.68 | 89.62 | 8.83 | 9.03 | 9.43 | 8.83 | 8.60 | 9.27 |
| 8 | GLM-4.6Closed Zhipu AI | 95.26 | 89.23 | 8.80 | 9.27 | 9.23 | 8.90 | 8.43 | 9.63 |
| 9 | QwQ-32BOpen Alibaba Cloud | 94.51 | 88.53 | 8.30 | 9.46 | 9.33 | 8.65 | 8.57 | 9.46 |
| 10 | Kimi-K2Closed Moonshot AI | 94.27 | 88.30 | 9.23 | 9.17 | 9.00 | 8.77 | 8.53 | 9.17 |
| 11 | GPT-5Closed OpenAI | 93.84 | 87.90 | 8.83 | 9.37 | 8.87 | 8.90 | 8.50 | 9.10 |
| 12 | Claude-Sonnet-4.5-ThinkingClosed Anthropic | 93.48 | 87.57 | 8.90 | 9.17 | 8.97 | 8.90 | 8.27 | 9.23 |
| 13 | o1Closed OpenAI | 93.36 | 87.45 | 8.90 | 9.30 | 8.77 | 8.90 | 8.17 | 9.27 |
| 14 | Claude-Sonnet-4.5Closed Anthropic | 93.31 | 87.40 | 8.80 | 8.97 | 8.73 | 8.97 | 8.13 | 9.10 |
| 15 | Gemini-2.5-Flash-ThinkingClosed Google | 92.74 | 86.87 | 8.67 | 9.27 | 9.00 | 8.90 | 8.03 | 8.93 |
| 16 | DeepSeek-V3.2Open DeepSeek | 92.27 | 86.43 | 8.73 | 9.13 | 8.70 | 8.87 | 8.53 | 9.33 |
| 17 | Qwen3-32BOpen Alibaba Cloud | 92.22 | 86.38 | 8.43 | 9.10 | 9.10 | 8.67 | 7.70 | 9.47 |
| 18 | Claude-Sonnet-4-ThinkingClosed Anthropic | 91.03 | 85.27 | 8.57 | 9.00 | 8.73 | 8.93 | 7.97 | 9.10 |
| 19 | Claude-Sonnet-4Closed Anthropic | 91.00 | 85.24 | 8.57 | 8.80 | 8.70 | 8.80 | 8.17 | 9.03 |
| 20 | GPT-4o-SearchClosed OpenAI | 89.40 | 83.74 | 8.27 | 8.77 | 8.67 | 8.20 | 8.27 | 8.80 |
Click column headers to sort. Absolute scores are on a 0–100 scale; discipline scores on a 0–10 scale. Relative scores use the SOTA model (Doubao-1.5-Thinking-Pro) as the 100% baseline. 59 models from the full appendix of LLMEval-Fair (ACL 2026).
About the Evaluations
LLMEval-Fair — LLMEval-Fair (ACL 2026): 220,000 graduate-level generative questions across 13 disciplines. Each model answers 1,000 randomly sampled questions. Absolute score (0-100) reflects raw performance; relative score measures the gap to the SOTA model. Discipline scores on a 10-point scale. GPT-4 Turbo judges each question on a 0-3 rubric.
LLMEval-Med — LLMEval-Med (EMNLP 2025): 2,996 questions from real electronic health records and expert-designed clinical scenarios, across 5 dimensions (MK, MLU, MR, MSE, MTG). Usability rate (%) = proportion of responses scoring 4+ (automated, 0-5) or 5+ (MTG, human-evaluated, 0-7). Human-machine agreement: 92.36%.
LLMEval-1 — LLMEval-1 (AAAI 2024): 17 categories, 453 questions, five dimensions (correctness, fluency, informativeness, logic, harmlessness, 0-3 scale). 2,186 annotators contributed 243,337 annotations. Pairwise comparison (0-1) also provided.
LLMEval-2 — LLMEval-2 (AAAI 2024): 12 disciplines, 480 questions, objective (max 5 pts) + subjective (max 14 pts). Total normalized to 0-100.
Want to submit your model for evaluation? Contact us at mingzhang23@m.fudan.edu.cn.