Leaderboard

Model performance rankings across all LLMEval benchmarks.

220K generative questions, 13 academic disciplines, ~60 models over a 30-month longitudinal study.

#	Model	Relative	Absolute ↓	Eng.	Econ.	Law	Sci.	Med.	Mgmt.
1	Doubao-1.5-Thinking-ProClosed ByteDance	100.00	93.67	9.47	9.67	9.77	9.23	8.97	9.53
2	DeepSeek-R1Open DeepSeek	97.40	91.23	9.47	9.43	9.37	9.03	8.50	9.37
3	Gemini-2.5-ProClosed Google	97.22	91.07	9.20	9.47	9.30	9.07	8.50	9.63
4	Gemini-2.5-Pro-ThinkingClosed Google	97.15	91.00	9.13	9.50	9.47	9.20	8.30	9.63
5	DeepSeek-V3Open DeepSeek	96.47	90.36	9.30	9.57	9.23	8.97	8.83	9.13
6	Qwen3-235BOpen Alibaba Cloud	96.42	90.32	9.23	9.43	9.50	8.97	8.73	9.43
7	Doubao-1.5-ProClosed ByteDance	95.68	89.62	8.83	9.03	9.43	8.83	8.60	9.27
8	GLM-4.6Closed Zhipu AI	95.26	89.23	8.80	9.27	9.23	8.90	8.43	9.63
9	QwQ-32BOpen Alibaba Cloud	94.51	88.53	8.30	9.46	9.33	8.65	8.57	9.46
10	Kimi-K2Closed Moonshot AI	94.27	88.30	9.23	9.17	9.00	8.77	8.53	9.17
11	GPT-5Closed OpenAI	93.84	87.90	8.83	9.37	8.87	8.90	8.50	9.10
12	Claude-Sonnet-4.5-ThinkingClosed Anthropic	93.48	87.57	8.90	9.17	8.97	8.90	8.27	9.23
13	o1Closed OpenAI	93.36	87.45	8.90	9.30	8.77	8.90	8.17	9.27
14	Claude-Sonnet-4.5Closed Anthropic	93.31	87.40	8.80	8.97	8.73	8.97	8.13	9.10
15	Gemini-2.5-Flash-ThinkingClosed Google	92.74	86.87	8.67	9.27	9.00	8.90	8.03	8.93
16	DeepSeek-V3.2Open DeepSeek	92.27	86.43	8.73	9.13	8.70	8.87	8.53	9.33
17	Qwen3-32BOpen Alibaba Cloud	92.22	86.38	8.43	9.10	9.10	8.67	7.70	9.47
18	Claude-Sonnet-4-ThinkingClosed Anthropic	91.03	85.27	8.57	9.00	8.73	8.93	7.97	9.10
19	Claude-Sonnet-4Closed Anthropic	91.00	85.24	8.57	8.80	8.70	8.80	8.17	9.03
20	GPT-4o-SearchClosed OpenAI	89.40	83.74	8.27	8.77	8.67	8.20	8.27	8.80

Click column headers to sort. Absolute scores are on a 0–100 scale; discipline scores on a 0–10 scale. Relative scores use the SOTA model (Doubao-1.5-Thinking-Pro) as the 100% baseline. 59 models from the full appendix of LLMEval-Fair (ACL 2026).

About the Evaluations

LLMEval-Fair — LLMEval-Fair (ACL 2026): 220,000 graduate-level generative questions across 13 disciplines. Each model answers 1,000 randomly sampled questions. Absolute score (0-100) reflects raw performance; relative score measures the gap to the SOTA model. Discipline scores on a 10-point scale. GPT-4 Turbo judges each question on a 0-3 rubric.

LLMEval-Med — LLMEval-Med (EMNLP 2025): 2,996 questions from real electronic health records and expert-designed clinical scenarios, across 5 dimensions (MK, MLU, MR, MSE, MTG). Usability rate (%) = proportion of responses scoring 4+ (automated, 0-5) or 5+ (MTG, human-evaluated, 0-7). Human-machine agreement: 92.36%.

LLMEval-1 — LLMEval-1 (AAAI 2024): 17 categories, 453 questions, five dimensions (correctness, fluency, informativeness, logic, harmlessness, 0-3 scale). 2,186 annotators contributed 243,337 annotations. Pairwise comparison (0-1) also provided.

LLMEval-2 — LLMEval-2 (AAAI 2024): 12 disciplines, 480 questions, objective (max 5 pts) + subjective (max 14 pts). Total normalized to 0-100.

Want to submit your model for evaluation? Contact us at mingzhang23@m.fudan.edu.cn.