Leaderboard

Model performance rankings across all LLMEval benchmarks.

246 Base + 190 Hard (938 sub-Q) Z3-audited logical-reasoning items, 14 frontier LLM configurations across 7 families, mean over 3 runs.

#
Model
Base
Hard
Hard Sub-Q
Form. (Free)
Form. (Fixed)
1
Gemini 3.1 ProThinkingProprietary
74.037.576.045.158.1
2
Claude Opus 4.6ThinkingProprietary
68.736.776.638.656.9
3
Claude Opus 4.6No-ThinkProprietary
69.036.074.443.550.0
4
Gemini 3.1 Pro (low-think)No-ThinkProprietary
73.333.571.344.354.5
5
GPT-5.4 ProNo-ThinkProprietary
71.833.070.032.960.2
6
GPT-5.4 ProThinkingProprietary
71.832.668.430.158.9
7
Qwen 3.5 PlusThinkingOpen-weight
71.329.370.036.252.4
8
Kimi K2.5ThinkingOpen-weight
72.928.168.436.259.4
9
Hy3 previewThinkingOpen-weight
75.321.662.727.651.2
10
Seed 2.0 ProThinkingProprietary
75.520.463.335.856.9
11
Seed 2.0 ProNo-ThinkProprietary
56.26.139.430.947.6
12
Qwen 3.5 PlusNo-ThinkOpen-weight
42.03.033.725.235.8
13
Kimi K2.5No-ThinkOpen-weight
37.91.827.027.238.6
14
Hy3 previewNo-ThinkOpen-weight
51.40.927.421.924.8

Accuracy (%) on LLMEval-Logic across 14 frontier LLM configurations from 7 families, mean over 3 independent runs. Base = Item Acc. on the 246-item Base subset. Hard = Item Acc. (all sub-questions must match) on the 190-item adversarially hardened subset. Hard Sub-Q = per-sub-question accuracy (938 sub-questions). Form. = joint Z3 + rubric formalization accuracy under free / fixed symbol settings. Judge: gpt-5.1-chat (validated against Claude Opus 4.6 / Gemini 3.1 Pro, κ ∈ [0.873, 0.922]). Data from LLMEval-Logic (arXiv 2026).

About the Evaluations

LLMEval-LogicLLMEval-Logic (arXiv 2026): Forward-authored Chinese logical reasoning benchmark, Z3-audited with expert rubrics. Base (246 items, single-question PL/FOL) reports Item Accuracy; Hard (190 items / 938 sub-questions, adversarially hardened) reports Item Accuracy (all sub-questions must match) and Sub-Q Accuracy. Each Base item also scored at the formalization level under free / fixed symbol settings via joint Z3 + rubric. Judge: gpt-5.1-chat, mean over 3 independent runs.

LLMEval-FairLLMEval-Fair (ACL 2026): 220,000 graduate-level generative questions across 13 disciplines. Each model answers 1,000 randomly sampled questions. Absolute score (0-100) reflects raw performance; relative score measures the gap to the SOTA model. Discipline scores on a 10-point scale. GPT-4 Turbo judges each question on a 0-3 rubric.

LLMEval-MedLLMEval-Med (EMNLP 2025): 2,996 questions from real electronic health records and expert-designed clinical scenarios, across 5 dimensions (MK, MLU, MR, MSE, MTG). Usability rate (%) = proportion of responses scoring 4+ (automated, 0-5) or 5+ (MTG, human-evaluated, 0-7). Human-machine agreement: 92.36%.

LLMEval-1LLMEval-1 (AAAI 2024): 17 categories, 453 questions, five dimensions (correctness, fluency, informativeness, logic, harmlessness, 0-3 scale). 2,186 annotators contributed 243,337 annotations. Pairwise comparison (0-1) also provided.

LLMEval-2LLMEval-2 (AAAI 2024): 12 disciplines, 480 questions, objective (max 5 pts) + subjective (max 14 pts). Total normalized to 0-100.

Want to submit your model for evaluation? Contact us at mingzhang23@m.fudan.edu.cn.