Leaderboard
Model performance rankings across all LLMEval benchmarks.
246 Base + 190 Hard (938 sub-Q) Z3-audited logical-reasoning items, 14 frontier LLM configurations across 7 families, mean over 3 runs.
# | Model | Base | Hard ↓ | Hard Sub-Q | Form. (Free) | Form. (Fixed) |
|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 ProThinkingProprietary | 74.0 | 37.5 | 76.0 | 45.1 | 58.1 |
| 2 | Claude Opus 4.6ThinkingProprietary | 68.7 | 36.7 | 76.6 | 38.6 | 56.9 |
| 3 | Claude Opus 4.6No-ThinkProprietary | 69.0 | 36.0 | 74.4 | 43.5 | 50.0 |
| 4 | Gemini 3.1 Pro (low-think)No-ThinkProprietary | 73.3 | 33.5 | 71.3 | 44.3 | 54.5 |
| 5 | GPT-5.4 ProNo-ThinkProprietary | 71.8 | 33.0 | 70.0 | 32.9 | 60.2 |
| 6 | GPT-5.4 ProThinkingProprietary | 71.8 | 32.6 | 68.4 | 30.1 | 58.9 |
| 7 | Qwen 3.5 PlusThinkingOpen-weight | 71.3 | 29.3 | 70.0 | 36.2 | 52.4 |
| 8 | Kimi K2.5ThinkingOpen-weight | 72.9 | 28.1 | 68.4 | 36.2 | 59.4 |
| 9 | Hy3 previewThinkingOpen-weight | 75.3 | 21.6 | 62.7 | 27.6 | 51.2 |
| 10 | Seed 2.0 ProThinkingProprietary | 75.5 | 20.4 | 63.3 | 35.8 | 56.9 |
| 11 | Seed 2.0 ProNo-ThinkProprietary | 56.2 | 6.1 | 39.4 | 30.9 | 47.6 |
| 12 | Qwen 3.5 PlusNo-ThinkOpen-weight | 42.0 | 3.0 | 33.7 | 25.2 | 35.8 |
| 13 | Kimi K2.5No-ThinkOpen-weight | 37.9 | 1.8 | 27.0 | 27.2 | 38.6 |
| 14 | Hy3 previewNo-ThinkOpen-weight | 51.4 | 0.9 | 27.4 | 21.9 | 24.8 |
Accuracy (%) on LLMEval-Logic across 14 frontier LLM configurations from 7 families, mean over 3 independent runs. Base = Item Acc. on the 246-item Base subset. Hard = Item Acc. (all sub-questions must match) on the 190-item adversarially hardened subset. Hard Sub-Q = per-sub-question accuracy (938 sub-questions). Form. = joint Z3 + rubric formalization accuracy under free / fixed symbol settings. Judge: gpt-5.1-chat (validated against Claude Opus 4.6 / Gemini 3.1 Pro, κ ∈ [0.873, 0.922]). Data from LLMEval-Logic (arXiv 2026).
About the Evaluations
LLMEval-Logic — LLMEval-Logic (arXiv 2026): Forward-authored Chinese logical reasoning benchmark, Z3-audited with expert rubrics. Base (246 items, single-question PL/FOL) reports Item Accuracy; Hard (190 items / 938 sub-questions, adversarially hardened) reports Item Accuracy (all sub-questions must match) and Sub-Q Accuracy. Each Base item also scored at the formalization level under free / fixed symbol settings via joint Z3 + rubric. Judge: gpt-5.1-chat, mean over 3 independent runs.
LLMEval-Fair — LLMEval-Fair (ACL 2026): 220,000 graduate-level generative questions across 13 disciplines. Each model answers 1,000 randomly sampled questions. Absolute score (0-100) reflects raw performance; relative score measures the gap to the SOTA model. Discipline scores on a 10-point scale. GPT-4 Turbo judges each question on a 0-3 rubric.
LLMEval-Med — LLMEval-Med (EMNLP 2025): 2,996 questions from real electronic health records and expert-designed clinical scenarios, across 5 dimensions (MK, MLU, MR, MSE, MTG). Usability rate (%) = proportion of responses scoring 4+ (automated, 0-5) or 5+ (MTG, human-evaluated, 0-7). Human-machine agreement: 92.36%.
LLMEval-1 — LLMEval-1 (AAAI 2024): 17 categories, 453 questions, five dimensions (correctness, fluency, informativeness, logic, harmlessness, 0-3 scale). 2,186 annotators contributed 243,337 annotations. Pairwise comparison (0-1) also provided.
LLMEval-2 — LLMEval-2 (AAAI 2024): 12 disciplines, 480 questions, objective (max 5 pts) + subjective (max 14 pts). Total normalized to 0-100.
Want to submit your model for evaluation? Contact us at mingzhang23@m.fudan.edu.cn.