LLMEval-Fair Accepted to ACL 2026 Main Conference
LLMEval-Fair Accepted to ACL 2026 Main Conference
We are thrilled to announce that "LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models" has been accepted to the main conference of ACL 2026.
Why LLMEval-Fair?
Existing LLM benchmarks face a fundamental challenge: test set leakage. Public benchmarks are easily scraped during pre-training, enabling "leaderboard hacking" and inflated scores that do not reflect true model capability. LLMEval-Fair tackles this head-on with multiple innovations.
Core Design
Anti-Cheating Mechanisms
- Non-public question sources — Questions sourced from non-public channels (undergraduate homework, mid-term/final exams, graduate entrance exams in PDF/Word format) to minimize pre-training contamination
- Random sampling — Each model completes 1,000 questions randomly sampled from a proprietary bank of 220,000
- Non-repeating evaluations — Models from the same institution receive different question sets across evaluations
- Sequential online delivery — Questions sent one-at-a-time, preventing bulk crawling
Generative QA Format
Unlike the multiple-choice format used by most benchmarks, LLMEval-Fair uses generative question-answering exclusively. This includes short answer, calculation, true/false, analysis, and essay questions — far better reflecting real-world user interactions with LLMs.
Dual Scoring System
- Absolute Score (0–100): Raw model performance across 1,000 questions, normalized from a 0–3 per-question rubric
- Relative Score: Performance relative to the current SOTA model (Doubao-1.5-Thinking-Pro as the 100% baseline)
Scope
The benchmark covers 13 academic disciplines as defined by the Ministry of Education: Philosophy, Economics, Law, Education, Literature, History, Science, Engineering, Agriculture, Medicine, Military Science, Management, and Arts — with 50+ sub-disciplines.
Current Leaderboard
As of December 2025, nearly 60 models have been evaluated from major organizations including OpenAI, Google, Anthropic, DeepSeek, ByteDance, Alibaba, Zhipu AI, Moonshot AI, and more. See the full rankings on our Leaderboard page.
- Paper: arXiv:2508.05452
- Code & Data: GitHub
- Evaluation Platform: llmeval.com