April 10, 2026·LLMEval Team

LLMEval-Fair Accepted to ACL 2026 Main Conference

ACL 2026LLMEval-Fair

LLMEval-Fair Accepted to ACL 2026 Main Conference

We are thrilled to announce that "LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models" has been accepted to the main conference of ACL 2026.

Why LLMEval-Fair?

Existing LLM benchmarks face a fundamental challenge: test set leakage. Public benchmarks are easily scraped during pre-training, enabling "leaderboard hacking" and inflated scores that do not reflect true model capability. LLMEval-Fair tackles this head-on with multiple innovations.

Core Design

Anti-Cheating Mechanisms

Non-public question sources — Questions sourced from non-public channels (undergraduate homework, mid-term/final exams, graduate entrance exams in PDF/Word format) to minimize pre-training contamination
Random sampling — Each model completes 1,000 questions randomly sampled from a proprietary bank of 220,000
Non-repeating evaluations — Models from the same institution receive different question sets across evaluations
Sequential online delivery — Questions sent one-at-a-time, preventing bulk crawling

Generative QA Format

Unlike the multiple-choice format used by most benchmarks, LLMEval-Fair uses generative question-answering exclusively. This includes short answer, calculation, true/false, analysis, and essay questions — far better reflecting real-world user interactions with LLMs.

Dual Scoring System

Absolute Score (0–100): Raw model performance across 1,000 questions, normalized from a 0–3 per-question rubric
Relative Score: Performance relative to the current SOTA model (Doubao-1.5-Thinking-Pro as the 100% baseline)

Scope

The benchmark covers 13 academic disciplines as defined by the Ministry of Education: Philosophy, Economics, Law, Education, Literature, History, Science, Engineering, Agriculture, Medicine, Military Science, Management, and Arts — with 50+ sub-disciplines.

Current Leaderboard

As of December 2025, nearly 60 models have been evaluated from major organizations including OpenAI, Google, Anthropic, DeepSeek, ByteDance, Alibaba, Zhipu AI, Moonshot AI, and more. See the full rankings on our Leaderboard page.

Paper: arXiv:2508.05452
Code & Data: GitHub
Evaluation Platform: llmeval.com