Blog

Updates and announcements from the LLMEval team.

May 12, 2026·LLMEval Team

LLMEval-Logic: Solver-Verified Chinese Logic Reasoning Benchmark (80% Public Release)

LLMEval-Logic is a Chinese logical reasoning benchmark double-audited by the Z3 SMT solver and human rubrics, and toughened via an adversarial-hardening agent loop. We are releasing 80% of the items (197 Base + 152 Hard + 197 rubrics); the remaining 20% is held out as a private contamination-resistant test set maintained by Fudan NLP Lab.

LLMEval-Logiclogical reasoningZ3contamination-resistant

April 10, 2026·LLMEval Team

LLMEval-Fair Accepted to ACL 2026 Main Conference

Our paper on robust and fair LLM evaluation has been accepted to ACL 2026. With 220K generative questions across 13 disciplines and anti-cheating mechanisms, LLMEval-Fair sets a new standard for trustworthy model benchmarking.

ACL 2026LLMEval-Fair

November 1, 2025·LLMEval Team

LLMEval-Med: Physician-Validated Clinical Benchmark (EMNLP 2025)

LLMEval-Med introduces a benchmark of 2,996 clinical questions built from real electronic health records, with physician-validated evaluation covering five core medical dimensions.

EMNLP 2025LLMEval-Medmedical AI

June 14, 2024·LLMEval Team

2024 Gaokao Math: LLM Evaluation Special Report

Using the freshly released 2024 Chinese Gaokao math papers — highly original and confidential — we evaluate leading LLMs with both LaTeX and escape-character prompts.

Gaokaomathematics

March 24, 2024·LLMEval Team

LLMEval: How to Evaluate Large Language Models (AAAI 2024)

Our foundational paper at AAAI 2024 systematically studies the 'how to evaluate' question — comparing evaluation criteria, annotator types, scoring methods, and ranking systems across 20 LLMs with 2,186 participants.

AAAI 2024LLMEvalmethodology