Papers

All publications from the LLMEval research series.

Year:

Topic:

Showing 5 of 5 papers

Under submission2026

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

LLMEval Team, Fudan NLP Lab(Anonymous authors during peer review)

LLMEval-Logic is a Chinese logical reasoning benchmark built through a three-stage audit pipeline: (a) annotators authored items forward from real-world stories rather than templating backward from formulas, (b) a hand-written rubric checklist together with the Z3 SMT solver double-audited every natural-language → first-order-logic translation, and (c) a closed-loop adversarial hardening agent workflow discarded items that turned out to be too easy. The dataset has two paired splits — LLMEval-Logic-Base (single-question PL & FOL items with Z3-verified answers, gold formalisations and atom-level NL→FL rubrics) and LLMEval-Logic-Hard (multi-question / sub-question items covering enumeration / counting / uniqueness / alternative-solution / counterfactual reasoning). Three independent runs of 14 frontier LLMs under thinking / no-thinking configurations show the strongest model reaches only 37.5% Item Accuracy on Hard, leaving substantial headroom for frontier reasoning research. Following the contamination-resistant tradition of LLMEval-Fair, only 80% of the corpus is released publicly; the remaining 20% is held out as a private contamination-resistant test set maintained by Fudan NLP Lab.

logical reasoningpropositional logicfirst-order logicZ3 / SMTadversarial hardeningcontamination-resistant

Code Dataset

ACL 2026 Main Conference2026

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang*†, Yujiong Shen*, Jingyi Deng*, Yuhui Wang*, Huayu Sha, Kexin Tan, Qiyuan Peng, Yue Zhang, Junzhe Wang, Shichun Liu, Yueyuan Huang, Jingqi Tong, Changhao Jiang, Yilong Wu, Zhihao Zhang, Mingqi Wu, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang†, Xuanjing Huang(* Equal Contribution, † Corresponding Author)

LLMEval-Fair addresses robustness and fairness concerns in LLM evaluation through a 30-month longitudinal study. Built on a proprietary bank of 220,000 graduate-level questions across 13 academic disciplines, it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts. A study of nearly 60 leading models reveals performance ceilings and exposes data contamination vulnerabilities undetectable by static benchmarks.

evaluationfairnessrobustnessgenerative QAlongitudinal study

37 arXiv Code Dataset

EMNLP 2025 Findings2025

LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

Ming Zhang*, Yujiong Shen*, Zelin Li*, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang†, Xuanjing Huang†(* Equal Contribution, † Corresponding Author)

LLMEval-Med is a physician-validated benchmark for evaluating LLMs on real-world clinical tasks. It covers five core medical areas (Medical Knowledge, Language Understanding, Reasoning, Ethics & Safety, Text Generation) with 2,996 questions from real electronic health records and expert-designed clinical scenarios. An automated evaluation pipeline with expert-developed checklists is validated through human-machine agreement analysis. 13 LLMs across specialized, open-source, and closed-source categories are evaluated.

medicalclinicalphysician validationLLM-as-Judge

25 arXiv Code Dataset

AAAI 20242024

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Yue Zhang*, Ming Zhang*, Haipeng Yuan, Shichun Liu, Yongyao Shi, Tao Gui, Qi Zhang†, Xuanjing Huang(* Equal Contribution, † Corresponding Author)

This paper tackles the third crucial question in LLM evaluation: how to evaluate. We compare various criteria with both manual and automatic evaluation, utilizing onsite staff, crowd-sourcing workers, public annotators, and GPT-4 across different scoring and ranking systems. 20 LLMs are evaluated with 2,186 participants generating 243,337 manual annotations and 57,511 automated results. The paper proposes the LLMEval dataset (from LLMEval-1 and LLMEval-2 rounds) and draws 10 conclusions for future evaluation practices.

evaluation methodologycrowdsourcingannotationscoringranking

arXiv DOI

Related Datasets

LLMEval-1 113 HF Dataset

Phase I dataset: 17 categories, 453 questions, 2,186 annotators for Chinese LLM evaluation

LLMEval-2 71 HF Dataset

Phase II dataset: professional domain evaluation across 12 academic disciplines, 480 questions

Technical Report2024

LLMEval-Gaokao2024-Math: 中文大语言模型评测 2024 高考数学专题

LLMEval Team

This evaluation utilizes the 2024 Chinese National College Entrance Examination (Gaokao) mathematics papers as a benchmark for large language models. Fresh exam questions with high originality and confidentiality make them an excellent test set. The evaluation covers both New Paper I and New Paper II, testing models with both LaTeX and escape-character formatted prompts to reveal sensitivity to prompt formatting in mathematical contexts.

mathematicsGaokaoprompt format

19 Code