All publications from the LLMEval research series.
Year:
Topic:
Showing 5 of 5 papers
Under submission2026
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval Team, Fudan NLP Lab(Anonymous authors during peer review)
LLMEval-Logic is a Chinese logical reasoning benchmark built through a three-stage audit pipeline: (a) annotators authored items forward from real-world stories rather than templating backward from formulas, (b) a hand-written rubric checklist together with the Z3 SMT solver double-audited every natural-language → first-order-logic translation, and (c) a closed-loop adversarial hardening agent workflow discarded items that turned out to be too easy. The dataset has two paired splits — LLMEval-Logic-Base (single-question PL & FOL items with Z3-verified answers, gold formalisations and atom-level NL→FL rubrics) and LLMEval-Logic-Hard (multi-question / sub-question items covering enumeration / counting / uniqueness / alternative-solution / counterfactual reasoning). Three independent runs of 14 frontier LLMs under thinking / no-thinking configurations show the strongest model reaches only 37.5% Item Accuracy on Hard, leaving substantial headroom for frontier reasoning research. Following the contamination-resistant tradition of LLMEval-Fair, only 80% of the corpus is released publicly; the remaining 20% is held out as a private contamination-resistant test set maintained by Fudan NLP Lab.
LLMEval-Fair addresses robustness and fairness concerns in LLM evaluation through a 30-month longitudinal study. Built on a proprietary bank of 220,000 graduate-level questions across 13 academic disciplines, it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts. A study of nearly 60 leading models reveals performance ceilings and exposes data contamination vulnerabilities undetectable by static benchmarks.
evaluationfairnessrobustnessgenerative QAlongitudinal study
LLMEval-Med is a physician-validated benchmark for evaluating LLMs on real-world clinical tasks. It covers five core medical areas (Medical Knowledge, Language Understanding, Reasoning, Ethics & Safety, Text Generation) with 2,996 questions from real electronic health records and expert-designed clinical scenarios. An automated evaluation pipeline with expert-developed checklists is validated through human-machine agreement analysis. 13 LLMs across specialized, open-source, and closed-source categories are evaluated.
This paper tackles the third crucial question in LLM evaluation: how to evaluate. We compare various criteria with both manual and automatic evaluation, utilizing onsite staff, crowd-sourcing workers, public annotators, and GPT-4 across different scoring and ranking systems. 20 LLMs are evaluated with 2,186 participants generating 243,337 manual annotations and 57,511 automated results. The paper proposes the LLMEval dataset (from LLMEval-1 and LLMEval-2 rounds) and draws 10 conclusions for future evaluation practices.
Phase II dataset: professional domain evaluation across 12 academic disciplines, 480 questions
Technical Report2024
LLMEval-Gaokao2024-Math: 中文大语言模型评测 2024 高考数学专题
LLMEval Team
This evaluation utilizes the 2024 Chinese National College Entrance Examination (Gaokao) mathematics papers as a benchmark for large language models. Fresh exam questions with high originality and confidentiality make them an excellent test set. The evaluation covers both New Paper I and New Paper II, testing models with both LaTeX and escape-character formatted prompts to reveal sensitivity to prompt formatting in mathematical contexts.