LLMEval-Logic: Solver-Verified Chinese Logic Reasoning Benchmark (80% Public Release)
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
We are releasing LLMEval-Logic, the logic-reasoning entry in the LLMEval series. The benchmark was built through a three-stage audit pipeline and is published in two paired splits — a Z3-verified single-question split and a multi-question adversarially hardened split.
Why another logic benchmark?
Existing logical reasoning benchmarks tend to fall into one of two failure modes:
1. Template-leak. Items are generated forward from formulas, so models that memorise the templates do well on the test set but generalise poorly to natural text.
2. Easy-answer drift. Once a model can pattern-match a few answer keywords, the benchmark stops differentiating frontier reasoners.
LLMEval-Logic is designed to attack both failure modes.
Three-stage audit pipeline
Stage (a) — forward authoring. Annotators write items *forward* from real-world situations (e.g. roommate scheduling, project sign-off, library borrowing) rather than templating *backward* from formulas. Each item carries Chinese natural-language premises and a question.
Stage (b) — solver + rubric double audit. For every Base item we ship a gold formalisation (parameters / premise / question in a solver-parseable JSON schema). The translation is double-checked by:
- Z3 SMT solver running three judgments (
possible / necessary / enumerate_models). - A hand-written atom-level rubric organised into
logical_relation/stated_constraint/query_alignmentgroups, so we can score whether a candidate FL preserves the original logical relations, constraints, and query — not just whether the final answer happens to match.
Stage (c) — closed-loop adversarial hardening. A five-role agent workflow (Decider / Proposal / Review / Answering / Verification) rewrites high-accuracy Base items into harder multi-question variants under six structured strategies (branching, effective distractors, explicit uncertainty, set-valued output, counterfactual variants, alias/coreference shifts). A rewrite is admitted only when it passes both Review and Verification, producing the Hard split.
What's in the public release
- LLMEval-Logic-Base — 197 single-question items (PL + FOL) with Z3-verified answers, gold formalisations, and atom-level rubrics.
- LLMEval-Logic-Hard — 152 multi-question items (~752 sub-questions) covering closed-world enumeration / counting / uniqueness / alternative-solution / counterfactual reasoning.
- Rubrics — 197 per-problem checklists (one per Base item), each containing atomic yes/no statements that can be answered jointly by Z3 and an LLM judge.
The public 80% is produced by deterministic stratified sampling with seed=2026 — Base is stratified by (logictype, label_type), Hard by sub-question-count bucket — so the public split preserves the full corpus's distribution and the public numbers stay comparable with the paper's full-corpus numbers.
| | Full corpus | Public (this release) | Private holdout |
|---|---:|---:|---:|
| LLMEval-Logic-Base items | 246 | 197 | 49 |
| LLMEval-Logic-Hard items | 190 | 152 | 38 |
| Hard sub-questions | 938 | ~752 | ~186 |
| Per-problem rubric files | 246 | 197 | 49 |
Headline numbers (paper, full corpus)
Three independent runs of 14 frontier LLMs under thinking / no-thinking configurations:
| Model | Base Item Acc. | Hard Item Acc. | Hard Sub-Q Acc. |
|---|---:|---:|---:|
| Gemini 3.1 Pro (thinking) | 74.0 ± 1.2 | 37.5 ± 3.8 | 76.0 ± 0.6 |
| Claude Opus 4.6 (thinking) | 68.7 ± 3.2 | 36.7 ± 2.6 | 76.6 ± 1.4 |
| Claude Opus 4.6 (no-think) | 69.0 ± 1.2 | 36.0 ± 2.4 | 74.4 ± 1.2 |
| GPT-5.4 Pro (thinking) | 71.8 ± 0.9 | 32.6 ± 4.3 | 68.4 ± 3.1 |
Even with the strongest reasoner, Hard Item Accuracy tops out at 37.5% — there is plenty of room left for frontier reasoning research.
Why a 20% private holdout?
Following the contamination-resistant tradition of LLMEval-Fair, we hold out 49 Base + 38 Hard items (and their rubrics) as a private contamination-resistant test set maintained by Fudan NLP Lab. The holdout backs the official LLMEval-Logic leaderboard at llmeval.com and supports contamination audits years after the public release.
To submit a model for official evaluation against the holdout, please contact
- Code & one-command pipeline:
- Public 80% dataset: