LLMEval-Logic: Solver-Verified Chinese Logic Reasoning Benchmark on arXiv
LLMEval-Logic is a Chinese logical reasoning benchmark double-audited by the Z3 SMT solver and human rubrics, and toughened via an adversarial-hardening agent loop. The paper is now on arXiv, alongside an 80% public release (197 Base + 154 Hard + 197 rubrics); the remaining 20% is held out as a private contamination-resistant test set maintained by Fudan NLP Lab.