LLMEval-Med: Physician-Validated Clinical Benchmark (EMNLP 2025)
LLMEval-Med: A Real-world Clinical Benchmark (EMNLP 2025 Findings)
LLMEval-Med has been accepted at EMNLP 2025 Findings. This work addresses critical gaps in medical LLM evaluation by building a benchmark grounded in real clinical practice rather than exam-style questions.
Motivation
Current medical LLM benchmarks have three key limitations:
- Question design — mostly multiple-choice, which poorly tests clinical reasoning
- Data sources — often not derived from real clinical scenarios
- Evaluation methods — inadequate assessment of complex medical reasoning
LLMEval-Med addresses all three.
Benchmark Design
Five Core Medical Areas
1. Medical Knowledge — factual recall and understanding of medical concepts
2. Medical Language Understanding — comprehension of clinical text, reports, and records
3. Medical Reasoning — diagnostic reasoning, differential diagnosis, treatment planning
4. Medical Ethics and Safety — ethical decision-making and harm avoidance
5. Medical Text Generation — clinical note writing, patient communication, report generation
Real-World Clinical Data
- 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios
- Questions span multiple difficulty levels and clinical specialties
Evaluation Innovation
LLM-as-Judge with Physician Validation
Rather than simple answer matching, LLMEval-Med uses an automated evaluation pipeline with:
- Expert-developed checklists incorporated into the LLM-as-Judge framework
- Human-machine agreement analysis to validate scoring reliability
- Dynamic refinement of checklists and prompts based on ongoing expert feedback
- 5-point scoring rubric from Unacceptable (1) to Accuracy (5)
Key Results
13 LLMs were evaluated across three categories:
- Specialized medical models — fine-tuned for healthcare
- Open-source models — general-purpose open weights
- Closed-source models — proprietary API-based systems
The evaluation reveals important insights about when and how LLMs can be safely deployed in clinical settings.
- Paper: arXiv:2506.04078
- Dataset: HuggingFace
- Code: GitHub