November 1, 2025·LLMEval Team

LLMEval-Med: Physician-Validated Clinical Benchmark (EMNLP 2025)

EMNLP 2025LLMEval-Medmedical AI

LLMEval-Med: A Real-world Clinical Benchmark (EMNLP 2025 Findings)

LLMEval-Med has been accepted at EMNLP 2025 Findings. This work addresses critical gaps in medical LLM evaluation by building a benchmark grounded in real clinical practice rather than exam-style questions.

Motivation

Current medical LLM benchmarks have three key limitations:

Question design — mostly multiple-choice, which poorly tests clinical reasoning
Data sources — often not derived from real clinical scenarios
Evaluation methods — inadequate assessment of complex medical reasoning

LLMEval-Med addresses all three.

Benchmark Design

Five Core Medical Areas

1. Medical Knowledge — factual recall and understanding of medical concepts

2. Medical Language Understanding — comprehension of clinical text, reports, and records

3. Medical Reasoning — diagnostic reasoning, differential diagnosis, treatment planning

4. Medical Ethics and Safety — ethical decision-making and harm avoidance

5. Medical Text Generation — clinical note writing, patient communication, report generation

Real-World Clinical Data

2,996 questions created from real-world electronic health records and expert-designed clinical scenarios
Questions span multiple difficulty levels and clinical specialties

Evaluation Innovation

LLM-as-Judge with Physician Validation

Rather than simple answer matching, LLMEval-Med uses an automated evaluation pipeline with:

Expert-developed checklists incorporated into the LLM-as-Judge framework
Human-machine agreement analysis to validate scoring reliability
Dynamic refinement of checklists and prompts based on ongoing expert feedback
5-point scoring rubric from Unacceptable (1) to Accuracy (5)

Key Results

13 LLMs were evaluated across three categories:

Specialized medical models — fine-tuned for healthcare
Open-source models — general-purpose open weights
Closed-source models — proprietary API-based systems

The evaluation reveals important insights about when and how LLMs can be safely deployed in clinical settings.

Paper: arXiv:2506.04078
Dataset: HuggingFace
Code: GitHub