Back to Blog
·LLMEval Team

LLMEval-Med: Physician-Validated Clinical Benchmark (EMNLP 2025)

EMNLP 2025LLMEval-Medmedical AI

LLMEval-Med: A Real-world Clinical Benchmark (EMNLP 2025 Findings)

LLMEval-Med has been accepted at EMNLP 2025 Findings. This work addresses critical gaps in medical LLM evaluation by building a benchmark grounded in real clinical practice rather than exam-style questions.

Motivation

Current medical LLM benchmarks have three key limitations:

  • Question design — mostly multiple-choice, which poorly tests clinical reasoning
  • Data sources — often not derived from real clinical scenarios
  • Evaluation methods — inadequate assessment of complex medical reasoning

LLMEval-Med addresses all three.

Benchmark Design

Five Core Medical Areas

1. Medical Knowledge — factual recall and understanding of medical concepts

2. Medical Language Understanding — comprehension of clinical text, reports, and records

3. Medical Reasoning — diagnostic reasoning, differential diagnosis, treatment planning

4. Medical Ethics and Safety — ethical decision-making and harm avoidance

5. Medical Text Generation — clinical note writing, patient communication, report generation

Real-World Clinical Data

  • 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios
  • Questions span multiple difficulty levels and clinical specialties

Evaluation Innovation

LLM-as-Judge with Physician Validation

Rather than simple answer matching, LLMEval-Med uses an automated evaluation pipeline with:

  • Expert-developed checklists incorporated into the LLM-as-Judge framework
  • Human-machine agreement analysis to validate scoring reliability
  • Dynamic refinement of checklists and prompts based on ongoing expert feedback
  • 5-point scoring rubric from Unacceptable (1) to Accuracy (5)

Key Results

13 LLMs were evaluated across three categories:

  • Specialized medical models — fine-tuned for healthcare
  • Open-source models — general-purpose open weights
  • Closed-source models — proprietary API-based systems

The evaluation reveals important insights about when and how LLMs can be safely deployed in clinical settings.