Award Date

12-15-2025

Degree Type

Thesis

Degree Name

Master of Science in Engineering (MSE)

Department

Electrical and Computer Engineering

First Committee Member

Brendan Morris

Second Committee Member

Mei Yang

Third Committee Member

Shahram Latifi

Fourth Committee Member

Mingon Kang

Number of Pages

139

Abstract

Despite achieving over 90% accuracy on medical benchmarks, recent studies show physicians cannot effectively leverage language models to improve clinical reasoning. Current benchmarks test isolated factual recall, but clinical practice requires hierarchical navigation through diagnostic categories—starting broad and narrowing systematically from chest pain to cardiovascular pathology to myocardial infarction to specific STEMI types. Existing evaluations cannot measure whether models preserve this taxonomic structure essential for clinical reasoning.

We introduce AnkiMedBench, built from 16,512 medical flashcards used by students preparing for licensing exams. Cards are organized across six hierarchy levels spanning 16 broad medical specialties to 672 specific diseases and conditions. We evaluate 30 models using two metrics: Flat F1 measures exact classification, while Hierarchical F1 gives partial credit when errors respect the medical taxonomy. This dual-metric approach reveals whether models achieve both clinical precision and structural coherence.

Our best model outperformed the worst by 2.6× at fine-grained classification. Highquality models maintain similar scores on both metrics, while low-quality models show large divergence—they capture broad categories but fail precise distinctions. Fine-tuning on 26 medical textbooks produced minimal hierarchical improvement, demonstrating that standard benchmark performance does not predict hierarchical capability.

AnkiMedBench exposes model limitations invisible to existing benchmarks, enabling more reliable evaluation of systems for medical education, clinical documentation, and diagnostic support where structured reasoning over taxonomic hierarchies is essential.

Keywords

AnkiMedBench; clinical reasoning; embedding models; hierarchical classification; medical benchmarks; taxonomic evaluation

Disciplines

Artificial Intelligence and Robotics | Computer Engineering | Computer Sciences | Medicine and Health Sciences

File Format

PDF

File Size

16400 KB

Degree Grantor

University of Nevada, Las Vegas

Language

English

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/


Share

COinS