Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

ACL ARR 2024 June Submission1394 Authors

14 Jun 2024 (modified: 03 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, \textbf{RABBITS}, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets.

Paper Type: Short

Research Area: Resources and Evaluation

Research Area Keywords: automatic evaluation of datasets, robustness, clinical NLP

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 1394

Loading