OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models

Published: 29 Feb 2024, Last Modified: 01 Mar 2024AAAI 2024 SSS on Clinical FMsEveryoneRevisionsBibTeXCC BY 4.0
Track: Non-traditional track
Keywords: LLMs, Medical AI, Prompt Engineering, Generative AI for Medicine
TL;DR: We present a prompting approach for open-source LLMs which achieves state-of-the-art performance on medical benchmarks with prompt engineering alone and without fine-tuning.
Abstract: LLMs have become increasingly capable at accomplishing a range of specialized-tasks and can be utilized to expand access to medical knowledge. Many researchers have attempted to leverage LLMs for medical applications and a range of medical benchmarks have been developed to test the acuity of these models on healthcare-specific tasks. Most of the LLMs that have been developed for medical applications have involved significant amounts of fine-tuning, leveraging specialized medical data and large amounts of computational power to complete. Additionally, many of the top performing models are proprietary models with limited access to all but a few research groups. However, open-source (OS) models represent a key area of growth for medical LLMs due to significant improvements in performance and to an inherent ability to provide the transparency and compliance required in the healthcare space. We present OpenMedLM, a prompting platform which delivers state-of-the-art (SOTA) performance for OS LLMs on medical benchmarks. Through a series of robust prompt engineering techniques, we found that OpenMedLM delivers SOTA results on 3 common medical LLM benchmarks, surpassing the previous best performing OS models which leveraged extensive fine-tuning. These results demonstrate the ability of OS foundation models to offer strong performance while alleviating the challenges associated with fine-tuning. Our results highlight medical-specific emergent properties in OS LLMs which have not yet been documented elsewhere and showcase the need to understand how else prompt engineering can improve the performance of LLMs for medical applications.
Presentation And Attendance Policy: I have read and agree with the symposium's policy on behalf of myself and my co-authors.
Ethics Board Approval: No, our research does not involve datasets that need IRB approval or its equivalent.
Data And Code Availability: No, we will not be making any data and/or code public.
Primary Area: Clinical foundation models
Student First Author: No, the primary author of the manuscript is NOT a student.
Submission Number: 24
Loading