A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks

Published: 01 Jan 2024, Last Modified: 03 Oct 2024Comput. Biol. Medicine 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Evaluated GPT-3.5, PaLM-2, Claude-2, and LLaMA-2 on 6 biomedical tasks (26 datasets).•Performance of each Large Language Model (LLM) across various tasks may vary.•Found that not a single LLM can achieve superiority over other LLMs in all tasks.•Observed that LLMs could be useful in biomedical tasks that lack large annotated data.•Our findings will help identify the best zero-shot LLM for a particular biomedical task.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview