Abstract: Highlights•Evaluated GPT-3.5, PaLM-2, Claude-2, and LLaMA-2 on 6 biomedical tasks (26 datasets).•Performance of each Large Language Model (LLM) across various tasks may vary.•Found that not a single LLM can achieve superiority over other LLMs in all tasks.•Observed that LLMs could be useful in biomedical tasks that lack large annotated data.•Our findings will help identify the best zero-shot LLM for a particular biomedical task.
Loading