Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving
TL;DR: We construct a multi-cognitive-level medical evaluation framework for LLMs, and conduct a systematic evaluation across several well-known LLM families.
Abstract: Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom's Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our study highlights the need to enhance LLMs' medical capabilities at higher cognitive levels and provides insights for developing LLMs suited to real-world medical applications.
Lay Summary: Large language models (LLMs), like ChatGPT, have shown impressive results on medical tests, but it’s still unclear how well they understand and reason through medical problems of different complexity. In this study, we designed a new way to evaluate LLMs’ medical abilities by testing them at three levels of thinking: basic knowledge grasp, complex knowledge application, and scenario-based problem-solving. We used this method to assess a wide range of popular LLMs, including those developed by Meta, OpenAI, and others. We found that while most models handle simple questions well, they struggle more as the problems become more complex. Larger models tend to perform better on these harder tasks. Our results suggest that future improvements should focus on helping LLMs think more like doctors when faced with complex clinical situations.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Model, Medical Evaluation, Bloom's Taxonomy
Submission Number: 10697
Loading