Revisiting the Scaling Effects of LLMs on Medical Reasoning Capabilities

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM evaluation, scaling effect, medical reasoning evaluation
TL;DR: We conduct an empirical study on the scaling effects of parameters and dataset size to LLMs' medical reasoning capabilities.
Abstract: Recently, LLMs such as the Llama and Qwen families have rapidly improved by significantly scaling their training corpora, with smaller models trained on larger datasets now approaching or surpassing the performance of previous-generation larger models on public benchmarks. In this paper, we revisit the scaling effects of LLMs, using the medical field as a case study, by carefully analyzing how training corpus size and parameter size affect model performance on problems of varying difficulty. To this end, we present MedResEval, a new benchmark built upon the MedQA dataset. It is designed to demand more complex reasoning and decision-making and more accurately reflect real-world medical scenarios. Leveraging MedResEval, we investigate the scaling effects of training corpus and model size in LLMs through a comprehensive analysis of several prominent LLM families on medical reasoning tasks of varying complexity. The results reveal that while smaller models like Llama 3 (8B) approach the performance of older, larger models like Llama 2 (70B) on simple tasks like MedQA, they consistently underperform on complex tasks requiring advanced reasoning. Furthermore, we develop a difficulty-dependent scaling-law formula to characterize how LLMs' performance varies with training data size at a fixed model parameter size. The quantitative study reveals that reasoning error reduction rates are 1.3 times greater for large LLMs ($\approx$ 70B) compared to small LLMs ($\leq$10B) on simple tasks, and 2 times greater on complex reasoning tasks. Our study highlights that while both data and parameter scales enhance LLM performance, greater emphasis must be placed on parameter scales, particularly for complex reasoning tasks. Only LLMs with sufficiently large parameters can effectively tackle the complexities of real-world medical scenarios.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14061
Loading