Track: Scientific Track
Keywords: Large Language Models, Hypothesis Testing, Statistics, Prompting
TL;DR: LLMs are helpful assistants for hypothesis testing, but not yet substitutes for statistical software.
Abstract: Statistical hypothesis testing is a cornerstone of evidence-based medicine and clinical research. Despite its central importance, previous research has consistently shown substantial deficits in statistical literacy among healthcare professionals. At the same time, large language models (LLMs) have demonstrated remarkable capabilities in scientific reasoning and data analysis. This study examines whether LLMs can serve as viable substitutes for conventional statistical software in guiding users through the selection, execution, and interpretation of hypothesis tests. Using a standardized prompt based on real survey data on the association between kick-scooter riding and knee pain in children, we evaluated seven LLMs and compared their outputs with statistical software results. Our findings indicate that none of the evaluated models can currently be considered a viable substitute. Although all models selected the appropriate test, substantial variation was observed in the quality of their explanations and in test execution. Gemini 3.1 Pro Preview, Claude Opus 4.6, and ChatGPT 5.4 Thinking performed strongly in test selection and result interpretation, with Gemini producing the most structured responses. However, none matched statistical software's result in test execution.
Submission Number: 49
Loading