Can you trust your experiments? Generalizability of Experimental Studies

Can you trust your experiments? Generalizability of Experimental Studies

TMLR Paper4532 Authors

21 Mar 2025 (modified: 26 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Experimental studies are a cornerstone of Machine Learning (ML) research. A common and often implicit assumption is that the study’s results will generalize beyond the study itself, e.g., to new data. That is, repeating the same study under different conditions will likely yield similar results. Existing frameworks to measure generalizability, borrowed from the casual inference literature, cannot capture the complexity of the results and the goals of an ML study. The problem of measuring generalizability in the more general ML setting is thus still open, also due to the lack of a mathematical formalization of experimental studies. In this paper, we propose such a formalization, use it to develop a framework to quantify generalizability, and propose an instantiation based on rankings and the Maximum Mean Discrepancy. We show how this latter offers insights into the desirable number of experiments for a study. Finally, we investigate the generalizability of two recently published experimental studies.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: In the light of the reviewers' comments, we revised the manuscript to clarify key definitions, improve notation consistency, and better explain the interpretation of experimental results. Specific changes include clarifications to the concept of generalizability, the role of experimental conditions, the estimation of $n^*$, and the interpretation of tied rankings. We also updated figure captions, added examples, and expanded explanations in Sections 3–5 and the appendix for greater transparency and readability.

Assigned Action Editor: ~Junpei_Komiyama1

Submission Number: 4532

Loading