Can you trust your experiments? Generalizability of Experimental Studies

TMLR Paper4532 Authors

21 Mar 2025 (modified: 14 Apr 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Experimental studies are a cornerstone of Machine Learning (ML) research. A common and often implicit assumption is that the study’s results will generalize beyond the study itself, e.g., to new data. That is, repeating the same study under different conditions will likely yield similar results. Existing frameworks to measure generalizability, borrowed from the casual inference literature, cannot capture the complexity of the results and the goals of an ML study. The problem of measuring generalizability in the more general ML setting is thus still open, also due to the lack of a mathematical formalization of experimental studies. In this paper, we propose such a formalization, use it to develop a framework to quantify generalizability, and propose an instantiation based on rankings and the Maximum Mean Discrepancy. We show how this latter offers insights into the desirable number of experiments for a study. Finally, we investigate the generalizability of two recently published experimental studies.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Junpei_Komiyama1
Submission Number: 4532
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview