Abstract: Experimental studies are a cornerstone of Machine Learning (ML) research. A common and often implicit assumption is that the study’s results will generalize beyond the study itself, e.g., to new data. That is, repeating the same study under different conditions will likely yield similar results. Existing frameworks to measure generalizability, borrowed from the casual inference literature, cannot capture the complexity of the results and the goals of an ML study. The problem of measuring generalizability in the more general ML setting is thus still open, also due to the lack of a mathematical formalization of experimental studies. In this paper, we propose such a formalization, use it to develop a framework to quantify generalizability, and propose an instantiation based on rankings and the Maximum Mean Discrepancy. We show how this latter offers insights into the desirable number of experiments for a study. Finally, we investigate the generalizability of two recently published experimental studies.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Junpei_Komiyama1
Submission Number: 4532
Loading