Landmark and Compare: Addressing the Algorithm Selection Problem via Problem Space Similarity

Published: 15 Oct 2025, Last Modified: 31 Oct 2025BNAIC/BeNeLearn 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Type E (Late-Breaking Abstracts)
Keywords: Algorithm Selection Problem, Landmarking, Data Generating Distribution, Meta Learning, Dataset Shift
Abstract: \textit{Introduction} In official statistics, researchers from different national statistic offices often work on similar machine learning (ML) tasks, such as imputing non-response items in surveys, but each using their own local dataset. As part of a collaborative effort, would one group benefit from adopting the other group's ML algorithm? This is a non-trivial task as implementing a new ML pipeline from feature extraction to predictions is often a significant effort. This practical challenge of identifying the most effective algorithm for a given computational task is known as the Algorithm Selection Problem (ASP) \cite{ref_rice1}. In this study, we limited ourselves to continuous response and feature variables. \textit{Methods} Our initial investigation explored the creation of a normalized metric for comparing algorithm performances across datasets via a universal scale. However, this approach faces a fundamental theoretical constraint rooted in the Expected Prediction Error (EPE). The EPE contains an irreducible error term ($\sigma^2_{\epsilon}$) that comes from the data generating distribution $P_{(X,Y)}$ (DGD) \cite{ref_wolpert1,ref_hastie1}. This term creates a dataset-specific error floor (i.e. performance ceiling) of an algorithm $A_i$ under $D \sim P_{(X,Y)}$: \begin{equation} \mathbb{E}_D[\text{error}_d(A_i)] \in [\sigma^2_{\epsilon}, \infty). \end{equation} This insight shifts the perspective from "Are performance scores similar?" to "Are the learning problems functionally similar?", aligning more closely with the No Free Lunch theorem \cite{ref_wolpert1}. We hypothesize that \textit{stable algorithm rankings serve as a robust proxy for the functional similarity of DGDs}. If two DGDs are similar, applying a diverse set of algorithms yields similar rank orders and thus high rank correlation (Kendall's $\tau \geq 0.4$). However, verifying the rank correlation requires fitting an identical set of candidate algorithms twice (once on each dataset) which is the very task we aim to avoid. We break this paradox with an insight from our initial exploratory simulations. We observed that algorithm rank stability is primarily driven by just two factors: (i) the noise proportion and (ii) the curvature (deviation from linearity) of the observable signal. Our "Landmark and Compare" framework \cite{ref_kodd_thesis} estimates these two meta-features via landmarking \cite{ref_pfahringer1}, which is the process of fitting a diverse set of algorithms using default hyperparameters and no additional tuning. We approximate the noise by applying $1-$max(NSE\footnote{Nash–Sutcliffe Efficiency (NSE): Predictive R$^2$ analog.}) across landmarking algorithms and the signal curvature via the NRMSE\footnote{Normalized Root Mean Squared Error (NRMSE): RMSE divided by the response standard deviation.} of a simple linear regression. The crux of the framework is the sequential decision rule. First, test whether the estimated noise proportions of the two datasets lie within a predetermined threshold (to allow for sampling variability). Only if noise proportions are similar, the NRMSE (curvature) values can be compared. Otherwise, the noise ($\sigma^2_{\epsilon}$) confounds the NRMSE (curvature) comparison due to the difference in (normalized) EPE scales. Second, if the curvature values also lie within their own threshold, the datasets are declared similar. If either test fails, the datasets are declared dissimilar. \textit{Results} Decision thresholds 0.2 (noise) and 0.15 (curvature) were empirically calibrated via grid search on 45 synthetic DGDs that varied in signal complexity and noise proportions. The framework was then validated on five real-world datasets by choosing one as the reference and the other four as the non-reference datasets. Landmarking set A (linear regression, elastic net, RF, XGB, SVM) and B (substituting Elastic Net for GAM) were used to estimate the meta-features for the decision rule. The ground truth rank correlations confirmed these predictions (4/4 correct), demonstrating robustness to minor variations in the landmarking sets A vs. B. \textit{Discussion} The proposed framework provides researchers with a new tool to assist them in their collaborative efforts. Instead of focusing on normalized metrics which are EPE dependent, the framework focuses on problem space similarity in a computationally inexpensive and easy to implement manner. However, the decision thresholds should be viewed only as sensible default hyperparameters and not as universal constants. Additionally, future work should further investigate edge cases of the proposed curvature metric as well as the robustness of the entire framework under more diverse landmarking sets A vs. B to replicate more realistic collaboration scenarios.
Serve As Reviewer: ~Bernhard_Pfahringer1
Submission Number: 25
Loading