Efficient Estimation of Kernel Surrogate Models for Task Attribution

Efficient Estimation of Kernel Surrogate Models for Task Attribution

ICLR 2026 Conference Submission14615 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Model interpretability, Data attribution, Kernels methods

Abstract: Modern AI systems, including LLMs, are trained on diverse tasks (e.g., translation, code generation, math reasoning, text prediction) simultaneously. A key challenge is to quantify the influence of individual training tasks on target task performance --- a problem we term \textit{task attribution}. A natural solution is leave-one-out retraining, where each task is removed and the model is retrained to measure its effect on target performance. However, this approach is computationally prohibitive at scale. We address this challenge using surrogate models that approximate a target task's performance given any subset of training tasks. While prior work has explored linear surrogates, these only capture first-order (linear) effects and do not model nonlinear task interactions such as synergy, antagonism, or XOR-type relationships. We introduce \textit{kernel surrogate models}, which better capture these nonlinear relationships. To make kernel estimation tractable, we develop a gradient-based procedure leveraging a first-order approximation of pretrained models, and empirically validate this to be accurate. Experiments across various domains (math reasoning in transformers, in-context learning, and multi-objective reinforcement learning) validate the effectiveness of kernel surrogate models. We find that kernel surrogate models demonstrate a 25\% higher correlation with the leave-one-out ground truth than linear surrogate models and influence functions (among other baselines), establishing a more accurate and scalable solution for task attribution. Using kernel surrogate models for downstream task selection leads to 40\% improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 14615

Loading