SPEX: Scaling Feature Interaction Explanations for LLMs

Justin Singh Kang; Landon Butler; Abhineet Agarwal; Yigit Efe Erginbas; Ramtin Pedarsani; Bin Yu; Kannan Ramchandran

SPEX: Scaling Feature Interaction Explanations for LLMs

Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Bin Yu, Kannan Ramchandran

Published: 05 Mar 2025, Last Modified: 02 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: Interpretability, Signal Processing, Sparsity, SHAP, Interactions

Abstract: Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide *marginal* feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose *Spectral Explainer* (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions—common in real-world data—and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20\% in terms of faithfully reconstructing LLM outputs. Code at https://github.com/basics-lab/spectral-explain.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 25

Loading