LAMP: Extracting Local Decision Surfaces from Large Language Models

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce \textbf{LAMP} (\textbf{L}ocal \textbf{A}ttribution \textbf{M}apping \textbf{P}robe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its reported predictions by approximating a decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. We apply LAMP to three tasks: \textit{sentiment analysis}, \textit{controversial-topic detection}, and \textit{safety-prompt auditing}. Across these tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case‑file data set, align with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model appears to behave consistently with the explanations it provides.
Submission Number: 2044
Loading