A Concept Level Energy-Based Framework for Interpreting Black-Box Large Language Model Responses

ICLR 2026 Conference Submission24982 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Black-box large language models, Post-hoc interpretation, Energy based models, Model-agnostic feature attribution
TL;DR: We propose a framework for training a model-agnostic interpreter that identifies influential prompt components for black-box LLM responses by leveraging a global energy-based training objective.
Abstract: The widespread adoption of proprietary Large Language Models (LLMs) accessed strictly through closed-access APIs has created a critical challenge for their reliable deployment: a fundamental lack of interpretability. In this work, we propose a model-agnostic, post-hoc interpretation framework to address this. Our approach defines an energy model that quantifies the conceptual consistency between prompts and the corresponding LLM-generated responses. We use this energy to guide the training of an interpreter network for a set of target sentences. Once trained, our interpreter operates as an efficient, standalone tool, providing sentence-level importance scores without requiring further queries to the original LLM API or energy model. These scores quantify how much each prompt sentence influences the generation of specific target sentences. A key advantage is that our framework globally trains a local interpreter, which helps mitigate common biases in LLMs. Our experiments demonstrate that the energy network accurately captures the target LLM's generation patterns. Furthermore, we show that our interpreter effectively identifies the most influential prompt sentences for any given output.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 24982
Loading