Keywords: Large Language Models, Interpretability, Argumentation
TL;DR: We introduce Latent Debate, a framework for interpreting model predictions through the lens of implicit internal debates.
Abstract: Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge.
To this end, we introduce \emph{latent debate}, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference step.
We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks.
Empirical studies demonstrate that latent debate is a faithful surrogate model that has highly consistent predictions with the original LLM.
Further analysis reveals strong correlations between hallucinations and debate patterns.
These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.
Primary Area: interpretability and explainable AI
Submission Number: 13918
Loading