Keywords: language model
Abstract: Chain-of-Thought (CoT) enables large language models (LLMs) to tackle complex reasoning tasks by generating intermediate steps. Although CoT provides opportunities for improved interpretability and facilitates the monitoring of AI safety, the consistency between generated CoT and the model's actual reasoning process is not guaranteed. Models can output seemingly reasonable CoT that fails to reflect the true computational trajectory leading to the final answer. IIn this work, we introduce a novel approach called CoT Inversion to evaluate CoT faithfulness. We cast the problem in a probabilistic framework, viewing the genuine reasoning chain as a latent variable that mediates between the input and the answer. Leveraging variational inference with a scoring function, we infer this hidden CoT by effectively reversing the model's answer generation process; under our chosen variational family, the optimization reduces to an instance of the Expectation Maximization (EM) algorithm.
Furthermore, we propose an explicit alignment objective that promotes similarity between the inferred latent CoT and the model's directly generated CoT, considering the explicit CoT as an informative and possibly unfaithful signal. Our approach enables the quantitative assessment of agreement between articulated and inferred reasoning processes, offering a practical metric of CoT faithfulness and strengthening our ability to interpret and trust the reasoning of language models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1216
Loading