Eliciting and evaluating generalizable explanations from large reasoning models

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: chain-of-thoughts, reasoning model, generalizability, interpretability
TL;DR: Explored the generalizability of Chain-of-Thought (CoT) reasoning by eliciting diverse CoT creation methods and evaluating them across models and task settings.
Abstract: Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem. This CoT is potentially a powerful tool to understand the problem, surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations do show generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings. We further analyze the conditions under which explanations do or do not yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. These results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 9618
Loading