Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning

Leo Lu; Jonathan Zhang; Sean Eugene Chua; Spencer Kim; Kevin Zhu; Sean O'Brien; Vasu Sharma

Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning

Leo Lu, Jonathan Zhang, Sean Eugene Chua, Spencer Kim, Kevin Zhu, Sean O'Brien, Vasu Sharma

Published: 08 Nov 2025, Last Modified: 08 Nov 2025ResponsibleFM @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chain of Thought, Interchangeability, Hybrid Reasoning Chains, Model Compatability, Multi-Step Reasoning, Logical Coherence, Handoff, Continuation, Partial Reasoning Chains

TL;DR: Hybrid reasoning chains in large language models tend to maintain coherence within the same model family but show reduced alignment across different families, highlighting constraints on cross-model reasoning interchangeability pairs.

Abstract: Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of large language models (LLMs). While prior work focuses on improving model performance through internal reasoning strategies, little is known about the interchangeability of reasoning across different models. In this work, we explore whether a partially completed reasoning chain from one model can be reliably continued by another model, either within the same model family or across families. We achieve this by assessing the sufficiency of intermediate reasoning traces as transferable scaffolds for logical coherence and final answer accuracy. We interpret this interchangeability as a means of examining inference-time trustworthiness, probing whether reasoning remains both coherent and reliable under model substitution. Using token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models, Gemma-3-4B-IT and LLaMA-3.1-70B-Instruct, we conduct continuation experiments with Gemma-3-1B-IT and LLaMA-3.1-8B-Instruct to test intra-family and cross-family behaviors. Our evaluation pipeline leverages truncation thresholds with a Process Reward Model (PRM), providing a reproducible framework for assessing reasoning stability via model interchange. Evaluations with a PRM reveal that hybrid reasoning chains often preserve, and in some cases even improve, final accuracy and logical structure. Our findings point towards interchangeability as an emerging behavioral property of reasoning models, offering insights into new paradigms for reliable modular reasoning in collaborative AI systems.

Submission Number: 124

Loading