Keywords: LLM Collaboration, Efficient Reasoning, Small Language Models, Thinking Insights, Cost-aware Inference
Abstract: Recent advancements in large language models (LLMs) have popularized reasoning-intensive paradigms, which improve response quality but significantly increase computational costs due to extended reasoning chains. We propose Tandem, a collaborative framework between large and small language models designed to achieve high-quality reasoning with low computational overhead. In Tandem, the LLM acts as a mentor by generating four types of critical reasoning insights (Goal, Planning, Retrieval, Action), while a small language model (SLM) executes the reasoning to produce the final answer. A cost-aware judgment mechanism uses perplexity and entropy to adaptively determine when sufficient insights have been accumulated, allowing early termination of LLM generation. Experiments on MATH and GSM8K show that Tandem reduces computational costs by approximately 50% while outperforming the LLM alone in accuracy.https://anonymous.4open.science/r/Ensemble-Hub-0FD8
Paper Type: Long
Research Area: Low-resource Methods for NLP
Research Area Keywords: LLM Efficiency, NLP in resource-constrained settings, agent coordination and negotiation
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3440
Loading