Keywords: Large Language Models (LLMs), Test-Time Scaling (TTS), Overthinking Reduction
Abstract: Large Language Models (LLMs) face persistent challenges in domain-specific reasoning tasks, particularly in fields such as mathematics, telecommunications, and scientific problem-solving, where structured, multi-step inference and adherence to formal constraints are essential. Traditional generation strategies often fail to balance accuracy and efficiency in these settings, partly due to the absence of high-quality, curated reasoning datasets. In this work, we introduce Adaptive Budget Forcing (ABF), a simple but very effective test-time inference strategy that dynamically adjusts the reasoning length of LLMs by monitoring real-time certainty signals—such as token-level confidence, entropy, and semantic coherence—within the model's thinking trajectory. ABF enables models to terminate generation when sufficient confidence is reached or extend it when further inference is needed, improving both computational efficiency and decision fidelity. To support ABF and enable effective fine-tuning, we construct TCORE (Telecom-Curated Open Reasoning Examples), a domain-specific dataset featuring multi-step reasoning traces derived from telecom standards and engineering tasks. TCORE is built via a multi-stage filtering process targeting quality, difficulty, and semantic diversity, and serves as both a fine-tuning resource and evaluation benchmark. Experimental results on telecom and mathematical reasoning tasks demonstrate that ABF consistently improves reasoning accuracy while reducing unnecessary computation. TCORE and code are available at https://anonymous.4open.science/r/ABF-TCORE.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9245
Loading