Abstract: Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction. Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training. Our work investigates whether we can elicit such behavior without any training. To that goal, we propose a decoding-time approach, ThinkLogit, which utilizes logits arithmetic to tune a target large LM for long reasoning using a substantially smaller model as guider. We then show that we can further boost its performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model---a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in pass@1 by 24.5% and 29.1%, respectively, over five mathematical and scientific reasoning datasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B---a model 21x smaller. Ablation studies confirm that ThinkLogit-DPO succeeds only when it couples a preference‑learning objective with training pairs drawn from both the target and guider models. Our work presents a computationally-efficient method to elicit long reasoning in large models with minimal or no additional training.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: reasoning, math QA
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 3.1
B2 Discuss The License For Artifacts: No
B2 Elaboration: We use publicly available datasets and open-sourced models that are widely adopted in academic research. All artifacts used in this work are governed by their respective licenses, which permit research use and distribution.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: We use publicly available datasets and open-sourced models that are widely adopted in academic research. All artifacts used in this work are governed by their respective licenses, which permit research use and distribution.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: The datasets contain problems from mathematical competitions and science literature, with no risk of containing personal info or offensive content.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 3.1
B6 Statistics For Data: Yes
B6 Elaboration: Section 3.1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 3.1
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 3.1, Appendix A.1
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 3.2
C4 Parameters For Packages: Yes
C4 Elaboration: Appendix A.1
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: We use AI assistants for debugging code and polish writing in the draft.
Author Submission Checklist: yes
Submission Number: 961
Loading