Weak2Wise: An Automated, Lightweight Framework for Weak-LLM-Friendly Reasoning Synthesis

Weak2Wise: An Automated, Lightweight Framework for Weak-LLM-Friendly Reasoning Synthesis

ACL ARR 2025 May Submission2856 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in large language model (LLM) fine‑tuning have shown that incorporating high‑quality reasoning traces into training data can markedly improve downstream performance. However, existing approaches often depend on expensive manual annotations or auxiliary models, and fail to adapt to the unique limitations of smaller “weak” LLMs. To address these gaps, we introduce Weak2Wise, a fully automated, lightweight framework for synthesizing high‑quality, weak-LLM-friendly reasoning traces. Starting from a QA dataset, Weak2Wise filters out the samples that can already be correctly answered by the weak LLM, gathers diverse candidate reasoning traces from multiple strong LLMs, and leverages our Step‑Mask scoring to rank and truncate the most guidance‑effective traces. These reasoning traces are then used for fine‑tuning, yielding substantial improvements in the weak LLM’s reasoning abilities. The name Weak2Wise has two meanings: using a “weak” LLM to select the "wisest" reasoning traces generated by stronger LLMs, and fine‑tuning the same weak LLM on these reasoning traces to become “wiser”. We further use Weak2Wise to build GR-1K, a 1,000‑sample math and science QA‑reasoning dataset optimized for weak LLMs, and fine‑tune Qwen2.5‑7B on it to create GR‑7B, which achieves superior performance on AIME2024, MATH‑500, and GPQA Diamond benchmarks. Source code, dataset, and pretrained models will be made publicly available upon acceptance.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: reasoning, chain-of-thought, fine-tuning, automatic evaluation

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Keywords: reasoning, chain-of-thought, fine-tuning, automatic evaluation

Submission Number: 2856

Loading