Keywords: reasoning, domain adaptation, DPO, SFT, dataset
TL;DR: We train a small 4B model achieving state-of-the-art reasoning for finance, surpassing other LRMs via SFT and RL on (to-be-released) data, enabling practical domain adaptation and deployment.
Abstract: Large reasoning models (LRMs) excel at reasoning tasks but face deployment barriers due to computational constraints, regulatory requirements, and domain-specific knowledge gaps. This work addresses these limitations by developing cost-efficient post-training methods to enhance reasoning capabilities. Using Qwen3-4B as our base model, we investigate variations of efficient Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). For this purpose, we construct a comprehensive financial reasoning dataset with diverse trace qualities from FinQA, enabling systematic analysis of how training data characteristics affect model performance under tight computational budgets. Our experiments demonstrate that reasoning data augmentation, combined with efficient training algorithms can achieve a high accuracy of 78.64\%, surpassing larger LRMs such as DeepSeek-R1 as well as previously published results on FinQA, despite avoiding GRPO and other costly online RL methods. The work contributes both a multi-trace reasoning dataset for rapid experimentation and empirical insights into optimizing reasoning performance within resource constraints, providing a reusable framework for customization of smaller language models for domain-specific applications.
Submission Number: 38
Loading