Exploiting Hardness and Diversity for Data-Efficient Fine-Tuning

Exploiting Hardness and Diversity for Data-Efficient Fine-Tuning

ACL ARR 2026 January Submission5022 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data-efficient fine-tuning, Mathematical reasoning, Data selection, Semantic diversity, Large language models

Abstract: Fine-tuning large language models for mathematical reasoning is often performed using large training sets, even though many examples become redundant once a model is already instruction-tuned. Under practical compute and time constraints, it is therefore important to understand which training examples actually matter. We investigate this problem on GSM8K by fine-tuning Gemma-2-2B-it with LoRA under a fixed data budget. We compare uniform random sampling with two structured data selection methods. A taxonomy-based approach, Skill-Balanced Sampling (SBS), enforces balanced coverage across predefined skill categories but yields only modest and inconsistent gains. We then propose Hardness-Weighted Diversity (HWD), which explicitly controls the proportion of easy, medium, and hard examples while promoting semantic diversity. Our empirical results show clear performance saturation well before the full dataset is utilized. Moreover, HWD achieves the best performance using only 9% of the GSM8K training data, outperforming both random sampling and SBS with substantially fewer training examples.

Paper Type: Long

Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning

Research Area Keywords: Reasoning, Data Selection, Efficient Training of Language Models

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 5022

Loading