Keywords: Inductive Reasoning, Small Language Models, Supervised Fine Tuning
TL;DR: An inductive reasoning corpus of triplets to fine tune small language models toward augmented inductive reasoning skills
Abstract: Most reasoning evaluations conflate deduction with induction. We target \emph{inductive} ability; i.e., ampliative inference from noisy evidence; and introduce a (Context, Question, Answer) corpus (IR-Triplets) aligned to ten canonical inductive forms (enumeration, statistical generalization/syllogism, analogy, default rules, abduction, Bayesian/Carnapian updates, Mill-style causal inference). The dataset is fairly balanced across forms and supports auditable supervision for agentic systems. We fine-tune ten small language models (0.5B–9B) with parameter-efficient Supervised Fine-Tuning (SFT) and evaluate in a 2×2 design: in-distribution (ID) held-out data and out-of-distribution (OOD) transfer to a different dataset DEER. IR-Triplets dataset yields consistent ID gains in Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence (ROUGE-L) (mean absolute ${\approx}0.07$, ${\sim}60$–$70\\%$ relative), with large improvements for several models; OOD transfer is heterogeneous but frequently positive (e.g., Gemma2-2B, Llama-8B). Post-hoc spectral diagnostics show strong compression: spectral tail index and stable rank typically drop by ${\sim}45$–$80\\%$ and ${\sim}44$–$55\\%$, respectively. Ordinary Least Square (OLS) analyses clarify that model size strongly predicts spectral compression, while ROUGE-L gains are not a significant predictor once size is controlled; conversely, spectral deltas do not significantly explain ROUGE-L gains in the reverse regression with this sample size. Overall, IR-Triplets dataset reliably improves text-level fidelity and reorganizes capacity toward lower-rank, heavier-tailed representations, but the magnitude of ROUGE-L improvement does not linearly track the \emph{amount} of global compression, pointing to subspace-level mechanisms as a key direction for OOD robustness.
Primary Area: datasets and benchmarks
Submission Number: 20492
Loading