Keywords: domain specific llm
Abstract: Large Language Models (LLMs) are typically trained on diverse, general-purpose datasets, enabling broad generalization but incurring substantial computational costs. However, real-world applications often require efficient models tailored to narrow, domain-specific tasks. In such settings, large model capacity and generality are unnecessary, and traditional fine-tuning pipelines struggle under resource constraints. We introduce FineScope, a framework that addresses this challenge by tightly coupling domain-aware data selection with model pruning and fine-tuning. Starting from a small set of user-provided seed examples, FineScope trains sparse autoencoders (SAEs) on intermediate model activations to automatically extract semantically aligned examples from large unlabeled corpora. The curated dataset then guides structured pruning to preserve domain-relevant substructures and supports self-distillation fine-tuning to recover task-specific performance. Experiments across STEM, humanities, social sciences, math, and coding domains show that FineScope consistently outperforms baseline fine-tuning approaches while enabling up to $35\%$ parameter pruning. On math reasoning tasks, it achieves an average improvement of ~11.50 points across pruned models. Code will be available.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 18156
Loading