Quality Over Quantity: Predictive Data Selection for Edge Language Models

Published: 11 Nov 2025, Last Modified: 16 Jan 2026DAI PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Predictive Data Selection, Edge Language Models, Data Curation, Data Filtering, PEFT, DoRA
Abstract: The performance of edge language models is fundamentally determined by the quality of their training data. To address the challenge of efficient data curation in resource-constrained environments, this study adapts and optimizes the Predictive Data Selection (Preselect) methodology. Our approach focuses on enhancing two core capabilities crucial for edge AI applications: ChatRAG, as the foundation for knowledge interaction, and Function Calling, as the basis for tool use. By designing an evaluation ensemble that includes specialized models and training a FastText lightweight classifier, we can efficiently filter high-value training samples from massive datasets. Experimental results demonstrate that this strategy yields significant performance improvements, particularly in ChatRAG (+10.5%) and Function Calling (+10.0%). This research validates that an edge-optimized Preselect is an effective and viable strategy for enhancing targeted capabilities in edge models, ultimately proving that under resource constraints, curated data quality is a more critical driver of performance than mere data quantity.
Submission Number: 31
Loading