DSBox: A Data Selection Framework for Efficient Deep Code Learning

Xinyang Liu, Lili Quan, Qiang Hu

Published: 2025, Last Modified: 28 Mar 2026ASE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Deep Learning has achieved remarkable advancements in various software engineering tasks and gained huge attention in the community. Following a data-centric paradigm, the preparation of code models requires high-quality datasets for the model training. However, constructing such datasets, especially for software tasks, is costly mainly due to the data labeling process. To address this challenge, multiple data selection methods have been proposed to identify and label data samples that are important for training. Despite this potential, unfortunately, there are limited tools to support the flexible usage of data selection methods, hindering their practical usage and future research in this domain. To bridge this gap, we introduce DSBox, a lightweight yet extensible framework that unifies 20 published selection methods, covering three categories: uncertainty, representativeness, and quality-based methods. Evaluation demonstrates that active learning methods outperform recently proposed techniques (designed for large language models) on the code vulnerability detection task. The tool, as well as a demonstration video, are available on the project website https://sites.google.com/view/dsbox2025.
Loading