Keywords: Code Generation, Large Language Models, Software Engineering
Abstract: Repository-level benchmarks such as SWE-Bench have highlighted the challenges of scaling language models to complex software engineering tasks. However, the open-sourced real data with verifiable environment from which we can get trajectories to perform rejection fine-tuning, is scarce. In such a situation, we propose training by augmented data as a mid-training stage. Previous augmented data primarily focus on monolingual issue resolving and feature implementation. In this work, we introduce SWE-Ext, an effort to scale and extend augmented data for repository-level coding tasks. SWE-Ext broadens existing data along two key dimensions: multilingual coverage (spanning 10 languages) and an auxiliary code completion task. We uncover distinct transfer mechanisms: data from other programming languages provides transferable signals that generally enhance localization and editing capabilities in single-language (Python) settings, while code completion data strengthens code editing capabilities, particularly for feature implementation tasks requiring substantial new code generation. Augemented data and these extensions yield consistent improvements on Python repository-level benchmarks like SWE-Bench and FEA-Bench and assist post-training performance on them. Our method offers a simple yet effective way to leverage more open-source data for advancing repository-level code models.
Paper Type: Long
Research Area: Code Models
Research Area Keywords: code language models, software engineering automation
Contribution Types: Data resources
Languages Studied: English
Submission Number: 10139
Loading