SWE-Ext: Extending and Scaling Augmented Data for Mid-Training of Repository-Level Coding Tasks

SWE-Ext: Extending and Scaling Augmented Data for Mid-Training of Repository-Level Coding Tasks

ACL ARR 2026 January Submission10139 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Generation, Large Language Models, Software Engineering

Abstract: Repository-level benchmarks such as SWE-Bench have highlighted the challenges of scaling language models to complex software engineering tasks. However, the open-sourced real data with verifiable environment from which we can get trajectories to perform rejection fine-tuning, is scarce. In such a situation, we propose training by augmented data as a mid-training stage. Previous augmented data primarily focus on monolingual issue resolving and feature implementation. In this work, we introduce SWE-Ext, an effort to scale and extend augmented data for repository-level coding tasks. SWE-Ext broadens existing data along two key dimensions: multilingual coverage (spanning 10 languages) and an auxiliary code completion task. We uncover distinct transfer mechanisms: data from other programming languages provides transferable signals that generally enhance localization and editing capabilities in single-language (Python) settings, while code completion data strengthens code editing capabilities, particularly for feature implementation tasks requiring substantial new code generation. Augemented data and these extensions yield consistent improvements on Python repository-level benchmarks like SWE-Bench and FEA-Bench and assist post-training performance on them. Our method offers a simple yet effective way to leverage more open-source data for advancing repository-level code models.

Paper Type: Long

Research Area: Code Models

Research Area Keywords: code language models, software engineering automation

Contribution Types: Data resources

Languages Studied: English

Submission Number: 10139

Loading