Keywords: Emergent, LLMs, SWE-Benchmark, Qwen, LLaMA, Finetune, Capabilities, Scaling Law, Software Engineering, Emergence, Prediction, Multi-file bug-fixing, Progression, Progression loss, Predictive framework, Safety risk
TL;DR: Fine-tuning mid-sized LLMs on SWE-bench reveals nonlinear emergent capabilities, where smaller models can match larger ones, enabling scaling-law forecasts of coding performance while raising safety concerns around capability amplification.
Abstract: Large Language Models exhibit unpredictable performance jumps on downstream tasks, and understanding when these emergent abilities arise remains challenging. While this has been observed across a variety of tasks, the extent to which it may pose an issue depends on the task at hand. This work extends emergence prediction to SWE-bench by fine-tuning LLaMA-3-1–8B and Qwen3-14B, demonstrating that task-specific fine-tuning accurately predicts higher capabilities—thus suggesting how larger models will behave. We fit an empirical emergence law by varying fine tuning data, showing that tracking the performance of smaller models may allow us to predict the performance of larger models on SWE-bench, using only a fraction of the computational resources. Validation on SWE-bench reveals that fine-tuned models achieve improved success rates (up to 44% vs. 5% untuned baseline), with the fitted emergence law accurately anticipating performance thresholds (LLaMA RMSE=2.22, R2 = 0.95: Qwen RMSE = 1.02, R2 = 0.99).
Submission Number: 162
Loading