CodeSync: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time API knowledge updates from Python third-party libraries.
Abstract: Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly the frequent updates of third-party library APIs. This limitation, rooted in the static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, we introduce CodeSync, a data engine to identify outdated code patterns and collect real-time code knowledge updates from Python third-party libraries. Building upon CodeSync, we develop CodeSyncBench, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases spanning three evaluation tasks and an update-aware instruction tuning dataset of 2,200 training samples. Extensive experiments on 14 LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). Our CodeSync lays a strong foundation for developing more effective and robust methods for real-time code knowledge updating in the future. The experimental code is available at: https://github.com/CGCL-codes/naturalcc/tree/main/examples/codesync.
Lay Summary: AI tools that write software code, known as Large Language Models (LLMs), are incredibly powerful. However, they are often trained on outdated information. This means they don't know when programming languages and their toolkits (called libraries) change, causing them to generate code that is broken, inefficient, or insecure. To tackle this, we developed CodeSync, an automated system that detects recent updates in popular Python libraries and enable to create dataset and benchmark based on this. Using CodeSync, we created CodeSyncBench, a comprehensive test to measure how well LLMs adapt to this evolving code knowledge. We tested 14 major LLMs, including the latest models from OpenAI, Google, and Anthropic. Our results show a significant weakness: all current LLMs struggle to keep up with these updates, even when using advanced learning techniques. Our work provides a crucial tool for developers to measure this "knowledge decay" and lays the groundwork for building next-generation AI coding assistants that can stay synchronized with the fast-paced world of software development, making them far more reliable.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Everything Else
Keywords: Code Generation, Large Language Models, Knowledge Updating, API Evolution
Submission Number: 6486
Loading