SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Xuehang Guo; Xingyao Wang; Yangyi Chen; Sha Li; Chi Han; Manling Li; Heng Ji

SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose SyncMind for evaluating AI agents' out-of-sync recovery in collaborative software engineering based on our benchmark SyncBench, revealing promising capabilities and key limitations in current LLM agents.

Abstract: Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants---whether humans or AI agents---to stay on the same page as their environment evolves. When a collaborator's understanding diverges from the current state---what we term the *out-of-sync* challenge---the collaborator's actions may fail, leading to integration issues. In this work, we introduce **SyncMind**, a framework that systematically defines the *out-of-sync* problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on ***SyncMind***, we create **SyncBench**, a benchmark featuring 24,332 instances of agent *out-of-sync* scenarios in real-world CSE derived from 21 popular *GitHub* repositories with executable verification tests. Experiments on ***SyncBench*** uncover critical insights into existing LLM agents' capabilities and limitations. Besides substantial performance gaps among agents (from *Llama-3.1* agents $\leq 3.33\%$ to *Claude-3.5-Sonnet* $\geq 28.18\%$), their consistently low collaboration willingness ($\le 4.86\%$) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with *out-of-sync* recovery success. Minimal performance differences in agents' resource-aware *out-of-sync* recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future development of resource-efficient collaborative systems. Our code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.

Lay Summary: What happens when your teammates contribute new updates to your collaborative project while you're away, leaving you out of the loop? In collaborative scenarios, team members can fall "out-of-sync" with the current state of the project—missing crucial changes and potentially introducing bugs or causing integration issues. This problem becomes increasingly significant as more collaborative environments incorporate AI assistants that need to stay aligned with human collaborators. Our research tackles this challenge through the lens of collaborative software engineering, where we introduce SyncMind, a framework that systematically measures how well large language model (LLM) agents recover when they lose synchronization with the coding environment. We constructed SyncBench, a scalable benchmark with an open-source construction method to evaluate agents' ability in detecting, localizing, and fixing synchronization issues. Our experiments not only revealed striking performance and ability differences across LLM agents, but also provided insights into resource-efficient collaboration in future collaborative systems. As agents can fall out-of-sync in various collaborative scenarios beyond software engineering, our findings carry implications for broad collaborative systems—whether among humans, AIs, or mixed teams—highlighting the need for better collaborative mechanisms, improved resource awareness in AI assistants, as well as strategic reasoning, planning, and decision-making capabilities to help agents efficiently and effectively recover from state misalignments.

Link To Code: https://xhguo7.github.io/SyncMind/

Primary Area: General Machine Learning->Evaluation

Keywords: Collaborative Software Engineering, Collaborative Systems, Human-AI Interaction, Agent Evaluation, Benchmark Construction, Large Language Models

Submission Number: 7148

Loading