Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development
Keywords: AI, Agents, Designer, LLM, Programmer, Software Engineering
Abstract: Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.
Revision Summary: Repeated Runs for the experiment is discussed in Section 3.1.
The Discussion Section has been revised to focus more on interaction-level observations and implications of the same. We have now removed the strong claims regarding reasoning model and "100" iterations.
The correctness rubrics has been explicitly mentioned Under Section 3.3., 1)
Table 2 has been modified to present cross-level analysis that connects alignment metrics with functional correctness and compilation success rate
Figure 8 has been modified to consider only C codes. All the cases of Compilation Success and Failure are respect to C codes. We say no code found when there was no code (in any language) and just text in the conversation turn.
Threats to Validity section has been modified to include the notion of the open-ended nature of the Fibonacci game task and the risk of subjective bias while manually analyzing the data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages
Reroute: false
Submission Number: 19
Loading