Compositional Skill Execution in LLM Multi-Agent Systems: A Comparative Study of Collaboration Architectures for Long-Horizon Tasks

Published: 27 May 2026, Last Modified: 09 Jun 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-agent systems, LLM, compositional tasks, penetration testing, collaboration architecture
TL;DR: Solo agents outperform multi-agent teams on compositional tasks, but full context retention closes the gap entirely (50%→100%)
Abstract: Large language model (LLM) agents are increasingly deployed on compositional tasks requiring sequential sub-goal decomposition, yet systematic comparisons of collaboration architectures remain scarce. This study evaluates eight agent configurations—four solo models and four multi-agent architectures—across 132 trials on a three-stage compositional skill pipeline under full-context and zero-context conditions. Three findings emerge: (1) architecture choice dominates model choice—a mid-tier team matches a premium solo model in rounds (but at $5\times$ the solo token cost), while a poorly chosen architecture wastes $55\times$ more tokens than the best baseline; (2) the optimal architecture reverses between information conditions—role pipelines lead under certainty, peer-review loops under uncertainty; (3) expanding inter-agent context from truncated to full history raises multi-agent success from 50\% to 100\% ($p{<}0.001$), pinpointing compositional skill handoff as the primary bottleneck.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 15
Loading