Compositional Skill Execution in LLM Multi-Agent Systems: A Comparative Study of Collaboration Architectures for Long-Horizon Tasks
Keywords: multi-agent systems, LLM, compositional tasks, penetration testing, collaboration architecture
TL;DR: Solo agents outperform multi-agent teams on compositional tasks, but full context retention closes the gap entirely (50%→100%)
Abstract: Large language model (LLM) agents are increasingly deployed on compositional tasks requiring sequential sub-goal decomposition, yet systematic comparisons of collaboration architectures remain scarce.
This study evaluates eight agent configurations—four solo models and four multi-agent architectures—across 132 trials on a three-stage compositional skill pipeline under full-context and zero-context conditions.
Three findings emerge:
(1) architecture choice dominates model choice—a mid-tier team matches a premium solo model in rounds (but at $5\times$ the solo token cost), while a poorly chosen architecture wastes $55\times$ more tokens than the best baseline;
(2) the optimal architecture reverses between information conditions—role pipelines lead under certainty, peer-review loops under uncertainty;
(3) expanding inter-agent context from truncated to full history raises multi-agent success from 50\% to 100\% ($p{<}0.001$), pinpointing compositional skill handoff as the primary bottleneck.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 15
Loading