CREW-Wildfire: Benchmarking Agentic Multi-Agent Collaborations at Scale

Jonathan Hyun; Nicholas R Waytowich; Boyuan Chen

CREW-Wildfire: Benchmarking Agentic Multi-Agent Collaborations at Scale

Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen

Published: 11 Dec 2025, Last Modified: 11 Dec 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: # **Revisions Summary** ## **1. New Experimental Results** * Added VLM vs. Perception Module ablation study in the appendix comparing ASCII encoding against GPT-4o vision on 100 observations, using the JudgeLLM evaluation protocol. * Extended experimental runs to **20 seeds** for six small-scale levels (data collected; will be integrated into **Table 3**). * Added explicit **cost information**: experiments range from **$0.5–$20** depending on baseline and level complexity. * Clarified that **Figure 7** reports only algorithm-side API costs; the environment itself has **zero** API cost. ## **2. Implementation Details** * Added complete model specifications: **GPT-4o (gpt-4o-2024-08-06)** with a **128K** context window. * Added baseline-specific hyperparameters: * *Embodied*: 2 rounds/timestep, 3-timestep lifespan * *COELA*: 30 max messages * Explicitly stated that **all baselines are zero-shot**, with no learning during inference. ## **3. Table Improvements** * Clarified: *“12 distinct levels with 4 having size variants = 16 total configurations”* in text and in the **Table 2** caption. * Added a **Max Score** column to **Table 3** describing the theoretical maximum for each level. ## **4. Technical Corrections** * Fixed fire-spread equation: replaced `θ_slope` with the complete piecewise slope-factor function **f(slope)** showing uphill vs. downhill dynamics. * Added environmental parameter details: moisture ratio is continuous; ignition is binary; propagation speed is user-configurable. * Added three concrete examples of **heterogeneous agent cooperation** requirements. * Corrected minimap coordinate formula to show explicit centering. ## **5. Typo Fixes** * Corrected reference error: *“Fig. 7” → “Table 2”*. * Fixed typos: *“PRIMATIVES” → “primitives”*, *“compentency” → “competency”*, *“wildfires creates” → “wildfire fighting creates”*. * Added forward reference for *“conceptually challenged”* to Section 5.3. * Verified citation formatting throughout. ## **6. Future Work Changes** * Rewrote future work into three areas: 1. **Scalable Architectures and Efficient Algorithms** 2. **Adaptive Planning and Reasoning** 3. **Evaluation and Human-AI Teaming** * Added commitments to open-source trajectory data and exploration of human baselines. ## **7. Appendix Additions** * Added **BCS interpretation guidance** explaining normalization purpose and limitations. * Added **coordinate system clarification** in the perception module prompt. --- # **Summary of Changes by Section** | Section | Changes | | -------------------------- | ---------------------------------------------------------------- | | **3.2 Environment Design** | Fire-spread equation fix; environmental parameter clarifications | | **3.3 Agent Design** | Added heterogeneous cooperation examples | | **3.4 Pillars** | Terminology correction (“wildfire fighting”) | | **4 Benchmarking Suite** | Level-count clarification; “conceptually challenged” reference | | **4.1 Experiment Setup** | Zero-shot learning statement | | **5.3 Results** | Figure 7 caption updated with cost clarification | | **5.3 Outlook** | Rewrote future work | | **Table 2** | Enhanced caption | | **Table 3** | Added Max Score column | | **Appendix A.2** | Coordinate system clarification | | **Appendix A.4** | Typo fix | | **New: Before A.15** | VLM ablation study | | **Appendix A.16** | BCS interpretation guidance | | **Appendix A.17** | Title typo fix | | **Appendix A.18** | Complete hyperparameter specifications |

Video: https://www.youtube.com/watch?v=IspKVw3mfFg&feature=youtu.be

Code: https://github.com/generalroboticslab/CREW/tree/main/crew-algorithms/crew_algorithms/wildfire_alg

Assigned Action Editor: ~Marlos_C._Machado1

Submission Number: 5974

Loading