Abstract: Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: # **Revisions Summary**
## **1. New Experimental Results**
* Added VLM vs. Perception Module ablation study in the appendix comparing ASCII encoding against GPT-4o vision on 100 observations, using the JudgeLLM evaluation protocol.
* Extended experimental runs to **20 seeds** for six small-scale levels (data collected; will be integrated into **Table 3**).
* Added explicit **cost information**: experiments range from **$0.5–$20** depending on baseline and level complexity.
* Clarified that **Figure 7** reports only algorithm-side API costs; the environment itself has **zero** API cost.
## **2. Implementation Details**
* Added complete model specifications: **GPT-4o (gpt-4o-2024-08-06)** with a **128K** context window.
* Added baseline-specific hyperparameters:
* *Embodied*: 2 rounds/timestep, 3-timestep lifespan
* *COELA*: 30 max messages
* Explicitly stated that **all baselines are zero-shot**, with no learning during inference.
## **3. Table Improvements**
* Clarified: *“12 distinct levels with 4 having size variants = 16 total configurations”* in text and in the **Table 2** caption.
* Added a **Max Score** column to **Table 3** describing the theoretical maximum for each level.
## **4. Technical Corrections**
* Fixed fire-spread equation: replaced `θ_slope` with the complete piecewise slope-factor function **f(slope)** showing uphill vs. downhill dynamics.
* Added environmental parameter details: moisture ratio is continuous; ignition is binary; propagation speed is user-configurable.
* Added three concrete examples of **heterogeneous agent cooperation** requirements.
* Corrected minimap coordinate formula to show explicit centering.
## **5. Typo Fixes**
* Corrected reference error: *“Fig. 7” → “Table 2”*.
* Fixed typos: *“PRIMATIVES” → “primitives”*, *“compentency” → “competency”*, *“wildfires creates” → “wildfire fighting creates”*.
* Added forward reference for *“conceptually challenged”* to Section 5.3.
* Verified citation formatting throughout.
## **6. Future Work Changes**
* Rewrote future work into three areas:
1. **Scalable Architectures and Efficient Algorithms**
2. **Adaptive Planning and Reasoning**
3. **Evaluation and Human-AI Teaming**
* Added commitments to open-source trajectory data and exploration of human baselines.
## **7. Appendix Additions**
* Added **BCS interpretation guidance** explaining normalization purpose and limitations.
* Added **coordinate system clarification** in the perception module prompt.
---
# **Summary of Changes by Section**
| Section | Changes |
| -------------------------- | ---------------------------------------------------------------- |
| **3.2 Environment Design** | Fire-spread equation fix; environmental parameter clarifications |
| **3.3 Agent Design** | Added heterogeneous cooperation examples |
| **3.4 Pillars** | Terminology correction (“wildfire fighting”) |
| **4 Benchmarking Suite** | Level-count clarification; “conceptually challenged” reference |
| **4.1 Experiment Setup** | Zero-shot learning statement |
| **5.3 Results** | Figure 7 caption updated with cost clarification |
| **5.3 Outlook** | Rewrote future work |
| **Table 2** | Enhanced caption |
| **Table 3** | Added Max Score column |
| **Appendix A.2** | Coordinate system clarification |
| **Appendix A.4** | Typo fix |
| **New: Before A.15** | VLM ablation study |
| **Appendix A.16** | BCS interpretation guidance |
| **Appendix A.17** | Title typo fix |
| **Appendix A.18** | Complete hyperparameter specifications |
Video: https://www.youtube.com/watch?v=IspKVw3mfFg&feature=youtu.be
Code: https://github.com/generalroboticslab/CREW/tree/main/crew-algorithms/crew_algorithms/wildfire_alg
Assigned Action Editor: ~Marlos_C._Machado1
Submission Number: 5974
Loading