Controlling a $\mu$RTS agent using Decision Transformers

Controlling a $\mu$RTS agent using Decision Transformers

ICLR 2026 Conference Submission14384 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MicroRTS, Reinforcement Learning, Decision Transformers, Return Conditioned Supervised Learning

TL;DR: Decision Transformers with a Critic and Online Finetuning are applied to the Gym-MicroRTS environment to match Implicit Q-Learning agent

Abstract: Decision Transformers (DT) are a Return-Conditioned Supervised Learning (RCSL) technique. A DT policy predicts actions by attending to a limited history of tokens that encodes states, actions and returns-to-go. Two existing extensions of DT, namely Online Decision Transformers (ODT), and Critic Guided Decision Transformers (CGDT), are re-implemented and applied to the Gym-$\\mu$RTS environment. In CGDT, a critic learns to predict reward distributions conditioned on sequences of interwoven states and actions to overcome DT's issues with stochasticity. In ODT, an additional entropy term and hindsight reward relabeling enable online fine-tuning. A dataset is generated from 3000 games between CoacAI and Mayari, two previous Gym-$\\mu$RTS competition winners, on procedurally generated 8x8 maps. We further explore the combination of both CGDT and ODT methods to create a novel model called the Online Critic-Guided Decision Transformer (OCGDT). Training proceeds in three phases: (1) supervised learning of the critic using a fixed dataset of 3000 trajectories, (2) supervised learning of the policy from the same dataset, and (3) online fine-tuning of the policy, using the dataset as a starting point for a replay buffer. The critic and offline networks are validated against 500 held-out trajectories, while the final policy’s performance is measured by its win rate against four benchmark scripted bots (CoacAI, Mayari, lightRushAI, and workerRushAI). The agent obtains a win rate of $26.2\\%\\pm4.3\\%$ against CoacAI and a win rate of $40.1\\%\\pm4.8\\%$ against Mayari over 4 seeds per match-up and 100 games per seed for a total of 400 games per match-up on held-out procedurally generated 8x8 maps. This matches the performance of Implicit Q-Learning (IQL). The agent also obtains a win-rate of $51.6\\%\\pm4.9\\%$ when matched up directly against IQL.

Primary Area: reinforcement learning

Submission Number: 14384

Loading