{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to JaxMARL!","text":"<p>MARL but really really fast!</p> <p>JaxMARL combines ease-of-use with GPU-enabled efficiency, and supports a wide range of commonly used MARL environments as well as popular baseline algorithms. Our aim is for one library that enables thorough evaluation of MARL methods across a wide range of tasks and against relevant baselines. We also introduce SMAX, a vectorised, simplified version of the popular StarCraft Multi-Agent Challenge, which removes the need to run the StarCraft II game engine. </p>"},{"location":"#what-we-provide","title":"What we provide:","text":"<ul> <li>9 MARL environments fully implemented in JAX - these span cooperative, competitive, and mixed games; discrete and continuous state and action spaces; and zero-shot and CTDE settings.</li> <li>8 MARL algorithms, also fully implemented in JAX - these include both Q-Learning and PPO based appraoches.</li> </ul>"},{"location":"#who-is-jaxmarl-for","title":"Who is JaxMARL for?","text":"<p>Anyone doing research on or looking to use multi-agent reinforcment learning!</p>"},{"location":"#what-is-jax","title":"What is JAX?","text":"<p>JAX is a Python library that enables programmers to use a simple numpy-like interface to easily run programs on accelerators. Recently, doing end-to-end single-agent RL on the accelerator using JAX has shown incredible benefits. To understand the reasons for such massive speed-ups in depth, we recommend reading the PureJaxRL blog post and repository.</p>"},{"location":"#performance-examples","title":"Performance Examples","text":"<p>coming soon</p>"},{"location":"#related-works","title":"Related Works","text":"<p>This works is heavily related to and builds on many other works. We would like to highlight some of the works that we believe would be relevant to readers:</p> <ul> <li>Jumanji. A suite of JAX-based RL environments. It includes some multi-agent ones such as RobotWarehouse.</li> <li>VectorizedMultiAgentSimulator (VMAS). It performs similar vectorization for some MARL environments, but is done in PyTorch.</li> <li>More to be added soon :)</li> </ul> <p>More documentation to follow soon!</p>"},{"location":"#citing-jaxmarl","title":"Citing JaxMARL","text":"<p>If you use JaxMARL in your work, please cite us as follows: <pre><code>@article{flair2023jaxmarl,\n    title={JaxMARL: Multi-Agent RL Environments in JAX},\n    author={Alexander Rutherford and Benjamin Ellis and Matteo Gallici and Jonathan Cook and Andrei Lupu and Gardar Ingvarsson and Timon Willi and Akbir Khan and Christian Schroeder de Witt and Alexandra Souly and Saptarashmi Bandyopadhyay and Mikayel Samvelyan and Minqi Jiang and Robert Tjarko Lange and Shimon Whiteson and Bruno Lacerda and Nick Hawes and Tim Rocktaschel and Chris Lu and Jakob Nicolaus Foerster},\n    journal={arXiv preprint arXiv:2311.10090},\n    year={2023}\n}\n</code></pre></p>"},{"location":"API/multi_agent_env/","title":"Multi agent env","text":"<p>Jittable abstract base class for all jaxmarl Environments.</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.agent_classes","title":"<code>agent_classes: dict</code>  <code>property</code>","text":"<p>Returns a dictionary with agent classes, used in environments with hetrogenous agents.</p> Format <p>agent_base_name: [agent_base_name_1, agent_base_name_2, ...]</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.name","title":"<code>name: str</code>  <code>property</code>","text":"<p>Environment name.</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.__init__","title":"<code>__init__(num_agents)</code>","text":"<p>num_agents (int): maximum number of agents within the environment, used to set array dimensions</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.action_space","title":"<code>action_space(agent)</code>","text":"<p>Action space for a given agent.</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.get_avail_actions","title":"<code>get_avail_actions(state)</code>","text":"<p>Returns the available actions for each agent.</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.get_obs","title":"<code>get_obs(state)</code>","text":"<p>Applies observation function to state.</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.observation_space","title":"<code>observation_space(agent)</code>","text":"<p>Observation space for a given agent.</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.reset","title":"<code>reset(key)</code>","text":"<p>Performs resetting of the environment.</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.step","title":"<code>step(key, state, actions, reset_state=None)</code>","text":"<p>Performs step transitions in the environment. Resets the environment if done. To control the reset state, pass <code>reset_state</code>. Otherwise, the environment will reset randomly.</p>"},{"location":"API/multi_agent_env/#jaxmarl.environments.multi_agent_env.MultiAgentEnv.step_env","title":"<code>step_env(key, state, actions)</code>","text":"<p>Environment-specific step transition.</p>"},{"location":"Algorithms/PPO/","title":"IPPO &amp; MAPPO","text":""},{"location":"Algorithms/PPO/#ippo-baseline","title":"IPPO Baseline","text":"<p>Pure JAX IPPO implementation, based on the PureJaxRL PPO implementation.</p>"},{"location":"Algorithms/PPO/#implementation-details","title":"\ud83d\udd0e Implementation Details","text":"<p>General features:</p> <ul> <li>Agents are controlled by a single network architecture (either FF or RNN).</li> <li>Parameters are shared between agents.</li> </ul>"},{"location":"Algorithms/PPO/#usage","title":"\ud83d\ude80 Usage","text":"<p>If you have cloned JaxMARL and are in the repository root, you can run the algorithms as scripts, e.g. <pre><code>python baselines/IPPO/ippo_rnn_smax.py\n</code></pre> Each file has a distinct config file which resides within <code>config</code>. The config file contains the IPPO hyperparameters, the environment's parameters and for some config files the <code>wandb</code> details (<code>wandb</code> is disabled by default).</p>"},{"location":"Algorithms/PPO/#mappo-baseline","title":"MAPPO Baseline","text":"<p>Pure JAX MAPPO implementation, based on the PureJaxRL PPO implementation.</p>"},{"location":"Algorithms/PPO/#implementation-details_1","title":"\ud83d\udd0e Implementation Details","text":"<p>General features:</p> <ul> <li>Agents are controlled by a single network architecture (either FF or RNN).</li> <li>Parameters are shared between agents.</li> <li>Each script has a <code>WorldStateWrapper</code> which provides a global <code>\"world_state\"</code> observation.</li> </ul>"},{"location":"Algorithms/PPO/#usage_1","title":"\ud83d\ude80 Usage","text":"<p>If you have cloned JaxMARL and are in the repository root, you can run the algorithms as scripts, e.g. <pre><code>python baselines/MAPPO/mappo_rnn_smax.py\n</code></pre> Each file has a distinct config file which resides within <code>config</code>. The config file contains the MAPPO hyperparameters, the environment's parameters and the <code>wandb</code> details (<code>wandb</code> is disabled by default).</p>"},{"location":"Algorithms/QLearning/","title":"QLearning","text":"<p>Pure JAX implementations of:</p> <ul> <li>PQN-VDN (Prallelised Q-Network)</li> <li>IQL (Independent Q-Learners)</li> <li>VDN (Value Decomposition Network)</li> <li>QMIX</li> <li>TransfQMix (Transformers for Leveraging the Graph Structure of MARL Problems)</li> <li>SHAQ (Incorporating Shapley Value Theory into Multi-Agent Q-Learning)</li> </ul> <p>PQN implementation follows purejaxql. IQL, VDN and QMix follow the original Pymarl codebase while SHAQ follows the paper code. </p> <p>Standard algorithms (iql, vdn, qmix) support:</p> <ul> <li>MPE</li> <li>SMAX</li> <li>Overcooked (qmix not supported)</li> </ul> <p>PQN-VDN supports:</p> <ul> <li>MPE</li> <li>SMAX</li> <li>Hanabi</li> <li>Overcooked</li> </ul> <p>At the moment, PQN-VDN should be the most performant baseline for Q-Learning in terms of returns and training speed.</p> <p>\u2757 TransfQMix and Shaq still use an old implementation of the scripts and need refactoring to match the new format. </p>"},{"location":"Algorithms/QLearning/#implementation-details","title":"\u2699\ufe0f Implementation Details","text":"<p>All the algorithms take advantage of the <code>CTRolloutManager</code> environment wrapper (found in <code>jaxmarl.wrappers.baselines</code>), which is used to:</p> <ul> <li>Batchify the step and reset functions to run parallel environments.</li> <li>Add a global observation (<code>obs[\"__all__\"]</code>) and a global reward (<code>rewards[\"__all__\"]</code>) to the returns of <code>env.step</code> for centralized training.</li> <li>Preprocess and uniform the observation vectors (flatten, pad, add additional features like id one-hot encoding, etc.).</li> </ul> <p>You might want to modify this wrapper for your needs.</p>"},{"location":"Algorithms/QLearning/#usage","title":"\ud83d\ude80 Usage","text":"<p>If you have cloned JaxMARL and you are in the repository root, you can run the algorithms as scripts. You will need to specify which parameter configurations will be loaded by Hydra by choosing them (or adding yours) in the config folder. Below are some examples:</p> <pre><code># vdn rnn in in mpe spread\npython baselines/QLearning/vdn_rnn.py +alg=ql_rnn_mpe\n# independent IQL rnn in competetive simple_tag (predator-prey)\npython baselines/QLearning/iql_rnn.py +alg=ql_rnn_mpe alg.ENV_NAME=MPE_simple_tag_v3\n# QMix with SMAX\npython baselines/QLearning/qmix_rnn.py +alg=ql_rnn_smax\n# VDN overcooked\npython baselines/QLearning/vdn_cnn_overcooked.py +alg=ql_cnn_overcooked alg.ENV_KWARGS.LAYOUT=counter_circuit\n# TransfQMix\npython baselines/QLearning/transf_qmix.py +alg=transf_qmix_smax\n\n# pqn feed-forward in MPE\npython baselines/QLearning/pqn_vdn_ff.py +alg=pqn_vdn_ff_mpe\n# pqn feed-forward in hanabi\npython baselines/QLearning/pqn_vdn_ff.py +alg=pqn_vdn_ff_hanabi\n# pqn CNN in overcooked\npython baselines/QLearning/pqn_vdn_cnn_overcooked.py +alg=pqn_vdn_cnn_overcooked\n# pqn with RNN in SMAX\npython baselines/QLearning/pqn_vdn_rnn.py +alg=pqn_vdn_rnn_smax\n</code></pre> <p>Notice that with Hydra, you can modify parameters on the go in this way:</p> <pre><code># change learning rate\npython baselines/QLearning/iql_rnn.py +alg=ql_rnn_mpe alg.LR=0.001\n# change overcooked layout\npython baselines/QLearning/pqn_vdn_cnn_overcooked.py +alg=pqn_vdn_cnn_overcooked alg.ENV_KWARGS.LAYOUT=counter_circuit\n# change smax map\npython baselines/QLearning/pqn_vdn_rnn.py +alg=pqn_vdn_rnn_smax alg.MAP_NAME=5m_vs_6m\n</code></pre> <p>Take a look at <code>config.yaml</code> for the default configuration when running these scripts. There you can choose how many seeds to vmap and you can setup WANDB. </p> <p>\u2757Note on Transformers: TransfQMix currently supports only MPE_Spread and SMAX. You will need to wrap the observation vectors into matrices to use transformers in other environments. See: <code>jaxmarl.wrappers.transformers</code></p>"},{"location":"Algorithms/QLearning/#hyperparameter-tuning","title":"\ud83c\udfaf Hyperparameter tuning","text":"<p>All the scripts include a tune function to perform hyperparameter tuning. To use it, set <code>\"HYP_TUNE\": True</code> in the <code>config.yaml</code> and set the hyperparameters spaces in the tune function. For more information, check wandb documentation.</p>"},{"location":"Environments/coin_game/","title":"Coin","text":"<p>JaxMARL contains an implementation of the Coin Game environment presented in Model-Free Opponent Shaping (Lu et al.). The original description and usage of the environment is from Maintaining cooperation in complex social dilemmas using deep reinforcement learning (Lerer et al.), and Learning with Opponent-Learning Awareness (Foerster et al.) is the first to popularize its use for opponent shaping. A description from Model-Free Opponent Shaping:</p> <pre><code>The Coin Game is a multi-agent grid-world environment that simulates social dilemmas like the IPD but with high dimensional dynamic states. First proposed by Lerer &amp; Peysakhovich (2017), the game consists of two players, labeled red and blue respectively, who are tasked with picking up coins, also labeled red and blue respectively, in a 3x3 grid. If a player picks up any coin by moving into the same position as the coin, they receive a reward of +1.  However, if they pick up a coin of the other player\u2019s color, the other player receives a reward of \u22122. Thus, if both agents play greedily and pick up every coin, the expected reward for both agents is 0.\n</code></pre>"},{"location":"Environments/hanabi/","title":"Hanabi","text":"<p>This directory contains a MARL environment for the cooperative card game, Hanabi, implemented in JAX. It is inspired by the popular Hanabi Learning Environment (HLE), but intended to be simpler to integrate and run with the growing ecosystem of JAX implemented RL research pipelines. </p>"},{"location":"Environments/hanabi/#action-space","title":"Action Space","text":"<p>Hanabi is a turn-based game. The current player can choose to discard or play any of the cards in their hand, or hint a colour or rank to any one of their teammates.</p>"},{"location":"Environments/hanabi/#observation-space","title":"Observation Space","text":"<p>The observations closely follow the featurization in the HLE. Each observation is comprised of 658 features:</p> <ul> <li>Hands (127): information about the visible hands.</li> <li>other player hand: 125 <ul> <li>card 0: 25,</li> <li>card 1: 25</li> <li>card 2: 25</li> <li>card 3: 25</li> <li>card 4: 25</li> </ul> </li> <li> <p>Hands missing card: 2 (one-hot)</p> </li> <li> <p>Board (76): encoding of the public information visible in the board.</p> </li> <li>Deck: 40, thermometer </li> <li>Fireworks: 25, one-hot</li> <li>Info Tokens: 8, thermometer</li> <li> <p>ife Tokens: 3, thermometer</p> </li> <li> <p>Discards (50): encoding of the cards in the discard pile.</p> </li> <li>Colour R: 10 bits for each card</li> <li>Colour Y: 10 bits for each card</li> <li>Colour G: 10 bits for each card</li> <li>Colour W: 10 bits for each card</li> <li> <p>Colour B: 10 bits for each card</p> </li> <li> <p>Last Action (55): encoding of the last move of the previous player.</p> </li> <li>Acting player index, relative to yourself: 2, one-hot</li> <li>MoveType: 4, one-hot</li> <li>Target player index, relative to acting player: 2, one-hot</li> <li>Color revealed: 5, one-hot</li> <li>Rank revealed: 5, one-hot</li> <li>Reveal outcome 5 bits, each bit is 1 if the card was hinted at</li> <li>Position played/discarded: 5, one-hot</li> <li>Card played/discarded 25, one-hot</li> <li>Card played scored: 1</li> <li> <p>Card played added info token: 1</p> </li> <li> <p>V0 belief (350): trivially-computed probability of being a specific car (given the played-discarded cards and the hints given), for each card of each player.</p> </li> <li>Possible Card (for each card): 25 (* 10)</li> <li>Colour hinted (for each card): 5 (* 10)</li> <li>Rank hinted (for each card): 5 (* 10)</li> </ul>"},{"location":"Environments/hanabi/#pretrained-models","title":"Pretrained Models","text":"<p>We make available to use some pretrained models. For example you can use a jax conversion of the original R2D2 OBL model in this way:</p> <ol> <li>Download the models from Hugginface: <code>git clone XXXX</code> (ensure to have git lfs installed). You can also use the script: <code>bash jaxmarl/environments/hanabi/models/download_r2d2_obl.sh</code></li> <li>Load the parameters, import the agent wrapper and use it with JaxMarl Hanabi:</li> </ol> <pre><code>!git clone XXXX\nimport jax\nfrom jax import numpy as jnp\nfrom jaxmarl import make\nfrom jaxmarl.wrappers.baselines import load_params\nfrom jaxmarl.environments.hanabi.pretrained import OBLAgentR2D2\n\nweight_file = \"jaxmarl/environments/hanabi/pretrained/obl-r2d2-flax/icml_OBL1/OFF_BELIEF1_SHUFFLE_COLOR0_BZA0_BELIEF_a.safetensors\"\nparams = load_params(weight_file)\n\nagent = OBLAgentR2D2()\nagent_carry = agent.initialize_carry(jax.random.PRNGKey(0), batch_dims=(2,))\n\nrng = jax.random.PRNGKey(0)\nenv = make('hanabi')\nobs, env_state = env.reset(rng)\nenv.render(env_state)\n\nbatchify = lambda x: jnp.stack([x[agent] for agent in env.agents])\nunbatchify = lambda x: {agent:x[i] for i, agent in enumerate(env.agents)}\n\nagent_input = (\n    batchify(obs),\n    batchify(env.get_legal_moves(env_state))\n)\nagent_carry, actions = agent.greedy_act(params, agent_carry, agent_input)\nactions = unbatchify(actions)\n\nobs, env_state, rewards, done, info = env.step(rng, env_state, actions)\n\nprint('actions:', {agent:env.action_encoding[int(a)] for agent, a in actions.items()})\nenv.render(env_state)\n</code></pre>"},{"location":"Environments/hanabi/#rendering","title":"Rendering","text":"<p>You can render the full environment state:</p> <pre><code>obs, env_state = env.reset(rng)\nenv.render(env_state)\n\nTurn: 0\n\nScore: 0\nInformation: 8\nLives: 3\nDeck: 40\nDiscards:                                                  \nFireworks:     \nActor 0 Hand:&lt;-- current player\n0 W3 || XX|RYGWB12345\n1 G5 || XX|RYGWB12345\n2 G4 || XX|RYGWB12345\n3 G1 || XX|RYGWB12345\n4 Y2 || XX|RYGWB12345\nActor 1 Hand:\n0 R3 || XX|RYGWB12345\n1 B1 || XX|RYGWB12345\n2 G1 || XX|RYGWB12345\n3 R4 || XX|RYGWB12345\n4 W4 || XX|RYGWB12345\n</code></pre> <p>Or you can render the partial observation of the current agent:</p> <pre><code>obs, new_env_state, rewards, dones, infos  = env.step_env(rng, env_state, actions)\nobs_s = env.get_obs_str(new_env_state, env_state, a, include_belief=True, best_belief=5)\nprint(obs_s)\n\nTurn: 1\n\nScore: 0\nInformation available: 7\nLives available: 3\nDeck remaining cards: 40\nDiscards:                                                  \nFireworks:     \nOther Hand:\n0 Card: W3, Hints: , Possible: RYGWB12345, Belief: [R1: 0.060 Y1: 0.060 G1: 0.060 W1: 0.060 B1: 0.060]\n1 Card: G5, Hints: , Possible: RYGWB12345, Belief: [R1: 0.060 Y1: 0.060 G1: 0.060 W1: 0.060 B1: 0.060]\n2 Card: G4, Hints: , Possible: RYGWB12345, Belief: [R1: 0.060 Y1: 0.060 G1: 0.060 W1: 0.060 B1: 0.060]\n3 Card: G1, Hints: , Possible: RYGWB12345, Belief: [R1: 0.060 Y1: 0.060 G1: 0.060 W1: 0.060 B1: 0.060]\n4 Card: Y2, Hints: , Possible: RYGWB12345, Belief: [R1: 0.060 Y1: 0.060 G1: 0.060 W1: 0.060 B1: 0.060]\nYour Hand:\n0 Hints: ,  Possible: RYGWB2345, Belief: [R2: 0.057 R3: 0.057 R4: 0.057 Y2: 0.057 Y3: 0.057]\n1 Hints: 1, Possible: RYGWB1,    Belief: [R1: 0.200 Y1: 0.200 G1: 0.200 W1: 0.200 B1: 0.200]\n2 Hints: 1, Possible: RYGWB1,    Belief: [R1: 0.200 Y1: 0.200 G1: 0.200 W1: 0.200 B1: 0.200]\n3 Hints: ,  Possible: RYGWB2345, Belief: [R2: 0.057 R3: 0.057 R4: 0.057 Y2: 0.057 Y3: 0.057]\n4 Hints: ,  Possible: RYGWB2345, Belief: [R2: 0.057 R3: 0.057 R4: 0.057 Y2: 0.057 Y3: 0.057]\nLast action: H1\nCards afected: [1 2]\nLegal Actions: ['D0', 'D1', 'D2', 'D3', 'D4', 'P0', 'P1', 'P2', 'P3', 'P4', 'HY', 'HG', 'HW', 'H1', 'H2', 'H3', 'H4', 'H5']\n</code></pre>"},{"location":"Environments/hanabi/#manual-game","title":"Manual Game","text":"<p>You can test the environment and your models by using the <code>manual_game.py</code> script in this folder. It allows to control one or two agents with the keyboard and one or two agents with a pretrained model (an obl model by default). For example, to play with an obl pretrained model:</p> <pre><code>python manual_game.py \\\n  --player0 \"manual\" \\\n  --player1 \"obl\" \\\n  --weight1 \"./pretrained/obl-r2d2-flax/icml_OBL1/OFF_BELIEF1_SHUFFLE_COLOR0_BZA0_BELIEF_a.safetensors\" \\\n</code></pre> <p>Or to look an obl model playing with itself:</p> <pre><code>python manual_game.py \\\n  --player0 \"obl\" \\\n  --player1 \"obl\" \\\n  --weight0 \"./pretrained/obl-r2d2-flax/icml_OBL1/OFF_BELIEF1_SHUFFLE_COLOR0_BZA0_BELIEF_a.safetensors\" \\\n  --weight1 \"./pretrained/obl-r2d2-flax/icml_OBL1/OFF_BELIEF1_SHUFFLE_COLOR0_BZA0_BELIEF_a.safetensors\" \\\n</code></pre>"},{"location":"Environments/hanabi/#citation","title":"Citation","text":"<p>The environment was orginally described in the following work: <pre><code>@article{bard2019hanabi,\n  title={The Hanabi Challenge: A New Frontier for AI Research},\n  author={Bard, Nolan and Foerster, Jakob N. and Chandar, Sarath and Burch, Neil and Lactot, Marc and Song,    H. Francis and Parisotto, Emilio and Dumoulin, Vincent and Moitra, Subhodeep and Hughes, Edward and          Dunning, Ian and Mourad, Shibl and Larochelle, Hugo and Bellemare, Marc G. and Bowling},\n  journal={Artificial Intelligence Journal},\n  year={2019}\n}\n</code></pre></p>"},{"location":"Environments/jaxnav/","title":"JaxNav","text":"<p>2D geometric navigation for differential drive robots. Using distances readings to nearby obstacles (mimicing LiDAR readings), the direction to their goal and their current velocity, robots must navigate to their goal without colliding with obstacles. </p>"},{"location":"Environments/jaxnav/#environment-details","title":"Environment Details","text":"<p>JaxNav was first introduced in \"No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery\" with an in-detail specification given in the Appendix.</p>"},{"location":"Environments/jaxnav/#map-types","title":"Map Types","text":"<p>The default map is square robots of width 0.5m moving within a world with grid based obstacled, with cells of size 1m x 1m. Map cell size can be varied to produce obstacles of higher fidelty or robot strucutre can be changed into any polygon or a circle.</p> <p>We also include a map which uses polygon obstacles, but note we have not used this code in a while so there may well be issues with it.</p>"},{"location":"Environments/jaxnav/#observation-space","title":"Observation space","text":"<p>By default, each robot receives 200 range readings from a 360-degree arc centered on their forward axis. These range readings have a max range of 6m but no minimum range and are discretised with a resultion of 0.05 m. Alongside these range readings, each robot receives their current linear and angular velocities along with the direction to their goal. Their goal direction is given by a vector in polar form where the distance is either the max lidar range if the goal is beyond their \"line of sight\" or the actual distance if the goal is within their lidar range. There is no communication between agents.</p>"},{"location":"Environments/jaxnav/#action-space","title":"Action Space","text":"<p>The environments default action space is a 2D continuous action, where the first dimension is the desired linear velocity and the second the desired angular velocity. Discrete actions are also supported, where the possible combination of linear and angular velocities are discretised into 15 options.</p>"},{"location":"Environments/jaxnav/#reward-function","title":"Reward function","text":"<p>By default, the reward function contains a sparse outcome based reward alongside a dense shaping term.</p>"},{"location":"Environments/jaxnav/#visulisation","title":"Visulisation","text":"<p>Visualiser contained within <code>jaxnav_viz.py</code>, with an example below:</p> <pre><code>from jaxmarl.environments.jaxnav.jaxnav_env import JaxNav\nfrom jaxmarl.environments.jaxnav.jaxnav_viz import JaxNavVisualizer\nimport jax \n\nenv = JaxNav(num_agents=4)\n\nrng = jax.random.PRNGKey(0)\nrng, _rng = jax.random.split(rng)\n\nobs, env_state = env.reset(_rng)\n\nobs_list = [obs]\nenv_state_list = [env_state]\n\nfor _ in range(10):\n    rng, act_rng, step_rng = jax.random.split(rng, 3)\n    act_rngs = jax.random.split(act_rng, env.num_agents)\n    actions = {a: env.action_space(a).sample(act_rngs[i]) for i, a in enumerate(env.action_spaces.keys())}\n    obs, env_state, _, _, _ = env.step(step_rng, env_state, actions)\n    obs_list.append(obs)\n    env_state_list.append(env_state)\n\nviz = JaxNavVisualizer(env, obs_list, env_state_list)\nviz.animate(\"test.gif\")\n</code></pre>"},{"location":"Environments/jaxnav/#todos","title":"TODOs:","text":"<ul> <li>remove <code>self.rad</code> dependence for non circular agents</li> <li>more unit tests</li> <li>add tests for non-square agents</li> </ul>"},{"location":"Environments/jaxnav/#citation","title":"Citation","text":"<p>JaxNav was introduced by the following paper, if you use JaxNav in your work please cite it as:</p> <pre><code>@misc{rutherford2024noregrets,\n      title={No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery}, \n      author={Alexander Rutherford and Michael Beukman and Timon Willi and Bruno Lacerda and Nick Hawes and Jakob Foerster},\n      year={2024},\n      eprint={2408.15099},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={XXXX}, \n}\n</code></pre>"},{"location":"Environments/mabrax/","title":"Multi-Agent Brax","text":"<p>This directory contains a subset of the multi-agent environments as described in the paper FACMAC: Factored Multi-Agent Centralised Policy Gradients. The task descriptions are the same as in the implementation from Gymnasium-Robotics. These are multi-agent factorisations of MuJoCo tasks such that each agent controls a subset of the joints and only observes the local state. </p> <p>Specifically, we include the following environments:</p> Environment Description <code>ant_4x2</code> 4 agents, 2 joints each. One agent controls each leg. <code>halfcheetah_6x1</code> 6 agents, 1 joint each. One agent controls each joint. <code>hopper_3x1</code> 3 agents, 1 joint each. One agent controls each joint. <code>humanoid_9\\|8</code> 2 agents, 9 and 8 joints. One agent controls the upper body, the other the lower body. <code>walker2d_2x3</code> 2 agents, 3 joints each. Factored into right and left leg."},{"location":"Environments/mabrax/#observation-space","title":"Observation Space","text":"<p>Each agent's observation vector is composed of the local state of the joints it controls, as well as the state of joints at distance 1 away in the body graph, and the state of the root body. State here refers to the position and velocity of the joint or body. All observations are continuous numbers in the range [-inf, inf].</p>"},{"location":"Environments/mabrax/#action-spaces","title":"Action Spaces","text":"<p>Each agent's action space is the input torques to the joints it controls. All environments have continuous actions in the range [-1.0, 1.0], except for <code>humanoid_9|8</code> where the range is [-0.4, 0.4].</p> <p>Note: the two agents in <code>humanoid_9|8</code> have different action space sizes. To pad the action spaces to be the same size pass <code>\"homogenisation_method\":\"max\"</code> to the envrionment. If using our config files, this would done as: <pre><code>\"ENV_NAME\": \"humanoid_9|8\" \n\"ENV_KWARGS\": {\"homogenisation_method\":\"max\"}\n</code></pre></p>"},{"location":"Environments/mabrax/#visualisation","title":"Visualisation","text":"<p>To visualise a trajectory in a Jupyter notebook, given a list of states, you can use the following code snippet: <pre><code>from IPython.display import HTML\nfrom brax.io import html\n\nHTML(html.render(env.sys, [s.qp for s in state_history]))\n</code></pre></p>"},{"location":"Environments/mabrax/#differences-to-gymnasium-robotics-mamujoco","title":"Differences to Gymnasium-Robotics MaMuJoCo","text":"<p>A notable difference to Gymansium-Robotics is that this JAX implementation currently fixes the observation distance to 1, whereas in the original implementation, it is a configurable parameter. This means that each agent has access to the observations of joints at distance 1 away from it in the body graph. We plan to make this a configurable parameter in a future update.</p>"},{"location":"Environments/mpe/","title":"MPE","text":"<p>Multi Particle Environments (MPE) are a set of communication oriented environment where particle agents can (sometimes) move, communicate, see each other, push each other around, and interact with fixed landmarks. We implement all of the PettingZoo MPE Environments.</p> Envrionment JaxMARL Registry Name Simple <code>MPE_simple_v3</code> Simple Push <code>MPE_simple_push_v3</code> Simple Spread <code>MPE_simple_spread_v3</code> Simple Crypto <code>MPE_simple_crypto_v3</code> Simple Speaker Listener <code>MPE_simple_speaker_listener_v4</code> Simple Tag <code>MPE_simple_tag_v3</code> Simple World Comm <code>MPE_simple_world_comm_v3</code> Simple Reference <code>MPE_simple_reference_v3</code> Simple Adversary <code>MPE_simple_adversary_v3</code> <p>The implementations follow the PettingZoo code as closely as possible, including sharing variable names and version numbers. There are occasional discrepancies between the PettingZoo code and docs, where this occurs we have followed the code. As our implementation closely follows the PettingZoo code, please refer to their documentation for further information on the environments.</p> <p>We additionally include a fully cooperative variant of Simple Tag, first used to evaluate FACMAC. In this environmnet, a number of agents attempt to tag a number of prey, where the prey are controlled by a heuristic AI.</p> Envrionment JaxMARL Registry 3 agents, 1 prey <code>MPE_simple_facmac_3a_v1</code> 6 agents, 2 prey <code>MPE_simple_facmac_6a_v1</code> 9 agents, 3 prey <code>MPE_simple_facmac_9a_v1</code>"},{"location":"Environments/mpe/#action-space","title":"Action Space","text":"<p>Following the PettingZoo implementation, we allow for both discrete or continuous action spaces in all MPE envrionments. The environments use discrete actions by default.</p> <p>Discrete (default)</p> <p>Represents the combination of movement and communication actions. Agents that can move select a value 0-4 corresponding to <code>[do nothing, down, up, left, right]</code>, while agents that can communicate choose between a number of communication options. The agents' abilities and with the number of communication options varies with the envrionments.</p> <p>Continuous (<code>action_type=\"Continuous\"</code>)</p> <p>Agnets that can move choose continuous values over <code>[do nothing, up, down, right, left]</code> with actions summed along their axis (i.e. vertical force = up - down). Agents that can communicate output values over the dimension of their communcation space. These two vectors are concatenated for agents that can move and communicate.</p>"},{"location":"Environments/mpe/#observation-space","title":"Observation Space","text":"<p>The exact observation varies for each environment, but in general it is a vector of agent/landmark positions and velocities along with any communication values.</p>"},{"location":"Environments/mpe/#visualisation","title":"Visualisation","text":"<p>Check the example <code>mpe_introduction.py</code> file in the tutorials folder for an introduction to our implementation of the MPE environments, including an example visualisation. We animate the environment after the state transitions have been collected as follows:</p> <pre><code>import jax \nfrom jaxmarl import make\nfrom jaxmarl.environments.mpe import MPEVisualizer\n\n# Parameters + random keys\nmax_steps = 25\nkey = jax.random.PRNGKey(0)\nkey, key_r, key_a = jax.random.split(key, 3)\n\n# Instantiate environment\nenv = make(\"MPE_simple_v3\")\nobs, state = env.reset(key_r)\n\n# Sample random actions\nkey_a = jax.random.split(key_a, env.num_agents)\nactions = {agent: env.action_space(agent).sample(key_a[i]) for i, agent in enumerate(env.agents)}\n\nstate_seq = []\nfor _ in range(max_steps):\n    state_seq.append(state)\n    # Iterate random keys and sample actions\n    key, key_s, key_a = jax.random.split(key, 3)\n    key_a = jax.random.split(key_a, env.num_agents)\n    actions = {agent: env.action_space(agent).sample(key_a[i]) for i, agent in enumerate(env.agents)}\n\n    # Step environment\n    obs, state, rewards, dones, infos = env.step(key_s, state, actions)\n\n# state_seq is a list of the jax env states passed to the step function\n# i.e. [state_t0, state_t1, ...]\nviz = MPEVisualizer(env, state_seq)\nviz.animate(view=True)  # can also save the animiation as a .gif file with save_fname=\"mpe.gif\"\n</code></pre>"},{"location":"Environments/mpe/#citations","title":"Citations","text":"<p>MPE was orginally described in the following work: <pre><code>@article{mordatch2017emergence,\n  title={Emergence of Grounded Compositional Language in Multi-Agent Populations},\n  author={Mordatch, Igor and Abbeel, Pieter},\n  journal={arXiv preprint arXiv:1703.04908},\n  year={2017}\n}\n</code></pre> The fully cooperative Simple Tag variant was proposed by: <pre><code>@article{peng2021facmac,\n  title={Facmac: Factored multi-agent centralised policy gradients},\n  author={Peng, Bei and Rashid, Tabish and Schroeder de Witt, Christian and Kamienny, Pierre-Alexandre and Torr, Philip and B{\\\"o}hmer, Wendelin and Whiteson, Shimon},\n  journal={Advances in Neural Information Processing Systems},\n  volume={34},\n  pages={12208--12221},\n  year={2021}\n}\n</code></pre></p>"},{"location":"Environments/overcooked/","title":"Overcooked","text":"<p>Overcooked-AI is a fully observable cooperative environment where two cooks (agents) must cooperate to prepare and serve onion soups. It is inspired by the popular videogame Overcooked.</p> <p>We implement all of the original Overcooked-AI layouts: * Cramped Room * Asymmetric Advantages * Coordination Ring * Forced Coordination * Counter Circuit</p> <p>We also provide a simple method for creating new layouts: <pre><code>custom_layout_grid = \"\"\"\nWWPWW\nOA AO\nW   W\nWBWXW\n\"\"\"\nlayout = layout_grid_to_dict(custom_layout_grid)\n</code></pre> </p> <p>The implementation aims to be as close as possible to the original Overcooked-AI environment, including dynamics, collision logic, and action and observation spaces.</p>"},{"location":"Environments/overcooked/#a-note-on-dynamics","title":"A note on dynamics","text":"<p>In the original Overcooked-AI environment and in this JAX implementation, the pot starts cooking as soon as 3 onions are placed in the pot. An update to Overcooked-AI has since changed the dynamics to require an additional pot interaction to start cooking.  Updating the Overcooked-JAX to implement the new pot dynamics is on the roadmap and should be done by the end of 2023.</p>"},{"location":"Environments/overcooked/#action-space","title":"Action Space","text":"<p>There are 6 possible actions, comprised of 4 movement actions (right, down, left, up), interact and no-op.</p>"},{"location":"Environments/overcooked/#observation-space","title":"Observation Space","text":"<p>The observations follow the featurization in the original Overcooked-AI environment, and is meant to be passed to a ConvNet.</p> <p>Each observation is a sparse, (mostly) binary encoding of size <code>layout_height x layout_width x n_channels</code>, where <code>n_channels = 26</code>.  For a detailed description of each channel, refer to the <code>get_obs(...)</code> method in <code>overcooked.py</code>.</p>"},{"location":"Environments/overcooked/#reward","title":"Reward","text":"<p>JaxMARL's Overcooked reward is the same as the original environment, which corresponds to the score of the game. Specifically, a +1 reward is given to all agents when a recipe is correctly completed and delivered.</p> <p>Additionally, we include a shaped reward as per the original Overcooked environment. The shaped reward is as follows:</p> <pre><code>BASE_REW_SHAPING_PARAMS = {\n    \"PLACEMENT_IN_POT_REW\": 3, # reward for putting ingredients \n    \"PLATE_PICKUP_REWARD\": 3, # reward for picking up a plate\n    \"SOUP_PICKUP_REWARD\": 5, # reward for picking up a ready soup\n    \"DISH_DISP_DISTANCE_REW\": 0,\n    \"POT_DISTANCE_REW\": 0,\n    \"SOUP_DISTANCE_REW\": 0,\n}\n</code></pre> <p>The shaped reward is accessible in the <code>infos</code> returned by the step function of the environment.</p>"},{"location":"Environments/overcooked/#get-started","title":"Get started","text":"<p>We provide an introduction on how to initialize, visualize and unroll a policy in the environment in <code>../../tutorials/overcooked_introduction.py</code>.</p> <p>You can also try the environment yourself by running <code>python interactive.py</code>. Use the arrows to move both agents and the spacebar to interact.</p>"},{"location":"Environments/overcooked/#visualization","title":"Visualization","text":"<p>We animate a collected set of state sequences. <pre><code>from jaxmarl.viz.overcooked_visualizer import OvercookedVisualizer\n\nstate_seq = [state_t0, state_t1, ...]  # collected state sequences\n\nviz =  OvercookedVisualizer()\nviz.animate(state_seq, env.agent_view_size, filename='animation.gif')\n</code></pre></p>"},{"location":"Environments/overcooked/#limitations","title":"Limitations","text":"<p>Overcooked is an approachable and popular environment to study coordination, but has limitations, notably due to being fully observable. For analysis on this topic, read more here.</p>"},{"location":"Environments/overcooked/#citation","title":"Citation","text":"<p>The environment was orginally described in the following work: <pre><code>@article{carroll2019utility,\n  title={On the utility of learning about humans for human-ai coordination},\n  author={Carroll, Micah and Shah, Rohin and Ho, Mark K and Griffiths, Tom and Seshia, Sanjit and Abbeel, Pieter and Dragan, Anca},\n  journal={Advances in neural information processing systems},\n  volume={32},\n  year={2019}\n}\n</code></pre></p>"},{"location":"Environments/overcooked/#to-do","title":"To Do","text":"<p>[] Clean up unused code (Randomised starts)</p> <p>[] Update dynamics to match latest version of Overcooked-AI</p>"},{"location":"Environments/smax/","title":"SMAX","text":""},{"location":"Environments/smax/#description","title":"Description","text":"<p>SMAX is a purely JAX SMAC-like environment. It, like SMAC, focuses on decentralised unit micromanagement across a range of scenarios. Each scenario features fixed teams.</p>"},{"location":"Environments/smax/#scenarios","title":"Scenarios","text":"Name Ally Units Enemy Units 2s3z 2 stalkers &amp; 3 zealots 2 stalkers &amp; 3 zealots 3s5z 3 stalkers &amp; 5 zealots 3 stalkers &amp; 5 zealots 5m_vs_6m 5 marines 6 marines 10m_vs_11m 10 marines 11 marines 27m_vs_30m 27 marines 30 marines 3s5z_vs_3s6z 3 stalkers &amp; 5 zealots 3 stalkers &amp; 6 zealots 3s_vs_5z 3 stalkers 5 zealots 6h_vs_8z 6 hydralisks 8 zealots smacv2_5_units 5 randomly chosen 5 randomly chosen smacv2_10_units 10 randomly chosen 10 randomly chosen smacv2_20_units 20 randomly chosen 20 randomly chosen"},{"location":"Environments/smax/#visualisation","title":"Visualisation","text":"<p>You can see the example <code>smax_introduction.py</code> in the tutorials folder for an introduction to SMAX, including example visualisation. SMAX environments tick at 8 times faster than each step of the agent. This means that when visualising, we have to expand the state sequence to encompass all ticks. This is why the <code>state_seq</code> for SMAX consists of a sequence of <code>(key, state, actions)</code> -- we must have not only the state and actions, but also the exact key passed to the step function to interpolate between the different states correctly. This process means visualisation can be time consuming if done for a large number of steps.</p> <pre><code>from jaxmarl import make\nfrom jaxmarl.environments.smax import map_name_to_scenario\nfrom jaxmarl.viz.visualizer import SMAXVisualizer\n\nscenario = map_name_to_scenario(\"3m\")\nenv = make(\n    \"HeuristicEnemySMAX\",\n    enemy_shoots=True,\n    scenario=scenario,\n    num_agents_per_team=3,\n    use_self_play_reward=False,\n    walls_cause_death=True,\n    see_enemy_actions=False,\n)\n\n# state_seq is a list of (key_s, state, actions) tuples\n# where key_s is the RNG key passed into the step function,\n# state is the jax env state and actions is the actions passed\n# into the step function.\nviz = SMAXVisualizer(env, state_seq)\n\nviz.animate(view=False, save_fname=\"output.gif\")\n</code></pre>"},{"location":"Environments/storm/","title":"STORM","text":"<p>Spatial-Temporal Representations of Matrix Games (STORM) is inspired by the \"in the Matrix\" games in Melting Pot 2.0, the STORM environment expands on matrix games by representing them as grid-world scenarios. Agents collect resources which define their strategy during interactions and are rewarded based on a pre-specified payoff matrix. This allows for the embedding of fully cooperative, competitive or general-sum games, such as the prisoner's dilemma. </p> <p>Thus, STORM can be used for studying paradigms such as opponent shaping, where agents act with the intent to change other agents' learning dynamics. Compared to the Coin Game or matrix games, the grid-world setting presents a variety of new challenges such as partial observability, multi-step agent interactions, temporally-extended actions, and longer time horizons. Unlike the \"in the Matrix\" games from Melting Pot, STORM features stochasticity, increasing the difficulty</p>"},{"location":"Environments/storm/#visualisation","title":"Visualisation","text":"<p>We render each timestep and then create a gif from the collection of images. Further examples are provided here.</p> <pre><code>import jax\nimport jax.numpy as jnp\nfrom PIL import Image\nfrom jaxmarl import make\n\n# load environment\nnum_agents = 2\nrng = jax.random.PRNGKey(18)\nenv = make('storm', \n        num_inner_steps=512, \n        num_outer_steps=1, \n        num_agents=num_agents, \n        fixed_coin_location=True,\n        payoff_matrix=jnp.array([[[3, 0], [5, 1]], [[3, 5], [0, 1]]]),\n        freeze_penalty=5,)\nrng, _rng = jax.random.split(rng)\nobs, old_state = env.reset(_rng)\n\n\n# render each timestep\npics = []\nfor t in range(512):\n    rng, *rngs = jax.random.split(rng, num_agents+1)\n    actions = [jax.random.choice(\n        rngs[a], a=env.action_space(0).n, p=jnp.array([0.1, 0.1, 0.5, 0.1, 0.2])\n    ) for a in range(num_agents)]\n\n    obs, state, reward, done, info = env.step_env(\n        rng, old_state, [a for a in actions]\n    )\n\n    img = env.render(state)\n    pics.append(img)\n\n    old_state = state\n\n# create and save gif\npics = [Image.fromarray(img) for img in pics]        \npics[0].save(\n    \"state.gif\",\n    format=\"GIF\",\n    save_all=True,\n    optimize=False,\n    append_images=pics[1:],\n    duration=1000,\n    loop=0,\n)\n</code></pre>"},{"location":"Environments/switch_riddle/","title":"Switch Riddle","text":"<p>This directory contains an implementation of the Switch Riddle game presented in Learning to communicate with deep multi-agent reinforcement learning. Following is a prosaic description of the game:</p> <p>There are n prisoners in prison and a warden. The Warden decides to free the prisoners if they can solve the following problem. So, every day the Warden will select one of the prisoners randomly and send him to an interrogation room which consists of a light bulb with a switch. If the prisoner in the room can tell that all other prisoners including him have been to the room at least once then the Warden will free all of them otherwise kill all of them. Except for the prisoner in the room, other prisoners are unaware of the fact that who got selected on that particular day to go to the interrogation room. Now, the prisoner in the interrogation room can switch on or off the bulb to send some indication to the next prisoner. He can also tell the warden that everyone has been to the room at least once or decide not to say anything. If his claim is correct, then all are set free otherwise they are all killed. </p> <p>A more detailed description of the original game can be found here. The original implementation of the game is here.</p> <p></p>"},{"location":"Environments/switch_riddle/#observation-action-space-and-reward","title":"Observation, action space and reward","text":"<p>In this implementation, each agent receives a 2-dimensional observation vector:</p> Feature Name Value In Room 1 if the agent is in the room, 0 otherwise Bulb State 1 if the agent is in the room and the light is on, 0 otherwise <p>The action space is different from the original one. In particular, in the original implementation each agent can *tell the warden each agent passed in the room *or **do nothing*. Next to it, it can pass a message* to the next agent (switch on-off the light). In this implementation, the message is part of the action space and the light state is embedded in the observation space.</p> Action Key Action 0 Do nothing 1 Switch the light (communicate) 2 Tell the warden <p>The game ends when an agent tells the warden or the maximum time steps is reached. The reward is the same as in the original implementation:</p> <ul> <li>+1 if the agent in the room tells the warden and all the agents have gone to the room.</li> <li>-1 if the agent in the room tells the warden before every agent has gone to the room once.</li> <li>0 otherwise (also if the maximum number of time steps is reached).</li> </ul>"},{"location":"Environments/switch_riddle/#usage","title":"Usage","text":"<p>A pedantic snippet for verbosing the environment:</p> <pre><code>import jax\nimport jax.numpy as jnp\nfrom jaxmarl import make\n\nkey = jax.random.PRNGKey(0)\nenv = make('switch_riddle', num_agents=5)\n\nobs, state = env.reset(key)\n\nfor _ in range(20):\n    key, key_reset, key_act, key_step = jax.random.split(key, 4)\n\n    env.render(state)\n    print(\"obs:\", obs)\n\n    # Sample random actions\n    key_act = jax.random.split(key_act, env.num_agents)\n    actions = {\n        agent: env.action_space(agent).sample(key_act[i])\n        for i, agent in enumerate(env.agents)\n    }\n\n    print(\n        \"action:\",\n        env.game_actions_idx[actions[env.agents[state.agent_in_room]].item()],\n    )\n\n    # Perform the step transition.\n    obs, state, reward, done, infos = env.step(key_step, state, actions)\n\n    print(\"reward:\", reward[\"agent_0\"])\n</code></pre> <p>The environment contains a rendering function that prints the current state of the environment:</p> <pre><code>&gt;&gt; env.render(state)\nCurrent step: 0\nBulb state: Off\nAgent in room: 0\nAgents been in room: [0 0 0 0 0]\nDone: False\n</code></pre>"},{"location":"Environments/switch_riddle/#citation","title":"Citation","text":"<p>If you use this environment, please cite:</p> <pre><code>@inproceedings{foerster2016learning,\n    title={Learning to communicate with deep multi-agent reinforcement learning},\n    author={Foerster, Jakob and Assael, Yannis M and de Freitas, Nando and Whiteson, Shimon},\n    booktitle={Advances in Neural Information Processing Systems},\n    pages={2137--2145},\n    year={2016} \n}\n</code></pre>"}]}