# Distributed Multi-Agent Constrained Reinforcement Learning (MARL)

This repository contains the reference implementation of our **distributed multi-agent constrained reinforcement learning (MARL)** framework. Our approach addresses global and local constraints by leveraging **primal-dual methods**, **state augmentation** with Lagrange multipliers, and **consensus** among agents to ensure coordinated constraint satisfaction.

---

## 1. Overview

Many multi-agent systems must jointly optimize local objectives while respecting global constraints—examples include energy management in smart grids, traffic flow optimization, and resource allocation in communication networks. Traditional decentralized RL approaches struggle to guarantee constraint satisfaction across distributed agents.

To overcome this challenge, our framework introduces two key ideas:

1. **State Augmentation**  
   Each agent’s policy uses not only local states but also **dual variables** (Lagrange multipliers) as part of its observation.

2. **Primal-Dual Coordination**  
   Agents learn policies that maximize local rewards while a *consensus mechanism* updates and aligns their local multipliers, ensuring global constraints are collectively met.

Our code provides:

- Algorithms for training and evaluating **constrained MARL** agents.
- Scripts to **replicate experiments** related to scalability, consensus dynamics, and comparison against standard baselines.
- Tools to **visualize** performance metrics and analyze parameter sensitivities.

---

## 2. Algorithmic Highlights

Our approach is summarized below (see Algorithm 1 in the associated paper):

1. **Augmented Policy Training**  
   Each agent \( i \) learns a policy \(\pi^{i,*}(s_t^i, \lambda_t^i)\) using any standard RL algorithm (e.g., PPO) but augmented with the **dual variables** \(\lambda_t^i\) in the state space. This ensures that the policy is responsive to evolving constraint multipliers.

2. **Dual Variable Updates**  
   - **Global Constraint Update**: Each agent’s local multiplier \(\lambda^i\) is updated via gradient descent on the Lagrangian, comparing the desired constraint threshold with observed performance.  
   - **Consensus Step**: Agents exchange their dual variables with neighbors to maintain consistency. Each agent’s \(\lambda^i\) moves toward the average value in its neighborhood, aligning local constraints with the global requirement.

3. **Execution**  
   - **Single Policy per MDP Type**: Agents of the same “type” (e.g., same building model) share policy parameters, reducing training complexity.  
   - **Distributed Consensus**: During execution, each agent continues updating its local multiplier \(\lambda^i\), ensuring constraints remain satisfied over time.

This primal-dual procedure converges to feasible solutions under mild assumptions, as the consensus mechanism coordinates agents to fulfill global constraints without requiring centralized control.

---

## 3. Use Case: Smart Grid Demand Response

We showcase the algorithm in a **smart grid management** problem, where multiple buildings (agents) independently manage their energy consumption:

- **Objective**  
  Each building aims to minimize electricity costs (negative reward) by optimally using grid energy versus battery (or solar) energy, with the ability to **shift** unmet demand to later time steps.

- **Global Constraint**  
  The sum of all agents’ grid consumption is constrained to be below a specified threshold \( c \) (e.g., a percentage of maximum peak demand), preserving grid stability and avoiding critical energy peaks.

- **Local Constraint**  
  Each building must eventually meet its own demand—i.e., any shifted load cannot remain unserved indefinitely.

- **Implementation Details**  
  - **Data Source**: We use energy demand, price, and solar generation data from the open-source Farama Foundation Gymnasium environment “City Learn” \citep{City1, City2}.
  - **Primal-Dual Augmentation**: During training, each agent’s state includes local variables (current demand, battery charge, energy price) and multipliers \(\lambda^i,\nu^i\).
  - **Policy**: Trained with Proximal Policy Optimization (PPO) in a *single-agent* manner per building type, but *multi-agent* execution ensures joint feasibility.

This scenario demonstrates how constraints (peak demand limits, no perpetual unmet demand) can be integrated into the RL loop via Lagrange multipliers and local or global updates.

---

## 4. Experimental Results

We evaluate performance on several network configurations with different connectivity patterns (circular, linear, clustered). In all cases:

1. **Consensus vs. No Consensus**  
   - With consensus, agents align their multipliers, ensuring global constraint satisfaction without *unbounded* postponement of demand.  
   - Without consensus, multipliers of high-demand agents often diverge or hit a predefined maximum bound, leading to suboptimal solutions or unbounded load postponement.

2. **Scalability**  
   - **Single Policy per Agent Type** avoids exponential blow-up in training as the number of agents increases.  
   - During *execution*, the decentralized consensus mechanism scales linearly with the number of agents, as each agent only communicates with its neighbors.

3. **Comparison to Fixed-Penalty Baselines**  
   - A grid of **Independent PPO (IPPO)** models with fixed penalties \(\lambda, \nu\) can sometimes solve *one* constraint level but generally fails to adapt to different constraints without retraining.  
   - In contrast, our primal-dual approach automatically adjusts multipliers to varying constraints in *one* training run.

These results affirm that our method robustly enforces constraints, adapts to different feasible regions, and scales to larger networks with minimal overhead.

---

## 5. Requirements

- **Python** 3.10+
- **Dependencies**:  
  - `numpy`, `pandas`, `matplotlib`, `seaborn`, `networkx`, `tqdm`  
  - `gymnasium`, `stable-baselines3`, `torch`  
  - `marlAcrl` (custom module included in this repo)

**Installation**:

- **Using `environment.yml` (conda)**:
  ```bash
  conda env create -f environment.yml
  conda activate icml_cmarl
  ```
- **Using `requirements.txt` (pip)**:
  ```bash
  conda create -n my-environment python=3.10
  conda activate my-environment
  pip install -r requirements.txt
  ```

---

## 6. Usage & Reproduction of Results

1. **Train CMARL**  
   ```bash
   python train_acrl.py --filename ppo_cmarl \
       --building 1 \
       --total_timesteps 1000000 \
       --dual --verbose 1
   ```
   - Trains a primal-dual augmented agent for a single building configuration.
   - Adjust parameters such as number of buildings, environment seeds, or hyperparameters in the script/config.

2. **Train IPPO Baseline**  
   ```bash
   python train_ippo.py --config ./experiments/ippo_train_config.json
   ```
   - Performs grid search over \(\lambda, \nu\) for each building type.
   - Results stored in a specified folder for later analysis.
   - Since the training of the same number of models we trained will take some time, we supply the some trained models in the folder (due to space constraints in the submision) "ippo_models" to play with using the notebooks.

3. **Evaluate CMARL and IPPO**  
   - **CMARL**:
     ```bash
     python run_cmarl.py --config ./experiments/config_test_cmarl.json
     ```
   - **IPPO**:
     ```bash
     python ippo_test.py --config ./experiments/ippo_config.json
     ```
   These scripts produce performance metrics (cost, unmet demand, constraint violations) and can generate plots or CSV logs for deeper analysis.
   **Note**: The supplied IPPO models have a hardcoded constraint value for visualization purposes. If you wish to use a different constraint level, you will need to train a new set of models.

4. **Scalability Experiments**  
   ```bash
   python scaling_experiment.py --config ./experiments/scaling_config.json
   ```
   - Varies the size of the network (number of agents) and measures execution time and consensus convergence.
   - Demonstrates near-linear scalability in execution.

---

## 7. Citation

If you use this code or adapt these methods for academic research, please cite our work.



---

## 8. License

This project is licensed under the terms of the **MIT License** . See [LICENSE](LICENSE) for details.

---

### Acknowledgments

- Farama Foundation “City Learn”[^1][^2] for open-source data on energy demand and pricing.
- Contributors to `stable-baselines3` and `gymnasium` for providing user-friendly RL tooling.

[^1]: [CityLearn on GitHub](https://github.com/farama-foundation/CityLearn)  
[^2]: 
