# Mean Payoff Policy Iteration

This repository contains a Python implementation of a mean payoff policy iteration algorithm applied to a specially constructed DMDP. The script builds a DMDP based on a parameter `n` and iteratively computes an optimal policy according to a mean payoff criterion.

## Overview

The main components of the code are:

1. **Graph Construction**:
   The `quadr(n)` function builds a graph with a set of vertices and weighted edges. The vertices are labeled as top (`t`) and bottom (`b`) nodes, and edge weights are assigned based on the provided parameter `n`.

2. **Policy Iteration Algorithm**:
   The `MeanPayoffPolicyIteration` class encapsulates the iterative process:
   - **Initialization**: An initial policy is generated by choosing, for each vertex, the neighbor with the minimum assigned index.
   - **Iteration**: The algorithm computes appraisal values for each edge (using cycle weights and potential functions) and updates the policy. The iteration continues until the policy converges.
   - **Output**: The final optimal policy is returned along with information about the iterations taken.

3. **Execution and Output**:
   The main block of the code handles:
   - Parsing command-line arguments.
   - Timing the algorithm’s execution.
   - Optionally printing detailed intermediate outputs (graph structure, policies, appraisal values) when the `--print` flag is provided.
   - Displaying the final results, including the optimal policy, execution time, and iteration counts.

## Dependencies

- **Python 3.9+**
  The code requires Python 3.9 or later.

## Usage

Run the script from the command line by providing the required parameter `n` (an integer) and an optional `--print` flag for verbose output.

```bash
python your_script.py n [--print]
```

### Example

```bash
python your_script.py 3 --print
```

This command will:
- Build a graph based on `n=3`.
- Run the mean payoff policy iteration algorithm.
- Print the detailed graph structure, initial policy, intermediate steps, and the final optimal policy.

## Code Structure

- **`Config`**:
  A dataclass used to manage configuration settings. Currently, it holds a boolean `print` flag that controls whether detailed output is shown.

- **`Graph`**:
  A class that represents the graph:
  - **Attributes**:
    - `vertices`: A dictionary mapping each vertex to a unique index.
    - `adjacency`: A dictionary of dictionaries representing the edges and their weights.
  - **Methods**:
    - `num_vertices()`: Returns the number of vertices.
    - `num_edges()`: Returns the total number of edges.

- **`MeanPayoffPolicyIteration`**:
  A class implementing the policy iteration algorithm:
  - **Initialization**: Sets the initial policy using `get_initial_policy()`.
  - **Methods**:
    - `get_initial_policy()`: Generates the starting policy.
    - `get_new_policy()`: Computes a new policy based on current appraisal values.
    - `get_optimal_policy()`: Iterates until the policy converges to an optimal solution.
  - Intermediate states (such as appraisal values and policy updates) are optionally printed if `Config.print` is set to `True`.

- **`quadr(n)`**:
  A function that constructs and returns a `Graph` object. The function:
  - Creates a list of vertices with labels `t1`, `b1`, ..., `bn`, and additional top vertices (`t2`, `t3`, ..., `tn`).
  - Builds the `edges` dictionary where:
    - Bottom vertices have edges to earlier bottom vertices with high weights and edges to all top vertices with zero weight.
    - Top vertices have a mix of high-weight edges (to bottom vertices), zero-weight edges (to earlier top vertices), and a special self-loop with a weight depending on `n` and the vertex index.

## Performance Metrics

At the end of execution, the script outputs:
- **Execution Time**: The total time taken to compute the optimal policy.
- **Iteration Count**: The actual number of iterations performed, along with a guess computed by the formula `(n**2 + 7*n - 6) // 2`.

