# Multi-Objective Sequential Decision Making Framework

A framework for intelligent question recommendation in educational contexts, using LLM orchestrators to balance multiple learning objectives without retraining policies.

## What This Framework Does

This framework solves the problem of **multi-objective optimization in education** by:
1. **Stage 1:** Training individual policies for single objectives (performance, gap, aptitude)
2. **Stage 2** Using LLM orchestratorsto intelligently select and combine these policies without retraining

## Quick Start

### 1. Setup
```bash
git clone [repository-url]
cd ai_planning
pip install -r requirements.txt
```

### 2. Train Individual Policies
```bash
# Train a policy for performance objective
python3 train_evaluate_policy.py --agent ppo --objectives "performance" --train_episodes 10000

# Train a policy for gap objective  
python3 train_evaluate_policy.py --agent a2c --objectives "gap" --train_episodes 10000

# Train a policy for multi-objective
python3 train_evaluate_policy.py --agent sarsa --objectives "performance" "gap"
```

### 3. Run LLM Orchestrator (Core)
```bash
python3 main.py --orchestrator_model mistral-24b-instruct \
               --policy_dir results/ \
               --objectives "performance" "gap" "aptitude" \
               --episodes 50
```

### 4. Visualize Results
```bash
# Generate mastery progression, radar plots, and per objective reward
python3 viz.py --config_file policy_comparison_base.json

# Generate pearson correlation plot between aptitude and gap given a policy trained on performance
python3 viz.py --config_file policy_comparison_base.json --correlation_scatter_plot --optimized_obj performance

# Generate flow alignment plot
python3 viz.py --config_file policy_comparison_base.json --flow_zone_plot

# Generate inference time bar chart, policy switching bar chart, and policy usage bar chart
python3 viz.py --config_file policy_comparison_all_orchestrators.json --all_orchestrator_analysis
```

### Complete Workflow
```bash
# 1. Train multiple policies
python3 train_evaluate_policy.py --agent ppo --objectives "performance" --train_episodes 100
python3 train_evaluate_policy.py --agent a2c --objectives "gap" --train_episodes 100
python3 train_evaluate_policy.py --agent ppo --objectives "aptitude" --train_episodes 100

# 2. Run orchestrator with trained policies
python3 main.py --orchestrator_model mistral-24b-instruct --episodes 10

# 3. Analyze results
python3 viz.py --config_file policy_comparison_base.json
```

## Project Structure

```
├── agents/              # RL agents (SARSA, A2C, PPO)
├── envs/                # Educational environment with student simulation
├── orchestrator/        # LLM orchestrator components
├── reward_handlers/     # Reward processing (scalarized, reward machine, pareto)
├── utils/               # Evaluation and training utilities
├── train_evaluate_policy.py  # Train individual policies
├── main.py              # LLM orchestrator (core)
├── viz.py               # Analysis and visualization
└── requirements.txt     # Dependencies
```

## Dependencies

- Python 3.8+
- PyTorch
- OpenAI/Anthropic API access
- Gymnasium
- NumPy, Pandas, Matplotlib, Seaborn
- LangChain
