# Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

## Abstract

Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows.
We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision–language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.

## Key Contributions

- Modular expert interface with strict input validation (text prompt, image, etc.) and lazy pipelines
- RL environment that decomposes prompts into sub‑tasks, scores outputs, and logs episodes with artifacts
- Practical expert set spanning T2I and I2I models, both open and hosted APIs
- End‑to‑end training, inference, and evaluation scripts with Weights & Biases logging and checkpoints

## Repository Structure

```
.
├── data/                      # JSONL data and sample images
├── rl/                        # DQN subclass and policy
├── utils/                     # config, logging, prompts, API helpers
├── train.py                   # Train DQN over experts
├── inference.py               # Run a trained policy on a dataset
├── eval_t2i.py | eval_i2i.py  # Evaluators for generation/editing
├── experts.py                 # Expert interface + concrete experts
├── env.py                     # Gymnasium environment
├── requirements.txt           # Pinned dependencies
└── README.md
```

## Installation

Prereqs: Python 3.11, CUDA 12.x GPU (tested on NVIDIA T4), ffmpeg/libgl for OpenCV.

1) Create and activate an environment, then install dependencies:

```bash
python3 -m venv envs/image-poser-env
source envs/image-poser-env/bin/activate
pip install -r requirements.txt
```

## Credentials and Environment Variables

Do not commit secrets. Export required keys in your shell or job scripts:

```bash
export HUGGING_FACE_HUB_TOKEN="<your_hf_token>"
export WANDB_API_KEY="<your_wandb_key>"
export OPENAI_API_KEY="<your_openai_key>"
export GOOGLE_API_KEY="<your_google_genai_key>"
```

Note: `utils/api_keys.py` is a currently a temporary helper for the OpenAI and Google GenAI API keys. Feel free to add your keys there.

## Data Format

Training and evaluation data are JSONL files with two fields per line:

```json
{"prompt": "<text instruction>", "img_path": "<path-or-empty-string>"}
```

- For text‑to‑image (T2I), set `img_path` to "".
- For image‑to‑image (I2I/editing), set `img_path` to an existing image file.

Provided samples:
- `data/train.jsonl`: mixed prompts for training
- `data/gen_test.jsonl`: prompts for T2I inference
- `data/edit_test.jsonl`: prompts + images for editing

## Method Overview

- Observation: 1536‑dim text embedding of the current sub‑task from `OpenAI/text-embedding-3-small`.
- Action space: discrete index over experts registered in `experts.py`.
- Transition: selected expert runs with validated inputs; output image is saved and becomes the next state’s context.
- Reward: structured evaluator returns a score in [0, 10]; environment uses `reward = score/10 - 0.05 * step`.
- Episode ends when the task is completed (no remaining sub‑tasks) or `max_steps` is reached.

### Implemented Experts (selection)

- T2I: SDXL, PixArt‑α, Stable Diffusion 3.5 Large, FLUX.1‑dev, DALL·E 3, GPT‑Image‑1, Gemini 2.5 Image Preview
- I2I: Instruct‑Pix2Pix, MagicBrush, GPT‑Image‑1 Edit, Gemini 2.5 Image Preview, FLUX.1‑Kontext

Each expert declares required/optional inputs and whether it’s T2I or I2I. See `experts.py`.

## Configuration

All defaults are defined in `utils/config.py`. Key parameters:

- `dataset_path` (str): JSONL input path (default `./data/train.jsonl`)
- DQN: `learning_rate`, `gamma`, `exploration_fraction`, `exploration_initial_eps`, `exploration_final_eps`, `learning_starts`, `buffer_size`, `batch_size`
- Experiment: `random_seed` (42), `text_embedding_dim` (1536), `max_steps` (6), `log_interval` (20), `eval_mode` (False)
- Outputs (auto‑created): `results_dir` (`/datasets/uig/results/$USER/`), `experiment_dir`, `checkpoint_dir`, `tensorboard_log`, `wandb_dir`, `stats_dir`

Example override in code:

```python
from utils.config import Config
cfg = Config(dataset_path="./data/train.jsonl", max_steps=8, learning_rate=1e-4)
```

## Training

Local:
```bash
python train.py
```

Training does the following:
- Instantiates `ImageEnv` with `Config`
- Builds expert registry and observation space
- Trains `ImagePoserDQN` with SB3‑DQN and logs to TensorBoard and W&B
- Saves checkpoints every `log_interval` steps to `checkpoints/` as `dqn_model_<N>_steps.zip`

Resume training (update file first to point to your experiment):
```bash
python train_from_checkpoint.py
```

## Inference

Update `inference.py` at the top to select the run and step:

```python
STEPS = <checkpoint_steps>
EXPERIMENT_NAME = "run_YYYYMMDD-HHMMSS"
```

Then run:
```bash
python inference.py
```

Outputs (per episode) are saved under:
```
/datasets/uig/results/$USER/<EXPERIMENT_NAME>/episode_*/step_*/*.png
```
Feel free to change to your desired path.

## Evaluation

- T2I quality on generated images:
```bash
python eval_t2i.py
```

- I2I/edit preservation and alignment:
```bash
python eval_i2i.py
```

The evaluators produce JSON with per‑image metrics in the corresponding output directories.

## Reproducibility

- Randomness: `Config.random_seed` seeds NumPy, Python, and PyTorch; cuDNN deterministic flags are set in `env.py`.
- Versions: `requirements.txt` is pinned; we recommend using the provided versions for artifact evaluation.
- Hardware: experiments were run on 1× NVIDIA T4 (16GB). Other GPUs with should work with adjusted batch/steps.
- Checkpoints: saved in `checkpoints/`; `inference.py` shows how to select a run and step.
- Logs: W&B project `image-poser`; TensorBoard logs under `tensorboard/` in the experiment directory.

## Outputs and Directory Layout

For each run `run_YYYYMMDD-HHMMSS/`:
- `wandb/`, `tensorboard/`, `checkpoints/`
- `stats/`: periodic JSON dumps of model usage and scores
- `episode_*/step_*/`: generated images and `info.json` with metadata


## Ethical & Responsible Use

This project integrates third‑party generative models. Ensure your usage complies with model, dataset, and API licenses. Avoid uploading sensitive images to hosted APIs. Never commit API keys to source control.


---

Quick start:
```bash
pip install -r requirements.txt
export OPENAI_API_KEY=... GOOGLE_API_KEY=... HUGGING_FACE_HUB_TOKEN=... WANDB_API_KEY=...
python train.py
python inference.py  # after setting EXPERIMENT_NAME and STEPS in the file
```