# 🔍 FocusDiff: Fine-Grained Text-Image Alignment for AR-Based Visual Generation

> **Anonymous repository** accompanying our submission to ICLR 2026.  
> This repo contains the code for **FocusDiff**, a method designed to improve fine-grained text-to-image alignment in autoregressive (AR) generation models.

## 📁 Project Structure

Below is the organized file structure with brief descriptions:

```text
PairComp/
├── internvl/                # Scripts related to InternVL
├── prompt/                  # Prompt datasets used for evaluation
├── evaluate_images.py       # Main script to run evaluation for PairComp
├── README.md                # Detailed instructions for running PairComp evaluations 
├── requirements.txt         # Python package dependencies
├── summary_scores.py        # Script to summarize evaluation metrics
src/
└── open_r1/                 # Source code for the Pair-GRPO
    ├── internvl/            # InternVL-related scripts
    ├── models/              # Model architecture and scripts
    ├── configs.py           # Configuration settings for Pair-GRPO
    ├── eval_clip.py         # CLIP-based evaluation utilities
    ├── grpo_trainer.py      # Implementation of the Pair-GRPO
    ├── grpo.py              # Entry point for running Pair-GRPO
    ├── intern_img.py        # Script for the reward model
    └── llama.py             # Inference script for Janus-Pro

```

## 🔥 Overview

**FocusDiff** addresses the challenge of fine-grained semantic control in AR-based image generation. While AR models excel in capturing global semantics, they often struggle with subtle distinctions. 

![alt text](assets/intro.png)

FocusDiff enhances alignment through two main innovations:

1. **FocusDiff-Data**:  
   A curated dataset of paired prompts and images with subtle semantic variations.

   ![alt text](assets/benchmark.png)

2. **Pair-GRPO**:  
   A novel RL algorithm extending Group Relative Policy Optimization to emphasize fine-grained semantic differences during training.

   ![alt text](assets/grpo.png)

Our method is evaluated with strong performance on multiple benchmarks including **GenEval**, **T2I-CompBench**, **DPG-Bench**, and our newly proposed **PairComp** benchmark.



## ✨ Main Results

- ### **Text-to-Image Generation**
![alt text](assets/casestudy1.png)

![alt text](assets/casestudy2.png)

- ### **Counterfactual Generation**
![alt text](assets/cf.jpg)

## 🔒 Anonymity Note

This anonymous repository releases the preliminary code. All identifying information has been removed to preserve anonymity for peer review.
