# Revisiting Block-wise Interactions of MMDiT for Training-free Improved Synthesis

![image](./asserts/teaser.png)

## 🛠️ Method Overview
This repository provides code and resources for analyzing and improving Multimodal Diffusion Transformers (MMDiT) in text-to-image generation and editing. We introduce a systematic pipeline to investigate the roles of different blocks and their interactions with textual conditions in MMDiT-based models such as FLUX and Qwen Image.

Key features:
- Block-wise analysis: Remove, disable, or enhance textual hidden-states at specific blocks to study their impact.
- Insights: Early blocks capture semantic information, later blocks render finer details, and selective enhancement of textual conditions improves semantic attributes.
- Training-free strategies: Methods for better text alignment, precise image editing, and faster inference.
- Performance: Our approach improves T2I-Combench and GenEval scores without sacrificing synthesis quality.

Refer to the documentation and examples to get started with block analysis, editing, and acceleration for your own diffusion models.

## 🚀 Getting Started
### Environment Requirement 🌍

We recommend using Python 3.8+ and PyTorch 1.12+ with CUDA support. The environment is compatible with `diffusers==0.35.0` or you can install the local version of diffusers in this repo.

```shell
pip install -r requirements.txt

cd diffusers
pip install -e .
```

### Minimal Example for Inference 🐍
We provide a minimal example for inference using the FLUX model.The only thing you need to modify is the inference parameters `modulated_layers` and `modulated_scales` for enhancing text alignment. `modulated_phases` can also be set for token-level enhancement.


```python

import torch
from diffusers import FluxPipeline 
# make sure import the local version of diffusers


pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power

prompt = "A cat and dog playing together in the park, photorealistic, high quality, 4k"

image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    removed_layers=None,
    generator=torch.Generator("cpu").manual_seed(0)
    modulated_layers=[2,7,12,17,22], # improve non-spatial text alignment  
    modulated_scales=1.5,   # Optional[float, List[float]] = 1.5,
    modulated_phases=None,  # phases to enhance, Optional[List[str]] = None, using sentence-level is None
    # modulated_ways=modulated_ways # using 'empty' for probing analysis
).images[0]
image.save("flux-dev-enhance.png")

```

### Image Editing ✂️

We do not have time to provide a minimal example for editing. For editing, we implement our proposed enhance techniques on selected blocks on [StableFlow](https://github.com/snap-research/stable-flow).

### Acceleration ⚡
We provide a script `flux_teacache_ours.py` for acceleration using our methods compatible with Teacache on FLUX model.  You can run the script as follows for acceleration evaluation. If you want to remove or modulate different layers, just change the `layers` and `layers2` variables.

```shell
export layers="5 10 15 20 25 30 35 40 45 50 55"
export layers2="30 40 50"
python flux_teacache_ours.py \
    --prompt_file ./T2I_CompBench_sampled_160prompts.txt \
    --save_dir ./teacache_results/flux_noteacache0.4_160_wocfg_${layers// /_}_remove${layers2// /_} \
    --removed_cfg_layers ${layers} --removed_layers ${layers2} \
    --enable_teacache --teacache_strength 0.4
```


## Parameters that can be modified for different applications

The main changes are the following parameters added to the `__call__` function of the SD3, FLUX, and Qwen Image pipelines:

| Parameter | Type / Values | Default | Description |
|----|----|----|----|
| `removed_layers`  | Optional[Union[int, List[int]]] | `None`     | Layers to **remove**, if setted, the blocks in this list will be skipped during inference. Default is `None`, meaning no layers are removed. |
| `modulated_layers`| Optional[Union[int, List[int]]] | `None`     | Layers where **text conditions are modulated**. Default is `None`, meaning no layers are modulated.   |
| `modulated_scales`|  Optional[Union[float, List[float]]] | `1.5`    | **Scaling factor(s)** controlling the strength of modulation. Only effective if `modulated_layers` is set. |
| `modulated_phases`| Optional[Union[str,List[str]]]=None | `None`     | **Target phrases**. Only effective if `modulated_layers` is set. |
| `modulated_ways`  | {"strengthen", "empty"}       | `"strengthen"`| Defines **how** to modulate: `"strengthen"` amplifies text conditions, `"empty"` zero the text conditions for probing analysis. |

## Datasets 📂 and Evaluation 🥇
For the probing analysis, we use the datasets in `./prompts` folders, which have been filtered by human checking. The datasets include:
- `prompts_number_gpt5_filterd.txt`: A set of prompts focusing on numerical attributes, used to evaluate the model's ability to accurately render numbers in generated images.
- `prompts_object_color_gpt5_filtered.txt`: A collection of prompts emphasizing object colors, designed to assess the model's performance in capturing and reproducing color details in images.
- `prompts_object_position_gpt5_filtered.txt`: A dataset of prompts centered around spatial relationships, aimed at evaluating how well the model understands and represents spatial arrangements in generated images.


We also provide the QwenVL-2.5 VQA code in `./QwenVL_VQA` for probing analysis. For evaluation, we use T2I-CompBench++ and GenEval.

## Acknowledgements

## Acknowledgements

We would like to thank the following open-source projects and their contributors for providing benchmarks and tools that facilitated our research:

- [T2I-CompBench++](https://github.com/Karine-Huang/T2I-CompBench)
- [GenEval](https://github.com/djghosh13/geneval)
- [CountGD](https://github.com/niki-amini-naieni/CountGD)
- [Stable Flow](https://github.com/snap-research/stable-flow)
- [Diffusers](https://github.com/huggingface/diffusers)
