Teaching Diffusion Models to Ground Alpha Matte

Tianyi Xiang; Weiying Zheng; Yutao Jiang; Tingrui Shen; Hewei Yu; Yangyang Xu; Shengfeng He

Teaching Diffusion Models to Ground Alpha Matte

Tianyi Xiang, Weiying Zheng, Yutao Jiang, Tingrui Shen, Hewei Yu, Yangyang Xu, Shengfeng He

Published: 11 Oct 2025, Last Modified: 15 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The power of visual language models is showcased in visual understanding tasks, where language-guided models achieve impressive flexibility and precision. In this paper, we extend this capability to the challenging domain of image matting by framing it as a soft grounding problem, enabling a single diffusion model to handle diverse objects, textures, and transparencies, all directed by descriptive text prompts. Our method teaches the diffusion model to ground alpha mattes by guiding it through a process of instance-level localization and transparency estimation. First, we introduce an intermediate objective that trains the model to accurately localize semantic components of the matte based on natural language cues, establishing a robust spatial foundation. Building on this, the model progressively refines its transparency estimation abilities, using the learned semantic structure as a prior to enhance the precision of alpha matte predictions. By treating spatial localization and transparency estimation as distinct learning objectives, our approach allows the model to fully leverage the semantic depth of diffusion models, removing the need for rigid visual priors. Extensive experiments highlight our model’s adaptability, precision, and computational efficiency, setting a new benchmark for flexible, text-driven image matting solutions.

Submission Length: Regular submission (no more than 12 pages of main content)

Supplementary Material: zip

Video: https://youtu.be/7IOBf-pzk1w

Code: https://github.com/xty435768/TeachDiffusionMatting

Assigned Action Editor: ~Lu_Jiang1

Submission Number: 5212

Loading