Teaching Diffusion Models to Ground Alpha Matte

TMLR Paper5212 Authors

26 Jun 2025 (modified: 05 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: The power of visual language models is showcased in visual understanding tasks, where language-guided models achieve impressive flexibility and precision. In this paper, we extend this capability to the challenging domain of image matting by framing it as a soft grounding problem, enabling a single diffusion model to handle diverse objects, textures, and transparencies, all directed by descriptive text prompts. Our method teaches the diffusion model to ground alpha mattes by guiding it through a process of instance-level localization and transparency estimation. First, we introduce an intermediate objective that trains the model to accurately localize semantic components of the matte based on natural language cues, establishing a robust spatial foundation. Building on this, the model progressively refines its transparency estimation abilities, using the learned semantic structure as a prior to enhance the precision of alpha matte predictions. By treating spatial localization and transparency estimation as distinct learning objectives, our approach allows the model to fully leverage the semantic depth of diffusion models, removing the need for rigid visual priors. Extensive experiments highlight our model’s adaptability, precision, and computational efficiency, setting a new benchmark for flexible, text-driven image matting solutions.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 5212
Loading