Learning to Remove, Not Repeat: Robust Object Removal in Cluttered Scenes using Diffusion Models

Learning to Remove, Not Repeat: Robust Object Removal in Cluttered Scenes using Diffusion Models

ICLR 2026 Conference Submission19433 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Computer Vision, Diffusion Models, Image Editing

Abstract: Object removal, a key image inpainting task, aims to erase specified objects and plausibly fill the resulting region. Although recent diffusion models excel at generating realistic content, when employed for the removal task, they often fail in cluttered scenes by replicating nearby objects or hallucinating semantically similar ones, an artifact of their powerful, yet context-agnostic, generative priors. To address this, we introduce a robust framework that Learns to Remove, Not Repeat (LRNR). Our approach has three key components. First, we propose the Scatter-Tile Object Removal (STORe) dataset, a large-scale synthetic dataset with unique scatter and tile configurations designed to make models robust to object replication. Second, we employ an efficient fine-tuning strategy that combines Low-Rank Adaptation (LoRA) with a learnable task prompt, which internalizes the concept of removal, thereby eliminating the need for manual text guidance. Third, we introduce Mask-Aware Scheduled Guidance (MASG), a training-free inference technique that spatially and temporally modulates classifier-free guidance to enhance inpainting quality and preserve background integrity. Our evaluations demonstrate that LRNR outperforms state-of-the-art approaches, particularly in terms of removal success rate in challenging scenes prone to object replication, leading to more reliable and semantically correct results. Our dataset, source code, and trained models will be publicly available.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 19433

Loading