EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

ICLR 2026 Conference Submission22426 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: inpainting, object removal, test-time adaptation, diffusion models, multimodal large-language model

Abstract: Object removal requires more than erasing a target—it must reconstruct the missing region with high structural fidelity while preserving diverse background context. Existing diffusion-based dataset-free approaches attempt to redirect self-attention away from the masked target but fail in two critical ways: (1) non-target foregrounds are often misinterpreted as background, causing unintended object regeneration, and (2) disruption of short-range activations degrades fine details and prevents coherent integration of multiple background cues. We introduce EraseLoRA, a dataset-free object-removal framework that leverages the visual reasoning power of multimodal large-language models (MLLMs) to exclude foreground distractions and assemble rich background content. The first stage, BRF (Background Reconstruction with Foreground Exclusion), isolates and removes non-target objects through MLLM-guided reasoning on a single image–mask pair, producing clean background candidates without ground-truth supervision. The second stage, Background Subtype Aggregation (BSA), restores the masked region by treating each inferred background subtype as a puzzle piece, enforcing their consistent integration to preserve both local detail and global context. EraseLoRA achieves state-of-the-art object-removal performance across diverse diffusion backbones without any additional training data or ground-truth background, demonstrating that MLLM reasoning—applied here for structural reconstruction rather than object generation—can directly guide diffusion models to rebuild complex scenes from a single image with unprecedented structural and contextual coherence.

Supplementary Material: pdf

Primary Area: generative models

Submission Number: 22426

Loading