Diffusion Fine-Tuning: Iterative Refinement for Advanced Grounding with Diffusion Large Language Models

12 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Fine-Tuning, Large Vision-Language Models, Visual Grounding
TL;DR: We employ Diffusion Fine-Tuning (DFT) to train masked Large Diffusion Language Models , enabling them to perform visual grounding tasks through a "coarse-to-fine" parallel optimization process.
Abstract: While Large Vision-Language Models (LVLMs) excel at simple bounding box grounding, they reveal fundamental limitations in tasks requiring precise spatial localization, most notably polygon grounding. We identify this bottleneck as stemming from two fundamental flaws of the autoregressive (AR) paradigm: 1) irreversible error accumulation, where early vertex errors propagate uncorrected through the sequence; and 2) a lack of global planning, which leads to a suboptimal allocation of the finite number of vertices (16 points) on the object's contour. We propose Diffusion Finetuning (DFT) to reframe visual grounding as a robust, parallel global optimization. Its core is a `sculpture-like', coarse-to-fine generation process, where coordinate digits are predicted hierarchically (e.g., hundreds, then tens, then units) to progressively refine the shape from a coarse outline to precise details. We use a novel Hierarchical Curriculum Learning strategy that progressively refines the loss supervision, guiding the model from a rough outline to a precise delineation. Extensive experiments show that DFT achieves state-of-the-art on both 2D bounding box and 16-point polygon grounding, and demonstrates its strong performance on the complex 9-DoF 3D bbox grounding task.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4335
Loading