In-Token Learning for High-Fidelity Image Restoration via Diffusion Transformers

Xingfu Yi; Xiaoxue Yu

In-Token Learning for High-Fidelity Image Restoration via Diffusion Transformers

Xingfu Yi, Xiaoxue Yu

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Models, Flow Matching, Ultra-High Resolution, Super-Resolution, Automatic Colorization, Image Restoration

TL;DR: Our work establishes a new paradigm that unifies diffusion models and flow matching for high-fidelity image restoration, including ultra-high-resolution super-resolution and denoising, and further generalizes to automatic colorization.

Abstract: Diffusion-based image restoration has advanced rapidly, yet existing methods remain fragile under severe degradations, exhibiting geometric drift, identity loss, or texture hallucination. We present In-Token Learning, a token-aligned framework that redefines restoration as learning a conditional velocity field via rectified flow matching (RFM), directly transporting pure noise to clean images under intra-token alignment within a Multimodal Diffusion Transformer (MMDiT). This design enables robust and high-fidelity restoration, avoiding misleading details from degraded inputs. To further stabilize conditioning, we introduce Direct Low-Quality Guidance (DLG), a lightweight mechanism that injects degraded-image and prompt embeddings into model's native text-conditioning pathway, without relying on external prompts, side branches, or sequence-level concatenation. Our framework (i) improves robustness under severe degradations, (ii) improves fidelity by narrowing the long-standing perception-distortion gap, and (iii) supports QHD ($2560{\times}1440$) inference and seamless scaling to ultra-high resolutions through fixed-length attention. We further demonstrate the first $12$K restoration of the classical scroll painting Along the River During the Qingming Festival using an unmodified backbone. Across five benchmarks (DIV2K, LSDIR, FFHQ, RealLQ250, RealPhoto60), our method achieves state-of-the-art performance on both full- and no-reference metrics, and generalizes to colorization, achieving state-of-the-art perceptual quality. These results position In-Token Learning as a unified and scalable paradigm across diverse tasks, degradations, and resolutions.

Primary Area: generative models

Submission Number: 987

Loading