Keywords: Diffusion Models, Flow Matching, Ultra-High Resolution, Super-Resolution, Automatic Colorization, Image Restoration
TL;DR: Our work establishes a new paradigm that unifies diffusion models and flow matching for high-fidelity image restoration, including ultra-high-resolution super-resolution and denoising, and further generalizes to automatic colorization.
Abstract: Diffusion-based image restoration has advanced rapidly, yet existing methods remain fragile under severe degradations, exhibiting geometric drift, identity loss, or texture hallucination. We present In-Token Learning, a token-aligned framework that redefines restoration as learning a conditional velocity field via rectified flow matching (RFM), directly transporting pure noise to clean images under intra-token alignment within a Multimodal Diffusion Transformer (MMDiT). This design enables robust and high-fidelity restoration, avoiding misleading details from degraded inputs.
To further stabilize conditioning, we introduce Direct Low-Quality Guidance (DLG), a lightweight mechanism that injects degraded-image and prompt embeddings into model's native text-conditioning pathway, without relying on external prompts, side branches, or sequence-level concatenation.
Our framework (i) improves robustness under severe degradations, (ii) improves fidelity by narrowing the long-standing perception-distortion gap, and (iii) supports QHD ($2560{\times}1440$) inference and seamless scaling to ultra-high resolutions through fixed-length attention.
We further demonstrate the first $12$K restoration of the classical scroll painting Along the River During the Qingming Festival using an unmodified backbone.
Across five benchmarks (DIV2K, LSDIR, FFHQ, RealLQ250, RealPhoto60), our method achieves state-of-the-art performance on both full- and no-reference metrics, and generalizes to colorization, achieving state-of-the-art perceptual quality.
These results position In-Token Learning as a unified and scalable paradigm across diverse tasks, degradations, and resolutions.
Primary Area: generative models
Submission Number: 987
Loading