HEAR: High-frequency Enhanced Autoregressive Modeling for Identity-Preserving Image Generation

20 Sept 2025 (modified: 04 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: generative model
Abstract: Recent autoregressive models such as LlamaGen, VAR, and Infinity have demonstrated remarkable advancements in image generation, even surpassing popular diffusion models in several aspects. However, diffusion models still dominate in controllable image generation, particularly in identity-preserving (IP) text-to-image generation, where autoregressive approaches remain underexplored. To bridge this gap, we propose ***HEAR***, a high-frequency enhanced autoregressive identity-preserving text-to-image framework based on a coarse-to-fine next-scale prediction paradigm, which leverages the key property of VAR we discovered for separating high- and low-frequency features in image generation. Innovations of our method include: (1) A comprehensive identity data curation pipeline that integrates powerful open-source vision-language models (VLMs) for image filtering and recaptioning, along with diffusion models for generating high-quality synthetic training data; (2) A high-frequency identity feature tokenizer, fine-tuned with compound losses and face-specific masking, to enhance high-frequency features essential for identity preservation; (3) A dual-control strategy in the autoregressive backbone, incorporating global information into the cross-attention blocks and introducing a decoupled adapter operating in parallel to maintain high-frequency details. Extensive experiments demonstrate that HEAR surpasses mostly existing diffusion-based methods in identity-preserving image generation. This work presents a general and scalable autoregressive framework for controllable image generation.
Primary Area: generative models
Submission Number: 24306
Loading