$\mathbf{R^3}$-Adapter: Progressive Residual Refinement and Representational Alignment for Personalized Image Generation
Keywords: Personalized Image Generation, Representation Alignment, Diffusion Model Post-Training and Adaptation
Abstract: Personalized image generation with diffusion models has achieved remarkable success in single-subject scenarios, yet extending to multiple subjects remains challenging. We identify two critical limitations: the multi-granularity bottleneck, where single-level representations fail to capture semantic information from coarse categorical structure to fine-grained details, and the semantic alignment gap, where pixel-level optimization provides insufficient guarantees for maintaining subject identity during multi-subject composition. We propose $\mathbf{R^3}\textbf{-Adapter}$, a novel framework that addresses these challenges through Progressive Residual Refinement and Representation Alignment (REPA). Our method decomposes subject representations across four semantic levels through bounded residual corrections with timestep-adaptive routing, while REPA grounds the diffusion model's internal features using DINOv3 features via cross-attention and self-attention alignment. Comprehensive experiments demonstrate state-of-the-art performance: 6.24\% improvement in CLIP-I and 12.24\% in DINO on single-subject tasks, with even larger gains of 21.36\% in DINO on challenging multi-subject scenarios. Ablations confirm that refinement and semantic alignment operate synergistically, achieving combined improvement over baselines.
Submission Number: 153
Loading