A Hybrid Paradigm for Vision Autoencoders: Unifying CNNs and Transformers for Learning Efficiency and Scalability

01 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Transformer, Convolutional Neural Network, Variantional AutoEncoder, Generative Models, Vision Representation
TL;DR: A CNN-ViT hybrid architecture for Visual AutoEncoders, which has faster convergence speed, superior performance, scalability, and downstream task friendly.
Abstract: Architectures for visual Variational Autoencoders (VAEs) have been dominated by Convolutional Neural Networks (CNNs), which inherently struggle to model long-range dependencies efficiently. While Vision Transformers (ViTs) offer a promising global receptive field, their direct application to VAEs has been hampered by a critical weakness in modeling fine-grained local details, leading to significant learning inefficiencies. This has left the field at an architectural impasse, limiting progress in high-fidelity representation learning. To break this impasse, we propose TransVAE, a hybrid paradigm that unifies a shallow CNN front-end for robust local feature extraction with a deep Transformer backbone for powerful global context modeling. TransVAE demonstrates superior learning efficiency, converging faster than CNN baselines while achieving state-of-the-art results. Critically, the comprehensive visual representation from our hybrid architecture unlocks three properties: First, scalability with performance consistently improving as parameters scale from 44M to 2.3B-a feat not effectively achieved by pure ViT VAEs. Second, enhanced extrapolation allows models trained on low resolutions to perform inference on arbitrary higher resolutions with superior global coherence. Third, a better harmonization of pixel-level and semantic-level representation facilitates both reconstruction and generation. TransVAE thus provides a new, effective blueprint for the next generation of visual VAEs. Codes and weights are available upon acceptance.
Primary Area: generative models
Submission Number: 493
Loading