A Hybrid Paradigm for Vision Autoencoders: Unifying CNNs and Transformers for Learning Efficiency and Scalability

Zhiying Lu; Shang Chai; Chuanbin Liu; Litong Gong; Pandeng Li; Tiezheng Ge; Hongtao Xie

A Hybrid Paradigm for Vision Autoencoders: Unifying CNNs and Transformers for Learning Efficiency and Scalability

Zhiying Lu, Shang Chai, Chuanbin Liu, Litong Gong, Pandeng Li, Tiezheng Ge, Hongtao Xie

01 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Transformer, Convolutional Neural Network, Variantional AutoEncoder, Generative Models, Vision Representation

TL;DR: A CNN-ViT hybrid architecture for Visual AutoEncoders, which has faster convergence speed, superior performance, scalability, and downstream task friendly.

Abstract: Architectures for visual Variational Autoencoders (VAEs) have been dominated by Convolutional Neural Networks (CNNs), which inherently struggle to model long-range dependencies efficiently. While Vision Transformers (ViTs) offer a promising global receptive field, their direct application to VAEs has been hampered by a critical weakness in modeling fine-grained local details, leading to significant learning inefficiencies. This has left the field at an architectural impasse, limiting progress in high-fidelity representation learning. To break this impasse, we propose TransVAE, a hybrid paradigm that unifies a shallow CNN front-end for robust local feature extraction with a deep Transformer backbone for powerful global context modeling. TransVAE demonstrates superior learning efficiency, converging faster than CNN baselines while achieving state-of-the-art results. Critically, the comprehensive visual representation from our hybrid architecture unlocks three properties: First, scalability with performance consistently improving as parameters scale from 44M to 2.3B-a feat not effectively achieved by pure ViT VAEs. Second, enhanced extrapolation allows models trained on low resolutions to perform inference on arbitrary higher resolutions with superior global coherence. Third, a better harmonization of pixel-level and semantic-level representation facilitates both reconstruction and generation. TransVAE thus provides a new, effective blueprint for the next generation of visual VAEs. Codes and weights are available upon acceptance.

Primary Area: generative models

Submission Number: 493

Loading