LLaMA Decoder As Vision Transformer

Jiahao Wang; Wenqi Shao; Mengzhao Chen; Chengyue Wu; Yong Liu; Taiqiang Wu; Kaipeng Zhang; Songyang Zhang; Kai Chen; Ping Luo

LLaMA Decoder As Vision Transformer

Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Decoder-only; Vision Transformer; LLaMA

TL;DR: A successful application of the decoder-only architecture adapted to the visual Transformer.

Abstract: Using the same architecture for text and image is important for AI standardization. Recent multimodal models use a decoder-only Transformer to generate text and an encoder-only Transformer to extract image features. Can images use exactly the same language architecture? To answer this question, we aim at a LLaMa decoder as vision Transformer (ViT) classifier in this paper. Specifically, we start our trajectory by “LLaMAfy” a standard ViT step-by-step, i.e., feed-forward net, normalization layer, causal self-attention and positional embedding, and point out a key issue—attention collapse—that result in the failure to the network training. Motivated by this observation, we propose post-sequence class token, enabling causal self-attention to efficiently capture the entire image’s information. To improve model optimization behavior and enhance performance, we then introduce a soft mask strategy to gradually transform the attention from bi-directional to causal mode. The tailored model, dubbed as image LLaMA (iLLaMA), maintains high consistency with LLaMA architecture, while matching up well against ViT, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ∼310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. Extensive experiments demonstrate iLLaMA’s reliable properties: shape-texture bias, calibration, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual architectures in the era of LLMs and contributes to standardized AI models.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5995

Loading