TinyVLM: Scaling Down Vision-Language Models for the Edge

TinyVLM: Scaling Down Vision-Language Models for the Edge

ACL ARR 2025 February Submission7596 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we introduce TinyVLM, a compact and efficient Vision Language Model (VLM) designed for edge devices, which can be trained end-to-end in 106 A100 GPU hours or $159. We introduce multiple adaptations to the classic ViT-LLM style VLMs, by introducing a convolution token pooler to reduce the number of visual tokens passed into the LLM by 4x, a cross-attention mechanism to fuse spatial features from a masked auto-encoder CNN model improving spatial understanding in tasks such as OCR, a patch zooming technique to capture fine-grained image details and a carefully curated fine-tuning dataset. Our final model has 0.6 B parameters and achieves a throughput of 18 toks/sec on a 8-core CPU machine, making it highly suitable for resource-constrained environments. TinyVLM achieves a good balance between performance and resource demands, advancing the capabilities of VLMs on the edge. We open source our complete training data, code and intermediate checkpoints for the open source community.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, cross-modal pretraining, vision language navigation

Contribution Types: Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 7596

Loading