Abstract: In this paper, we introduce TinyVLM, a compact and efficient Vision Language Model (VLM) designed for edge devices, which can be trained end-to-end in 106 A100 GPU hours or $159. We introduce multiple adaptations to the classic ViT-LLM style VLMs, by introducing a convolution token pooler to reduce the number of visual tokens passed into the LLM by 4x, a cross-attention mechanism to fuse spatial features from a masked auto-encoder CNN model improving spatial understanding in tasks such as OCR, a patch zooming technique to capture fine-grained image details and a carefully curated fine-tuning dataset. Our final model has 0.6 B parameters and achieves a throughput of 18 toks/sec on a 8-core CPU machine, making it highly suitable for resource-constrained environments. TinyVLM achieves a good balance between performance and resource demands, advancing the capabilities of VLMs on the edge. We open source our complete training data, code and intermediate checkpoints for the open source community.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, cross-modal pretraining, vision language navigation
Contribution Types: Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7596
Loading