Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

Published: 01 Jan 2024, Last Modified: 14 May 2025Inf. Fusion 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•An Instruction-ViT model is to design prompts based on instruction tuning in ViT.•Multi-modal (text and image) prompts are fused to fine-tune the model.•Model performance and adaptability are improved in several image understanding tasks.•A novel strategy to fuse multi-modal prompts for visual models is offered.
Loading