Abstract: Highlights•An Instruction-ViT model is to design prompts based on instruction tuning in ViT.•Multi-modal (text and image) prompts are fused to fine-tune the model.•Model performance and adaptability are improved in several image understanding tasks.•A novel strategy to fuse multi-modal prompts for visual models is offered.
Loading