Abstract: Vision-language models (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage the potential of VLMs in adapting to downstream tasks, context optimization methods such as prompt tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose context optimization with multi-knowledge representation (CoKnow), a framework that enhances prompt learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we train lightweight semantic knowledge mappers, which are capable of generating multi-knowledge representations for an input image without requiring additional priors. Experimentally, we conduct extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods.
External IDs:doi:10.1109/tmm.2025.3599096
Loading