Abstract: The paper presents CTVTL-CLIP, a conditional visuo-textual prompt tuning method designed to adapt large mul-timodal models for highly specialized applications, where the required semantic representations are not initially encoded in the models' latent space. CTVTL-CLIP jointly optimizes CLIP's text and visual encoders using sets of soft prompts that condition the representations on domain-specific knowledge and features, with the support of a lightweight pretrained network to condition textual learnable tokens. This approach enhances the model's ability to accurately understand medical images, aligning textual descriptions with visual representations even in cases with limited annotated data. CTVTL-CLIP was tested on an endoscopy image dataset containing multiple gastric lesions, significantly outperforming traditional classifiers such as CNNs and vision transformers, while using fewer learnable parameters. It was also evaluated on two additional medical image analysis tasks, including skin lesion classification and stenosis classification in angiographies, showing improved performance compared to state-of-the-art methods that require training more parameters. The combination of CTVTL-CLIP's efficiency and superior performance under-scores its practical potential for real-world medical applications, where data scarcity and model efficiency are critical challenges.
Loading