Abstract: Thyroidectomy often leaves highly visible scars that impact patient quality of life; however, automated scar classification remains impractical in many small clinics because state-of-the-art models trained on large datasets–and their weights–are not publicly available due to privacy and security constraints. In this work, we present a novel framework, XTag-CLIP, that extends the CLIP model with an attribute tagging module and a cross-attention alignment module to classify thyroidectomy scars into keloid, hypertrophic, or other categories using very limited data. XTag-CLIP employs a two-stage training strategy: first, the tagging module is pre-trained on a general scar dataset to learn robust attribute extraction; then, during fine-tuning on a small, institution-specific thyroidectomy dataset, cross-attention fusion integrates CLIP visual embeddings with tag-derived text embeddings for precise classification. In experiments on thoyroid scar dataset, XTag-CLIP improves accuracy by over 14 %p and boosts F1 score by 0.0670 compared to a CLIP image-only baseline. An enhanced variant–XTag-CLIP-DualFT, which adds an intermediate tagging-module fine-tuning step–further elevates performance to 78.2 % accuracy and 0.5997 F1. Ablation studies confirm the necessity of both tagging and cross-modal fusion, and Qualitative analyses demonstrate that the model’s attention maps focus on clinically relevant features, enhancing interpretability and clinician trust. Consequently, XTag-CLIP provides a practical and data-efficient framework for applying multimodal vision–language models to specialized medical imaging tasks.
External IDs:doi:10.1007/978-3-032-09569-5_15
Loading