Abstract: Recent advancements underscore the potential of deep learning-based Computer-Assisted Diagnosis (CAD) systems for cervical cytology image analysis. However, traditional methods focusing solely on single-view of cells fall short in performance due to the lack of contextual information. Moreover, the unclear reasoning behind model’s classification hinders their interpretability. To overcome these issues, we present Cervi-CAT, a context-aware, text-assisted multimodal framework for cervical cytology cell classification. CerviCAT captures visual cell representations from both global and local perspectives and subsequently generates textual descriptions based on the visual representation. A multimodal transformer then integrates these descriptions with visual features for interpretable and accurate cell classification. Additionally, we introduce Cyto-Vicuna, a cytology-specific large language model fine-tuned based on Vicuna-7b using collected cytology-specific data. When integrated into CerviCAT, it produces more detailed diagnostic reports while simultaneously fostering interaction between the model and cytologists, promoting collaborative diagnosis. Our results demonstrate that CerviCAT not only surpasses traditional CAD methods in performance but also provides interpretable diagnosis.
Loading