\PassOptionsToPackage{table}{xcolor}
\documentclass{midl} % Include author names
% \documentclass[anon]{midl} % Anonymized submission

\usepackage{midl_custom}

\begin{document}

\maketitle

\begin{abstract}

Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension.
Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. 
Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data.

Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored.
This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated $11$ datasets encompassing diverse modalities and insightful language prompts and experiments.
Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role.
While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts.
The code and datasets are available at \url{https://github.com/naamiinepal/medvlsm}.

\end{abstract}

\section{Introduction}
\label{sec:introduction}

Medical image segmentation is crucial for various clinical applications such as diagnosis, prognosis, and surgery planning.
The latest supervised segmentation models exhibit promising outcomes across diverse imaging modalities, anatomies, and diseases \citep{milletari2016v, havaei2017brain, zhou2018unet++, chen2021transunet, isensee2021nnu, hatamizadeh2022unetr, oktay2022attention, wazir2022histoseg}.
Despite their success, these models are constrained to predefined foreground classes on specific modalities and anatomies, lacking adaptability to auxiliary information and hindering their application outside extensive population-based studies.

The integration of VLMs \citep{huang2020pixel, jia2021scaling, li2021supervision, radford2021learning, furst2022cloob, singh2022flava, zhai2022lit} into VLSMs \citep{luddecke2022image, rao2022denseclip, wang2022cris} presents a paradigm shift in medical image segmentation.
Models like CLIP \citep{radford2021learning} and BiomedCLIP \citep{zhang2023large}, capable of joint text-image representation, allow for auxiliary information incorporation through language prompts during segmentation. 
This approach can enhance interpretability and robustness against domain shift and out-of-distribution data.

% While transfer learning from natural to medical images for image-only representation learning has been extensively explored \citep{ghafoorian2017transfer, cheplygina2019not, amin2019new}, such studies for joint vision-language representation remain unexplored. 
% \citet{qin2022medical} demonstrated promising zero-shot results for object detection.
While transfer learning from natural to medical images for image-only representation learning has been extensively explored \citep{ghafoorian2017transfer, cheplygina2019not, amin2019new}, only a few such studies have been done for joint vision-language representation \citep{qin2022medical}.
Yet, two critical questions persist (\textbf{i}) the generalizability of this approach across multiple VLSMs for segmentation tasks, and (\textbf{ii}) the nuanced role of language prompts vs. images during finetuning and the VLSMs' capacity to handle pooled dataset training and out-of-distribution data.

This work presents the first systematic study on VLSM transfer learning to the medical images, using four models based on the two most popular contrastive VLMs: CLIP pretrained on natural image-text pairs and BiomedCLIP pretrained in the medical domain.
%This paper includes the segmentation results of VLSMs obtained by attaching decoders -- pretrained and finetuned -- to these two VLMs (\sectionref{sec:clip_for_medical_image_segmentation}).
% In addition to CLIPSeg \citep{luddecke2022image} and CRIS \citep{wang2022cris}) (the two VLSMs proposed in the literature with different architecture designs and trained on a large dataset of natural image segmentation data to learn pixel-token level representation), we introduce additional VLSMs, including BiomedCLIPSeg-D (with a pretrained CLIPSeg decoder added to BiomedCLIP) and BiomedCLIPSeg (with the decoder weights initialized randomly).

Key contributions include meticulous dataset selection ($11$ datasets) across four 2D medical image modalities, diverse anatomical structures, and pathology.
We also enrich existing datasets with diverse language prompts generated through automated methods utilizing image metadata, VQA models, and segmentation masks.
Our extensive experiments with four VLSMs, diverse datasets, and carefully designed prompts explore intricate relationships between language and image during joint representation adaptation for medical images.
We evaluate robustness against domain shift and the ability to handle pooled datasets with diverse modalities, attributes, and targets.
Finally, we open-source our framework, source code, and prompts, promoting transparency and reproducibility in the scientific community.

% \section{Image Segmentation using Foundation Vision Language Models}

% \subsection{Vision Language Pretraining and Foundation Models}
% \label{sec:vlp_and_foundation_models}

% Recent foundation VLMs like CLIP jointly train a transformer-based text encoder and an image encoder on large-scale text-image pairs.
% CLIP employs a contrastive loss to maximize the similarity of correct pairs' embeddings while minimizing incorrect pairings.
% FLAVA \citep{singh2022flava} introduces an additional multimodal encoder and employs various loss functions for different multimodal tasks and standalone vision and language tasks.
% Other similar VLMs address challenges such as the need for a large number of image-language pairs \citep{li2021supervision}, noisy pairs \citep{jia2021scaling, li2022blip}, and computational complexity \citep{zhai2022lit}.

% Foundation VLMs in medical imaging typically follow two approaches: (\textbf{1}) finetuning a pretrained VLM with medical image-text pairs \citep{seibold2022breaking, eslami2023pubmedclip}, or (\textbf{2}) pretraining the VLM from scratch with medical image-text pairs \citep{wang2022medclip, zhang2022contrastive, wu2023medklip, zhang2023large}.
% While these models are versatile, their global embedding approach, aligning the entire image with the input text, may not be optimal for dense prediction tasks like segmentation.

% \subsection{Vision Language Segmentation Models (VLSMs) with Pixel-Token alignment}
% \label{sec:vlsms_with_pixel_token_alignment}

% For segmentation tasks, explicitly aligning images and text descriptions is crucial.
% State-of-the-art VLSMs extend CLIP to segmentation by incorporating a decoder trained to generate segmentation maps from CLIP's vision and language embedding.
% DenseCLIP \citep{rao2022denseclip} introduces vision language decoders on CLIP encoders, using pixel-text score maps for limited class prompts.
% CLIPSeg and CRIS enforce zero-shot segmentations by providing pixel-level activations for text or image prompts. ZegCLIP \citep{zhou2023zegclip} introduces tune prompts and associates image information with text encodings before patch-text contrasting, reducing overfitting to seen classes.
% While specific architectures exist for joint embeddings in particular datasets, such as TGANet \citep{tomar2022tganet} for endoscopy images of polyps, there is a lack of well-studied VLSMs tailored for medical images.

\section{Method}
\label{sec:method}

\subsection{CLIP- and BiomedCLIP-based Medical VLSMs}
\label{sec:clip_for_medical_image_segmentation}

We create four medical VLSMs using CLIP and BiomedCLIP: (\textbf{i}) Finetuning CLIP-based VLSMs, \textbf{CLIPSeg} \citep{luddecke2022image} and \textbf{CRIS\footnote{We used unofficial weights from \href{https://github.com/DerrickWang005/CRIS.pytorch/issues/3}{a GitHub issue} since the authors haven't released the model weights yet.}} \citep{wang2022cris}, pretrained on natural image-text pairs, and (\textbf{ii}) Building two new VLSMs for the medical domain by adding a decoder to BiomedCLIP, pretrained on medical image-text pairs.
The proposed new models are \textbf{BiomedCLIPSeg-D} (with a pretrained CLIPSeg decoder) and \textbf{BiomedCLIPSeg} (with a randomly initialized decoder of CLIPSeg).
A sample from the datasets in our experiments is a triplet of a medical image, a segmentation mask, and a text prompt.
%BiomedCLIP is chosen for its diverse medical imaging modalities and CLIP architecture.
\figureref{fig:architecture} displays the overall VLSM architecture.

\begin{figure}[t]
     % Caption and label go in the first argument, and the figure contents
     % go in the second argument
    \floatconts
      {fig:architecture}
      {\caption{
            CRIS and CLIPSeg-variants include Text and Image encoders, an Aggregator, and a Vision-Language Decoder.
            %The Aggregator generates representations for the Vision-Language Decoder.
      }}
      {\includegraphics[width=0.8\linewidth]{architecture}}
\end{figure}


% where pretrained text and image encoder embeddings (from CLIP or BiomedCLIP) are fed to the Vision-Language Decoder for binary segmentation masks.


% CLIPSeg and CRIS\footnote{We used \href{https://github.com/DerrickWang005/CRIS.pytorch/issues/3}{unofficial weights} since the authors haven't released the model weights yet.} were trained on PhraseCut \citep{wu2020phrasecut} and RefCOCO \citep{kazemzadeh2014referitgame} with $340,000$ and $142,210$ text-image pairs, respectively.
CLIPSeg accommodates both CNN and ViT \citep{dosovitskiy2020image} backbones, whereas CRIS only supports a CNN-based CLIP backbone.
BiomedCLIPSeg-based models include transformer-based backbones for both the encoders.
We study CLIPSeg and CRIS in both zero-shot and finetuning while only finetuning for BiomedCLIPSeg-based models as they lack an end-to-end pretrained encoder-decoder.

\subsection{Datasets}
\label{sec:datasets}

We collected $11$ 2D medical imaging datasets of diverse modalities, organs, and pathologies covering both radiology and non-radiology images for binary and multi-class segmentation tasks (see \tableref{tab:dataset_info}).
All the datasets are used for finetuning separately or combined (as a single pooled dataset) except the last three endoscopy datasets (ETIS, ColonDB, and CVC300), which are used only as the test split to study domain shift robustness.

\begin{table}[h]
    \floatconts
      {tab:dataset_info}%
      {\caption{Datasets overview for single and multi-class segmentation tasks.}}%
      {\resizebox{0.95\linewidth}{!}{%
        \begin{tabular}{llllp{0.61\linewidth}l}
            \textbf{Category} & \textbf{Modality} & \textbf{Organ} & \textbf{Name} & \textbf{Foreground Class(es)} & \textbf{\# train/val/test} \\
            \hline
            \multirow{8}{*}{Non-Radiology} & \multirow{6}{*}{Endoscopy} & \multirow{6}{*}{Colon} & Kvasir-SEG & \multirow{6}{*}{Polyp} & 800/100/100 \\
             &  &  & ClinicDB &  & 490/61/61 \\ 
             &  &  & BKAI &  & 800/100/100 \\
             &  &  & ETIS &  & 0/0/196 \\
             &  &  & ColonDB &  & 0/0/380 \\
             &  &  & CVC300 &  & 0/0/60 \\
             \cline{2-6}
             & \multirow{2}{*}{Photography} & Skin & ISIC 2016 & Skin Lesion & 810/90/379 \\
             &  & Foot & DFU 2022 & Foot Ulcer & 1600/200/200 \\
             \hline
            \multirow{3}{*}{Radiology} & \multirow{2}{*}{Ultrasound} & Heart & CAMUS &  Myocardium, Left ventricular, and Left atrium cavity & 4800/600/600 \\
             &  & Breast & BUSI & Benign and Malignant Tumors & 624/78/78 \\
             \cline{2-6}
             & X-Ray & Chest & CheXlocalize & Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Lung Lesion, Lung Opacity, Pleural Effusion, Pneumothorax, and Support Devices & 1279/446/452
        \end{tabular}%
        }
    }
\end{table}
\subsection{Generating Language Prompts}
\label{sec:prompt_engineering}

Although language prompts enable injecting rich information into VLSMs, manually crafting individual image-specific prompts becomes impractical for large-scale evaluations.
Thus, we implement an automated prompt generation system for extensive assessments of medical VLSMs.
This involves incorporating semantic concepts such as size, position, color, and specific medical attributes like gender, age, and pathology.

In addition to automated prompts, we introduce manual prompts that provide general class-level information applicable to all samples within a given dataset.
% This dual approach leverages diverse sources, including VQA, medical image metadata, and automated image processing.
The generated language prompts encapsulate a comprehensive set of attributes and information, comprising: (\textbf{i}) Inspired by \citet{tomar2022tganet}, \textit{number}, \textit{size}, and \textit{relative location} are derived through image processing on segmentation masks.
(\textbf{ii}) Motivated by \citet{qin2022medical}, we use \textit{shape} and \textit{color} information from VQA queries.
% In radiology datasets, only \textit{shape} is considered, given the irrelevance of \textit{color} in grayscale radiology images.
(\textbf{iii}) \textit{General class information}, extracted for photographic images from online medical journals, provides overarching details applicable across different datasets.
Notably, \citet{qin2022medical} used PubMedBERT \citep{gu2021domain} for this purpose; however, our experiments revealed its unreliability, leading us to manually gather this information from online medical journals (see \tableref{tab:PubMedBERT_analysis}).
(\textbf{iv}) Attributes like \textit{age}, \textit{gender of patients}, \textit{image quality}, \textit{cardiac cycle}, and \textit{tumor type} are extracted whenever available, contributing valuable context to the language prompts.
There are $14$ such attributes, (\textbf{a1} to \textbf{a14}), which we combined in various ways to build nine distinct prompt types (\textbf{P1} to \textbf{P9}) for each dataset (\tableref{tab:dataset-prompts-combination}; \appendixref{sec:appendix-prompts}).
Each prompt type caters to specific attribute combinations, prioritizing the class name as the foundational attribute and enhancing the versatility of the generated prompts.
%This meticulous approach ensures broad coverage of attributes in prompts, aligning with the diverse nature of medical images.
% The systematic ordering of attributes in the prompts, prioritizing the class name as the foundational attribute, contributes to the effective communication of target structures and aids in comprehensive evaluations of VLSMs.

\subsection{Implementation Details}
\label{sec:implementation_details}

We finetuned VLSMs with minimal hyperparameter changes from the original pretraining settings.
AdamW \citep{loshchilov2017decoupled} optimizer with weight decay of $10^{-3}$, and initial learning rates of $2\times10^{-3}$ (CLIPSeg) and $2\times10^{-5}$ (CRIS) were utilized.
Dice loss was used alongside Binary Cross Entropy loss scaled by $0.2$.
The learning rate was reduced by $10$ times if validation loss did not decrease for $5$ consecutive epochs.
Batch sizes of $128$ and $32$ were used for CLIPSeg and CRIS, respectively, due to the difference in model sizes\footnote{Further details are in \appendixref{sup:sec:experiments}.}.

\section{Results}
\label{sec:results}

% This section first presents experimental results in zero-shot settings (for CRIS and CLIPSeg) and finetuned settings (for all four VLSMs) using a maximum of nine prompts on all the datasets.
% This is followed by a more subtle look into how well the VLSMs capture concepts represented by different attributes and the influence on segmentation output when wrong information is provided.
% Finally, the robustness of the VLSMs in handling diverse datasets and comparison against standard segmentation models are reported.

\paragraph{VLSMs adapt better to non-radiology images in Zero-Shot Setting (ZSS).}
%\label{sec:vlsms_adapter_better_to_non_radiology_images_in_zss}
Both CRIS and CLIPSeg barely work in ZSS for radiology images except for CRIS in the BUSI dataset but get a Dice score in the range of $20\%-70\%$ for non-radiology datasets, with $67.98\%$ being the highest Dice score for ISIC (\figureref{fig:combined_data_plots}).
Adding more attributes to the prompt generally improved performance, but the gain is inconsistent across prompts and datasets.

\begin{figure}[t]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
      {fig:combined_data_plots}
      {\caption{
        Zero-shot and finetuning performance of CRIS, CLIPSeg, BiomedCLIPSeg, and BiomedCLIPSeg-D model on non-radiology (first two rows) and radiology datasets (last row).
        Finetuning using the prompts improves performance compared to the empty prompt, particularly in multi-class settings.
        %However, using the label name and adding additional prompts does not significantly affect the model performance.
        }}
       {%
            \subfigure[BKAI][b]{%
            \label{fig:bkai_dice}% label for this sub-figure
            \includegraphics[width=0.2\linewidth]{bkai_dice_BiomedCLIPSeg_BiomedCLIPSeg-D_CLIPSeg_CRIS}
            } \hfill % space out the images a bit
            \subfigure[ClinicDB][b]{%
            \label{fig:clinicdb_dice}% label for this sub-figure
            \includegraphics[width=0.2\linewidth]{clinic_db_dice_BiomedCLIPSeg_BiomedCLIPSeg-D_CLIPSeg_CRIS}
            }\hfill
            \subfigure[Kvasir-SEG][b]{%
             \label{fig:kvasir_dice}
             \includegraphics[width=0.2\linewidth]{kvasir_dice_BiomedCLIPSeg_BiomedCLIPSeg-D_CLIPSeg_CRIS}
            } \hfill
            \subfigure[CVCColonDB][b]{%
             \label{fig:cvc_colon_dice}
             \includegraphics[width=0.2\linewidth]{cvc_colon_dice_CLIPSeg_CRIS}
            } \subfigure[CVC300][b]{%
             \label{fig:cvc300_dice}
             \includegraphics[width=0.2\linewidth]{cvc300_dice_CLIPSeg_CRIS}
            } \hfill
            \subfigure[ETIS][b]{%
             \label{fig:etis_dice}
             \includegraphics[width=0.2\linewidth]{etis_dice_CLIPSeg_CRIS}
            } \hfill
            \subfigure[DFU 2022][b]{%
             \label{fig:dfu_dice}
             \includegraphics[width=0.2\linewidth]{dfu_dice_BiomedCLIPSeg_BiomedCLIPSeg-D_CLIPSeg_CRIS}
            } \hfill
            \subfigure[ISIC 2016][b]{%
             \label{fig:isic_dice}
             \includegraphics[width=0.2\linewidth]{isic_dice_BiomedCLIPSeg_BiomedCLIPSeg-D_CLIPSeg_CRIS}
            } \subfigure[CAMUS][b]{%
             \label{fig:camus_dice}
             \includegraphics[width=0.2\linewidth]{camus_testing_dice_BiomedCLIPSeg_BiomedCLIPSeg-D_CLIPSeg_CRIS}
            }\hspace{1.5mm}%
            \subfigure[BUSI][b]{%
            \label{fig:busi_dice}
            \includegraphics[width=0.2\linewidth]{busi_dice_BiomedCLIPSeg_BiomedCLIPSeg-D_CLIPSeg_CRIS}
            }\hspace{1.5mm}%
            \subfigure[CheXlocalize][b]{%
             \label{fig:chexlocalize_dice}
             \includegraphics[width=0.2\linewidth]{chexlocalize_dice_BiomedCLIPSeg_BiomedCLIPSeg-D_CLIPSeg_CRIS}
             } \hfill
            \subfigure[Legend][b]{%
             \raisebox{0.1\height}{\includegraphics[width=0.145\linewidth]{legend}}
            }
     }
\end{figure}

% \paragraph{Image-specific-attributes or general descriptions?}
% In the zero-shot setting, CRIS has better performance in almost all endoscopy datasets when the prompt contains multiple image-specific attributes (\textit{size}, \textit{number}, and \textit{location} with the \textit{class name}; \textbf{P4}, \textbf{P5}, and \textbf{P6}; see \figureref{fig:combined_data_plots}).
% However, the non-image-specific attributes together with \textit{class name} degrades this performance (\textbf{P7}, \textbf{P8}, \textbf{P9}).
% Interestingly, prompts \textbf{P8} and \textbf{P9} achieve the highest performance for the DFU 2022 dataset, possibly due to the pretrained models' greater familiarity with general descriptions of feet and skin compared to colon and endoscopy.
% % Conversely, using just the \textit{class name} (\textbf{P1}) or adding a non-image-specific general description (\textbf{P7}) seems to perform better than using multiple image-specific attributes in CLIPSeg.
% This shows that pretraining data and VLSM architecture have a complex relationship with the target medical segmentation task.

\paragraph{Image-specific-attributes or general descriptions?}
In the ZSS, CRIS performs better on endoscopy datasets when prompts contain image-specific attributes (\textit{size}, \textit{number}, and \textit{location}; \textbf{P4}, \textbf{P5}, and \textbf{P6}; \figureref{fig:combined_data_plots}), but degrades with non-image-specific attributes added (\textbf{P7}, \textbf{P8}, \textbf{P9}). 
Interestingly, prompts with general descriptions (\textbf{P8} and \textbf{P9}) achieve the highest performance on the DFU 2022 dataset, possibly due to pretrained models' familiarity with feet and skin compared to the colon. 
This highlights the complex relationship between pretraining data, VLSM architecture, and the medical segmentation task.

\paragraph{Making prompts richer does not always help during finetuning.}
\label{sec:making_prompts_richer_does_not_always_help}

\figureref{fig:combined_data_plots} shows that the DSC variation across prompt type is minimal in the finetuned setting for all the models.
Prompt with only \textit{class name} (\textbf{P1}) improves segmentation performance in radiology datasets for all four VLSMs.
While CRIS' performance almost saturates after adding the \textit{class name} and \textit{mask shape} (\textbf{P2}), the rest of the models have similar performance for all the prompts except \textbf{P0} with multi-class segmentation (CAMUS and CheXlocalize).

BiomedCLIPSeg and BiomedCLIPSeg-D, despite being based on a VLM pretrained on medical data, consistently perform poorly across all prompts compared to CLIP and CLIPSeg.
This is likely because it has not been further pretrained for segmentation tasks on a large-scale dataset.
Subsequent experiments use better performing CLIPSeg and CRIS to study the impact of individual attributes and robustness of VLSMs\footnote{Additionally, we have also trained both the models, keeping their encoders frozen whose results are shown in \appendixref{sec:supp_ablation}.}.

\paragraph{When finetuned, CRIS captures some language semantics better than CLIPSeg.}

\begin{figure}[t]
     % Caption and label go in the first argument, and the figure contents
     % go in the second argument
    \floatconts
      {fig:changed_attr_plot}
      {\caption{Relative change in percentage dice score on replacing attribute values by a random uncommon English word (left of vertical lines) or semantically opposite value such as replacing `large' with `small' (right of vertical lines) in prompt \textit{P6}.}}
      {%
        \subfigure[CRIS][h]{%
        \includegraphics[width=0.42\textwidth]{cris_change_attr}
        }\hfill%
        \subfigure[CLIPSeg][h]{%
            \includegraphics[width=0.42\textwidth]{clipseg_change_attr}
        }
      }
\end{figure}

% To check how well the VLSMs captured the semantics of the prompt, we replace attribute values in input prompts during inference with:
% (\textbf{i}) a random and uncommon English word, and (\textbf{ii}) semantically wrong or opposite value sampled from the set of possible values of the same attribute (e.g., ``small" by ``large" for the \textit{size} attribute).
% The former aims to see how strongly the semantics of words familiar to the model drive output, and the latter allows witnessing the importance of the attribute's presence vs. unfamiliar words' presence.
% \figureref{fig:changed_attr_plot} shows results on five datasets with five altered attributes, where CLIPSeg has almost no change in results.
% To further confirm that there is virtually no role of different prompts in CLIPSeg, we found that when testing the CLIPSeg model trained on prompt \textbf{P6} by sending only the \textit{class name} (\textbf{P1}), the performance was almost the same compared to using \textbf{P6} for all non-radiology datasets.
We replaced attribute values of the input prompts during inference with random uncommon English words and semantically wrong or opposite values to assess whether VLSMs leverage the language semantics.
%This helps evaluate the influence of familiar words versus attribute presence. 
\figureref{fig:changed_attr_plot} shows that altering attributes minimally impacts CLIPSeg's performance but notably deteriorates CRIS's. 
To further investigate CLIPSeg's indifference to attribute values, we provided only the \textit{class name}(\textbf{P1}) as input during inference to the model trained on rich prompts \textbf{P6}; the results were very similar to providing the rich prompts, reinforcing the minimal impact of attributes in CLIPSeg. 

CRIS's performance decreases notably for attributes like size and location.
The decline is more significant when providing semantically opposite values than random uncommon English words, indicating robust semantic learning.
A qualitative examination of predicted segmentation masks confirms this trend (\figureref{fig:changed_attr}).
% \textcolor{red}{TODO: footnote if referring to appendix} 
% \appendixref{sec:supp_vis} displays examples of images with the highest drops in dice score for two datasets when values for sensitive attributes are replaced.

% In \figureref{fig:changed_attr_plot}, CRIS's performance drops considerably, with the most significant drop for \textit{size} and \textit{location}.
% The decline is much more pronounced when giving semantically opposite values compared to random uncommon English words, suggesting that it learned the semantics very well.
% This is further verified when we look qualitatively into the predicted segmentation masks of CRIS with correct vs. incorrect prompts.
% \figureref{fig:changed_attr} shows examples of the images having the highest drops in DSC for two datasets when replacing values for the most sensitive attributes\footnote{More examples provided in \appendixref{sec:supp_vis}.}.

% \footnote{More such visualizations in \ref{sec:supp_vis}}}

\begin{figure}[h]
     % Caption and label go in the first argument, and the figure contents
     % go in the second argument
    \floatconts
      {fig:changed_attr}
      {\caption{Examples of images with the highest drops in dice score for two datasets when values for sensitive attributes are replaced with another value within the value set of the attributes in the dataset in \textit{P6}.}}
      {%
        \subfigure[Kvasir-SEG for attribute \textit{location}]{%
             \label{fig:kvasir_swapped_dice}
             \includegraphics[width=0.7\linewidth]{kvasir_polyp_swapped_pos_2}
         }
         \subfigure[ClinicDB for attribute \textit{size}]{%
             \label{fig:clinicdb_swapped_dice}
             \includegraphics[width=0.7\linewidth]{clinicdb_polyp_swapped_size_2}
         }
     }
\end{figure}

\paragraph{Finetuned VLSMs comparable to SOTA segmentation models.} 

\begin{table}[h]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
    {tab:finetuning_combined}
    {\caption{Performance of VLSMs (Dice (\%)) and CNN models when finetuning in different combinations of datasets.
    For each column, \textbf{Bold} and \underline{\textbf{Bold with underline}} represent the best result among all models for the specific dataset combination and all combinations, respectively.
   }}
    {\tiny \resizebox{\linewidth}{!}{%
    \begin{tabular}{llp{0.09\linewidth}p{0.14\linewidth}p{0.09\linewidth}cccp{0.09\linewidth}p{0.08\linewidth}p{0.08\linewidth}p{0.09\linewidth}c}
         \multicolumn{2}{r}{\textbf{Tested Dataset} $\rightarrow$} & \boldcenterrotatebox{Kvasir-SEG} & \boldcenterrotatebox{ClinicDB} & \boldcenterrotatebox{BKAI} & \boldcenterrotatebox{CVC-300} & \boldcenterrotatebox{CVC-ColonDB} & \boldcenterrotatebox{ETIS} & \boldcenterrotatebox{ISIC} & \boldcenterrotatebox{DFU} & \boldcenterrotatebox{CAMUS} & \boldcenterrotatebox{BUSI} & \boldcenterrotatebox{CheXlocalize} \\
         \multicolumn{1}{p{0.08\linewidth}}{\textbf{Finetuned Dataset}} & \multicolumn{1}{c}{\textbf{Model}} \\
         \hline
         \multirow{5}{*}{\textbf{Individual}} & CRIS & $\underline{\mathbf{91.39}}$ & $\mathbf{91.69}$ & $\underline{\mathbf{92.40}}$ & - & - & - & $91.94$ & $\underline{\mathbf{76.13}}$ & $\underline{\mathbf{91.09}}$ & $69.31$ & $\underline{\mathbf{62.57}}$ \\
         & CLIPSeg & $89.51$ & $88.74$ & $86.47$ & - & - & - & $\underline{\mathbf{92.12}}$ & $73.24$ & $88.85$ & $64.32$ & $59.56$ \\
         \cline{2-13}
         & UNet & $84.77$ & $85.65$ & $83.79$ & - & - & - & $90.40$ & $67.87$ & $90.19$ & $\underline{\textbf{75.21}}$ & $50.29$ \\
         & UNet++ & $84.70$ & $84.16$ & $84.61$ & - & - & - & $90.12$ & $69.95$ & $89.95$ & $72.55$ & $49.53$ \\
         & DeepLabv3+ & $84.11$ & $89.11$ & $84.95$ & - & - & - & $90.66$ & $67.89$ & $90.43$ & $70.57$ & $49.95$ \\
         \cline{2-13}
         & \textit{SOTA*} & $95.02$ & $95.73$ & $90.23$ & - & - & - & $92.00$ & $72.87$ & $94.10$ & $89.80$ & - \\
         \hline
         \multirow{5}{*}{\textbf{Pooled}} & CRIS & $\mathbf{90.23}$ & $\mathbf{91.88}$ & $\mathbf{90.21}$ & $\mathbf{88.99}$ & $\mathbf{78.07}$ & $\mathbf{75.93}$ & $\mathbf{91.99}$ & $\mathbf{75.55}$ & $\mathbf{91.00}$ & $67.89$ & $\mathbf{61.01}$ \\
         & CLIPSeg & $87.25$ & $87.49$ & $87.30$ & $87.24$ & $71.32$ & $69.64$ & $91.34$ & $71.94$ & $88.76$ & $66.02$ & ${56.60}$ \\
         \cline{2-13}
         & UNet & $36.60$ & $26.10$ & $37.70$ & $4.94$ & $8.55$ & $12.00$ & $64.90$ & $38.60$ & $76.82$ & $44.60$ & $38.00$\\
         & UNet++ & $80.52$ & $78.21$ & $77.87$ & $87.80$ & $51.92$ & $48.16$ & $88.41$ & $65.78$ & $89.99$ & $75.59$ & $53.88$ \\
         & DeepLabv3+ & $82.40$ & $82.70$ & $77.60$ & $84.40$ & $59.30$ & $48.30$ & $89.60$ & $67.70$ & $90.17$ & $\underline{\textbf{77.80}}$ & $54.56$ \\
         \hline
         \multirow{5}{0.08\linewidth}{\textbf{Endoscopy Pooled}} & CRIS & $\mathbf{91.25}$ & $\underline{\mathbf{92.94}}$ & $\textbf{92.35}$ & $\underline{\textbf{90.42}}$ & $\underline{\mathbf{81.00}}$ & $\underline{\mathbf{79.67}}$ & - & - & - & - & - \\
         & CLIPSeg & $89.62$ & $88.96$ & $86.98$ & $88.98$ & $75.23$ & $71.18$ & - & - & - & - & - \\
         \cline{2-13}
         & UNet & $85.45$ & $88.17$ & $84.70$ & $90.27$ & $67.87$ & $61.84$ & - & - & - & - & - \\
         & UNet++ & $83.99$ & $85.44$ & $82.27$ & $89.4$ & $66.61$ & $55.62$ & - & - & - & - & - \\
         & DeepLabv3+ & $87.87$ & $87.60$ & $84.38$ & $87.54$ & $69.95$ & $65.24$ & - & - & - & - & - \\
         \hline
         \multicolumn{2}{r}{\textit{*SOTA Sources}} & \citet{dumitru2023using} & \citet{fitzgerald2023fcb} & \citet{tomar2022tganet} & - & - & - & \citet{hasan2022dermoexpert} & \citet{liao2022hardnet} & \citet{ling2022reaching} & \citet{zhang2023fully} & -
    \end{tabular}%
    }}
\end{table}

\tableref{tab:finetuning_combined} compares VLSMs vs. traditional CNN-based models \citep{ronneberger2015u, chen2018encoder, zhou2018unet++} on their ability to learn in two scenarios:  when trained on (\textbf{i}) individual specialized datasets or (\textbf{ii}) a pooled dataset that combines diverse datasets into a single training set.
While the segmentation models (CNNs and VLSMs) achieve better on pooled endoscopy datasets than individual endoscopy datasets, performance mainly drops when training on a pooled set comprising all the datasets.
VLSMs outperform image-only off-the-shelf CNN-based methods in most cases.
We have also compared with the best method reported in the literature for each dataset.\footnote{To ensure a thorough comparison across datasets with diverse modalities and SOTA methods,  we report the SOTA for each dataset from literature, apart from implementing a few commonly used CNN baselines.}
The state-of-the-art results\footnote{Except for CAMUS and ISIC, may have different training, validation, and test splits due to the unavailability of the standard splits in literature.} are better, although VLSMs seem to have competitive performance.

\paragraph{VLSMs adapt better to distribution shifts.}

To assess the ability of the segmentation models to transfer knowledge learned from one dataset to another similar one, we train the models on each large endoscopy dataset (Kvasir-SEG, ClinicDB, and BKAI) and evaluate them on all endoscopy datasets.
\tableref{tab:cross_dataset}, shows that VLSMs perform better in all the cases than the conventional models for endoscopic datasets.
VLSMs show smaller performance drops than conventional models when trained on a different distribution from the test set.

\begin{table}[t]
    % Caption and label go in the first argument, and the figure contents
     % go in the second argument
    \floatconts
    {tab:cross_dataset}
    {\caption{Segmentation performance (Dice (\%)) on out-of-distribution endoscopy datasets.
    For each column, \textbf{Bold} and \underline{\textbf{Bold with underline}} show the best result across the model concerning the tested dataset for each finetuning dataset and across the finetuning datasets, respectively.
    The \colorbox{gray!25}{shaded} results correspond to results in test sets of the same distribution, while the rest are on out-of-distribution test sets.
   }}
    {\resizebox{0.75\linewidth}{!}{%
    \begin{tabular}{l|l|ccccccc}
         \textbf{Tested on $\rightarrow$} &  & \textbf{Kvasir-SEG} & \textbf{ClinicDB} & \textbf{BKAI} & \textbf{CVC-300} & \textbf{CVC-ColonDB} & \textbf{ETIS} \\
         \textbf{Finetuned on $\downarrow$} & \textbf{Model $\downarrow$} \\
         \hline
         \multirow{5}{*}{\textbf{Kvasir-SEG}} & CRIS & \cellcolor{gray!25}$\underline{\mathbf{91.39}}$ & $\mathbf{82.99}$ & $\mathbf{83.26}$ & $86.15$ & $\underline{\textbf{76.87}}$ & $\textbf{62.99}$ & \\
         & CLIPSeg & \cellcolor{gray!25}$89.51$ & $80.21$ & $77.89$ & $\mathbf{86.49}$ & ${70.46}$ & ${62.83}$ \\
         \cline{2-8}
         & UNet & \cellcolor{gray!25}$84.77$ & $64.84$  &  $66.22$ & $77.16$ & $50.81$ & $34.98$ \\
         & UNet++ & \cellcolor{gray!25}$84.70$ & $68.15$ & $61.76$  & $79.35$ & $52.3$  & $32.81$  \\
         & DeepLabv3+ & \cellcolor{gray!25}$84.11$ & $68.0$ & $63.57$  & $76.93$ & $58.41$ & $33.81$ \\
         \hline
         \multirow{5}{*}{\textbf{ClinicDB}} & CRIS & $82.66$ & \cellcolor{gray!25}$\underline{\mathbf{91.69}}$ & $\mathbf{76.21}$ & $\underline{\textbf{87.47}}$ & $\textbf{76.14}$ & $\textbf{64.62}$ \\
         & CLIPSeg & $\mathbf{84.02}$ & \cellcolor{gray!25}$88.74$ & ${72.04}$ & ${87.07}$ & ${67.91}$ & $60.09$ \\
         \cline{2-8}
         & UNet & $65.80$ & \cellcolor{gray!25}$85.65$ & $35.26$ & $73.91$ & $55.01$ & $29.66$ \\
         & UNet++ & $61.93$ & \cellcolor{gray!25}$84.16$ & $38.81$ & $71.15$ & $55.05$ & $23.16$ \\
         & DeepLabv3+ & $66.63$ & \cellcolor{gray!25}$89.11$ & $40.89$ & $82.05$ & $61.79$ & $39.53$ \\
         \hline
         \multirow{5}{*}{\textbf{BKAI}} & CRIS & $\textbf{83.74}$ & $\textbf{78.18}$ & \cellcolor{gray!25}$\underline{\mathbf{92.40}}$ & ${79.48}$ & $\mathbf{65.30}$ & $66.72$\\
         & CLIPSeg & ${83.70}$ & ${76.07}$ & \cellcolor{gray!25}$86.47$ & $\mathbf{86.06}$ & ${63.59}$ & $\underline{\mathbf{66.97}}$ \\
         \cline{2-8}
         & UNet & $68.42$ & $62.20$ & \cellcolor{gray!25}$83.79$ & $60.13$  & $44.52$ & $42.91$\\
         & UNet++ & $70.64$ & $62.66$ & \cellcolor{gray!25}$84.61$ & $82.44$ & $55.60$ & $46.84$ \\
         & DeepLabv3+ & $69.02$ & $61.99$ & \cellcolor{gray!25}$84.95$ & $77.47$ & $53.15$ & $49.61$ \\
         \hline
    \end{tabular}%
    }}
\end{table}

\section{Discussion, Limitations, and Conclusion}
\label{sec:discussion_and_conclusion}

% VLSMs pretrained on natural images exhibit suboptimal zero-shot accuracy for medical images.
% However, they offer a foundation for joint text-image representation, subject to further finetuning on triplets of medical images, text, and masks.
% Our study delves into prompt design, attribute roles, and model performance across diverse datasets.
% The best-performing prompts vary between datasets, with superior performance in attributes familiar during natural domain pretraining, such as size, location, and number, as CRIS exemplifies on the RefCOCO dataset for referring image segmentation.

% The translation of pretrained model abilities from zero-shot to finetuning depends on factors like prompt attribute quality, consistency, image diversity, encoder knowledge, and image saliency.
% CLIPSeg appears less adept at capturing fine-grained textual information in richer prompts than CRIS, indicating the predominant utilization of image encoder representations during finetuning.
% This discrepancy may stem from architectural differences, with CRIS focusing on token-level intervention and CLIPSeg utilizing sentence-level embedding as the intervention from the text encoder.
% Additionally, CRIS's training involved updating CLIP's encoders and the segmentation decoder, while CLIPSeg was trained with frozen CLIP.

% BiomedCLIP, trained on medical test-image pairs, demonstrates greater familiarity with the medical domain than other encoders or segmentation models.
% Surprisingly, VLSMs pretrained on natural images outperform those with an added decoder to BiomedCLIP.
% The familiarity of CRIS and CLIPSeg with segmentation tasks in the natural domain proves more valuable than the domain knowledge embedded in BiomedCLIP.

VLSMs pretrained on natural images show suboptimal zero-shot accuracy with medical images for practical use but provide a foundation for joint text-image representation. 
Our study provides intriguing insights into prompt design, attributes' roles, and models' performance when finetuning across diverse datasets. 
The zero-shot segmentation performance showed improvement across all non-radiology datasets when compared to the radiology datasets. This could be attributed to the non-radiology medical imaging modalities being closer to open-domain images, as well as the potential familiarity with organs such as skin and feet (for ISIC and DFU datasets) during pretraining.
The best-performing prompts vary with datasets but often include attributes familiar to models during pretraining.
For instance, CRIS trained on RefCOCO \citep{kazemzadeh2014referitgame} for referring image segmentation captures size, location, and number well.
 
%The optimum translation of the pretrained model's abilities to leverage such attributes from zero-shot to finetuning depends on factors like quality and consistency of attributes in prompts, the pretrained encoder's knowledge, and, very significantly, image saliency and diversity corresponding to those attributes. 
%CLIPSeg seems less adept at capturing fine-grained textual information than CRIS, indicating the predominant utilization of image encoder representations during finetuning. 
The ability of CRIS to leverage better language semantics than CLIPSeg might be due to (\textbf{i}) CRIS's architecture that focuses on token-level intervention instead of CLIPSeg's sentence-level embedding, and (\textbf{ii}) end-to-end VLSM training of CRIS compared to CLIPSeg's training for segmentation task with frozen CLIP encoder.
Interestingly, models based on CLIP performing better than those based on BiomedCLIP (pretrained with image-text pairs of $400$ million natural domain versus $15$ million medical domain) shows that large-scale dataset has the benefit that is hard to achieve with smaller-scale domain-specific data.
%This suggests CRIS and CLIPSeg's familiarity with segmentation, albeit in the natural domain, was more valuable than domain knowledge in BiomedCLIP.

% A limited number of VLSMs are trained on large-scale image segmentation data with language prompts.
% When adapting it to our case, some could not be covered due to a lack of source code, pretrained weights, or reproducibility issues.
% For example, ZegCLIP gave constant zero scores in zero-shot settings, was trained without background class, and had many channels that were not amenable to our setup.
% SAM does not support text prompts\footnote{Though the paper mentions that text prompts can be added to SAM, its open-sourced implementation does not support text prompts, and there are no pretrained models for text prompts in SAM at the time of this study. Refer to this GitHub issue: \url{https://github.com/facebookresearch/segment-anything/issues/93}}.
% Nevertheless, as discussed in \sectionref{sec:method}, the four VLSMs we considered cover fascinating diversity -- architectural variation in leveraging global level and token level information in prompts, trained end-to-end for referring image segmentation vs. finetuned only decoder during with segmentation data, based on VLM pretrained on natural vs. medical domain, etc.



% Our study focused on 2D medical images and did not include common 3D imaging modalities like MRI or CT scans.
% While the results seen in ultrasound and X-ray are likely to extend to 2D slices extracted from MRI or CT scans, it is not clear how the results or models extend to 3D volumes.
% Investigating the adaptation of VLSMs to 3D medical images could open up exciting avenues for future research.

Our study aims to build insights into how well VLSMs leverage textual information and perform transfer learning in the medical domain.
It proposes pragmatic prompt settings and systematic experiments instead of implementing an exhaustive list of VLSMs and only grossly comparing their performance.
The four CLIP-based VLSMs cover significant variations in architecture to capture global vs. token level information in prompts, training approach with end-to-end for referring image segmentation vs. finetuning only decoder for segmentation, and based on VLM pretrained on natural vs. medical domain, etc.
We focus only on 2D medical images, excluding 3D modalities like MRI or CT scans, as most existing VLSMs are suitable only for 2D images, requiring further research in building 3D VLSMs.
% Some VLSMs we considered for large-scale image segmentation with language prompts were unsuitable due to a lack of source code, pretrained weights, or reproducibility problems. 
% For instance, DenseCLIP produced poor predictions in zero-shot settings. 
% While mentioned to support text prompts, SAM \citep{kirillov2023segment} lacked implementation and pretrained models for text prompts. 
% Despite these limitations, the four VLSMs we focused on offer diverse architectural variations and training approaches. 
% However, initial observations of the VQA model's output suggest it may not always be reliable. 

% While the results seen in ultrasound and X-ray are likely to extend to 2D slices extracted from MRI or CT scans, it is not clear how the results or models extend to 3D volumes.
%Investigating the adaptation of Vision-Language foundation models and VLSMs to 3D could open up exciting avenues for future research.

% Also, quick glimpses of the VQA model's outputs suggest they are unreliable.
% However, we could not manually conduct a comprehensive quality check to assess the impact of good versus bad prompts generated by VQA.
% Enhancing VQA and Masked Language Models to generate highly reliable automated prompts from specific medical images could enable a more scalable analysis of the kind presented in \figureref{fig:changed_attr_plot} at a large scale.

% VLSMs with language prompts for segmentation show immense potential, yet major works remain for practical realization.
% Compared to CLIPSeg's image feature reliance, CRIS's semantic utilization in finetuning underscores the importance of joint image-text representation focusing on pixel-token alignment.
% BiomedCLIPSeg's underperformance suggests the superiority of further pretraining on large-scale natural image-text pairs over training on moderately sized medical datasets.

% Future research should explore generating large-scale medical image-mask-text triplets, efficiently teaching network concepts specific to medical images.
% A well-versed VLSM foundation, learning joint representation across different problems and datasets, holds promise for addressing challenges in applying deep learning to clinical applications.
% The study provides a benchmarking framework, datasets with prompts, and insights for future investigation.
While the VLSMs' performance seems on par with image-only architectures, and some of the VLSMs use information injected via text prompts, our results show that further research is needed to develop novel approaches that can better leverage the rich information provided via prompts.
Moreover, interesting future directions can explore how these prompts could help build more robust and explainable models against out-of-distribution data.
%to in the language Although VLSMs with language prompts show great potential for robust segmentation, substantial work remains before the ML community can fully leverage this potential. 
%CRIS's ability to utilize semantics in a finetuned model compared to CLIPSeg's dominant reliance on image features shows that careful designs to learn image-text representation focusing on pixel-token alignment jointly can provide better segmentation models.
%A critical line of research seems to be finding ways to generate (potentially synthetic but realistic) large-scale medical image-mask-text triplets and finding efficient ways to teach the network concepts represented by language that help identify target structure semantics in medical images. 
%A solid foundation VLSM is well versed with concepts specific to medical images and classes, and learning joint representation across different problems and datasets could address numerous challenges in applying deep learning to clinical applications.
Our work serves as an essential first step in this direction, offering a valuable evaluation framework, datasets enriched with prompts, and fascinating insights for future investigation.

\midlacknowledgments{
    % We thank Kathmandu University for providing us access to their GPUs to run some of the experiments done in this paper.
    We thank Kathmandu University for their invaluable support in granting us access to their supercomputer infrastructure.
    This enabled the successful execution of the experiments crucial to this paper.
    % We are grateful for its willingness to collaborate and commitment to advancing scientific knowledge.
}

\bibliography{mildl24_134}

\appendix
\input{supplementary_core_revised}

\end{document}