\section{Impact on Society}

The incorporation of language prompts in medical image segmentation has the potential to impact society significantly, particularly in clinical settings.
By enabling radiologists to quickly and accurately segment complex shapes using just a few words, language prompts offer a more interpretable and explainable approach compared to traditional visual prompts such as points or boxes.

One significant advantage of language prompts is their ability to convey detailed information about normal and abnormal structures' texture, shape, and spatial relationships.
This allows for a more comprehensive understanding of medical images, facilitating more accurate segmentation results.
Additionally, language prompts can be easily adapted to new classes, making them highly versatile and adaptable in various medical scenarios.

Using language prompts in medical image segmentation can improve the efficiency and effectiveness of radiologists' work, potentially leading to faster diagnoses and treatment decisions.
Moreover, the interpretability of language prompts can aid in building trust and confidence among healthcare professionals and patients as the reasoning behind the segmentation process becomes more transparent.

Overall, the integration of language prompts in medical image segmentation has the potential to revolutionize clinical practices, providing radiologists with a powerful tool to enhance their segmentation capabilities and ultimately improve patient care outcomes.

We strongly encourage and invite other researchers to contribute to this field of study.
This research paper has no negative impact on society or further research in medical imaging, as we have adhered to ethical considerations in medical imaging and have not expressed disapproval of any previous studies.

\section{Dataset and Code Access}
\label{sup:sec:dataset}

The GitHub repository\footnote{\url{https://github.com/naamiinepal/medvlsm}} contains the source code with detailed documentation, the generated prompts for all the datasets, and thorough instructions along with the relevant links to access the individual image-mask pair datasets used in this work.

\section{Experiments}
\label{sup:sec:experiments}

\subsection{VLSM Finetuning Experiments}
\label{sup:sec:vlsm_experiments}
CLIPSeg and CRIS internally resize the three-channeled input images to $352 \times 352$ and  $416 \times 416$, respectively.
The dice scores mentioned in the paper are calculated after resizing the output of the models back to the original size (before respective resizing).
We normalize the resized images with means and standard deviations provided by the respective models and haven't performed other preprocessing and post-processing to access the models' raw performance.

For the five non-radiology datasets (Kvasir-SEG, ClinicDB, BKAI, ISIC, and DFU), we finetune VLSMs with ten prompts for an individual dataset, resulting in $50$ experiments for each VLSM.
Similarly, in the case of radiology datasets (CAMUS, BUSI, and CheXlocalize), we have a total of $22$ finetuning experiments for each VLSM. 
We also finetune CRIS and CLIPSeg with the pooled datasets comprising only endoscopic and all datasets. 
Thus, including all varieties with the VLSMs and the different prompting mechanisms, we have $442$ finetuning experiments.

The average time to fine-tune CRIS for a dataset on a prompt is approximately $60$ minutes in our training setup, running $45$ epochs on average.
For CLIPSeg, the average training time is $40$ minutes, running for $90$ epochs on average.
BiomedCLIPSeg's and BiomedCLIPSeg-D's average training times are $20$ minutes and $30$ minutes, running for $80$ epochs and $50$ epochs, respectively.
We monitored the segmentation metric on the held-out validation sets for early stopping, with patience of $50$ epochs for CLIPSeg variants and $10$ epochs for CRIS.

\subsection{Hyperparameter Tuning}

We experiment with multiple sets of hyperparameters including learning rates, optimizers, batch sizes, and schedulers.
We select the optimal setting of hyperparameters (as mentioned in the main paper) that showed optimal performance in most datasets (\tableref{tab:hyperparameter_search}).

\begin{table}[h!]
    \centering
    \begin{tabular}{r|l}
        Optimizers & \{Adam, AdamW\} \\
        \hline
         Learning Rates (LRs) & $[10^{-5}, 10^{-2}]$  \\
        \hline
         LR Schedulers & \{CosineAnealingLR, ConstantLR, ReduceLROnPlateau\} \\
        \hline
         Batch sizes & $\{16, 32, 64, 128 \}$ \\
    \end{tabular}
    \caption{Different settings of hyperparameters that have been experimented with to select the optimal one.}
    \label{tab:hyperparameter_search}
\end{table}

\subsection{CNN-based Experiments}
\label{sup:sec:cnn_based_experiments}

For comparative analysis, we consider three of the conventional CNN-based segmentation models: UNet \citep{ronneberger2015u}, UNet++\citep{zhou2018unet++}, and DeepLabV3+ \citep{chen2018encoder}. 
For all of the models, we use pretrained ResNet-50 \citep{he2016deep} as the backbone, and default parameters given by the framework \textit{Segmentation Models PyTorch}\footnote{\url{https://github.com/qubvel/segmentation\_models.pytorch}} are chosen as the model hyperparameters. 
We use Dice loss for error propagation within the models with Adam optimizer \citep{kingma2014adam} of learning rate $10^{-3}$ and zero weight decay.


\section{PubMedBERT's failure to give reliable output}

\tableref{tab:PubMedBERT_analysis} contains the predictions of PubMedBERT for the masked language modeling in different datasets.

\begin{table}[h!]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
      {tab:PubMedBERT_analysis}
      {\caption{PubMedBERT's top five predictions for the masked language modeling inference. The predictions are ordered in the descending order of the probability generated by the model.
      The model has high uncertainty as the maximum probability is about $0.1$.
      The predictions are almost the same and uninformative, which is more prominent in the radiology datasets.}}
      {\resizebox{\linewidth}{!}{%
        \begin{tabular}{lll}
            \textbf{Dataset} & \textbf{Masked sentence} & \textbf{Top-5 Predictions} \\
            \hline
            \multirow{6}{*}{All Endoscopy*} & \multirow{1}{*}{The location of the polyp is [MASK].} & \multirow{1}{*}{[variable, unknown, varied, unpredictable, uncertain]} \\
            & \multirow{1}{*}{Polyp is located at [MASK].} & \multirow{1}{*}{[bifurcation, apex, rectum, midline, right]} \\
            & \multirow{1}{*}{The shape of polyp is [MASK].} & \multirow{1}{*}{[irregular, variable, oval, round, different]} \\
            & \multirow{1}{*}{Polyp is [MASK] in shape.} & \multirow{1}{*}{[oval, irregular, round, spherical, cylindrical]} \\
            & The color of the polyp is [MASK]. & [yellow, red, blue, brown, pink] \\
            & Polyp is [MASK] in color. & [yellow, white, red, black, green] \\
            \hline
            \multirow{5}{*}{ISIC}& The location of skin melanoma is [MASK]. & [unknown, variable, unusual, unpredictable, rare] \\
            & The color of skin melanoma is [MASK]. & [red, yellow, brown, black, blue]\\
            & Skin melanoma is [MASK] in texture. & [heterogeneous, variable, soft, irregular, fibrous]\\
            & Skin cancer is located at [MASK]. & [extremities, birth, puberty, adolescence, skin]\\
            & Skin cancer is [MASK] in texture. & [heterogeneous, unique, variable, diverse, distinctive]\\
            \hline
            \multirow{4}{*}{DFU} & The location of a diabetic foot ulcer is at [MASK]. & [first, rest, ankle, home, foot] \\
            & Diabetic foot ulcer is located at [MASK]. & [ankle, heel, foot, extremities, feet] \\
            & The location of the foot ulcer is [MASK]. & [ankle, knee, first, heel, night] \\
            & Foot ulcer is located at [MASK]. & [ankle, heel, foot, knee, night] \\
            \hline
            \multirow{6}{*}{CAMUS} & The left ventricular cavity is [MASK] in shape. & [spherical, triangular, normal, oval, round] \\
            & The myocardium is [MASK] in shape. & [spherical, cylindrical, circular, round, triangular] \\
            & The left atrium cavity is [MASK] in shape. & [oval, round, triangular, spherical, irregular] \\
            & The left ventricular cavity is located at [MASK]. & [diastole, apex, rest, $90^\circ$, $45^\circ$] \\
            & The myocardium is located at [MASK]. & [rest, apex, risk, diastole, birth] \\
            & The left atrium cavity is located at [MASK]. & [diastole, right, left, $90^\circ$, apex] \\
            \hline
            \multirow{2}{*}{BUSI} & The malignant breast tumor is [MASK] in shape. & [round, irregular, oval, solid, spherical] \\
            & The benign breast tumor is [MASK] in shape. & [oval, round, irregular, solid, spherical] \\
            \hline
            \multirow{7}{*}{CheXlocalize} & Airspace Opacity is [MASK] in shape. & [irregular, oval, round, triangular, globular] \\
            & Enlarged Cardiomediastinum is [MASK] in shape. & [oval, triangular, irregular, round, rounded] \\
            & Cardiomegaly is [MASK] in shape. & [irregular, triangular, normal, oval, round] \\
            & Lung Opacity is [MASK] in shape. & [irregular, round, oval, nodular, reticular] \\
            & Consolidation is [MASK] in shape. & [spherical, circular, triangular, irregular, round] \\
            & Atelectasis is [MASK] in shape. & [irregular, oval, triangular, spherical, round] \\
            & Pleural Effusion is [MASK] in shape. & [irregular, round, oval, spherical, solid] \\
            \hline
            \multicolumn{3}{l}{*This includes six datasets of endoscopy: Kvasir-SEG, ClinicDB, BKAI, CVC-300, CVC-ColonDB, ETIS}
        \end{tabular}%
    }}
\end{table}

% \newpage
\section{Some visualizations and qualitative analysis}
\label{sec:supp_vis}

Some visualizations and qualitative analysis are shown in \figureref{fig:changed_attr_plot_sup,fig:plot_fig}.

\begin{figure}[hp]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
      {fig:changed_attr_plot_sup}
      {\caption{Visualization of CRIS's performance when prompt attributes are changed using a wrong attribute value. For each medical image, three corresponding masks are displayed: ground truth mask, output mask for the corresponding prompt, and output mask after altering an attribute value of the prompts.}}
      {%
        \subfigure[ISIC for attribute \textit{size}]{
        \includegraphics[width=0.49\textwidth]{isic_swapped_size_12}
        }%
        \subfigure[Kvasir-SEG for attribute \textit{location}]{
            \includegraphics[width=0.49\textwidth]{kvasir_polyp_swapped_pos_12}
        }
        \subfigure[ClinicDB for attribute \textit{size}]{
            \includegraphics[width=0.49\textwidth]{clinicdb_polyp_swapped_size_12}
        }% 
        \subfigure[DFU for attribute \textit{size}]{
            \includegraphics[width=0.49\textwidth]{dfu_swapped_size_12}
        }
        \subfigure[BKAI for attribute \textit{size}]{
            \includegraphics[width=0.49\textwidth]{bkai_polyp_swapped_size_12}
        }
      }
\end{figure}

\begin{figure}[hp]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
      {fig:plot_fig}
      {\caption{Sample input, ground truth, and models' predictions}}
      {\includegraphics[height=0.96\textheight]{plot_fig}}
\end{figure}

\section{Results}

\subsection{Finetuning only the Decoders for CLIP-based VLSMs }
\label{sec:supp_ablation}
% We have also ablated by freezing the CLIP encoders for CLIPSeg and CRIS to see if we can reuse the learned representation of text and image encoders of CLIP, finetuning only the decoder block.
\tableref{tab:finetune_cris_frozen_sup,tab:clip_seg_dec_fine_tune_frozen_sup} show the results of VLSMs with finetuned the decoder while keeping the encoders frozen.

\begin{table}[h!]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
      {tab:finetune_cris_frozen_sup}
      {\caption{Finetuned segmentation Dice score (\%) of CRIS on different datasets on different sets of prompts with frozen CLIP.}}
      {\resizebox{\linewidth}{!}{%
        \begin{tabular}{l|cccccccccc}
             \diagbox[width=2.8cm]{\textbf{Dataset $\downarrow$}}{\textbf{Prompt $\rightarrow$}} & \textbf{P0} & \textbf{P1} & \textbf{P2} & \textbf{P3} & \textbf{P4} & \textbf{P5} & \textbf{P6} & \textbf{P7} & \textbf{P8} & \textbf{P9} \\
             \hline
             \textbf{Kvasir-SEG} & $ 75.49 \smallStd{ 27.22 }$ & $ 76.03 \smallStd{ 26.35 }$ & $ 82.18 \smallStd{ 22.40 }$ & $ 81.89 \smallStd{ 21.78 }$ & $ 84.26 \smallStd{ 20.39 }$ & $\mathbf{86.39 \smallStd{ 17.01 }}$ & $ 85.37 \smallStd{ 17.29 }$ & $ 82.43 \smallStd{ 22.11 }$ & $ 85.06 \smallStd{ 18.89 }$ & $ 85.02 \smallStd{ 19.10 }$ \\
             \hline
             \textbf{ClinicDB} & $ 49.48 \smallStd{ 33.67 }$ & $ 46.98 \smallStd{ 34.30 }$ & $ 81.07 \smallStd{ 24.37 }$ & $ 82.72 \smallStd{ 23.96 }$ & $ 84.88 \smallStd{ 24.00 }$ & $ 85.01 \smallStd{ 21.84 }$ & $ 83.31 \smallStd{ 22.74 }$ & $ 81.66 \smallStd{ 26.10 }$ & $\mathbf{87.13 \smallStd{ 21.38 }}$ & $ 84.65 \smallStd{ 22.25 }$ \\
             \hline
             \textbf{BKAI} & $ 77.98 \smallStd{ 28.73 }$ & $ 75.01 \smallStd{ 29.97 }$ & $ 81.93 \smallStd{ 24.66 }$ & $ 82.49 \smallStd{ 24.7 }$ & $ 82.39 \smallStd{ 23.65 }$ & $ 84.65 \smallStd{ 21.75 }$ & $ 85.75 \smallStd{ 21.48 }$ & $ 84.91 \smallStd{ 23.06 }$ & $\mathbf{86.40 \smallStd{ 20.49 }}$ & $ 85.07 \smallStd{ 22.01 }$ \\
             \hline
             \textbf{ISIC} & $ 87.64 \smallStd{ 14.37 }$ & $ 85.77 \smallStd{ 18.29 }$ & $ 90.25 \smallStd{ 10.37 }$ & $ 90.32 \smallStd{ 10.93 }$ & $ 91.28 \smallStd{ 7.45 }$ & $ 91.23 \smallStd{ 8.56 }$ & $\mathbf{91.29 \smallStd{ 8.10 }}$ & $ 90.46 \smallStd{ 10.90 }$ & $ 91.29 \smallStd{ 8.11 }$ & $ 91.28 \smallStd{ 7.65 }$ \\
             \hline
             \textbf{DFU} & $ 66.30 \smallStd{ 29.57 }$ & $ 66.14 \smallStd{ 29.81 }$ & $\mathbf{70.28 \smallStd{ 27.11 }}$ & $ 67.24 \smallStd{ 30.22 }$ & $ 69.19 \smallStd{ 28.98 }$ & $ 68.55 \smallStd{ 29.56 }$ & $ 68.93 \smallStd{ 29.41 }$ & $ 69.35 \smallStd{ 28.75 }$ & $ 68.36 \smallStd{ 29.62 }$ & $ 70.15 \smallStd{ 28.59 }$ \\
             \hline
             \textbf{CAMUS} & $ 46.15 \smallStd{ 9.69 }$ & $ 88.87 \smallStd{ 8.49 }$ & $ \mathbf{89.18 \smallStd{ 6.79 }}$ & $ 88.94 \smallStd{ 7.05 }$ & $ 88.92 \smallStd{ 6.69 }$ & $ 88.02 \smallStd{ 7.37 }$ & $ 88.96 \smallStd{ 6.84 }$ & $ 89.04 \smallStd{ 6.85 }$ & N/A & N/A \\
             \hline
             \textbf{BUSI} & $ 47.11 \smallStd{ 39.12 }$ & $ 61.49 \smallStd{ 36.03 }$ & $ 63.18 \smallStd{ 36.89 }$ & $ 62.87 \smallStd{ 37.60 }$ & $ 65.10 \smallStd{ 36.60 }$ & $ 66.69 \smallStd{ 35.68 }$ & $\mathbf{66.76 \smallStd{ 35.77 }}$ & N/A & N/A & N/A \\
             \hline
             \textbf{CheXlocalize} & $ 41.03 \smallStd{ 24.96 }$ & $ 54.18 \smallStd{ 25.77 }$ & $ 54.57 \smallStd{ 25.06 }$ & $ 53.30 \smallStd{ 25.16 }$ & $\mathbf{56.17 \smallStd{ 24.73 }}$ & $ 56.03 \smallStd{ 24.49 }$ & $ 52.48 \smallStd{ 25.89 }$ & N/A & N/A & N/A\\
             \hline
        \end{tabular}%
    }}
\end{table}

\begin{table}[h!]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
      {tab:clip_seg_dec_fine_tune_frozen_sup}
      {\caption{Finetuned segmentation Dice score (\%) of CLIPSeg on different datasets on different sets of prompts with frozen CLIP.}}
      {\resizebox{\linewidth}{!}{%
        \begin{tabular}{l|cccccccccc}
            \diagbox[width=2.8cm]{\textbf{Dataset $\downarrow$}}{\textbf{Prompt $\rightarrow$}} & \textbf{P0} & \textbf{P1} & \textbf{P2} & \textbf{P3} & \textbf{P4} & \textbf{P5} & \textbf{P6} & \textbf{P7} & \textbf{P8} & \textbf{P9} \\
            \hline
            \textbf{Kvasir-SEG} & $ 86.38 \smallStd{ 17.8 }$ & $ 87.50 \smallStd{ 15.35 }$ & $ 87.49 \smallStd{ 14.29 }$ & $ 87.68 \smallStd{ 14.60 }$ & $ 88.33 \smallStd{ 10.95 }$ & $ 88.25 \smallStd{ 12.11 }$ & $ \mathbf{88.98 \smallStd{ 11.98 }}$ & $87.97 \smallStd{ 13.93 }$ & $ 88.39 \smallStd{ 14.72 }$ & $ 88.71 \smallStd{ 11.4 }$ \\
            \hline
            \textbf{ClinicDB} & $ 87.23 \smallStd{ 14.93 }$ & $ 87.07 \smallStd{ 14.43 }$ & $ \mathbf{88.41 \smallStd{ 11.01 }}$ & $ 87.17 \smallStd{ 14.73 }$ & $ 87.25 \smallStd{ 15.09 }$ & $ 87.73 \smallStd{ 13.52 }$ & $ 87.76 \smallStd{ 13.56 }$ & $ 87.57 \smallStd{ 13.98 }$ & $ 87.05 \smallStd{ 14.79 }$ & $ 87.46 \smallStd{ 14.39 }$ \\
            \hline
            \textbf{BKAI} & $ 83.64 \smallStd{ 18.59 }$ & $ 85.26 \smallStd{ 15.40 }$ & $ 85.47 \smallStd{ 15.15 }$ & $ 84.7 \smallStd{ 16.94 }$ & $ 85.93 \smallStd{ 14.66 }$ & $ \mathbf{86.01 \smallStd{ 14.84 }}$ & $ 85.02 \smallStd{ 17.23 }$ & $ 85.45 \smallStd{ 14.76 }$ & $ 85.50 \smallStd{ 15.68 }$ & $ 84.99 \smallStd{ 17.11 }$ \\
            \hline
            \textbf{ISIC} & $ 91.71 \smallStd{ 8.68 }$ & $ 91.45 \smallStd{ 8.47 }$ & $ 91.66 \smallStd{ 8.29 }$ & $ 91.85 \smallStd{ 8.36 }$ & $ \mathbf{92.11 \smallStd{ 6.87 }}$ & $ 92.02 \smallStd{ 6.88 }$ & $ 92.09 \smallStd{ 7.00 }$ & $ 91.77 \smallStd{ 7.73 }$ & $ 91.89 \smallStd{ 7.70 }$ & $ 91.90 \smallStd{ 7.21 }$ \\
            \hline
            \textbf{DFU} & $ 72.35 \smallStd{ 25.04 }$ & $ 72.19 \smallStd{ 25.69 }$ & $ 71.79 \smallStd{ 25.05 }$ & $ 71.88 \smallStd{ 24.83 }$ & $ 72.5 \smallStd{ 24.43 }$ & $ 72.31 \smallStd{ 25.27 }$ & $ \mathbf{73.53 \smallStd{ 23.68 }}$ & $ 72.1 \smallStd{ 25.48 }$ & $ 73.11 \smallStd{ 23.98 }$ & $ 73.31 \smallStd{ 23.81 }$ \\
            \hline
            \textbf{CAMUS} & $ 46.48 \smallStd{ 9.07 }$ &$ 88.67 \smallStd{ 6.25 }$  &$ 88.70 \smallStd{ 5.93 }$  & $ \mathbf{88.81 \smallStd{ 6.15 }}$  & $ 88.77 \smallStd{ 6.22 }$ & $ 88.47 \smallStd{ 6.55 }$ & $ 88.53 \smallStd{ 6.29 }$ & $ 87.82 \smallStd{ 7.01 }$ & N/A & N/A \\
            \hline
            \textbf{BUSI} & $ 62.03 \smallStd{ 38.3 }$ & $ 62.79 \smallStd{ 37.55 }$ & $ 62.97 \smallStd{ 37.27 }$ & $ 62.85 \smallStd{ 36.66 }$ & $ \mathbf{64.47 \smallStd{ 37.54 }}$ & $ 62.83 \smallStd{ 38.19 }$ & $ 62.33 \smallStd{ 38.68 }$ & N/A & N/A & N/A\\
            \hline
            \textbf{CheXlocalize} & $ 45.35 \smallStd{ 25.18 }$ & $ 58.10 \smallStd{ 25.03 }$ & $ 58.37 \smallStd{ 24.50 }$ & $ 58.95 \smallStd{ 24.48 }$ & $ 59.49 \smallStd{ 25.11 }$ & $ \mathbf{59.56 \smallStd{ 24.70 }}$ & $ 58.06 \smallStd{ 25.34 }$ & N/A & N/A & N/A \\
            \hline
        \end{tabular}%
    }}
\end{table}

\subsection{Using radiology reports for lung segmentation}
To examine the usage of free-text radiology reports of chest x-rays for segmentation, we utilize 1,141 frontal-view CXRs randomly selected from the MIMIC-CXR database \citep{johnson2019mimic, johnson2019mimicjpg, chen2022chest}.
This dataset contains the segmentation of lungs, which has been verified manually.
We use the free-text radiology reports provided in the MIMIC-CXR Database \citep{johnson2019mimic} as the only prompt (P1), and the results are reported in \tableref{tab:manual-cxr-experiments}.

% \tableref{tab:manual-cxr-experiments} shows that \st{in both CRIS and CLIPSeg models, adding reports as additional prompts negatively affects zero-shot segmentation.
% However,} when finetuning, CRIS performs significantly better with reports than with empty prompts.
% This indicates that adding only free-text radiology reports of the chest X-ray might benefit lung segmentation tasks.

\begin{table}[h!]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
      {tab:manual-cxr-experiments}
      {\caption{Zero-shot and finetuning Dice scores (\%) of the CRIS and  CLIPSeg Manually labeled Chest X-ray Segmentation Dataset. We have used the actual radiology reports as \textbf{P1}. P0 indicates an empty prompt.}}
      {\begin{tabular}{cc|lllllll}
        \textbf{Models $\downarrow$} & \diagbox{\textbf{Experiment $\downarrow$}}{\textbf{Prompt $\rightarrow$}} & \textbf{P0} & \textbf{P1}  \\
        \hline
        \multirow{2}{*}{\textbf{CRIS}} &\textbf{Zero-shot} & $44.8 \smallStd{18.97}$ & $40.73 \smallStd{18.95}$ \\
        &\textbf{Finetuning} & $81.66 \smallStd{5.65}$ & $90.99 \smallStd{1.41}$ \\
        \hline
        \multirow{2}{*}{\textbf{CLIPSeg}}&\textbf{Zero-shot} & $0.26 \smallStd{2.35}$ & $0.09 \smallStd{0.88}$ \\
        &\textbf{Finetuning} & $91.39 \smallStd{1.09}$ & $91.22 \smallStd{1.26}$ \\
        \hline
    \end{tabular}%
    }
    
\end{table}

\section{Prompt Composition}
\label{sec:appendix-prompts}

The prompts used during the training for various datasets are shown below.
If there are multiple templates for the same prompts for a dataset, one is randomly chosen during the training to increase the regularization for the models.

\begin{table}[h!]
     % The first argument is the label.
     % The caption goes in the second argument, and the table contents
     % go in the third argument.
    \floatconts
      {tab:dataset-prompts-combination}%
      {\caption{
        Different prompts are formed for each dataset using combinations of $14$ potential attributes.
        Although some attributes, like \textit{Pathology}, are specific to some particular datasets, others, like \textit{Class Keywords}, are common to all the datasets.
      }}%
      {\resizebox{\linewidth}{!}{%
        \begin{tabular}{c|ccccccccc}
        \multicolumn{10}{l}{\begin{tabular}[c]{@{}l@{}}\textbf{Attributes $\rightarrow$} \textbf{a1:} Class Keyword; \textbf{a2:} Shape; \textbf{a3:} Color; \textbf{a4:} Size; \textbf{a5:} Number; \textbf{a6:} Location; \textbf{a7:} General Class Info; \textbf{a8:} View; \textbf{a9:} Pathology; \textbf{10:} Cardiac Cycle; \\ \qquad \qquad
        \textbf{a11:} Gender; \textbf{a12:} Age; \textbf{a13:} Image Quality; \textbf{a14:} Tumor Type\end{tabular}} \\
        \hline
        \textbf{Prompts $\rightarrow$} & \multirow{2}{*}{P1} & \multirow{2}{*}{P2} & \multirow{2}{*}{P3} & \multirow{2}{*}{P4} & \multirow{2}{*}{P5} & \multirow{2}{*}{P6} & \multirow{2}{*}{P7} & \multirow{2}{*}{P8} & \multirow{2}{*}{P9} \\
        \cline{1-1}
        \textbf{Datasets $\downarrow$} &  \\
        \hline
        \textbf{Non-Radiology} & a1 & a1a2 & a1a2a3 & a1a2a3a4 & a1a2a3a4a5 & a1a2a3a4a6 & a1a7 & a1a2a3a4a5a7 & a1a2a3a4a5a6a7 \\\\
        Example Prompt & \multicolumn{9}{l}{\textbf{P9} $\rightarrow$ \textbf{one small pink round polyp} which is \textbf{often a bumpy flesh in rectum} located in \textbf{center} of the image} \\\\
        \hline
        \textbf{CheXlocalize} & a1 & a1a8 & a1a2a8 & a1a2a6a8 & a1a2a6a8a9 & a1a9 & N/A & N/A & N/A \\\\
        Example Prompt & \multicolumn{9}{p {1.45\linewidth}}{\textbf{P5} $\rightarrow$ \textbf{Airspace Opacity} of shape \textbf{rectangle}, and located in \textbf{right} of the \textbf{frontal} view of a Chest Xray. \textbf{Enlarged Cardiomediastinum, Cardiomegaly, Lung Opacity, Consolidation, Atelectasis, Pleural Effusion} are present.} \\
        \hline
        \textbf{CAMUS} & a1 & a1a8 & a1a8a10 & a1a8a10a11 & a1a8a10a11a12 & a1a8a10a11a12a13 & a1a8a10a11a12a13a2 & N/A & N/A \\\\
        Example Prompt & \multicolumn{9}{p {1.45\linewidth}}{\textbf{P7 $\rightarrow$ Left ventricular cavity} of \textbf{triangular shape} in \textbf{two-chamber view} in the cardiac ultrasound at the end of the \textbf{diastole cycle} of a \textbf{40-year-old female} with \textbf{poor image quality}.} \\
        \hline
        \textbf{BUSI} & a1 & a1a14 & a1a14a5 & a1a14a5a4 & a1a14a5a4a6 & a1a14a5a4a6a2 & N/A & N/A & N/A \\\\
        Example Prompt & \multicolumn{9}{l}{\textbf{P6 $\rightarrow$ Two medium square-shaped benign tumors} at the \textbf{center, left} in the breast ultrasound image.} \\
        \hline
        \end{tabular}%
        }
    }
\end{table}

\subsection{Non-radiology images}
\label{sec:non_radiology_images}

\subsubsection{Endoscopy Datasets}

A total of six endoscopy datasets (polyp segmentation image-mask pairs) have been used for finetuning and evaluating our proposed models: Kvasir-SEG \citep{jha2020kvasir}, ClinicDB \citep{bernal2015wm}, BKAI \citep{ngoc2021neounet, an2022blazeneo}, CVC-300 \citep{vazquez2017benchmark}, CVC-ColonDB \citep{tajbakhsh2015automated}, and ETIS \citep{silva2014toward}.
The last three datasets have a small number of image-masks pairs, so they are used only for testing and evaluating the trained models. 

\begin{enumerate}
    \item \textbf{P0}: ``" (No prompt)
    
    \item \textbf{P1}: ``\textit{class name}"
        \begin{itemize}
            \item \textit{polyp}
        \end{itemize}
        
    \item \textbf{P2}: ``\textit{shape} \textit{class name}"
        \begin{itemize}
            \item \textit{round} \textit{polyp}
        \end{itemize}
    
    \item \textbf{P3}: ``\textit{color} \textit{shape} \textit{class name}"
        \begin{itemize}
            \item \textit{pink} \textit{round} \textit{polyp}
        \end{itemize}
    
    \item \textbf{P4}: ``\textit{size} \textit{color} \textit{shape} \textit{class name}"
        \begin{itemize}
            \item \textit{medium} \textit{pink} \textit{round} \textit{polyp}
        \end{itemize}
    
    \item \textbf{P5}: ``\textit{number} \textit{size} \textit{color} \textit{shape} \textit{class name}"
        \begin{itemize}
            \item \textit{one} \textit{medium} \textit{pink} \textit{round} \textit{polyp}
        \end{itemize}
    
    \item \textbf{P6}: ``\textit{number} \textit{size} \textit{color} \textit{shape} \textit{class name}, located in the \textit{location} of the image"
        \begin{itemize}
            \item \textit{one} \textit{medium} \textit{pink} \textit{round} \textit{polyp}, located in the \textit{top left} of the image
        \end{itemize}
        
    \item \textbf{P7}: ``\textit{class name}, which is a \textit{general description of the class}"
        \begin{itemize}
  \item\textit{polyp}, which is a \textit{small lump in the lining of colon} 
        \end{itemize}

    \item \textbf{P8}: ``\textit{number} \textit{size} \textit{color} \textit{shape} \textit{class name}, which is a \textit{general description of the class}"
        \begin{itemize}
  \item\textit{one} \textit{medium} \textit{pink} \textit{round} 
 \textit{polyp}, which is a \textit{small lump in the lining of colon} 
        \end{itemize}

    \item \textbf{P9}: ``\textit{number} \textit{size} \textit{color} \textit{shape} \textit{class name}, which is a \textit{general description of the class} located in the \textit{location} of the image "
    \begin{itemize}
\item\textit{one} \textit{medium} \textit{pink} \textit{round} 
\textit{polyp}, which is a \textit{small lump in the lining of colon} located in the \textit{top left} of the image
    \end{itemize}
\end{enumerate}

For \textit{General Description of the class}, prompts were built using information about the subject on the internet.
Five such descriptions were designed for each dataset, and one random sample was selected each time as the \textit{general description of the class} attribute whenever the prompts \textbf{p7}, \textbf{p8}, and \textbf{p9} were used.

\subsubsection{ISIC and DFU-2022}


The templates of prompts for the DFU-2022 \citep{kendrick2022translating} and ISIC \citep{gutman2016skin} datasets used were the same as the above examples for endoscopy images, with \textit{class name} and \textit{general description of the class} being different. We used class names \textbf{skin melanoma} and \textbf{foot ulcer} for the two datasets, respectively.

The five \textit{General Description of the class} for each of the three types of photographic datasets used is listed in the table below.

\begin{table}[h!]
     % Caption and label go in the first argument and the figure contents
     % go in the second argument
    \floatconts
      {tab:general_descriptions}
      {\caption{General Descriptions selected for each of the photographic datasets.}}
      {\small%
        \begin{tabular}{p{0.3\linewidth}p{0.3\linewidth}p{0.3\linewidth}}
            \textbf{Endoscopy Datasets} & \textbf{ISIC} & \textbf{DFU-2022} \\
            \hline
             $\rightarrow$ a projecting growth of tissue & $\rightarrow$ a spot with dark speckles & $\rightarrow$ a wound in foot and toes \\
             $\rightarrow$ often a bumpy flesh in rectum & $\rightarrow$ a spot with irregular texture & $\rightarrow$ a sore in foot and toes \\
             $\rightarrow$ a small lump in the lining of colon & $\rightarrow$ a dark sore with irregular texture & $\rightarrow$ a sore in skin of foot and toe \\
            $\rightarrow$ a tissue growth that often resemble mushroom-like stalks & $\rightarrow$ an irregular sore with speckles & $\rightarrow$ an abnormality in foot and toes \\
            $\rightarrow$ an abnormal growth of tissues projecting from a mucous membrane & $\rightarrow$ a rough wound on skin & $\rightarrow$ an open sore or lesion in foot and toes
        \end{tabular}%
      }
\end{table}

\subsection{Radiology Images}

\subsubsection{CheXlocalize}

The prompts for the CheXlocalize \citep{saporta2022benchmarking} dataset are listed below.

\begin{enumerate}
    \item \textbf{P0}: ``" (No prompt)
    
    \item \textbf{P1}: ``\textit{labels} in a chest Xray."
        \begin{itemize}
            \item \textit{Airspace Opacity} in a chest Xray.
        \end{itemize}
        
    \item \textbf{P2}: ``\textit{labels} in the \textit{xray\_view} view of a Chest Xray."
        \begin{itemize}
            \item Airspace Opacity in the \textit{frontal} view of a Chest Xray.
        \end{itemize}
    
    \item \textbf{P3}: ``\textit{labels} of shape \textit{shape} in the \textit{xray\_view} view of a Chest Xray."
        \begin{itemize}
            \item Airspace Opacity of shape \textit{rectangle} in the frontal view of a Chest Xray.
        \end{itemize}
    
    \item \textbf{P4}: ``\textit{labels} of shape \textit{shape}, and located in \textit{location} of the \textit{xray\_view} view of a Chest Xray."
        \begin{itemize}
            \item Airspace Opacity of shape rectangle, and located in \textit{right} of the frontal view of a Chest Xray.
        \end{itemize}
    
    \item \textbf{P5}: ``\textit{labels} of shape \textit{shape}, and located in \textit{location} of the \textit{xray\_view} view of a Chest Xray. \textit{pathology} are present."
        \begin{itemize}
            \item Airspace Opacity of shape rectangle, and located in right of the frontal view of a Chest Xray. \textit{Enlarged Cardiomediastinum, Cardiomegaly, Lung Opacity, Consolidation, Atelectasis, Pleural Effusion} are present.
        \end{itemize}
    
    \item \textbf{P6}: ``\textit{labels} in a Chest Xray. \textit{pathology} are present."
        \begin{itemize}
            \item Airspace Opacity in a Chest Xray. Enlarged Cardiomediastinum, Cardiomegaly, Lung Opacity, Consolidation, Atelectasis, Pleural Effusion are present.
        \end{itemize}
\end{enumerate}

\subsubsection{CAMUS}

The prompts for the CAMUS \citep{leclerc2019deep} dataset are listed below.

\begin{enumerate}

\item Class of Current Image
\begin{itemize}
    \item \textit{Left ventricular cavity}, \textit{Myocardium}, or \textit{Left atrium cavity} of the heart
    \item $[$\textit{class}$]$ in the cardiac ultrasound
\end{itemize}

\item Include the chamber information
\begin{itemize}
    \item Left ventricular cavity in \textit{two-chamber view} of the heart.
    \item Left ventricular cavity in \textit{two-chamber view} in the cardiac ultrasound.
\end{itemize}

\item Include the cycle

\begin{itemize}
    \item Left ventricular cavity in two-chamber view of the heart at the \textit{end of the diastole cycle}.
    \item Left ventricular cavity in two-chamber view in the cardiac ultrasound at the \textit{end of the diastole cycle}.
\end{itemize}

\item Include the gender

\begin{itemize}
    \item Left ventricular cavity in two-chamber view of the heart at the end of the diastole cycle of \textit{a female}.

    \item Left ventricular cavity in two-chamber view in the cardiac ultrasound at the end of the diastole cycle of \textit{a female}.
\end{itemize}

\item Include the age

\begin{itemize}
    \item Left ventricular cavity in two-chamber view of the heart at the end of the diastole cycle of a \textit{forty-six-year-old} female.

    \item Left ventricular cavity in two-chamber view in the cardiac ultrasound at the end of the diastole cycle of a \textit{forty-six-year-old} female.
\end{itemize}

\item Include the image quality

\begin{itemize}
    \item Left ventricular cavity in two-chamber view of the heart at the end of the diastole cycle of a 40-year-old female with \textit{poor image quality}.

    \item Left ventricular cavity in two-chamber view in the cardiac ultrasound at the end of the diastole cycle of a 40-year-old female with \textit{poor image quality}.
\end{itemize}

\item Include the mask shape

\begin{itemize}
    \item Left ventricular cavity of \textit{triangular shape} in two-chamber view of the heart at the end of the diastole cycle of a 40-year-old female with \textit{poor image quality}.

    \item Left ventricular cavity of \textit{triangular shape} in two-chamber view in the cardiac ultrasound at the end of the diastole cycle of a 40-year-old female with \textit{poor image quality}.
\end{itemize}

\end{enumerate}

\subsubsection{Breast Ultrasound Images Dataset}

The prompts for the Breast Ultrasound Images (BUSI) \citep{al2020dataset} dataset are listed below. 

\begin{enumerate}
    \item Presence of tumor
        \begin{itemize}
            \item \textit{[No] tumor} in the breast ultrasound image
        \end{itemize}
    \item Tumor Type
        \begin{itemize}
            \item \textit{Benign} tumor in the breast ultrasound image
            \item \textit{Regular-shaped} tumor in the breast ultrasound image
        \end{itemize}
    \item Tumor Number
        \begin{itemize}
            \item \textit{Two} benign tumors in the breast ultrasound image
            \item \textit{Two} regular-shaped tumors in the breast ultrasound image
        \end{itemize}
    \item Tumor Coverage
        \begin{itemize}
            \item Two \textit{medium} benign tumors in the breast ultrasound image
            \item Two \textit{medium} regular-shaped tumors in the breast ultrasound image
        \end{itemize}
    \item Tumor Location
        \begin{itemize}
            \item Two medium benign tumors \textit{at the center, left} in the breast ultrasound image
            \item Two medium regular-shaped tumors \textit{at the center, left} in the breast ultrasound image
        \end{itemize}
    \item Tumor Shape
        \begin{itemize}
            \item Two medium \textit{square-shaped} benign tumors at the center, left in the breast ultrasound image
            \item Two medium \textit{square-shaped} regular tumors at the center, left in the breast ultrasound image
        \end{itemize}
\end{enumerate}