% \documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

% Added \usepackage
\usepackage[colorlinks,
            linkcolor=red,       %%修改此处为你想要的颜色
            anchorcolor=red,  %%修改此处为你想要的颜色
            citecolor=green,        %%修改此处为你想要的颜色，例如修改blue为red
            ]{hyperref}

\usepackage{url}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}

\usepackage[utf8]{inputenc}   % allow utf-8 input
\usepackage[T1]{fontenc}   % use 8-bit T1 fonts
\usepackage{url}   % simple URL typesetting
\usepackage{booktabs}   % professional-quality tables
\usepackage{amsfonts}   % blackboard math symbols
\usepackage{nicefrac}    % compact symbols for 1/2, etc.
\usepackage{microtype}   % microtypography
\usepackage{xcolor}   % colors

% For theorems and such
\usepackage{url}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{tikz} 
\usepackage{pgfplots}
\usepackage{subcaption}
\usepackage{titling}
\usepackage{wrapfig} % allow utf-8 input
\usepackage{graphicx}
\usepackage{subcaption}


\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{algorithm} %%算法
\usepackage{algorithmic} %%算法
\usepackage{enumerate} %%用于item
\usepackage{multirow} %%表格合并行单元格
\usepackage{amsmath}

\usepackage{colortbl} 
\usepackage{xcolor}
\usepackage{array}
\usepackage{bbding}
\usepackage{enumerate}
\usepackage{bm}
% \usepackage{subfigure}

\usepackage{helvet} %Required
\usepackage{courier} %Required
\usepackage{color}
% \usepackage{cite}
\usepackage[capitalize,noabbrev]{cleveref}
\usepackage[textsize=tiny]{todonotes}


%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\newcommand{\Amat}[0]{\ensuremath{{\bf A}}}
\newcommand{\Bmat}[0]{\ensuremath{{\bf B}} }
\newcommand{\Cmat}[0]{\ensuremath{{\bf C}} }
\newcommand{\Dmat}[0]{\ensuremath{{\bf D}} }
\newcommand{\Emat}[0]{\ensuremath{{\bf E}} }
\newcommand{\Fmat}[0]{\ensuremath{{\bf F}} }
\newcommand{\Gmat}[0]{\ensuremath{{\bf G}} }
\newcommand{\Hmat}[0]{\ensuremath{{\bf H}} }
\newcommand{\Imat}[0]{\ensuremath{{\bf I}} }
\newcommand{\Jmat}[0]{\ensuremath{{\bf J}} }
\newcommand{\Kmat}[0]{\ensuremath{{\bf K}} }
\newcommand{\Lmat}[0]{\ensuremath{{\bf L}} }
\newcommand{\Mmat}[0]{\ensuremath{{\bf M}} }
\newcommand{\Nmat}[0]{\ensuremath{{\bf N}} }
\newcommand{\Omat}[0]{\ensuremath{{\bf O}} }
\newcommand{\Pmat}[0]{\ensuremath{{\bf P}} }
\newcommand{\Qmat}[0]{\ensuremath{{\bf Q}} }
\newcommand{\Rmat}[0]{\ensuremath{{\bf R}} }
\newcommand{\Smat}[0]{\ensuremath{{\bf S}} }
\newcommand{\Tmat}[0]{\ensuremath{{\bf T}} }
\newcommand{\Umat}[0]{\ensuremath{{\bf U}} }
\newcommand{\Vmat}[0]{\ensuremath{{\bf V}} }
\newcommand{\Wmat}[0]{\ensuremath{{\bf W}} }
\newcommand{\Xmat}[0]{\ensuremath{{\bf X}} }
\newcommand{\Ymat}[0]{\ensuremath{{\bf Y}} }
\newcommand{\Zmat}[0]{\ensuremath{{\bf Z}} }
\newcommand{\bds}[1]{\boldsymbol{#1}}
\newcommand{\zerov}[0]{\ensuremath{{\bf 0}} }
\newcommand{\onev}[0]{\ensuremath{{\bf 1}} }
\newcommand{\av}[0]{\ensuremath{\boldsymbol{a}} }
\newcommand{\bv}[0]{\ensuremath{\boldsymbol{b}} }
\newcommand{\cv}[0]{\ensuremath{\boldsymbol{c}} }
\newcommand{\dv}[0]{\ensuremath{\boldsymbol{d}} }
\newcommand{\ev}[0]{\ensuremath{\boldsymbol{e}} }
\newcommand{\fv}[0]{\ensuremath{\boldsymbol{f}} }
\newcommand{\gv}[0]{\ensuremath{\boldsymbol{g}} }
\newcommand{\hv}[0]{\ensuremath{\boldsymbol{h}} }
\newcommand{\iv}[0]{\ensuremath{\boldsymbol{i}} }
\newcommand{\jv}[0]{\ensuremath{\boldsymbol{j}} }
\newcommand{\kv}[0]{\ensuremath{\boldsymbol{k}} }
\newcommand{\lv}[0]{\ensuremath{\boldsymbol{l}} }
\newcommand{\mv}[0]{\ensuremath{\boldsymbol{m}} }
\newcommand{\nv}[0]{\ensuremath{\boldsymbol{n}} }
\newcommand{\ov}[0]{\ensuremath{\boldsymbol{o}} }
\newcommand{\pv}[0]{\ensuremath{\boldsymbol{p}} }
\newcommand{\qv}[0]{\ensuremath{\boldsymbol{q}} }
\newcommand{\rv}[0]{\ensuremath{\boldsymbol{r}} }
\newcommand{\sv}[0]{\ensuremath{\boldsymbol{s}} }
\newcommand{\tv}[0]{\ensuremath{\boldsymbol{t}} }
\newcommand{\uv}[0]{\ensuremath{\boldsymbol{u}} }
\newcommand{\vv}[0]{\ensuremath{\boldsymbol{v}} }
\newcommand{\wv}[0]{\ensuremath{\boldsymbol{w}} }
\newcommand{\xv}[0]{\ensuremath{\boldsymbol{x}} }
\newcommand{\yv}[0]{\ensuremath{\boldsymbol{y}} }
\newcommand{\zv}[0]{\ensuremath{\boldsymbol{z}} }
\newcommand{\Av}[0]{\ensuremath{\boldsymbol{A}} }
\newcommand{\Bv}[0]{\ensuremath{\boldsymbol{B}} }
\newcommand{\Cv}[0]{\ensuremath{\boldsymbol{C}} }
\newcommand{\Dv}[0]{\ensuremath{\boldsymbol{D}} }
\newcommand{\Ev}[0]{\ensuremath{\boldsymbol{E}} }
\newcommand{\Fv}[0]{\ensuremath{\boldsymbol{F}} }
\newcommand{\Gv}[0]{\ensuremath{\boldsymbol{G}} }
\newcommand{\Hv}[0]{\ensuremath{\boldsymbol{H}} }
\newcommand{\Iv}[0]{\ensuremath{\boldsymbol{I}} }
\newcommand{\Jv}[0]{\ensuremath{\boldsymbol{J}} }
\newcommand{\Kv}[0]{\ensuremath{\boldsymbol{K}} }
\newcommand{\Lv}[0]{\ensuremath{\boldsymbol{L}} }
\newcommand{\Mv}[0]{\ensuremath{\boldsymbol{M}} }
\newcommand{\Nv}[0]{\ensuremath{\boldsymbol{N}} }
\newcommand{\Ov}[0]{\ensuremath{\boldsymbol{O}} }
\newcommand{\Pv}[0]{\ensuremath{\boldsymbol{P}} }
\newcommand{\Qv}[0]{\ensuremath{\boldsymbol{Q}} }
\newcommand{\Rv}[0]{\ensuremath{\boldsymbol{R}} }
\newcommand{\Sv}[0]{\ensuremath{\boldsymbol{S}} }
\newcommand{\Tv}[0]{\ensuremath{\boldsymbol{T}} }
\newcommand{\Uv}[0]{\ensuremath{\boldsymbol{U}} }
\newcommand{\Vv}[0]{\ensuremath{\boldsymbol{V}} }
\newcommand{\Wv}[0]{\ensuremath{\boldsymbol{W}} }
\newcommand{\Xv}[0]{\ensuremath{\boldsymbol{X}} }
\newcommand{\Yv}[0]{\ensuremath{\boldsymbol{Y}} }
\newcommand{\Zv}[0]{\ensuremath{\boldsymbol{Z}} }
\newcommand{\Gammamat}[0]{\ensuremath{\boldsymbol{\Gamma}} }
\newcommand{\Deltamat}[0]{\ensuremath{\boldsymbol{\Delta}} }
\newcommand{\Thetamat}[0]{\ensuremath{\boldsymbol{\Theta}} }
\newcommand{\Lambdamat}[0]{\ensuremath{\boldsymbol{\Lambda}} }
\newcommand{\Ximat}[0]{\ensuremath{\boldsymbol{\Xi}} }
\newcommand{\Pimat}[0]{\ensuremath{\boldsymbol{\Pi}} }
\newcommand{\Sigmamat}[0]{\ensuremath{\boldsymbol{\Sigma}} }
\newcommand{\Upsilonmat}[0]{\ensuremath{\boldsymbol{\Upsilon}} }
\newcommand{\Phimat}[0]{\ensuremath{\boldsymbol{\Phi}}}
\newcommand{\Psimat}[0]{\ensuremath{\boldsymbol{\Psi}} }
\newcommand{\Omegamat}[0]{\ensuremath{\boldsymbol{\Omega}}}
\newcommand{\alphav}[0]{\ensuremath{\boldsymbol{\alpha}} }
\newcommand{\betav}[0]{\ensuremath{\boldsymbol{\beta}} }
\newcommand{\gammav}[0]{\ensuremath{\boldsymbol{\gamma}} }
\newcommand{\deltav}[0]{\ensuremath{\boldsymbol{\delta}} }
\newcommand{\epsilonv}[0]{\ensuremath{\boldsymbol{\epsilon}} }
\newcommand{\zetav}[0]{\ensuremath{\boldsymbol{\zeta}} }
\newcommand{\etav}[0]{\ensuremath{\boldsymbol{\eta}} }
\newcommand{\thetav}[0]{\ensuremath{\boldsymbol{\theta}} }
\newcommand{\iotav}[0]{\ensuremath{\boldsymbol{\iota}} }
\newcommand{\kappav}[0]{\ensuremath{\boldsymbol{\kappa}} }
\newcommand{\lambdav}[0]{\ensuremath{\boldsymbol{\lambda}} }
\newcommand{\muv}[0]{\ensuremath{\boldsymbol{\mu}} }
\newcommand{\nuv}[0]{\ensuremath{\boldsymbol{\nu}} }
\newcommand{\xiv}[0]{\ensuremath{\boldsymbol{\xi}} }
\newcommand{\omicronv}[0]{\ensuremath{\boldsymbol{\omicron}} }
\newcommand{\piv}[0]{\ensuremath{\boldsymbol{\pi}} }
\newcommand{\rhov}[0]{\ensuremath{\boldsymbol{\rho}} }
\newcommand{\sigmav}[0]{\ensuremath{\boldsymbol{\sigma}} }
\newcommand{\tauv}[0]{\ensuremath{\boldsymbol{\tau}} }
\newcommand{\upsilonv}[0]{\ensuremath{\boldsymbol{\upsilon}} }
\newcommand{\phiv}[0]{\ensuremath{\boldsymbol{\phi}} }
\newcommand{\chiv}[0]{\ensuremath{\boldsymbol{\chi}} }
\newcommand{\psiv}[0]{\ensuremath{\boldsymbol{\psi}} }
\newcommand{\omegav}[0]{\ensuremath{\boldsymbol{\omega}} }
\newcommand{\varepsilonv}[0]{\ensuremath{\boldsymbol{\varepsilon}} }
\newcommand{\varthetav}[0]{\ensuremath{\boldsymbol{\vartheta}} }
\newcommand{\varpiv}[0]{\ensuremath{\boldsymbol{\varpi}} }
\newcommand{\varrhov}[0]{\ensuremath{\boldsymbol{\varrho}} }
\newcommand{\varsigmav}[0]{\ensuremath{\boldsymbol{\varsigma}} }
\newcommand{\varphiv}[0]{\ensuremath{\boldsymbol{\varphi}} }
\newcommand{\cdotv}[0]{\ensuremath{\boldsymbol{\cdot}}}
\newcommand{\E}[0]{\ensuremath{\mathbb{E}}}

\newcommand{\mc}{\multicolumn}
\newcommand{\mr}{\multirow}
\newcommand{\nts}{\negthickspace}

\newcommand{\given}{\,|\,}
\def\rr{\textcolor{red}}
\def\bb{\textcolor{blue}}
\def\re{\textcolor{black}}
\def\mz#1{{\color{orange}{\bf [mz:} {\it{#1}}{\bf ]}}}


\title{Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Xinyang Liu\thanks{Equal contribution: \{xinyangatk, wds\}@stu.xidian.edu.cn}}
\author[1]{Dongsheng Wang$^*$}
\author[1]{Bowei Fang}
\author[1]{Miaoge Li}
\author[1]{Yishi Xu}
\author[1]{Zhibin Duan}
\author[1]{Bo Chen\thanks{Corresponding author: bchen@mail.xidian.edu.cn}}
\author[2]{Mingyuan Zhou}
% Add affiliations after the authors
\affil[1]{%
National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, 710071, China.
}
\affil[2]{%
McCombs School of Business, The University of Texas at Austin, Austin, TX 78712, USA
}

  \begin{document}
\maketitle

\begin{abstract}
  For downstream applications of vision-language pre-trained models, there has been significant interest in constructing effective prompts. Existing works on prompt engineering, which either require laborious manual designs or optimize the prompt tuning as a point estimation problem, may fail to describe diverse characteristics of categories and limit their applications. We introduce a Bayesian probabilistic resolution to prompt tuning, where the label-specific stochastic prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model. Importantly, we semantically regularize the tuning process by minimizing the statistical distance between the visual patches and linguistic prompts, which pushes the stochastic label representations to faithfully capture diverse visual concepts, instead of overfitting the training categories. We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts. Extensive results over 15 datasets show promising transferability and generalization performance of our proposed model, both quantitatively and qualitatively.
\end{abstract}

\section{Introduction}\label{sec:intro}
Large-scale vision-language pre-trained models (VLPs) have recently demonstrated impressive achievements on various computer vision tasks~\citep{wang2021simvlm,jia2021scaling,cho2021unifying,radford2021learning,li2022blip}. {Pre-trained on web-scale image-text association pairs, such VLPs have the ability to carry the semantic knowledge on which visual concepts correspond to which textual sequence and vice versa, and this has been proven beneficial for visual understanding \citep{radford2021learning,mei2022guest,du2022learning}.} This has motivated the rapid rise of \textit{prompt tuning} that hopes to fine-tune VLPs by formalizing the downstream tasks as language modeling problems and optimizing only the text inputs (prompts)~\citep{radford2021learning,zhou2022conditional,zhou2022learning}, such as ``\textit X \textit X \textit X \textit X \textit{\{class\}}.'', where ``\textit X'' and ``\textit{\{class\}}'' denotes the prefix tokens and real class names, respectively. In contrast to supervised learning with discrete labels from a closed set of categories, prompt tuning receives knowledge from pre-trained language models and supports open-set visual concepts, often producing better performance, especially on few/zero-shot tasks~\citep{zhou2022conditional,ppt}.

To specify the optimal prefix tokens ``\textit X'' that provide rich context for pre-trained language models, prompt tuning methods often optimize them as learnable embedding vectors with a task-specific loss. For example, CoOp ~\citep{zhou2022learning} employs the cross entropy loss to learn 16 prefix tokens that are shared across all categories and finds that such data-driven paradigms achieve significant improvement over hand-crafted prompts. However, recent studies report that the overfitting issue occurs in the training process and often leads to poor generalizability and transferability ~\citep{zhu2022prompt,ma2022understanding,lu2022prompt}. To this end, various techniques are introduced under different assumptions, including conventional anti-overfitting tricks, instance-specific prompt generation, and gradient flow~\citep{gao2021clip,zhou2022conditional,ma2022understanding,zhu2022prompt}. Another concern stems from deterministic prompt learning, where the prompts are learned as the point estimation, and only a single sentence is searched to represent a given class. Intuitively, one class can be characterized by multiple intrinsic attributes (See Fig ~\ref{bpl} for example). Thus, it is critical to learn multiple prompts that focus on different concepts. Motivated by this, \revision{several previous works attempt to learn multiple prompt~\citep{chen2022prompt} or introduce distributed prompt embeddings~\citep{derakhshani2022variational,lu2022prompt,wang2023improving}, showing a large improving gap over the baseline method. However, those models either require pre-defined prompts or focus on the sample-dependent prompt generation, failing to discover label-specific prompts efficiently.}


\begin{figure}[!t]
\centering
\includegraphics[width=1\linewidth]{fig/motivation1.pdf}
\vspace{-4mm}
\caption{\small{The motivation of the proposed model. Multiple prompts are generated from the label-specific distributions.}}
\vspace{-6mm}
\label{bpl}
\end{figure}


To address the above shortcomings, we in this paper propose Bayesian prompt tuning, where label-specific stochastic prompts are generated hierarchically under the Bayesian framework. 
As illustrated in Fig~\ref{bpl}, one of the core ideas is to generate multiple prompts for the given categories, with each of the learned prompt capturing various visual attributes, resulting in diverse and generalizable prompt discovery. Specifically, we first introduce uncertainty in the latent embedding space and model each category as a variational distribution ~\citep{vae}. 
Compared to previous point estimation methods, this approach enables us to infer a posterior distribution that contains meta-information about the corresponding category, offering advantages in modeling uncertainty and highly structured data~\citep{fan2020bayesian}.
To complete the prompt sentence, a sequence generation module is employed to generate the prefix sequence according to the meta-vector sampled from the underlying distribution. Note that various language models can be chosen as the generator, e.g., the LSTM~\citep{hochreiter1997long} and transformers~\citep{al2019character}. Although the generator itself is a deterministic mapping, the output prompts can be viewed as an implicit distribution in the embedding space due to its stochastic inputs. This property allows our proposed model to naturally handle diverse visual concepts, resulting in robust prompt tuning.

Furthermore, to tackle the issue of overfitting in prompt tuning, we propose a novel semantic regularization approach that leverages the conditional transport (CT) framework~\citep{zheng2021exploiting} to establish a relationship between visual patches and textual prompts. Specifically, we use the modality-specific outputs of CLIP to construct a visual patch set as well as a textual prompt set for each target image. The former is obtained by collecting the image patch embeddings and the latter is constructed from all label embeddings. Due to the shared common embedding space of CLIP, these two sets can be regarded as two discrete distributions over the same semantic space. They represent similar meanings about the target image, while from different modalities. Therefore, prompt tuning can be viewed as the process of learning the distribution of textual prompts to be as close to the distribution of visual patches as possible. Fortunately, the recent developments in CT provide us with an efficient tool to quantify the difference between two discrete distributions ~\citep{tanwisuth2021prototype,wangwete,tanwisuth2023pouf}. Importantly, the distance function in CT specifies the similarities between the prompt embeddings and visual patches in the embedding space, which makes it possible to regularize the learning of prompts with visual guidance. As a result, the aligned prompts are encouraged to capture the true label-specific visual concepts, rather than over-fitting to the training set.

The main contributions of this paper are summarized as follows:
\begin{itemize}
    \item We propose Bayesian prompt tuning that generates label-specific stochastic prompts hierarchically, models each label as a distribution over the embedding space and successfully handles diverse visual concepts.
    \item To avoid over-fitting to the training set, we introduce the CT distance as a regularization that guides the learning of prompts with visual knowledge by aligning the patches and prompt embeddings semantically. 
    \item We formulate the proposed model as a variational inference problem, and a combined loss function is derived to optimize all parameters efficiently. Extensive experiments show that our models outperform the baselines.
\end{itemize}

\section{The Proposed Method}
An overview of our proposed \textbf{P}atch-prompt aligned \textbf{B}ayesian prompt tuning (PBPrompt)
is shown in Fig ~\ref{model_overview}. Below, we first briefly review CoOp, which is the basic concept used in this paper. Then, we introduce the details of our model, which aims to improve the diversity and generalizability of CoOp.
\begin{figure*}[!t]
\centering
\includegraphics[width=0.87\linewidth]{fig/PBPrompt_AC.pdf}
\caption{\small{Overview of the proposed PBPrompt. PBPrompt generates the stochastic prompts by first sampling a label-specific vector $\rv_c$ and then employing a single-layer self-attention generator. CT distance is performed between the textual prompts and image patches to regularize the prompts with the visual knowledge.}}
\vspace{-5mm}
\label{model_overview}
\end{figure*}

\subsection{Reviews of CoOp}
Context Optimization (CoOp)~\citep{zhou2022learning} is built on CLIP-like VLPs and is a pioneering method for continuous prompt tuning.
A VLP often consists of an image encoder $f$ and a text encoder $g$, each taking modality-specific sequence as inputs and outputs $d$-dimensional vectors in the shared embedding space. Prompt tuning methods usually design a template to construct the category descriptions and then view the outputs of $g$ as the class weight for the classification task.
To address the limitation of handcrafted templates and facilitate the learning of optimal prompts for adapting VPLs to downstream tasks, CoOp models each prompt token as a continuous vector that can be learned from data. \textit{E.g.}, the prompt for $c$-th class can be denoted as: $\tv_c = [\vv_1, \vv_2, ..., \vv_{b}, \ev_c]$, where $\ev_c$ is the label embedding of class $c$, $\vv = \{ \vv_i \in \mathbb{R}^{d} \}_{i=1}^{b}$ are $b$ learnable context vectors. Given a set of category descriptions $\{\tv_c\}_{c=1}^C$ and an image $\xv \in \mathbb{R}^{(3\times H \times W)}$, CoOp models the image label $p(\yv | \xv)$ as a categorical distribution according to the similarity between the image and label features with:
\begin{equation} \label{likelihood}
    p(y=c | \xv) = \frac{\text{exp} (\text{sim} (f(\xv), g(\tv_c)) / \tau)}{\sum_{c'}^C \text{exp} (\text{sim} (f(\xv), g(\tv_{c'})/\tau)},
\end{equation}
where $\text{sim}(\cdot,\cdot)$ means the similarity function, \textit{e.g.}, the cosine similarity, and $\tau$ is the temperature parameter. Then one can optimize the prefix embeddings $\vv$ by back-propagating the following loss through the frozen VLPs with a few training samples $\mathcal{D}^{\text{tr}}=\{(\xv_i, y_i)\}_{i=1}^{N_{tr}}$:
\begin{equation}
    \mathcal{L}(\vv) = \mathbb{E}_{\xv_i,y_i}[- \text{log}p(y_i\, |\, \xv_i; \vv)].
\end{equation}
After tuning, $\tv_c$ can be used to define the target classifier for open-set image classification.

\subsection{Patch-prompt Aligned Bayesian Prompt tuning}
The core idea behind the proposed PBPrompt is to learn distributed label-specific prompts under the Bayesian framework, as well as align the image patches and textual prompts by minimizing the CT distance. Below, we introduce the details of PBPrompt, which consists of stochastic prompt generation, patch-prompt alignment, and the training algorithm.

\paragraph{Stochastic Prompts Generation (SPG)}\label{spg}
Generally, it is less sound to represent one class with a deterministic point, which may fail to cover diverse visual concepts, $\textit{e.g.}$, the object type, size, color, and so on. This issue becomes particularly acute in cases involving distribution shifts. For instance, a model may see an image of a dog playing on the green ground during training but fail to make a correct prediction of another image of a dog on the beach. To this end, one of the goals of PBPrompt is to introduce uncertainty into prompt generation. For a target label, we assume there are various prompts that can achieve similar performance. These prompts originate from the same target class but depict its representative attributes from different perspectives, resulting in robust representation. An intuitive approach is to model the prompts as a distribution $p(\rv)$. Unfortunately, directly learning such a distribution over a sequence of $b$ vectors feature dimension $d$ is not simple~\citep{brown2020language,lu2022prompt}, especially under the few-shot setting. 
To this end, we move the uncertainty forward to its inputs and develop a hierarchical generative module to produce the stochastic prompts:
\begin{equation} \label{prefix}
    \tv_c = [\phi(\vv_c \given \rv_c), \ev_c], \quad \rv_c \sim p(\rv_c),
\end{equation}
where $p(\rv_c)$ denotes the label-specific distribution that handles the conceptual diversity of class \textit{c}. $\phi(\vv_c \given \rv_c)$ denotes the deterministic generative model that takes the sampled $\rv_c$ as input and outputs the prefix token sequence $\vv_c=\{ \vv_{c,l} \in \mathbb{R}^{d} \}_{l=1}^{b}$. Like previous works ~\citep{zhou2022learning,zhou2022conditional}, the final prompt input $\tv_c$ is obtained by adding the label embedding $\ev_c$ at the end of prefix tokens. Different from previous models that view $\tv_c$ as the learnable embedding vectors, we generate $\tv_c$ via a hierarchical path, where a stochastic vector $\rv_c$ is first sampled from the label-specific distribution and the prefix sequence $\vv_c$ is then generated according to $\rv_c$. Although the generative model $\phi$ is a deterministic network, $\tv_c$ can be viewed as an implicit distribution over $\rv_c$. In this way, multiple prompts can be generated by sampling various $\rv_c$.

Note that $\phi(\vv_c \given \rv_c)$ can be implemented with various language models~\cite{lstm,bert}, and we find a single-layer self-attention network works well in most cases~\citep{vaswani2017attention}, empirically:
\begin{equation} \label{atten}
\begin{aligned} 
    &\sv_c = [\rv_c + \text{PE}_{1}, \wv_1 +
     \text{PE}_2, ..., \wv_{b}+\text{PE}_{b+1}],  \\
     [&\hat{\rv}_c, \vv_{c,1},  ..., \vv_{c,b}] = \phi(\vv_c|\rv_c) := \text{Self-Attn}(\sv_c),
\end{aligned}
\end{equation}
where $\wv=[\wv_1,..., \wv_{b}]$ is the initialized prefix embeddings, and $\text{PE}$ is the learnable position embedding matrix that captures the sequential relations of prefix tokens. The \text{Self-Attn} decoder takes $\sv_c$ as inputs, where the sampled $\rv_c$ in Eq.~\ref{prefix} is viewed as a special label token presented at the beginning of the initialized prefix sequence. It then outputs the class-specific prefix sequence $\vv_c=[&\hat{\rv}_c, \vv_{c,1},  ..., \vv_{c,b}]$. This process allows the output tokens to encompass both contextual information and class-specific guidance, resulting in the generation of meaningful prompts.

\paragraph{Regularization Between Textual Prompts and Visual Patches}
Notably, the core motivation behind SPG is to learn diverse prompts that cover multiple visual concepts. However, directly optimizing SPG with the classification loss may suffer from the mode-collapse problem, where the sampled $\rv_c$ tends to be close to each other, leading to single-mode prompt tuning. \textit{E.g.}, the learned prompt pattern overfits the training set while failing to provide the true context. To address this issue, we introduce the regularization between the prompt outputs and image patches. This regularization encourages the sampled prompts to be close to a variety of patch embeddings, preventing them from overfitting to the training mode.

Recall that a VLP describes target labels from both the image and text domains. The former divides an image $\xv$ into $M$ patches $\uv=\{ \uv_m |_{m=1}^M \} \in \mathbb{R}^{d \times M}$, which provides the local visual features. We view the output embeddings of the textual encoder as the class-specific features, which provide the linguistic description for classes. Mathematically, given $\xv$ and its prediction probability $\pv=p(\yv|\xv)$, we formulate those two sets as discrete distributions:
\begin{equation} \label{pq}
    P = \sum_{m=1}^M \frac{1}{M} \delta_{\uv_m}, \quad 
    Q = \sum_{c=1}^C p_c \delta_{\gv_c}
\end{equation}
where $\delta$ is the Dirac delta function, $\gv_c=g(\tv_c)$ is the textual outputs of label $c$. Eq.~\ref{pq} represents $\xv$ as a mixture of patch embeddings $P$ and a mixture of prompt embeddings $Q$, {both sharing the same semantics but originating from different domains. }Naturally, we aim to regularize the learning of $Q$ by aligning it to $P$. A common choice is to minimize the optimal transport (OT) between $P$ and $Q$~\citep{cuturi2013sinkhorn,chen2022prompt}. However, the calculating of OT struggles in two-stage iterations: first solving for the transport plan and then updating the network, leading to unstable training. Fortunately, the recently developed conditional transport (CT)~\citep{zheng2021exploiting} offers an efficient tool to align two distributions over different supports~\citep{wangwete,tanwisuth2021prototype}. The CT distance between the textual prompts and visual patches is defined from two directions: 
\begin{equation} \label{ct}
    \mathcal{L}_{CT}(P,Q) = \mathcal{L}_{\uv \rightarrow \gv} + \mathcal{L}_{\gv \rightarrow \uv},
\end{equation}
where $\mathcal{L}_{\uv \rightarrow \gv}$ denotes {the transport distance} from patch embeddings to prompts, while $\mathcal{L}_{\gv \rightarrow \uv}$ denotes {the transport distance} in the reverse direction. {The transport distance from patch embeddings to prompts} can be calculated as:
\begin{equation} \label{ug}
    \mathcal{L}_{\uv \rightarrow \gv} = \frac{1}{M} \sum_{m=1}^M \sum_{c=1}^C  \mathcal{C}(\uv_m, \gv_c) \pi(\gv_c | \uv_m),
\end{equation}
where $\mathcal{C}(\uv_m, \gv_c)$ is the cost function that measures the point-wise transport cost from $m$-th patch to $c$-th prompt embedding, \textit{e.g.}, $\mathcal{C}(\uv_m, \gv_c) = 1- cosine(\uv_m, \gv_c)$. $\pi(\gv_c | \uv_m) = \frac{p_c \text{exp}(\uv_m^T \gv_c)}{\sum_{c'=1}^C p_{c'} \text{exp}(\uv_m^T \gv_{c'})}$ is the transport plan. The core idea of Eq.~\ref{ug} is to assign $M$ patches to their expected prompts. This can be viewed as a clustering process that learns a semantic center for each class-specific prompt. Unfortunately, only with $\mathcal{L}_{\uv \rightarrow \gv}$, many less-related patches within an image may be assigned to the target prompt. This may push the stochastic prompt to an average point, leading to mode collapse. To address this issue, CT introduces $\mathcal{L}_{\gv \rightarrow \uv}$ from an opposite direction:
\begin{equation} \label{gu}
    \mathcal{L}_{\gv \rightarrow \uv} = {\sum_{c=1}^C p_c} \sum_{m=1}^M  \mathcal{C}(\gv_c, \uv_m) \phi(\uv_m | \gv_c),
\end{equation}
where $\pi(\uv_m | \gv_c)=\frac{ \text{exp}(\gv_c^T \uv_m )}{\sum_{m'=1}^M \text{exp}(\gv_{c}^T \uv_{m'})}$. Unlike $\mathcal{L}_{\uv \rightarrow \gv}$ which has the patch-clustering effect, $\mathcal{L}_{\gv \rightarrow \uv}$ aims to push the expected prompt towards patches that semantically close to it, creating a prompt-covering effect. The CT distance in Eq.~\ref{ct} provides us with a novel regularization, enabling the learning of stochastic prompts with vision knowledge from bi-directions. The \textit{patch-to-prompt} transportation explores meaningful prompt outputs, and the \textit{prompt-to-patch} transportation improves the uncertainty of the prompt outputs.

\subsection{Training With Combined ELBO} \label{elbo}
Given the VLPs and labeled images $\mathcal{D}^{\text{tr}}$, we would like to distill the pre-trained knowledge and learn the posterior of the label-specific representation $p(\rv_c |\mathcal{D}^{\text{tr}})$ as well as the deterministic generative model $\phi(\vv_c | \rv_c)$. Unfortunately, the exact posterior for $\rv_c$ is intractable and needs to be approximated. 
To this end, we define the variational distribution $q(\rv_c|c)$ and employ the variational inference to optimize the proposed method by minimizing the following combined Evidence Lower BOund (ELBO)~\citep{vae}:
\begin{equation} \label{loss}
\begin{aligned}
    \mathcal{L} = -\mathbb{E}_{\tv_c=[\pi(\vv_c|\rv_c),\ev_c], \rv_c \sim q(\rv_c|c)} \text{log}p(y|\xv, \tv_c) \\
    - \text{D}_{\text{KL}}[q(\rv_c|c) || p(\rv_c)] 
    + \eta \mathcal{L}_{CT}(P, Q),
\end{aligned}
\end{equation}
where we follow previous practices~\citep{gordon2018metalearning,derakhshani2022variational} and define the variational distribution $q$ as a Gaussian distribution conditioned on the label embedding $\ev_c$: $q(\rv_c | c) = \mathcal{N}(\mu(\ev_c), \Sigma(\ev_c))$, with $\mu$ and $\Sigma$ parameterized by two fully-connected layers. The first term in Eq.~\ref{loss} is the expected log-likelihood defined at Eq.\ref{likelihood}, the second term is the KL-divergence {that encourages the variational posterior to approach to its prior}, and the last term is the CT distance that aligns the class-specific prompt with image patches. $\eta$ denotes the trade-off hyperparameter that controls the regularization weights. Unlike most previous works {that solely learn prompts from task-specific loss~\citep{zhou2022learning,lu2022prompt}, we optimize the proposed PBPrompt with combined ELBO that introduces the CT distance as a regularization, guiding} the label embeddings to focus on meaningful visual concepts rather than over-fitting to the base sets. We summarize the training algorithm at the Algorithm ~\ref{alg} in Appendix.
\vspace{-2mm}
\paragraph{Contextual Prior $p(\rv_c)$} Instead of treating the prior as a fixed distribution independent of the label $c$, here we define the label-specific priors to further explore label semantics via the label embeddings, \textit{e.g.}, $p(\rv_c) = \mathcal{N}(\ev_c, I)$. Thus compared to the fixed prior, the proposed label-specific prior introduces additional label semantics and achieves better prior guidance.

\section{Related Work}
\vspace{-4mm}
The technique of prompt tuning, originating from the natural language processing (NLP) domain and aims at best utilizing pre-trained language models~\citep{brown2020language,autoprompt,liu2023pre}, has gained increasing research attention in VLPs due to its impressive results~\citep{ge2022domain,sun2022dualcoop,feng2022promptdet}. For example, CLIP~\citep{radford2021learning} %in practice
manually designs templates based on human knowledge and shows great potential in few/zero-shot tasks. Context Optimization (CoOp) ~\citep{zhou2022learning} first introduces the continuous prompt into VLPs and views the prompt tokens as a set of learnable vectors that can be optimized by minimizing the cross entropy loss. Instead of learning static prompts, Conditional CoOp (CoCoOp)~\citep{zhou2022conditional} learns an input-specific prompt by incorporating image features via a lightweight network and shows better generalization on unseen categories. The most related work to ours is distributed prompt tuning, which focuses on stochastic prompt tuning. For instance, Prompt Distribution leArning (ProDA)~\citep{lu2022prompt} first designs multiple handcrafted templates and then employs a Gaussian distribution to model the latent representation. Variational prompt tuning (VPT) of \citep{derakhshani2022variational} constructs prompt tokens by directly adding Gaussian samples into prompt vectors. \revision{SyntHesIzed Prompt (SHIP) of~\citep{wang2023improving} samples a image-dependent prompt by training a VAE with the image features.}
Prompt learning with optimal transport (PLOT)~\citep{chen2022prompt} applies optimal transport theory to learn multiple local prompts. {While all above methods—ProDA, VPT, and SHIP, PLOT, and ours—involve learning stochastic prompts, they are fundamentally distinct}. We model each target label as a Gaussian distribution and then generate stochastic prompts based on label-specific samples, resulting in better label representations.
% For a detailed discussion on the differences between these baselines and our proposed method, please refer to the appendix.

\section{Experiments}
We follow the exact experimental setup of previous works~\citep{zhou2022learning,zhou2022conditional} and validate the performance of PBPrompt against the recent state-of-the-art prompt learning models on widely-used benchmarks under various settings, including few-shot learning, base-to-new generalization, cross-dataset transferability, and domain generalization.


\subsection{Experimental Setup}
\textbf{Datasets.}
For the first two tasks, we rely on 11 classification datasets, \textit{i.e.}, ImageNet~\citep{deng2009imagenet} and Caltech101~\citep{fei2004learning} for generic object classification, OxfordPets~\citep{parkhi2012cats}, StanfordCars~\citep{krause20133d}, Flowers102~\citep{nilsback2008automated}, Food101~\citep{bossard2014food} and FGVCAircraft~\citep{maji2013fine} for fine-grained image recognition, EuroSAT~\citep{helber2019eurosat} for satellite image classification, UCF101~\citep{soomro2012ucf101} for action classification, DTD~\citep{cimpoi2014describing} for texture classification, and SUN397~\citep{xiao2010sun} for scene recognition. For the domain generalization task, we use ImageNet as the source domain dataset and evaluate performance on ImageNetV2~\citep{recht2019imagenet}, ImageNet-Sketch~\citep{wang2019learning}, ImageNet-A~\citep{hendrycks2021natural}, and ImageNet-R~\citep{hendrycks2021many}. The details of each dataset are provided at Table ~\ref{tab: statistics}.

\textbf{Baselines.}
We compare our proposed approach with following state-of-the-art (SoTa) models: zero-shot CLIP~\citep{radford2021learning} with the fixed handcrafted prompt \textit{"A photo of a \{class\}."}, \revision{CoOp~\citep{zhou2022learning}, CoCoOp~\citep{zhou2022conditional}, PLOT~\citep{chen2022prompt}, and stochastic prompt tuning methods, including ProDA~\citep{lu2022prompt}, VPT~\citep{derakhshani2022variational} and SHIP~\citep{wang2023improving}}.

\begin{figure*}
    \centering
    \includegraphics[width=0.85\linewidth]{fig/fsl.png}
    \caption{\small{The few-shot learning results on 11 datasets. We compare our PBPrompt with CoOp, CoCoOp and PLOT. Overall, our proposed model outperforms the baselines in most cases. More numerical results can be found at Table ~\ref{tab: vit_fsl} and Table ~\ref{tab: rn50_fsl}.}}
    \label{fig:fsl}
    \vspace{-3mm}
\end{figure*}


\textbf{Implementation Details.}
Similar to previous works~\citep{zhou2022learning, zhou2022conditional}, 
PBPrompt adopts the vision and language encoders as a ViT-B/16~\citep{dosovitskiy2020image} and transformer~\citep{vaswani2017attention} respectively.
We consistently perform prompt tuning with 16 shots and fix the prompt length as 4 for the four primary image classification tasks across all datasets.
We set the trade-off hyperparameter $\eta$ as 0.01 and run each experiment with 10 epochs on base-to-new generalization.
The label embedding $\ev_c$ is obtained by averaging the CLIP embedding of the class names, and we initialize the learnable prompt embedding vectors from $\mathcal{N}(0, 0.02)$. For the self-attention network in \eqref{atten}, we employ 8 heads for deeper interactions between prompt tokens. We summarize the training details in the appendix.
The results for CoOp and CoCoOp are adopted The results for CoOp and CoCoOp are adopted from the published papers, except for the few-shot learning experiments. For these experiments, we re-ran them using the same settings, with a maximum epoch set to 200 for 16/8 shots, 100 for 4/2 shots, and 50 for 1 shot across all datasets. For a fair comparison, we re-run PLOT with ViT-B/16 on all the experiments in the settings above. All results are reported as the mean value over three seeds.
\vspace{-3mm}
\subsection{Experiment Results}
\paragraph{Few-shot Learning} evaluates a model's capability to handle limited labeled data and samples. The complete results are summarized in Fig ~\ref{fig:fsl}, where we find that 1) our method consistently outperforms the baseline models across various scenarios, and 2) PBPrompt outperforms other methods when trained with 1, 2, and 4 shots, showcasing a substantial performance margin on DTD, EuroSAT, Flowers102, and FOOD101 datasets. Furthermore, as the number of training samples increases, the performance gap between models diminishes, particularly evident in the case of training with 8/16 shots. This emphasizes the exceptional performance of our model in few-shot learning tasks. Notably, PBPrompt surpasses CoOp with average accuracy increases of 3.14\%, 2.32\%, 6.33\%, 1.24\%, and 0.32\% at 1, 2, 4, 8, and 16 shots, respectively.

\begin{table*}[!th]
\centering
    \scalebox{0.70}{
    \begin{tabular}{lcccccccccccc}
    \toprule[1.5pt]
    \textbf{} &\multicolumn{1}{c}{Source} &\multicolumn{11}{c}{Target} \\
    \cmidrule(lr){2-2}\cmidrule(lr){3-13}
    \textbf{Method}
    &\rotatebox{90}{\textbf{Imagenet}} 
    &\rotatebox{90}{\textbf{Caltech}}
    &\rotatebox{90}{\textbf{Pets}}
    &\rotatebox{90}{\textbf{Cars}}
    &\rotatebox{90}{\textbf{Flowers}}
    &\rotatebox{90}{\textbf{Food}}
    &\rotatebox{90}{\textbf{Aircraft}}
    &\rotatebox{90}{\textbf{SUN}}
    &\rotatebox{90}{\textbf{DTD}}
    &\rotatebox{90}{\textbf{EuroSAT}}
    &\rotatebox{90}{\textbf{UCF}}
    &\rotatebox{90}{\textbf{Average}}\\
    \midrule
    CoOp         & 71.51 & 93.70 & 89.14 & 65.41 & 68.71 & 85.30 & 18.47 & 64.15 & 41.92 & 46.39 & 66.55 & 63.81 \\
    CoCoOp       & 71.02 & 94.43 & 90.14 & 65.32 & 71.88 & 86.06 & 22.94 & 67.36 & \textbf{45.73} & 45.37 & 68.21 & 65.74 \\
    % PLOT \\
    % CoOp + VPT   & 69.73 & 93.67 & 89.27 & 65.50 & 70.20 & 86.27 & 22.13 & 66.57 & 46.93 & 47.43 & 67.21 & 65.51 \\
    % \rowcolor{gray!25}
    % PBPrompt     & 71.80 & 94.43 & 90.62 & 64.81 & 70.40 & 86.13 & 23.95 & 67.41 & 45.62 & 46.20 & 67.47 & 65.70 \\
    PBPrompt     & \textbf{71.71} & \textbf{94.87} & \textbf{90.62} & \textbf{66.00} & \textbf{72.44} & \textbf{86.34} & \textbf{24.82} & \textbf{67.69} & 45.62 & \textbf{47.13} & \textbf{68.83} & \textbf{66.40} \\
    \midrule
    % \rowcolor{gray!25}
    % $\Delta$    & \color{teal}{\bm{$-0.61$}} 
    %             & \color{orange}{\bm{$+0.73$}} & \color{orange}{\bm{$+1.48$}} & \color{teal}{\bm{$-0.60$}} & \color{orange}{\bm{$+1.69$}} & \color{orange}{\bm{$+0.83$}} & \color{orange}{\bm{$+5.48$}} & \color{orange}{\bm{$+3.26$}} & \color{orange}{\bm{$+3.70$}} & \color{teal}{\bm{$-0.19$}} & \color{orange}{\bm{$+0.92$}} & \color{orange}{\bm{$+1.29$}} \\
    % \midrule
    % $\Delta$    & \color{teal}{\bm{$-0.61$}} 
    %         & \color{orange}{\bm{$+0.73$}} & \color{orange}{\bm{$+1.48$}} & \color{teal}{\bm{$-0.60$}} & \color{orange}{\bm{$+1.69$}} & \color{orange}{\bm{$+0.83$}} & \color{orange}{\bm{$+5.48$}} & \color{orange}{\bm{$+3.26$}} & \color{orange}{\bm{$+3.70$}} & \color{teal}{\bm{$-0.19$}} & \color{orange}{\bm{$+0.92$}} & \color{orange}{\bm{$+1.29$}} \\
    % \midrule
    % CoCoOp + VPT & 70.70 & 93.67 & 90.63 & 65.00 & 70.90 & 86.30 & 24.93 & 67.47 & 46.10 & 45.87 & 68.67 & 65.95 \\
    % BPL          & \bb{70.90} & \bb{94.40} & \bb{90.90} & \bb{} & \bb{70.40} & \bb{} & \bb{23.90} & - & \bb{45.60} & - & \bb{68.40} & - \\
    % \rowcolor{gray!25}
    % CPBPrompt   & 70.94 & 94.92 & 90.83 & 65.34 & 72.37 & 86.41 & 24.58 & 67.75 & 45.23 & 45.10 & 68.78 & 66.13 \\
    % \midrule
    $\Delta$     & \color{orange}{\bm{$+0.69$}} & \color{orange}{\bm{$+0.44$}} & \color{orange}{\bm{$+0.48$}} & \color{orange}{\bm{$+0.68$}} & \color{orange}{\bm{$+0.56$}} & \color{orange}{\bm{$+0.28$}} & \color{orange}{\bm{$+2.90$}} & \color{orange}{\bm{$+0.33$}} & \color{teal}{\bm{$-0.11$}} & \color{orange}{\bm{$+1.76$}} & \color{orange}{\bm{$+0.62$}} & \color{orange}{\bm{$+0.66$}} \\
    \bottomrule[1.5pt]
    \end{tabular}}
    \caption{\small{Cross-dataset transfer learning accuracy results of various baselines on source and target datasets.
$\Delta$: The improvements of the proposed model compared to CoCoOp.}}
\label{tab: cross_dataset}
\end{table*}

\paragraph{Base-to-New Generalization} \revision{assesses model's generalizability in a zero-shot setting. We report the Base-to-New results at Fig ~\ref{fig: base2new} (The detailed accuracy on base and new set can be found at Table ~\ref{tab: vit_b2n}). Note that the H score is calculated as $\text{H} = (2\times \text{Base} \times \text{New}) / (\text{Base} + \text{New})$, which is a trade-off metric between the base and new sets. We find that PBPrompt surpasses other stochastic baselines in terms of H score across all datasets. This demonstrates the efficiency of the introduced label-specific SPG. Besides, due to the CT regularization, our approach successfully mitigates the overfitting issue, showing robust ability to balance the Base and New performance.}
\begin{figure}[!h]
		\centering
		\vspace{-0.3cm}
		\setlength{\abovecaptionskip}{0.28cm}
		\includegraphics[width=0.65\linewidth]{fig/base2new.png}
		\caption{\small{Performance comparison on base-to-new generalization evaluated by harmonic mean. More results can be found at Table ~\ref{tab: vit_b2n} and ~\ref{tab: rn50_b2n}.}}
		\label{fig: base2new}
\end{figure}
% assesses model's generalizability in a zero-shot setting. The dataset is split into the base and new sets with no shared classes~\cite{zhou2022conditional}.
% Models are trained on the base set and subsequently tested on the new set.
% We report the harmonic mean (H) of various methods on 11 datasets at Fig ~\ref{fig: base2new} (The detailed accuracy on base and new set can be found in the Appendix). Note that, the H score is calculated as $\text{H} = (2\times \text{Base} \times \text{New}) / (\text{Base} + \text{New})$, which is a trade-off metric between the base and new sets. We find that PBPrompt surpasses other baselines across all datasets, enhanced with conditional transport theory, and is indeed capable of learning more efficient prompts for downstream tasks.


% \begin{table}[htbp]
%     \centering
%     \scalebox{0.60}{
%         \begin{tabular}{@{} lc|c|c|c @{}}
%     \toprule[1.5pt]
%                         \multicolumn{2}{l}{\textbf{Dataset}} & Base & New & H \\
%     \midrule
%     \multirow{7}{*}{\textbf{Caltech101}}    & CoCoOp &97.96 &93.81 &95.84 \\
%                                             % & CoPLOT &96.53 &82.86 &89.17 \\
%                                             & \revision{ProDA} &\textbf{98.27} &93.23 &95.68 \\
%                                             & \revision{VPT} &95.47 &93.80 &94.62 \\
%                                             &\revision{SHIP} &97.55 &95.20 &96.36 \\
%                                             & B-Prompt &97.95 &93.12 &95.47\\
%                                             & P-Prompt &97.35 &95.00 &96.16\\
%                                             & PBPrompt &{97.98} &\textbf{95.54} &\textbf{96.74}\\

%     \midrule
%     \multirow{7}{*}{\textbf{Flowers102}}    & CoCoOp & 94.87 & 71.75 &81.71 \\
%                                             % & CoPLOT & 95.36 & 56.73 &71.14 \\
%                                             & \revision{ProDA} &{97.70} &68.68 &80.66 \\
%                                             & \revision{VPT} &92.97 &75.90 &74.40 \\
%                                             & \revision{SHIP} &94.02 &\textbf{74.40} &83.06 \\
%                                             & B-Prompt &\textbf{97.35} &69.57 &81.15\\
%                                             & P-Prompt &95.21 &72.35 &82.22\\
%                                             & PBPrompt &95.47 &{73.60} &\textbf{83.12}\\
%     \midrule
%     \multirow{7}{*}{\textbf{DTD}}           & CoCoOp &77.01 &56.00 &64.85\\
%                                             % & CoPLOT &69.87 &53.63 &60.68 \\
%                                             & \revision{ProDA} &{80.67} &56.48 &66.44 \\
%                                             & \revision{VPT} &57.67 &\textbf{58.70} &58.18 \\
%                                             &\revision{SHIP} &74.88 &56.88 &64.65 \\
%                                             & B-Prompt &\textbf{79.97} &47.67 &59.73\\
%                                             & P-Prompt &77.20 &57.00 &65.58\\
%                                             & PBPrompt &78.03&{57.81} &\textbf{66.42} \\
%     \midrule
%     \multirow{7}{*}{\textbf{EuroSAT}}       & CoCoOp &87.49 &60.04 &71.21 \\
%                                             % & CoPLOT &87.39 &64.63 &74.31 \\
%                                             & \revision{ProDA} &{83.90} &66.00 &73.88 \\
%                                             & \revision{VPT} &67.97 &71.63 &69.75 \\
%                                             &\revision{SHIP} &88.62 &66.87 &76.22 \\
%                                             & B-Prompt &\textbf{92.46} &62.58 &74.64\\
%                                             & P-Prompt &87.21 &72.33 &79.08\\
%                                             & PBPrompt &89.53&\textbf{72.87} &\textbf{80.35}\\                
%     \bottomrule[1.5pt]
%     \end{tabular}}
%         \captionof{table}{\small{\revision{Base-to-New generalization results of various baselines. B-Prompt: Bayesian prompt tuning. P-Prompt: Patch-Prompt CT alignment. More resutls can be found at Table ~\ref{tab: vit_b2n}.}}}
%         \label{tab: base2new}
% \end{table}


% \begin{table*}[!th]
% \caption{\small{Cross-dataset transfer learning accuracy results of various baselines on source and target datasets.
% $\Delta$: The improvements of the proposed model compared to CoCoOp.}}
% \label{tab: cross_dataset}
% \centering
%     \scalebox{0.70}{
%     \begin{tabular}{lcccccccccccc}
%     \toprule[1.5pt]
%     \textbf{} &\multicolumn{1}{c}{Source} &\multicolumn{11}{c}{Target} \\
%     \cmidrule(lr){2-2}\cmidrule(lr){3-13}
%     \textbf{Method}
%     &\rotatebox{90}{\textbf{Imagenet}} 
%     &\rotatebox{90}{\textbf{Caltech}}
%     &\rotatebox{90}{\textbf{Pets}}
%     &\rotatebox{90}{\textbf{Cars}}
%     &\rotatebox{90}{\textbf{Flowers}}
%     &\rotatebox{90}{\textbf{Food}}
%     &\rotatebox{90}{\textbf{Aircraft}}
%     &\rotatebox{90}{\textbf{SUN}}
%     &\rotatebox{90}{\textbf{DTD}}
%     &\rotatebox{90}{\textbf{EuroSAT}}
%     &\rotatebox{90}{\textbf{UCF}}
%     &\rotatebox{90}{\textbf{Average}}\\
%     \midrule
%     CoOp         & 71.51 & 93.70 & 89.14 & 65.41 & 68.71 & 85.30 & 18.47 & 64.15 & 41.92 & 46.39 & 66.55 & 63.81 \\
%     CoCoOp       & 71.02 & 94.43 & 90.14 & 65.32 & 71.88 & 86.06 & 22.94 & 67.36 & \textbf{45.73} & 45.37 & 68.21 & 65.74 \\
%     % PLOT \\
%     % CoOp + VPT   & 69.73 & 93.67 & 89.27 & 65.50 & 70.20 & 86.27 & 22.13 & 66.57 & 46.93 & 47.43 & 67.21 & 65.51 \\
%     % \rowcolor{gray!25}
%     % PBPrompt     & 71.80 & 94.43 & 90.62 & 64.81 & 70.40 & 86.13 & 23.95 & 67.41 & 45.62 & 46.20 & 67.47 & 65.70 \\
%     PBPrompt     & \textbf{71.71} & \textbf{94.87} & \textbf{90.62} & \textbf{66.00} & \textbf{72.44} & \textbf{86.34} & \textbf{24.82} & \textbf{67.69} & 45.62 & \textbf{47.13} & \textbf{68.83} & \textbf{66.40} \\
%     \midrule
%     % \rowcolor{gray!25}
%     % $\Delta$    & \color{teal}{\bm{$-0.61$}} 
%     %             & \color{orange}{\bm{$+0.73$}} & \color{orange}{\bm{$+1.48$}} & \color{teal}{\bm{$-0.60$}} & \color{orange}{\bm{$+1.69$}} & \color{orange}{\bm{$+0.83$}} & \color{orange}{\bm{$+5.48$}} & \color{orange}{\bm{$+3.26$}} & \color{orange}{\bm{$+3.70$}} & \color{teal}{\bm{$-0.19$}} & \color{orange}{\bm{$+0.92$}} & \color{orange}{\bm{$+1.29$}} \\
%     % \midrule
%     % $\Delta$    & \color{teal}{\bm{$-0.61$}} 
%     %         & \color{orange}{\bm{$+0.73$}} & \color{orange}{\bm{$+1.48$}} & \color{teal}{\bm{$-0.60$}} & \color{orange}{\bm{$+1.69$}} & \color{orange}{\bm{$+0.83$}} & \color{orange}{\bm{$+5.48$}} & \color{orange}{\bm{$+3.26$}} & \color{orange}{\bm{$+3.70$}} & \color{teal}{\bm{$-0.19$}} & \color{orange}{\bm{$+0.92$}} & \color{orange}{\bm{$+1.29$}} \\
%     % \midrule
%     % CoCoOp + VPT & 70.70 & 93.67 & 90.63 & 65.00 & 70.90 & 86.30 & 24.93 & 67.47 & 46.10 & 45.87 & 68.67 & 65.95 \\
%     % BPL          & \bb{70.90} & \bb{94.40} & \bb{90.90} & \bb{} & \bb{70.40} & \bb{} & \bb{23.90} & - & \bb{45.60} & - & \bb{68.40} & - \\
%     % \rowcolor{gray!25}
%     % CPBPrompt   & 70.94 & 94.92 & 90.83 & 65.34 & 72.37 & 86.41 & 24.58 & 67.75 & 45.23 & 45.10 & 68.78 & 66.13 \\
%     % \midrule
%     $\Delta$     & \color{orange}{\bm{$+0.69$}} & \color{orange}{\bm{$+0.44$}} & \color{orange}{\bm{$+0.48$}} & \color{orange}{\bm{$+0.68$}} & \color{orange}{\bm{$+0.56$}} & \color{orange}{\bm{$+0.28$}} & \color{orange}{\bm{$+2.90$}} & \color{orange}{\bm{$+0.33$}} & \color{teal}{\bm{$-0.11$}} & \color{orange}{\bm{$+1.76$}} & \color{orange}{\bm{$+0.62$}} & \color{orange}{\bm{$+0.66$}} \\
%     \bottomrule[1.5pt]
%     \end{tabular}}
% \end{table*}

\paragraph{Cross-Dataset Transfer Learning} measures the transfer performance from different sources, where we train our model on ImageNet (source dataset) and then test it on 10 distinct target datasets. As shown at Table ~\ref{tab: cross_dataset}, PBPrompt has improvements on 9 out of 10 target domains compared to CoCoOp, 
This demonstrates that the proposed PBPrompt has the potential to transfer from a single dataset. Moreover, we also find that PBPrompt exhibits large gaps on fine-grained datasets (FGCVAircraft, OxfordPets, and Flowers102), suggesting the capacity to handle the discriminative features of each category.
\vspace{-3mm}

% \begin{table}[!th]
% \centering
% \caption{\small{Cross-domain generalization accuracy results of various baselines.}}
% \label{tab: domain_generalization}
% \vspace{-0.4cm}
%     \scalebox{0.53}{
%     \begin{tabular}{lcccccc}
%     \toprule[1.5pt]
%     \textbf{} &\textbf{} &\multicolumn{1}{c}{Source} &\multicolumn{4}{c}{Target} \\
%     \cmidrule(lr){3-3}\cmidrule(lr){4-7}
%     \textbf{Method} &\textbf{Learnable} 
%     &\textbf{ImageNet} 
%     &\textbf{ImageNetV2} 
%     &\textbf{ImageNet-Sketch} 
%     &\textbf{ImageNet-A} 
%     &\textbf{ImageNet-R}\\

%     \midrule
%     CLIP         & \XSolidBrush & 66.73 & 60.83 & 46.15 & 47.77 & 73.96  \\
%     CoOp         & \Checkmark & 71.51 & 64.20 & 47.99 & 49.71 & 75.21  \\
%     CoCoOp       & \Checkmark & 71.02 & 64.07 & 48.75 & 50.63 & 76.18  \\
%     % PLOT \\
%     % CoOp + VPT   & \Checkmark & 69.73 & 63.17 & 48.87 & 50.77 & 77.40  \\
%     % \rowcolor{gray!25}
%     % PBPrompt     & \Checkmark & 70.90 & \textbf{64.40} & \textbf{49.10} & \textbf{51.00} & \textbf{76.40}  \\
%     \rowcolor{gray!25}
%     PBPrompt     & \Checkmark & \textbf{71.71} & \textbf{64.53} & \textbf{49.32} & \textbf{51.64} & \textbf{76.71}  \\
%     % \midrule
%     % CoCoOp + VPT & \Checkmark & 70.70 & 64.23 & 49.20 & 51.33 & 77.00  \\
%     % \rowcolor{gray!25}
%     % CPBPrompt    & \Checkmark & 70.97 & \textbf{64.54} & \textbf{49.47} & \textbf{51.39} & \textbf{76.92}  \\
%     \bottomrule[1.5pt]
%     \end{tabular}}
% \end{table}

\paragraph{Domain Generalization} concerns about the robustness of the distribution shift, where we assess the proposed models on ImageNetV2, ImageNet-Sketch, ImageNet-A, and ImageNet-R after training it on the source dataset (ImageNet). We report the results at Table ~\ref{tab: domain_generalization} and find that PBPrompt achieves the highest accuracy on all target domains compared to other baselines. This indicates that the learnable stochastic prompts are less sensitive to distribution shifts and can generalize well across domains.
\vspace{-3mm}
\subsection{Further Analysis}
\begin{table}[!t]
\centering
\vspace{-0.2cm}
    \scalebox{0.53}{
    \begin{tabular}{lcccccc}
    \toprule[1.5pt]
    \textbf{} &\textbf{} &\multicolumn{1}{c}{Source} &\multicolumn{4}{c}{Target} \\
    \cmidrule(lr){3-3}\cmidrule(lr){4-7}
    \textbf{Method} &\textbf{Learnable} 
    &\textbf{ImageNet} 
    &\textbf{ImageNetV2} 
    &\textbf{ImageNet-Sketch} 
    &\textbf{ImageNet-A} 
    &\textbf{ImageNet-R}\\

    \midrule
    CLIP         & \XSolidBrush & \large66.73 & \large60.83 & \large46.15 & \large47.77 & \large73.96  \\
    CoOp         & \Checkmark & \large71.51 & \large64.20 & \large47.99 & \large49.71 & \large75.21  \\
    CoCoOp       & \Checkmark & \large71.02 & \large64.07 & \large48.75 & \large50.63 & \large76.18  \\
    % PLOT \\
    % CoOp + VPT   & \Checkmark & 69.73 & 63.17 & 48.87 & 50.77 & 77.40  \\
    % \rowcolor{gray!25}
    % PBPrompt     & \Checkmark & 70.90 & \textbf{64.40} & \textbf{49.10} & \textbf{51.00} & \textbf{76.40}  \\
    \rowcolor{gray!25}
    PBPrompt     & \Checkmark & \large\textbf{71.71} & \large\textbf{64.53} & \large\textbf{49.32} & \large\textbf{51.64} & \large\textbf{76.71}  \\
    % \midrule
    % CoCoOp + VPT & \Checkmark & 70.70 & 64.23 & 49.20 & 51.33 & 77.00  \\
    % \rowcolor{gray!25}
    % CPBPrompt    & \Checkmark & 70.97 & \textbf{64.54} & \textbf{49.47} & \textbf{51.39} & \textbf{76.92}  \\
    \bottomrule[1.5pt]
    \end{tabular}}
    \caption{\small{Cross-domain generalization accuracy results of various baselines.}}
\label{tab: domain_generalization}
\end{table}

% \begin{table}[]
%     \centering
%     \vspace{-0.5cm}
%     \scalebox{0.56}{
%     \begin{tabular}{ll|c|c|c|c|c|c|c}
%     \toprule[1.5pt]
%     \multicolumn{2}{l}{\textbf{Dataset}} & CoCoOp & ProDA & VPT &SHIP &B-Prompt &P-Prompt &PBPrompt \\
%     \midrule
%     \multirow{3}{*}{\textbf{Caltech101}} & Base & 97.96& 98.27& 95.47& 97.55& 97.95& 97.35& 97.98 \\
%     & New & 93.81& 93.23& 93.80& 95.20& 93.12& 95.00& 95.54\\
%     & H & 95.84& 95.68& 94.62& 96.36& 95.47& 96.16& 96.74\\
%         \midrule
%     \multirow{3}{*}{\textbf{Flowers102}} & Base &94.87 &97.70 &92.97 &94.02 &97.35 &95.21 &95.47  \\
%     & New &71.75 &68.68 &75.90 &74.40 &69.57 &72.35 &73.60 \\
%     & H &81.87 &80.66 &74.40 &83.06 &81.15 &82.22 &83.12 \\
%         \midrule
%     \multirow{3}{*}{\textbf{DTD}} & Base &77.01 &80.67 &57.67 &74.88 &79.97 &77.20 &78.03  \\
%     & New &56.00 &56.48 &58.70 &56.88 &47.67 &57.00 &57.81 \\
%     & H &64.85 &66.44 &58.18 &64.65 &59.73 &65.58 &66.42 \\
%         \midrule
%     \multirow{3}{*}{\textbf{EuroSAT}} & Base & 87.49& 83.90&67.97 &88.62 &92.46 &87.21 &89.53  \\
%     & New &60.04 &66.00 &71.63 &66.87 &62.58 &72.33 &72.87 \\
%     & H &71.21 &73.88 &69.75 &76.22 &74.64 &79.08 &80.35 \\
%     \bottomrule[1.5pt]
%     \end{tabular}}
%     \captionof{table}{\small{\revision{Base-to-New generalization results of various baselines. B-Prompt: Bayesian prompt tuning. P-Prompt: Patch-Prompt CT alignment. More resutls can be found at Table ~\ref{tab: vit_b2n}.}}}
%         \label{tab: base2new}
% \end{table}

\begin{table}[!h]
  \centering
  \vspace{-0.4cm}
  \scalebox{0.62}{
    \begin{tabular}{l|c|c|c|c|c|c|c}
    \toprule[1.5pt]
    \multicolumn{2}{l}{\textbf{Backbones}} & \multicolumn{3}{c}{ViT-B/16} &\multicolumn{3}{c}
    {RN50}\\
    \hline
    \multicolumn{2}{l}{\textbf{Dataset}} & 1 shot & 2 shots & 4 shots  & 1 shot & 2 shots & 4 shots\\
    \midrule
    \multirow{5}{*}{\textbf{Caltech101}} & CoOp     &93.19 &92.97 &94.50 &87.51 &87.84 &89.52 \\
                                         & PLOT     &87.90 &89.53 &91.87 &89.83 &\underline{90.67} &\underline{90.80}\\
                                         & B-Prompt &\underline{93.57} &\underline{94.10} &\underline{94.75} &\underline{90.10} &89.70 &90.56\\
                                         & P-Prompt &93.34 &93.95 &94.60 &88.54 &89.45 &90.70\\
                                         & PBPrompt &\textbf{93.92} &\textbf{94.40} &\textbf{94.83} &\textbf{90.21} &\textbf{90.86} &\textbf{90.92}\\
    \midrule
    \multirow{5}{*}{\textbf{DTD}}        & CoOp     &50.03 &53.93 &59.23 &43.62 &45.35 &53.94\\
                                         & PLOT     &\textbf{52.20} &\underline{56.03} &58.37 &46.55 &51.24 &56.03 \\
                                         & B-Prompt &51.87 &55.85 &\underline{59.53} &46.00 &\underline{51.67} &\underline{56.17}\\
                                         & P-Prompt &50.95 &55.10 &59.02 &\underline{46.95} &48.35 &55.89\\
                                         & PBPrompt &\underline{52.03} &\textbf{56.20} &\textbf{59.63} &\textbf{47.21} &\textbf{52.08} &\textbf{56.97}\\
    \midrule
    \multirow{5}{*}{\textbf{FOOD101}}    & CoOp     &82.70 &82.77 &83.63 &74.25 &72.61 &74.49\\
                                         & PLOT     &69.33 &72.73 &75.17 &\textbf{77.74} &\underline{77.70} &77.21\\
                                         & B-Prompt &84.97 &\underline{86.03} &\underline{86.21} &\underline{77.02} &76.45 &\underline{77.58}\\
                                         & P-Prompt &\underline{85.00} &83.67 &84.39 &76.20 &75.39 &76.45\\
                                         & PBPrompt &\textbf{85.55} &\textbf{86.25} &\textbf{86.30} &77.35 &\textbf{77.83}  &\textbf{78.09}\\
    \midrule
    \multirow{5}{*}{\textbf{SUN397}}     & CoOp     &67.32 &67.67 &70.14 &60.12 &59.60 &63.24\\
                                         & PLOT     &55.17 &59.40 &62.73 &\underline{62.47} &61.71 &\textbf{65.09}\\
                                         & B-Prompt &\underline{67.98} &\underline{69.00} &\underline{70.20} &62.42 &\underline{63.03} &64.83\\
                                         & P-Prompt &67.45 &68.25 &70.10 &62.10 &61.54 &64.12 \\
                                         & PBPrompt &\textbf{68.10} &\textbf{69.35} &\textbf{70.21} &\textbf{62.51} &\textbf{63.45} &\underline{64.77}\\                          
    \bottomrule[1.5pt]
    \end{tabular}}
    \caption{Ablation studies of backbones on few-shot learning.}
        \label{tab: backbone}
\end{table}

\paragraph{Robustness and Synergistic Effect} 
In our previous experiments, we utilized the ViT-B/16 backbone. However, in this study, we also employ the RN50 backbone to assess the robustness of our model across different backbones. The few-shot learning accuracy results are presented in Table ~\ref{tab: backbone}. As demonstrated in the results, PBPrompt provides more consistent results than the prior state-of-the-art methods on both backbones, especially with the ViT-B/16 backbone, where PLOT suffers a significant performance drop in comparison.
Additionally, we have compared two variants of PBPrompt, namely B-Prompt and P-Prompt, in few-shot learning and base-to-new tasks. B-Prompt contains only the SPG module, while P-Prompt only utilizes the conditional transport framework, both based on CoOp. We report the accuracy scores at Table ~\ref{tab: backbone} and Table ~\ref{tab: base2new} respectively. We observe that both variants exhibit significant improvements compared to CoOp, especially B-Prompt, which outperforms the previous methods in most of the test cases. Furthermore, PBPrompt achieves the highest performance on the majority of test cases among all methods by incorporating both variations, demonstrating the powerful synergistic effect of our approach.
\vspace{-2mm}

\begin{table}[]
    \centering
    \vspace{-0.1cm}
    \scalebox{0.56}{
    \begin{tabular}{ll|c|c|c|c|c|c|c}
    \toprule[1.5pt]
    \multicolumn{2}{l}{\textbf{Dataset}} & CoCoOp & ProDA & VPT &SHIP &B-Prompt &P-Prompt &PBPrompt \\
    \midrule
    \multirow{3}{*}{\textbf{Caltech101}} & Base & 97.96& \textbf{98.27}& 95.47& 97.55& 97.35& 97.95& 97.98 \\
    & New & 93.81& 93.23& 93.80& 95.20& 95.00& 93.12& \textbf{95.54}\\
    & H & 95.84& 95.68& 94.62& 96.36& 96.16& 95.47& \textbf{96.74}\\
        \midrule
    \multirow{3}{*}{\textbf{Flowers102}} & Base &94.87 &\textbf{97.70} &92.97 &94.02 &95.21 &97.35 &95.47  \\
    & New &71.75 &68.68 &\textbf{75.90} &74.40 &72.35 &69.57 &73.60 \\
    & H &81.87 &80.66 &74.40 &83.06 &82.22 &81.15 &\textbf{83.12} \\
        \midrule
    \multirow{3}{*}{\textbf{DTD}} & Base &77.01 &\textbf{80.67} &57.67 &74.88 &77.20 &79.97 &78.03  \\
    & New &56.00 &56.48 &\textbf{58.70} &56.88 &57.00 &47.67 &57.81 \\
    & H &64.85 &66.44 &58.18 &64.65 &65.58 &59.73 &\textbf{66.42} \\
        \midrule
    \multirow{3}{*}{\textbf{EuroSAT}} & Base & 87.49& 83.90&67.97 &88.62 &87.21 &\textbf{92.46} &89.53  \\
    & New &60.04 &66.00 &71.63 &66.87 &72.33 &62.58 &\textbf{72.87} \\
    & H &71.21 &73.88 &69.75 &76.22 &79.08 &74.64 &\textbf{80.35} \\
    \bottomrule[1.5pt]
    \end{tabular}}
    \captionof{table}{\small{\revision{Base-to-New generalization results of various baselines. B-Prompt: Bayesian prompt tuning. P-Prompt: Patch-Prompt CT alignment. More resutls can be found at Table ~\ref{tab: vit_b2n}.}}}
        \label{tab: base2new}
\end{table}

% \begin{table}[!h]
%   \centering   
%   \scalebox{0.62}{
%     \begin{tabular}{l|c|c|c|c|c|c|c}
%     \toprule[1.5pt]
%     \multicolumn{2}{l}{\textbf{Backbones}} & \multicolumn{3}{c}{ViT-B/16} &\multicolumn{3}{c}
%     {RN50}\\
%     \hline
%     \multicolumn{2}{l}{\textbf{Dataset}} & 1 shot & 2 shots & 4 shots  & 1 shot & 2 shots & 4 shots\\
%     \midrule
%     \multirow{5}{*}{\textbf{Caltech101}} & CoOp     &93.19 &92.97 &94.50 &87.51 &87.84 &89.52 \\
%                                          & PLOT     &87.90 &89.53 &91.87 &89.83 &\underline{90.67} &\underline{90.80}\\
%                                          & B-Prompt &\underline{93.57} &\underline{94.10} &\underline{94.75} &\underline{90.10} &89.70 &90.56\\
%                                          & P-Prompt &93.34 &93.95 &94.60 &88.54 &89.45 &90.70\\
%                                          & PBPrompt &\textbf{93.92} &\textbf{94.40} &\textbf{94.83} &\textbf{90.21} &\textbf{90.86} &\textbf{90.92}\\
%     \midrule
%     \multirow{5}{*}{\textbf{DTD}}        & CoOp     &50.03 &53.93 &59.23 &43.62 &45.35 &53.94\\
%                                          & PLOT     &\textbf{52.20} &\underline{56.03} &58.37 &46.55 &51.24 &56.03 \\
%                                          & B-Prompt &51.87 &55.85 &\underline{59.53} &46.00 &\underline{51.67} &\underline{56.17}\\
%                                          & P-Prompt &50.95 &55.10 &59.02 &\underline{46.95} &48.35 &55.89\\
%                                          & PBPrompt &\underline{52.03} &\textbf{56.20} &\textbf{59.63} &\textbf{47.21} &\textbf{52.08} &\textbf{56.97}\\
%     \midrule
%     \multirow{5}{*}{\textbf{FOOD101}}    & CoOp     &82.70 &82.77 &83.63 &74.25 &72.61 &74.49\\
%                                          & PLOT     &69.33 &72.73 &75.17 &\textbf{77.74} &\underline{77.70} &77.21\\
%                                          & B-Prompt &84.97 &\underline{86.03} &\underline{86.21} &\underline{77.02} &76.45 &\underline{77.58}\\
%                                          & P-Prompt &\underline{85.00} &83.67 &84.39 &76.20 &75.39 &76.45\\
%                                          & PBPrompt &\textbf{85.55} &\textbf{86.25} &\textbf{86.30} &77.35 &\textbf{77.83}  &\textbf{78.09}\\
%     \midrule
%     \multirow{5}{*}{\textbf{SUN397}}     & CoOp     &67.32 &67.67 &70.14 &60.12 &59.60 &63.24\\
%                                          & PLOT     &55.17 &59.40 &62.73 &\underline{62.47} &61.71 &\textbf{65.09}\\
%                                          & B-Prompt &\underline{67.98} &\underline{69.00} &\underline{70.20} &62.42 &\underline{63.03} &64.83\\
%                                          & P-Prompt &67.45 &68.25 &70.10 &62.10 &61.54 &64.12 \\
%                                          & PBPrompt &\textbf{68.10} &\textbf{69.35} &\textbf{70.21} &\textbf{62.51} &\textbf{63.45} &\underline{64.77}\\                          
%     \bottomrule[1.5pt]
%     \end{tabular}}
%         \label{fig: backbone}
%         \caption{}
% \end{table}
    
\begin{figure*}[!ht]
\centering%
\begin{minipage}[b]{0.49\textwidth}
\centering%
\includegraphics[width=\linewidth]{fig/samples.png}
\caption{\small{Monte Carlo sampling numbers}}
\label{fig: sample}
\end{minipage}%
\hspace{3mm}%
\begin{minipage}[b]{0.49\textwidth}
\centering%
\includegraphics[width=\linewidth]{fig/coef.png}
\caption{\small{Regularization coefficient $\eta$}}
\label{fig: coef}
\end{minipage}
\vspace{-2mm}
\end{figure*}

% \begin{figure*}[t!]
%   \centering
%     \begin{subfigure}{0.48\textwidth}
%       \centering   
%       \scalebox{0.62}{
%     \begin{tabular}{l|c|c|c|c|c|c|c}
%     \toprule[1.5pt]
%     \multicolumn{2}{l}{\textbf{Backbones}} & \multicolumn{3}{c}{ViT-B/16} &\multicolumn{3}{c}
%     {RN50}\\
%     \hline
%     \multicolumn{2}{l}{\textbf{Dataset}} & 1 shot & 2 shots & 4 shots  & 1 shot & 2 shots & 4 shots\\
%     \midrule
%     \multirow{5}{*}{\textbf{Caltech101}} & CoOp     &93.19 &92.97 &94.50 &87.51 &87.84 &89.52 \\
%                                          & PLOT     &87.90 &89.53 &91.87 &89.83 &\underline{90.67} &\underline{90.80}\\
%                                          & B-Prompt &\underline{93.57} &\underline{94.10} &\underline{94.75} &\underline{90.10} &89.70 &90.56\\
%                                          & P-Prompt &93.34 &93.95 &94.60 &88.54 &89.45 &90.70\\
%                                          & PBPrompt &\textbf{93.92} &\textbf{94.40} &\textbf{94.83} &\textbf{90.21} &\textbf{90.86} &\textbf{90.92}\\
%     \midrule
%     \multirow{5}{*}{\textbf{DTD}}        & CoOp     &50.03 &53.93 &59.23 &43.62 &45.35 &53.94\\
%                                          & PLOT     &\textbf{52.20} &\underline{56.03} &58.37 &46.55 &51.24 &56.03 \\
%                                          & B-Prompt &51.87 &55.85 &\underline{59.53} &46.00 &\underline{51.67} &\underline{56.17}\\
%                                          & P-Prompt &50.95 &55.10 &59.02 &\underline{46.95} &48.35 &55.89\\
%                                          & PBPrompt &\underline{52.03} &\textbf{56.20} &\textbf{59.63} &\textbf{47.21} &\textbf{52.08} &\textbf{56.97}\\
%     \midrule
%     \multirow{5}{*}{\textbf{FOOD101}}    & CoOp     &82.70 &82.77 &83.63 &74.25 &72.61 &74.49\\
%                                          & PLOT     &69.33 &72.73 &75.17 &\textbf{77.74} &\underline{77.70} &77.21\\
%                                          & B-Prompt &84.97 &\underline{86.03} &\underline{86.21} &\underline{77.02} &76.45 &\underline{77.58}\\
%                                          & P-Prompt &\underline{85.00} &83.67 &84.39 &76.20 &75.39 &76.45\\
%                                          & PBPrompt &\textbf{85.55} &\textbf{86.25} &\textbf{86.30} &77.35 &\textbf{77.83}  &\textbf{78.09}\\
%     \midrule
%     \multirow{5}{*}{\textbf{SUN397}}     & CoOp     &67.32 &67.67 &70.14 &60.12 &59.60 &63.24\\
%                                          & PLOT     &55.17 &59.40 &62.73 &\underline{62.47} &61.71 &\textbf{65.09}\\
%                                          & B-Prompt &\underline{67.98} &\underline{69.00} &\underline{70.20} &62.42 &\underline{63.03} &64.83\\
%                                          & P-Prompt &67.45 &68.25 &70.10 &62.10 &61.54 &64.12 \\
%                                          & PBPrompt &\textbf{68.10} &\textbf{69.35} &\textbf{70.21} &\textbf{62.51} &\textbf{63.45} &\underline{64.77}\\                          
%     \bottomrule[1.5pt]
%     \end{tabular}}
%         \label{fig: backbone}
%         \caption{}
%     \end{subfigure}  
%     \hfill
%     \begin{subfigure}{0.4\textwidth}
%       \centering   
%       \includegraphics[width=0.97\linewidth]{fig/vis.pdf}
%         \label{fig: vis}
%         \caption{}
%     \end{subfigure}
%     \caption{
% \label{fig: total}
% \small{(a) Ablation studies of backbones on few-shot learning. (b) Visualization of the learned prompts.}}
% \end{figure*}

\paragraph{The effect of Monte Carlo sampling and $\eta$}
Generally, increasing the number of samples in Monte Carlo sampling leads to stable results, but an appropriate number can introduce a moderate level of uncertainty, ultimately enhancing the model's generalization and representation capabilities.

Meanwhile, the hyperparameter $\eta$, which balances the regularization weights, plays a crucial role in establishing the connection between the stochastically generated prompts and various visual concepts. We ablate these two hyperparameters on few-shot learning with 1/2/4 shots at Fig ~\ref{fig: sample} and Fig ~\ref{fig: coef}. In Fig ~\ref{fig: sample}, we use $\mu$ to represent the simple adoption of the mean of multiple prompt embedding, and we observe that employing fewer samples leads to increased uncertainty and a significant drop in performance. This indicates that a higher number of samples is essential for achieving more reliable results. Fig ~\ref{fig: coef} demonstrates that the presence of large coefficients can detrimentally impact results by overemphasizing image relationships, thus potentially overshadowing CLIP's alignment properties. We set the sampling number as 20 and $\eta=0.01$ by default.

% \begin{figure*}[t!]
%   \centering
%     \begin{subfigure}{0.48\textwidth}
%       \centering   
%       \scalebox{0.62}{
%     \begin{tabular}{l|c|c|c|c|c|c|c}
%     \toprule[1.5pt]
%     \multicolumn{2}{l}{\textbf{Backbones}} & \multicolumn{3}{c}{ViT-B/16} &\multicolumn{3}{c}
%     {RN50}\\
%     \hline
%     \multicolumn{2}{l}{\textbf{Dataset}} & 1 shot & 2 shots & 4 shots  & 1 shot & 2 shots & 4 shots\\
%     \midrule
%     \multirow{5}{*}{\textbf{Caltech101}} & CoOp     &93.19 &92.97 &94.50 &87.51 &87.84 &89.52 \\
%                                          & PLOT     &87.90 &89.53 &91.87 &89.83 &\underline{90.67} &\underline{90.80}\\
%                                          & B-Prompt &\underline{93.57} &\underline{94.10} &\underline{94.75} &\underline{90.10} &89.70 &90.56\\
%                                          & P-Prompt &93.34 &93.95 &94.60 &88.54 &89.45 &90.70\\
%                                          & PBPrompt &\textbf{93.92} &\textbf{94.40} &\textbf{94.83} &\textbf{90.21} &\textbf{90.86} &\textbf{90.92}\\
%     \midrule
%     \multirow{5}{*}{\textbf{DTD}}        & CoOp     &50.03 &53.93 &59.23 &43.62 &45.35 &53.94\\
%                                          & PLOT     &\textbf{52.20} &\underline{56.03} &58.37 &46.55 &51.24 &56.03 \\
%                                          & B-Prompt &51.87 &55.85 &\underline{59.53} &46.00 &\underline{51.67} &\underline{56.17}\\
%                                          & P-Prompt &50.95 &55.10 &59.02 &\underline{46.95} &48.35 &55.89\\
%                                          & PBPrompt &\underline{52.03} &\textbf{56.20} &\textbf{59.63} &\textbf{47.21} &\textbf{52.08} &\textbf{56.97}\\
%     \midrule
%     \multirow{5}{*}{\textbf{FOOD101}}    & CoOp     &82.70 &82.77 &83.63 &74.25 &72.61 &74.49\\
%                                          & PLOT     &69.33 &72.73 &75.17 &\textbf{77.74} &\underline{77.70} &77.21\\
%                                          & B-Prompt &84.97 &\underline{86.03} &\underline{86.21} &\underline{77.02} &76.45 &\underline{77.58}\\
%                                          & P-Prompt &\underline{85.00} &83.67 &84.39 &76.20 &75.39 &76.45\\
%                                          & PBPrompt &\textbf{85.55} &\textbf{86.25} &\textbf{86.30} &77.35 &\textbf{77.83}  &\textbf{78.09}\\
%     \midrule
%     \multirow{5}{*}{\textbf{SUN397}}     & CoOp     &67.32 &67.67 &70.14 &60.12 &59.60 &63.24\\
%                                          & PLOT     &55.17 &59.40 &62.73 &\underline{62.47} &61.71 &\textbf{65.09}\\
%                                          & B-Prompt &\underline{67.98} &\underline{69.00} &\underline{70.20} &62.42 &\underline{63.03} &64.83\\
%                                          & P-Prompt &67.45 &68.25 &70.10 &62.10 &61.54 &64.12 \\
%                                          & PBPrompt &\textbf{68.10} &\textbf{69.35} &\textbf{70.21} &\textbf{62.51} &\textbf{63.45} &\underline{64.77}\\                          
%     \bottomrule[1.5pt]
%     \end{tabular}}
%         \label{fig: backbone}
%         \caption{}
%     \end{subfigure}  
%     \hfill
%     \begin{subfigure}{0.4\textwidth}
%       \centering   
%       \includegraphics[width=0.97\linewidth]{fig/vis.pdf}
%         \label{fig: vis}
%         \caption{}
%     \end{subfigure}
%     \caption{
% \label{fig: total}
% \small{(a) Ablation studies of backbones on few-shot learning. (b) Visualization of the learned prompts.}}
% \end{figure*}

% \paragraph{Computation Cost Evaluation}
% We summarize the comparison of the parameters and inference speed of the baseline methods CoOp~\cite{zhou2022learning}, CoCoOp~\cite{zhou2022conditional}, PLOT~\cite{chen2022prompt} with 4 prompts and our PBPrompt with 10 samples. We report the number of learnable parameters and the number of images processed by the model in 1 second during inference on the Food101~\cite{bossard2014food} dataset. As shown in Table ~\ref{tab: computation_cost}, despite the introduction of additional learnable parameters in our model, we were able to achieve comparable inference speed.

% \begin{table}[h]
%     \centering
%     \scalebox{0.7}{
%     \begin{tabular}{l|cccc}
%     \toprule[1.5pt]
%          Settings &\textbf{CoOp} &\textbf{CoCoOp} &\textbf{PLOT(N=4)} &\textbf{PBPrompt} \\ 
%          \midrule
%          \# Params &2048 &35360 &8192 &1577984 \\
%          Inference Speed \small(images/s) &645 &37 &583 &541\\
%     \bottomrule[1.5pt]
%     \end{tabular}}
%         \caption{The parameters and inference time comparison.}
%     \label{tab: computation_cost}
% \end{table}

\paragraph{Further ablation study}
Due to space constraints, details of other interesting ablation study can be found in the Appendix and now they are briefly introduced as follows.
First, we explore the impact of the two terms, patch-to-prompt and prompt-to-path, in proposed CT regularization.
We find that neither of these two terms can be omitted and we attempt to choose different coefficients as discussed in Sec.~\ref{sec: balance}.
Then, on Base-to-New generalization, the trade-off between performance on base and new classes is ineviTable  Thus we ablate the number of training epochs on various datasets. We find that our method is very tolerant to changes in the harmonic mean and more details can be found in Sec. ~\ref{tab: trade_off}. Empirically, we validate that the stochastic generated module is the crucial factor affected the performance of our proposed method rather than additional parameters in inference network. We also compared the results under the OT framework to demonstrate the effectiveness of our approach as shown in Sec. ~\ref{sec: more}. Besides, we also evaluate the computation cost compared with other baseline methods in sec. ~\ref{sec: cost}.


\vspace{-2mm}
\paragraph{Visualization}
Excitingly, we have discovered that transport plans $\pi$ in Eq.~\ref{ug} serve as a potent tool for achieving visualization, allowing us to demonstrate how stochastic-generated prompts for a specific class concentrate on the visual concepts of the corresponding images. 
We provide visualization examples in {Fig ~\ref{fig: vis}} to illustrate this. 
Besides, as shown in Fig ~\ref{fig: app_prompt}, we also attempt to explain the learned prompt from text domain via a multimodal model.
More analysis and visualization can be found at Sec.~\ref{app_vis}.

\begin{figure}
      \centering   
      \includegraphics[width=0.9\linewidth]{fig/vis.pdf}
        \caption{Visualization of the learned prompts.}
        \label{fig: vis}
\end{figure}


\section{Conclusion}
In this paper, we propose Patch-Prompts aligned Bayesian prompt tuning (PBPrompt) for pre-trained vision-language models. PBPrompt is a Bayesian prompt tuning method that generates label-specific stochastic prompts hierarchically under the variational inference framework comprising a stochastic sampling network and a deterministic generative model. Moreover, we also introduce a CT regularization that aligns the textual prompts with the image patches under the conditional transport framework. PBPrompt is optimized by the derived combined ELBO via the stochastic gradient algorithm. Extensive experiments over 15 datasets at various tasks are conducted to evaluate the efficiency of our models. We hope PBPrompt will provide a simple tool for prompt tuning and inspire future work.


\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
This work was supported in part by the National Natural Science Foundation of China under Grant U21B2006; in part by Shaanxi Youth Innovation Team Project; in part by the Fundamental Research Funds for the Central Universities QTZX24003 and QTZX22160; in part by the 111 Project under Grant B18039;
\end{acknowledgements}

% References
\bibliography{uai2024-template}

\newpage

\onecolumn

% \title{Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models\\(Supplementary Material)}
% \maketitle



% This Supplementary Material should be submitted together with the main paper.

\appendix
% \begin{center}
%         {\LARGE Patch-Prompt Aligned Bayesian Prompt Tuning \\for Vision-Language Models (Supplementary Material)}
% \end{center}
% \vspace{5mm}
    
\section{Discussions}
\textbf{The main purpose of the introduced Bayesian prompt generation and Patch-Prompt CT alignments.}

One of the main contributions of the proposed model is the stochastic prompt generation, which introduces uncertainty into the prompt embeddings. E.g., for each category, we can generate different prompts that capture diverse visual concepts, resulting in better class-specific representations. Unfortunately, due to the mode-collapse problem that usually appears in most Bayesian generative models, we find that only optimizing the stochastic module by the classification loss could lead to suboptimal results. Motivated by previous PLOT~\citep{chen2022prompt}, we here employ the CT regularization to align the generated prompts and the image patches. Intuitively, we view images are two discrete distributions over the prompt and patch embeddings. They share similar semantics but with different domains. Ideally, those two distributions should have close semantic distance. By minimizing the CT distance, the learned prompt embeddings tend to capture the true label-specific visual concepts, improving the quality of the learned prompts. That is, the CT regularization improves the performance of the method by aligning the textual prompt domain and the visual patch domain, which is usually ignored by previous works.

\textbf{The improvement is marginal when compared to CoCoOp in some cases.}

We highlight the superiority of the proposed model below.
First, the paper provides a novel Baeysian prompt-generation strategy for the prompt-tuning community. This enables the learned prompt to capture diverse visual concepts and gives the following studies a new stochastic view rather than only focusing on deterministic paradigms.
Second, consistent improvement in most cases. We here want to note that it is a nontrivial contribution that achieves consistent improvement over 4 tasks on 15 datasets. For the marginal improvement on several datasets, we note that previous models (e.g., CoCoOp) have achieved high results, and thus the improvements are slight. We find that the proposed PBPrompt usually has a significant improvement on 1/2/4 shots, which clearly highlights the performance of our method with fewer training samples(see Table ~\ref{tab: vit_fsl} and Table ~\ref{tab: rn50_fsl} for detailed results). Besides, our method balances the seen and unseen sets well according to Table ~\ref{fig: base2new}. E.g., PBPrompt achieves 0.9\%-9.14 \% improvements compared to CoCoOp in terms of H score.
Third, the interpretability of the proposed model. The visualization in Fig ~\ref{fig: total}(a) shows the interpretability of the learned prompts, while CoCoOp only reports the numerical results.

\revision{\textbf{Differences between SHIP.}}

\revision{Both SHIP and PBPrompt introduce the uncertainty into the prompt generation process. However, the latent variable $\zv$ ($\rv$ in PBPrompt) models different levels of uncertainty and comes from different assumption. SHIP introduces the stochastic prompts into each image, and infers a sample-dependent posterior:}
\begin{equation}
    q(\zv_i) = \mathcal{N}(\mu(\xv_i), \Sigma(\xv_i)),
\end{equation}
\revision{where $\xv_i$ denotes the feature of $i$-th image. While PBPrompt views each category has a underlying distribution and infers a label-specific posterior:}
\begin{equation}
    q(\zv_c) = \mathcal{N}(\mu(\ev_c), \Sigma(\ev_c)),
\end{equation}
\revision{where $\ev_c$ denote the embedding of $c$-th category.}

\revision{\textbf{Prior on $p(\zv)$}. SHIP simply adopts the standard Gaussian as the prior of $\zv$, \textit{e.g.}, $p(\zv)=\mathcal{N}(0, \Imat)$, while PBPrompt utilizes the contextual prior to capture label-specific features: $p(\zv_c)=\mathcal{N}(\ev_c, \Imat)$. This difference enables PBPrompt to access additional label semantics, achieving better prior guidance.}

\revision{\textbf{Training pipelines}. SHIP introduces an additional feature reconstruction loss to pre-train the VAE, and then finetunes the prompt via the task-specific loss. Our PBPrompt naturally interages the stochastic prompts into the CLIP framework and directly optimize the prompt via the combined ELBO.}



\section{Method Details}\label{algo}
Given the labeled training dataset $\mathcal{D}={(\xv_j, y_j)}_{j=1}^{N_{tr}}$, our proposed PBPrompt aims to learn stochastic prompts for each class. Note that, all parameters in PBPrompt are optimized by minimizing the combined ELBO end-to-end. We summarize the training algorithm at Algorithm ~\ref{algo}.
\begin{algorithm}[H]
\footnotesize
\caption{Training algorithm for our proposed PBPrompt.}
\label{alg}
\begin{algorithmic}
\STATE \textbf{Output}: The trained PBPrompt, which can generate the stochastic label-specific prompts for downstream tasks.
\STATE \textbf{Input}: Training set $\mathcal{D}={(\xv_j, y_j)}_{j=1}^{N_{tr}}$, a VLP, class names, and hyperparameter $\eta$.\\
\STATE \textbf{Initialize}: The prefix token embeddings, the parameters in inference network $q(\rv_c|c)$ and the generative model $\phi(\vv_c|\rv_c)$.
\FOR{ iter = 1,2,3,...}
 \STATE Sample a batch of $B$ image-label pairs and get the image feature and patch embeddings by feeding the image into the image encoder $f(\xv)$.
 \STATE \textcolor{gray}{\# Learning of PBPrompt}
 \STATE Generate $C$ stochastic prompts hierarchically with Eq.(2) for all classes.
 \STATE Get the label embeddings by feeding the prompts into the text encoder $g(\tv)$.
 \STATE Compute the CT distance between patches and the class-specific prompts with Eq.(5).
 \STATE Compute the combined ELBO $\mathcal{L}$ with Eq.(8) and update all learnable parameters by minimizing the $\mathcal{L}$ with the stochastic gradient descent algorithm.
\ENDFOR\\
\end{algorithmic}\label{algo}
\end{algorithm}


\section{Experiment Details}
\renewcommand\thetable{\Alph{section}. \arabic{table}}    
\setcounter{table}{0}

\subsection{Data Statistics}
Our experiments are conducted on 15 widely-used vision datasets. \textit{E.g.}, ImageNet~\cite{deng2009imagenet} and Caltech101~\cite{fei2004learning} for generic object classification, OxfordPets~\cite{parkhi2012cats}, StanfordCars~\cite{krause20133d}, Flowers102~\cite{nilsback2008automated}, Food101~\cite{bossard2014food} and FGVCAircraft~\cite{maji2013fine} for fine-grained image recognition, EuroSAT~\cite{helber2019eurosat} for satellite image classification, UCF101~\cite{soomro2012ucf101} for action classification, DTD~\cite{cimpoi2014describing} for texture classification, and SUN397~\cite{xiao2010sun} for scene recognition. For the domain generalization task, we use ImageNet as the source domain dataset and evaluate performance on ImageNetV2~\cite{recht2019imagenet}, ImageNet-Sketch~\cite{wang2019learning}, ImageNet-A~\cite{hendrycks2021natural}, and ImageNet-R~\cite{hendrycks2021many}. 
We summarize the data statistics at Table ~\ref{tab: statistics}
\begin{table}[h]
\centering
\caption{Statistics of the datasets.}
\label{tab: statistics}
\begin{tabular}{ccccc}
\toprule[2pt]
\textbf{Dataset} & \textbf{Classes} & \textbf{Train} & \textbf{Val} & \textbf{Test}\\
\midrule
ImageNet        & 1000 & 1.28M & N/A & 50,000\\
Caltech101      & 100 & 4,128 & 1,649 & 2,465\\
OxfordPets      & 37 & 2,944 & 736 & 3,669\\
StanfordCars    & 196 & 6,509 & 1,635 & 8,041\\
Flowers102      & 102 & 4,093 & 1,633 & 2,463\\
Food101         & 101 & 50,500 & 20,200 & 30,300\\
FDVCAircraft    & 100 & 3,334 & 3,333 & 3,333\\
SUN397          & 397 & 15,880 & 3,970 & 19,850\\
DTD             & 47 & 2,820 & 1,128 & 1,692\\
EuroSAT         & 10 & 13,500 & 5,400 & 8,100\\
UCF101          & 101 & 7,639 & 1,808 & 3,783\\
\midrule
ImageNetV2      & 1000 & N/A & N/A & 10,000\\
ImageNet-Sketch & 1000 & N/A & N/A & 50,889\\
ImageNet-A      & 200 & N/A & N/A & 7,500\\
ImageNet-R      & 200 & N/A & N/A & 30,000\\
\bottomrule[2pt]
\end{tabular}
\end{table}

\begin{table}[h]
\centering
\caption{All results in the main paper were generated using shared hyperparameters when employing the ViT-B/16 backbone.}
\label{tab: hyperparameter_setting}
\begin{tabular}{ll}
\toprule[2pt]
\textbf{Hyperparameters} & \textbf{Values}\\
\midrule
Batch Size                  & 1\\
Input Size                  & $224\times224$\\
Input Interpolation         & "Bicubic"\\
Input Pixel Mean            & $[0.48145466, 0.4578275, 0.40821073]$\\
Input Pixel STD             & $[0.26862954, 0.26130258, 0.27577711]$\\
Transforms                  & ["random resized crop", "random filp", "normalize"]\\
Optimizer                   & SGD\\
Learning Rate               & 2$e$-3\\
LR Scheduler                & "cosine"\\
Warmup Epoch                & 1\\
Warmup Type                 & "constant"\\
Warmup LR                   & 1$e$-5\\
Backbone                    & ViT-B/16\\
Prompt Length               & 4\\
Prompt Initialization       & ""\\
Precision                   & "fp16"\\
Number of shots             & 16\\
\bottomrule[2pt]
\end{tabular}
\end{table}

\subsection{Hyperparameter Setting}
We set the training hyper-parameters as well as the training pipeline to be the same as Zhou et al.~\cite{zhou2022conditional} in terms of definitions of few-shot tasks while using ViT-B/16 in the manuscript. For the RN50 backbone, we replace the ViT-B/16 with RN50 and set the number of shots as 4 to maintain consistency with the other works using RN50. We list those settings at Table ~\ref{tab: hyperparameter_setting}.

% \section{Additional Results}
% \renewcommand\thetable{\Alph{section}. \arabic{table}}    
% \setcounter{table}{0}

\subsection{Impact of the patch-to-prompt and prompt-to-patch transport}\label{sec: balance}
In the previous experiments, we view the patch-to-prompt and prompt-to-patch transport in Eq.~\ref{ct} equally. To discuss the impact of those two terms, we rewrite Eq.~\ref{ct} as:
\begin{equation} \label{ct_app}
    \mathcal{L}_{CT}(P,Q) = \lambda \mathcal{L}_{\uv \rightarrow \gv} + (1-\lambda) \mathcal{L}_{\gv \rightarrow \uv},
\end{equation}
where $\lambda$ controls the weight of the patch-to-prompt term. We report the few-shot results with various $\lambda$ at Fig ~\ref{lambda}. We find that 1) regardless of considering the $\mathcal{L}_{\uv \rightarrow \gv}$  or the $\mathcal{L}_{\gv \rightarrow \uv}$, the final experimental results were not satisfactory. 2) Promising results could be obtained by carefully choosing $\lambda$. Thus we set this hyperparameter as 0.5 for ease of parameter tuning.


\begin{figure}[htbp]
    \centering
		\centering
    \scalebox{0.9}{
        \begin{tabular}{@{} lc|c|c|c|c|c|c|c @{}}
    \toprule[1.5pt]
                        \multicolumn{2}{l}{\textbf{Dataset}} & 0.0 & 0.2 & 0.4 &0.5 &0.6 &0.8 &1.0 \\
    \midrule
    \multirow{5}{*}{\textbf{DTD}}    & 1-shot &51.36 &51.54 &51.77 &\textbf{52.03} &51.83 &51.95 &51.37 \\
                                     & 2-shots &54.43 &55.67 &56.20 &\textbf{56.34} &55.85 &55.20 &55.67 \\     
                                     & 4-shots &58.16 &58.75 &\textbf{59.66} &59.63 &59.53 &59.42 &58.87\\

    \midrule
    \multirow{5}{*}{\textbf{EuroSAT}}    & 1-shot &60.78 &61.21 &\textbf{61.93} &60.92 &61.02 &61.61 &61.20 \\
                                     & 2-shots &68.12 &68.76 &68.34 &\textbf{68.77} &68.05 &67.43 &67.98 \\     
                                     & 4-shots &70.63 &71.01 &71.1 &\textbf{72.84} &72.71 &72.14 &71.96\\
    \midrule
    \multirow{5}{*}{\textbf{Caltech101}}           & 1-shot &93.21 &93.90 &\textbf{93.94} &93.92 &93.93 &93.32 &93.4\\
                                            & 2-shots &93.98 &94.20 &94.41 &94.40 &\textbf{94.45} &94.39 &94.23 \\
                                            & 4-shots &94.78 &\textbf{94.85} &94.83 &94.83 &94.83 &94.80 &94.51 \\
   \midrule
    \multirow{5}{*}{\textbf{StanfordCars}}           & 1-shot &66.21 &66.54 &67.10 &\textbf{67.30} &66.70 &66.98 &66.49\\
                                            & 2-shots &69.52 &70.14 &\textbf{70.48} &70.20 &70.36 &70.44 &70.23 \\
                                            & 4-shots &72.94 &73.57 &73.42 &\textbf{73.60} &73.61 &73.84 &73.60 \\             
    \bottomrule[1.5pt]
    \end{tabular}}
        \captionof{table}{\small{Ablation studies of Base-to-New generalization on Bayesian prompt tuning (B-Prompt) and Patch-Prompt CT alignment (P-Prompt).}}
        \label{lambda}
\end{figure}



\subsection{Additional comparison to ProDA}

We compared PBPrompt to PLOT in the manuscript, and extensive results show the superiority of the proposed Bayesian framework. Note that ProDA~\citep{lu2022prompt} also comes from stochastic prompt tuning. We summarize the difference below. First, ProDA focuses on the output embeddings of prompts and employs a Gaussian distribution to model the latent representation by pre-defining K label-specific templates. However, ours is a novel Bayesian prompt generation method based on input embeddings, aiming to generate the label-specific stochastic prompts in a data-driven framework, rather than based on handcraft prompts. Second, we introduce the CT regularization to align the textual prompt domain and the visual patch domain and develop a novel combined loss to optimize the proposed model end-to-end. While the ProDA employs an EM algorithm to train the parameters. Last, the learned transport plan provides us with an interpretable tool to visualize the learned prompts, while the ProDA fails to give such an interpretable.

Empirically, we report the Base-to-New comparisons (H score) at Table ~\ref{proda}. Because of the unreleased code of ProDA, we could only compare with results adopted from previous work ~\citep{derakhshani2022variational} under the same setting on the Base-to-New task. From Table ~\ref{proda}, we find that our proposed method outperforms ProDA on 9/11 datasets and has the best result on average accuracy.



\begin{table}[!th]
\caption{\small{H score of CoCoOp, ProDA, and PBPrompt on Base-to-New task.}}
\label{proda}
\centering
    \scalebox{0.83}{
    \begin{tabular}{lcccccccccccc}
    \toprule[1.5pt]
    % \cmidrule(lr){2-2}\cmidrule(lr){3-13}
    \textbf{Method}
    &\rotatebox{70}{\textbf{Imagenet}} 
    &\rotatebox{70}{\textbf{Caltech}}
    &\rotatebox{70}{\textbf{Pets}}
    &\rotatebox{70}{\textbf{Cars}}
    &\rotatebox{70}{\textbf{Flowers}}
    &\rotatebox{70}{\textbf{Food}}
    &\rotatebox{70}{\textbf{Aircraft}}
    &\rotatebox{70}{\textbf{SUN}}
    &\rotatebox{70}{\textbf{DTD}}
    &\rotatebox{70}{\textbf{EuroSAT}}
    &\rotatebox{70}{\textbf{UCF}}
    &\rotatebox{70}{\textbf{Average}}\\
    \midrule
    CoCoOp         & 73.10 & 95.84 & 96.43 &72.01 &81.71 &90.99 &27.74 &78.27 &64.85 &71.21 &77.64 &75.83 \\
    ProDA       & 72.72 &95.68 & 96.62 & 72.91 & 80.66 & 89.43 & \textbf{35.46} & 77.79 & \textbf{66.44} & 73.88 &78.04 &76.65 \\
    PBPrompt     & \textbf{73.76} & \textbf{96.66} & \textbf{96.92} & \textbf{73.02} & \textbf{83.12} & \textbf{91.22} &34.64 & \textbf{78.35} & 66.41 & \textbf{80.34} & \textbf{79.51} & \textbf{77.86} \\
    \bottomrule[1.5pt]
    \end{tabular}}
\end{table}


\subsection{Few-shot Learning Details}

In this section, we provide the complete results on few-shot learning task using ViT-B/16 and RN50 respectively. As a result of introducing additional learnable parameters into our model, we trained for more epochs that the maximum epoch is set to 400 for 16/8 shots, 200 for 4/2 shots, and 100 for 1 shot for all datasets. 
Table ~\ref{tab: vit_fsl} shows more detailed accuracy consistent with Fig ~\ref{fig:fsl} in the manuscript. Besides, we ablate the backbone using RN50 with CoOp~\cite{zhou2022conditional}, PLOT~\cite{chen2022prompt}, and our PBPrompt, and report the results in Table ~\ref{tab: rn50_fsl}. We find that our PBPrompt also has comparable performance with other baselines, especially on 1/2/4 shots. These results, as shown in the two tables, highlight the stable performance across different backbones, demonstrating the strong robustness of our model.% From these two tables, stable results in different backbones better demonstrate the strong robustness of our model.


\begin{table}[T]
    \centering
    \caption{The few-shot learning results of various methods on 11 datasets using \textbf{ViT-B/16}. We report the average value over three different seeds.}
    \label{tab: vit_fsl}
    \begin{tabular}{ccccccc}
    \toprule
         Dataset &Methods &1 shot & 2 shots & 4 shots & 8 shots & 16 shots  \\
         \hline
         \multirow{4}{*}{ImageNet} &CoOp &68.10 &69.25 &69.53 &70.40 &71.51\\
         &CoCoOp &68.40 &69.13 &69.30 &70.45 &71.60 \\
         &PLOT &67.40 &68.80 &69.90 &70.15 &71.37\\
         % \rowcolor{gray!15}
         &PBPrompt &69.55 &69.90 &70.50 &71.62 &71.86\\
         \hline
         \multirow{4}{*}{Caltech101} 
         &CoOp &93.13 &92.97 &94.50 &94.73 &95.50\\
         &CoCoOp &92.27 &93.47 &94.27 &94.73 &95.21 \\
         &PLOT &87.90 &89.53 &91.87 &92.90 &93.80 \\
         &PBPrompt &93.92 &94.40 &94.83 &95.13 &95.37\\
         \hline
         \multirow{4}{*}{DTD} 
         &CoOp &50.03 &53.93 &59.23 &64.37 &68.40\\
         &CoCoOp &50.80 &54.10 &58.37 &63.07 &67.67 \\
         &PLOT &52.20 &56.03 &58.37 &65.57 &70.17\\
         &PBPrompt &52.03 &56.20 &59.63 &64.17 &68.50\\
         \hline
         \multirow{4}{*}{EuroSAT} 
         &CoOp &51.80 &66.33 &65.87 &74.77 &83.07\\
         &CoCoOp &51.93 &64.17 &67.20 &75.07 &82.87 \\
         &PLOT &59.77 &69.03 &73.50 &80.03 &83.47 \\
         &PBPrompt &60.92 &68.77 &72.84 &80.14 &84.21\\
         \hline
         \multirow{4}{*}{FGVCAircraft} 
         &CoOp &26.20 &27.90 &30.03 &36.00 &39.73\\
         &CoCoOp &16.83 &26.47 &29.27 &36.17 &38.60 \\
         &PLOT &20.20 &21.87 &23.90 &27.13 &30.57\\
         &PBPrompt &27.41 &29.03 &31.89 &36.10 &39.54\\
         \hline
         \multirow{4}{*}{Flowers102} 
         &CoOp &73.00 &81.90 &86.50 &94.13 &96.20\\
         &CoCoOp &76.80 &86.40 &91.80 &93.98 &96.30 \\
         &PLOT &70.50 &80.57 &88.70 &93.77 &95.70\\
         &PBPrompt &75.43 &83.37 &88.90 &94.00 &96.32\\
         \hline
         \multirow{4}{*}{FOOD101} 
         &CoOp &82.70 &82.77 &83.63 &84.00 &85.33\\
         &CoCoOp &83.35 &82.85 &82.75 &84.20 &85.46 \\
         &PLOT &69.33 &72.73 &75.17 &76.70 &77.87\\
         &PBPrompt &85.55 &86.25 &86.30 &87.00 &87.10\\
         \hline
         \multirow{4}{*}{OxfordPets} 
         &CoOp &90.27 &89.93 &92.20 &92.47 &92.47\\
         &CoCoOp &90.20 &88.87 &91.77 &91.73 &92.10 \\
         &PLOT &82.93 &85.40 &85.97 &87.40 &88.10\\
         &PBPrompt &91.20 &91.73 &92.63 &93.00 &93.40\\
         \hline
         \multirow{4}{*}{StanfordCars} 
         &CoOp &67.03 &70.13 &73.27 &76.90 &79.13\\
         &CoCoOp &67.13 &68.83 &72.03 &76.10 &77.45 \\
         &PLOT &45.97 &51.43 &53.97 &59.62 &64.51\\
         &PBPrompt &67.30 &70.20 &73.60 &77.23 &79.47\\
         \hline
         \multirow{4}{*}{SUN397} &CoOp &67.32 &67.67 &70.14 &72.37 &74.57\\
         &CoCoOp &65.60 &66.13 &69.85 &70.35 &73.13 \\
         &PLOT &55.17 &59.40 &62.73 &65.80 &67.00\\
         &PBPrompt &68.10 &69.35 &70.21 &72.20 &74.15\\
         \hline
         \multirow{4}{*}{UCF101} 
         &CoOp &70.07 &73.30 &77.87 &80.10 &82.40\\
         &CoCoOp &70.80 &73.50 &76.15 &79.23 &82.30 \\
         &PLOT &49.63 &53.20 &60.80 &67.23 &70.50\\
         &PBPrompt &71.45 &74.90 &77.60 &79.77 &80.93\\
         \hline
         \multirow{4}{*}{Average} &CoOp &67.24 &70.55 &70.02 &76.36 &78.92 \\
         &CoCoOp &66.74 &70.36 &72.98 &75.92 &78.43 \\
         &PLOT &60.09 &64.36 &67.69 &71.48 &73.91 \\
         &PBPrompt &69.35 &72.19 &74.45 &77.31 &79.17 \\
         \bottomrule
    \end{tabular}
\end{table}
\begin{table}[H]
    \centering
    \caption{The few-shot learning results of various methods on 11 datasets using \textbf{RN50}. We report the average value over three different seeds.}
    \label{tab: rn50_fsl}
    \scalebox{0.85}{
    \begin{tabular}{ccccccc}
    \toprule
         Dataset &Methods &1 shot & 2 shots & 4 shots & 8 shots & 16 shots  \\
         \hline
         \multirow{3}{*}{Caltech101} 
         &CoOp &87.51 $\pm$ 1.02 &87.84 $\pm$ 1.10 &89.52 $\pm$ 0.80 &90.28 $\pm$ 0.42 &91.99 $\pm$ 0.31 \\
         &PLOT &89.83 $\pm$ 0.33 &90.67 $\pm$ 0.21 &90.80 $\pm$ 0.20 &91.54 $\pm$ 0.33 &92.24 $\pm$ 0.38 \\
         &PBPrompt &90.21 $\pm$ 0.45 &90.86 $\pm$ 0.24 &90.92 $\pm$ 0.10 &91.37 $\pm$ 0.21 &92.03 $\pm$ 0.17\\
         \hline
         \multirow{3}{*}{DTD} 
         &CoOp &43.62 $\pm$ 1.96 &45.35 $\pm$ 0.31 &53.94 $\pm$ 1.37 &59.69 $\pm$ 0.13 &62.51 $\pm$ 0.25\\
         &PLOT &46.55 $\pm$ 2.62 &51.24 $\pm$ 1.95 &56.03 $\pm$ 0.43 &61.70 $\pm$ 0.35 &65.60 $\pm$ 0.82\\
         &PBPrompt &47.21 $\pm$ 1.22 &52.08 $\pm$ 0.78 &56.97 $\pm$ 0.55 &61.84 $\pm$ 0.21 &65.58 $\pm$ 0.33\\
         \hline
         \multirow{3}{*}{EuroSAT} 
         &CoOp &52.12 $\pm$ 5.46 &59.00 $\pm$ 3.48 &68.61 $\pm$ 3.54 &77.08 $\pm$ 2.42 &83.69 $\pm$ 0.47\\
         &PLOT &54.05 $\pm$ 5.95 &64.21 $\pm$ 1.90 &72.36 $\pm$ 2.29 &78.15 $\pm$ 2.65 &82.23 $\pm$ 0.91 \\
         &PBPrompt &57.34 $\pm$ 3.12 &64.67 $\pm$ 1.21 &73.10 $\pm$ 1.34 &78.39 $\pm$ 1.72 &82.20 $\pm$ 0.32\\
         \hline
         \multirow{3}{*}{FGVCAircraft} 
         &CoOp &8.59 $\pm$ 5.79 &16.52 $\pm$ 2.38 &20.63 $\pm$ 2.46 &26.63 $\pm$ 0.86 &31.43 $\pm$ 0.96\\
         &PLOT &17.90 $\pm$ 0.09 &18.94 $\pm$ 0.44 &22.36 $\pm$ 0.42 &26.17 $\pm$ 0.29 &31.49 $\pm$ 0.89\\
         &PBPrompt &17.49 $\pm$ 1.24 &18.72 $\pm$ 0.45 &22.55 $\pm$ 0.44 &26.71 $\pm$ 0.31 &31.44 $\pm$ 0.64\\
         \hline
         \multirow{3}{*}{Flowers102} 
         &CoOp &67.98 $\pm$ 1.98 &77.58 $\pm$ 1.46 &86.10 $\pm$ 1.05 &91.27 $\pm$ 0.83 &94.49 $\pm$ 0.40\\
         &PLOT &71.72 $\pm$ 0.97 &81.19 $\pm$ 0.79 &87.82 $\pm$ 0.20 &92.43 $\pm$ 0.25 &94.76 $\pm$ 0.34\\
         &PBPrompt &70.84 $\pm$ 1.23 &81.35 $\pm$ 0.87 &87.57 $\pm$ 0.34 &92.44 $\pm$ 0.31 &94.60 $\pm$ 0.24\\
         \hline
         \multirow{3}{*}{FOOD101} 
         &CoOp &74.25 $\pm$ 1.52 &72.61 $\pm$ 1.33 &73.49 $\pm$ 2.03 &71.58 $\pm$ 0.79 &74.48 $\pm$ 0.15\\
         &PLOT &77.74 $\pm$ 0.47 &77.70 $\pm$ 0.02 &77.21 $\pm$ 0.43 &75.31 $\pm$ 0.30 &77.09 $\pm$ 0.18\\
         &PBPrompt &77.35 $\pm$ 0.33 &77.93 $\pm$ 0.12 &78.09 $\pm$ 0.21 &77.79 $\pm$ 0.20 &77.75 $\pm$ 0.12\\
         \hline
         \multirow{3}{*}{ImageNet} 
         &CoOp &56.99 $\pm$ 1.03 &56.40 $\pm$ 0.87 &58.48 $\pm$ 0.47 &60.39 $\pm$ 0.57 &61.91 $\pm$ 0.17\\
         &PLOT &59.54 $\pm$ 0.16 &60.64 $\pm$ 0.06 &61.49 $\pm$ 0.23 &61.92 $\pm$ 0.09 &63.01 $\pm$ 0.13\\
         &PBPrompt &60.54 $\pm$ 0.12 &60.72 $\pm$ 0.09 &61.68 $\pm$ 0.13 &62.00 $\pm$ 0.09 &62.95 $\pm$ 0.11\\
         \hline
         \multirow{3}{*}{OxfordPets} 
         &CoOp &85.99 $\pm$ 0.28 &82.22 $\pm$ 2.15 &86.65 $\pm$ 0.97 &85.36 $\pm$ 1.00 &87.02 $\pm$ 0.89\\
         &PLOT &87.49 $\pm$ 0.16 &86.64 $\pm$ 0.06 &88.63 $\pm$ 0.23 &87.39 $\pm$ 0.09 &87.21 $\pm$ 0.13\\
         &PBPrompt &87.75 $\pm$ 0.25 &86.32 $\pm$ 0.75 &89.08 $\pm$ 0.23 &88.34 $\pm$ 0.14 &88.45 $\pm$ 0.21\\
         \hline
         \multirow{3}{*}{StanfordCars} 
         &CoOp &55.81 $\pm$ 1.67 &58.41 $\pm$ 0.43 &62.74 $\pm$ 0.16 &67.64 $\pm$ 0.06 &73.60 $\pm$ 0.19\\
         &PLOT &56.60 $\pm$ 0.36 &57.52 $\pm$ 0.71 &63.41 $\pm$ 0.29 &67.03 $\pm$ 0.50 &72.80 $\pm$ 0.75\\
         &PBPrompt &57.14 $\pm$ 0.21 &57.76 $\pm$ 0.34 &63.53 $\pm$ 0.20 &67.64 $\pm$ 0.12 &73.75 $\pm$ 0.34\\
         \hline
         \multirow{3}{*}{SUN397} 
         &CoOp &60.12 $\pm$ 0.82 &59.60 $\pm$ 0.76 &63.24 $\pm$ 0.63 &65.77 $\pm$ 0.02 &68.36 $\pm$ 0.66\\
         &PLOT &62.47 $\pm$ 0.43 &61.71 $\pm$ 0.65 &65.09 $\pm$ 0.43 &67.48 $\pm$ 0.04 &69.96 $\pm$ 0.24\\
         &PBPrompt &62.51 $\pm$ 0.49 &63.45 $\pm$ 0.66 &64.77 $\pm$ 0.51 &67.35 $\pm$ 0.08 &69.93 $\pm$ 0.17\\
         \hline
         \multirow{3}{*}{UCF101} 
         &CoOp &62.13 $\pm$ 1.14 &64.05 $\pm$ 0.99 &67.79 $\pm$ 0.71 &72.71 $\pm$ 0.50 &76.90 $\pm$ 0.50\\
         &PLOT &64.53 $\pm$ 0.70 &66.83 $\pm$ 0.43 &69.60 $\pm$ 0.67 &74.45 $\pm$ 0.50 &77.26 $\pm$ 0.64\\
         &PBPrompt &64.29 $\pm$ 0.84 &66.88 $\pm$ 0.32 &69.95 $\pm$ 0.55 &74.86 $\pm$ 0.47 &77.35 $\pm$ 0.52\\
         \hline
         \multirow{3}{*}{Average} 
         &CoOp &59.56 $\pm$ 2.06 &61.51 $\pm$ 1.39 &66.47 $\pm$ 1.29 &69.85 $\pm$ 0.69 &73.31 $\pm$ 0.42\\
         &PLOT &62.58 $\pm$ 1.13 &65.21 $\pm$ 0.72 &68.62 $\pm$ 0.52 &71.23 $\pm$ 0.51 &73.97 $\pm$ 0.54\\
         &PBPrompt &62.97 $\pm$ 0.86 &65.52 $\pm$ 0.52 &68.93 $\pm$ 0.42 &71.70 $\pm$ 0.35 &74.18 $\pm$ 0.29\\
         \bottomrule
    \end{tabular}}
\end{table}

Besides, for a fair comparison, we re-run ProGrad~\cite{zhu2022prompt} with ViT-B/16 and set the prompt length as 4 on 1/2/4 shot as shown at Table ~\ref{tab: prograd_few}. Compared to ProGrad which only optimizes the prompt whose gradient is aligned to the CLIP knowledge, our approach aims to squeeze CLIP knowledge by finding the stochastic prompts for each class, showing greater potential in capturing diverse visual attributes and improving generalizability.

\begin{table}
    \centering
    \caption{Comparison with ProGrad on the few-shot learning using \textbf{ViT-B/16}. We report the average value over three different seeds.}
    \label{tab: prograd_few}
    \scalebox{0.85}{
    \begin{tabular}{ccccc}
    \toprule
         Dataset &Methods &1 shot & 2 shots & 4 shots  \\
         \hline
        % \multirow{3}{*}{Caltech101} 
        % &CoOp &87.51 &87.84 &89.52 \\
        % &ProGrad &89.83 &90.67 &90.80\\
        % &PBPrompt &90.21 &90.86 &90.92 \\
        %          \hline
        \multirow{3}{*}{Caltech101} 
         &CoOp &93.13 &92.97 &94.50 \\
         &ProGrad &93.67 &94.33 &94.60\\
         &PBPrompt &\textbf{93.92} &\textbf{94.40} &\textbf{94.83} \\
                  \hline
         \multirow{3}{*}{DTD} 
         &CoOp &50.03 &53.94 &59.23 \\
         &ProGrad &51.12 &52.30 &56.00 \\
         &PBPrompt &\textbf{52.03} &\textbf{56.20} &\textbf{59.63} \\
                  \hline
         \multirow{3}{*}{EuroSAT} 
         &CoOp &51.80 &66.33 &65.87 \\
         &ProGrad &56.65 &60.65 &68.70 \\
         &PBPrompt &\textbf{60.92} &\textbf{68.77} &\textbf{72.84} \\
                  \hline
         \multirow{3}{*}{FOOD101} 
         &CoOp &82.70 &82.77 &86.50 \\
         &ProGrad &85.55 &85.75 &86.17 \\
         &PBPrompt &\textbf{85.55} &\textbf{86.25} &\textbf{86.30} \\
                  \hline
         \multirow{3}{*}{SUN397} 
         &CoOp &67.32 &67.67 &70.14 \\
         &ProGrad &67.92 &68.95 &70.17 \\
         &PBPrompt &\textbf{68.10} &\textbf{69.35} &\textbf{70.21} \\
                  \hline
         \multirow{3}{*}{UCF101} 
         &CoOp &70.07 &73.30 &77.87 \\
         &ProGrad &\textbf{72.65} &73.60 &77.40 \\
         &PBPrompt &71.45 &\textbf{74.90} &\textbf{77.60} \\
         \bottomrule
    \end{tabular}}
\end{table}



\vspace{4cm}
\subsection{Base-to-New Generalization Details}

In this section, we report the complete results on base-to-new generalization using ViT-B/16 and RN50 respectively. Table ~\ref{tab: vit_b2n} shows more detailed accuracy consistent with Fig ~\ref{fig: base2new} in the manuscript. Besides, we also provide comprehensive results using RN50 with CoOp~\cite{zhou2022learning}, CoPLOT~\cite{chen2022prompt}, and our PBPrompt (shown in Table~\ref{tab: rn50_b2n}).

\begin{table}[h]
\caption{The base-to-new generalization accuracy results of various baselines on 11 datasets using \textbf{ViT-B/16}. We report the average value over three different seeds, and the results are performed on a 16-shot base set and then evaluated on the held-out new class. The best and the runner-up results are \textbf{highlighted} and \underline{underlined}.  H: the harmonic mean.}
\label{tab: vit_b2n}
\centering
    \scalebox{0.85}{
   \begin{tabular}{l|ccc|ccc|ccc|ccc}
   \toprule[1.5pt]
   \textbf{} &\multicolumn{3}{c|}{\textbf{Average}} &\multicolumn{3}{c|}{ImageNet} &\multicolumn{3}{c}{Caltech 101}&\multicolumn{3}{c}{Oxford Pets} \\
   &Base &New &H &Base &New &H &Base &New &H &Base &New &H \\
   \midrule
   CLIP         & 69.34 & \underline{74.22} & 71.69 
                & 72.34 & 68.14 & 70.18 
                & 96.84 & \underline{94.00} & 95.39
                & 91.17 & 97.26 & 94.11\\
   CoOp         & \textbf{82.66} & 63.22 & 71.65 
                & \underline{76.14} & 67.88 & 71.77 
                & \textbf{98.00} & 89.81 & 93.72
                & 93.67 & 95.29 & 94.47\\
    % VPT         & 80.10 & 74.94 & 77.43 
    %             & 76.00 & 70.93 & 73.37 
    %             & 98.00 & 94.93 & 96.44
    %             & 95.67 & 98.00 & 96.8\\
   CoCoOp       & 80.47 & 71.69 & \underline{75.83} 
                & 75.98 & \underline{70.43} & \underline{73.10} 
                & 97.96 & 93.81 & \underline{95.84}
                & \underline{95.20} & \underline{97.69} & \underline{96.43}\\
   % CoCoOp + VPT & 80.10 & 74.94 & 77.43 
   %              & 76.00 & 70.93 & 73.37 
   %              & 98.00 & 94.93 & 96.44
   %              & 95.67 & 98.00 & 96.82\\
   % ProDA        & \underline{81.56} & 72.30 & 76.65
   %              & 75.40 & 70.23 & 72.72 
   %              & \textbf{98.27} & 93.23 & 95.68
                % & 95.43 & 97.83 & 96.62\\
   CoPLOT         &77.20 &60.38 &67.76
                &75.97 &69.23 &72.44
                &96.53 &82.86 &89.17
                &93.45 &79.76 &86.06\\
   % \rowcolor{gray!25}
   % PBPrompt†    & 80.74 & 72.51 & 76.40 
   %              & 75.73 & 69.24 & 72.33
   %              & 97.98 & 94.56 & 96.23
   %              & 95.24 & 96.83 & 96.03\\
   
   CoOp+VPT     & 71.98 & 74.76  & 73.34  
                & 74.73 & 70.60 & 72.60
                & 95.47 & 93.80 & 94.62
                & 90.77 & 97.83 & 96.61 \\
   CoOp+SHIP    & 80.03 & 73.69 & 76.73 
                & 75.87 & 69.95 & 72.79 
                & 97.55 & 95.20 & 96.36 
                & 92.19 & 93.85 & 93.01 \\
   % \bb{CoOp+VPT}     & \bb{71.98}  & \bb{74.76}  & \bb{73.34}  
   %              & \bb{74.73} & \bb{70.60} & \bb{72.60}
   %              & \bb{95.47} & \bb{93.80} & \bb{94.62}
   %              & \bb{90.77} & \bb{97.83} & \bb{96.61} \\
   % \bb{CoOp+SHIP}    & \bb{80.03} & \bb{73.69} & \bb{76.73} 
   %              & \bb{75.87} & \bb{69.95} & \bb{72.79} 
   %              & \bb{97.55} & \bb{95.20} & \bb{96.36} 
   %              & \bb{92.19} & \bb{93.85} & \bb{93.01} \\
                
   \rowcolor{gray!25}
   PBPrompt     & \underline{81.36} & \textbf{74.65} & \textbf{77.86}
                & \textbf{76.90} & \textbf{70.87} & \textbf{73.76}
                & \underline{97.98} & \textbf{95.37}& \textbf{96.66}
                & \textbf{95.83} & \textbf{98.03} & \textbf{96.92} \\
   % \rowcolor{gray!25}
   % CPBPrompt    & 80.88 & \textbf{74.74} & \textbf{77.69} 
   %              & \underline{76.02} & \textbf{70.96} & \textbf{73.40} 
   %              & \underline{98.10} & \textbf{95.54} & \textbf{96.80}
   %              & \textbf{95.97} & \underline{97.98} & \textbf{96.96}\\
   \midrule
  \textbf{} &\multicolumn{3}{c|}{Stanford Cars} &\multicolumn{3}{c}{Flowers 102} &\multicolumn{3}{c|}{Food 101} &\multicolumn{3}{c}{FGVC Aircraft}\\
   &Base &New &H &Base &New &H &Base &New &H &Base &New &H \\
   CLIP         & 63.37 & \textbf{74.89} & \underline{68.65} 
                & 72.08 & \textbf{77.80} & 74.83
                & 90.10 & 91.22 & 90.66 
                & 27.19 & \textbf{36.29} & \underline{31.09}\\
   CoOp         & \textbf{78.12} & 60.40 & 68.13 
                & \textbf{97.60} & 59.67 & 74.06
                & 88.33 & 82.26 & 85.19 
                & \textbf{40.44} & 22.30 & 28.75\\
   % CoOp + VPT   & 65.27 & \textbf{75.97} & 70.21 
   %              & 72.97 & 75.90 & 74.40
   %              & 90.70 & 91.29 & 90.99 
   %              & 33.41 & 23.71 & 27.74\\
   CoCoOp       & 70.49 & \underline{73.59} & 72.01 
                & 94.87 & 71.75 & \underline{81.71}
                & \underline{90.70} & \underline{91.29} & \underline{90.99} 
                & 33.41 & 23.71 & 27.74\\
   % CoCoOp + VPT & 72.93 & 73.23 & 73.07 
   %              & 95.70 & 70.40 & 81.12
   %              & \textbf{91.03} & \textbf{92.13} & \textbf{91.57} 
   %              & 34.40 & 35.00 & \textbf{34.69}\\
   % ProDA        & \underline{74.70} & 71.20 & \underline{72.91} 
   %              & \textbf{97.70} & 68.68 & 80.66
   %              & 90.30 & 88.57 & 89.43 
   %              & \underline{36.90} & 34.13 & \textbf{35.46}\\
   CoPLOT         &61.41 &42.69 &50.37
                &95.26 &56.03 &70.56
                &88.45 &85.28 &86.84
                &29.63 &16.17 &20.92\\
   % \rowcolor{gray!25}
   % PBPrompt†    & 72.21 & 70.32 & 71.25 
   %              & 94.77 & 70.96 & 81.15 
   %              & 90.32 & 90.55 & 90.43
   %              & 34.17 & 32.84 & 33.49\\
   CoOp+VPT     & 65.27 & 75.97 & 70.21
                & 72.97 & 75.90 & 74.40 
                & 90.37 & 91.67 & 91.01  
                & 29.57 & 33.80 & 31.54   \\
   CoOp+SHIP    & 68.57 & 73.90 & 71.14 
                & 94.02 & 74.40 & 83.06 
                & 90.54 & 91.03 & 90.87  
                & 34.27 & 32.33 & 33.28   \\
                
   \rowcolor{gray!25}
   PBPrompt     & \underline{72.93} & 73.12 & \textbf{73.02} 
                & \underline{95.47} & \underline{73.60} & \textbf{83.12}
                & \textbf{90.87} & \textbf{91.57} & \textbf{91.22} 
                & \underline{35.47} & \underline{33.84} & \textbf{34.64} \\
   % \rowcolor{gray!25}
   % CPBPrompt    & 73.13 & 73.07 & \textbf{73.10} 
   %              & 95.63 & 72.76 & \textbf{82.64}
   %              & \textbf{90.87} & \textbf{91.62} & \textbf{91.24}
   %              & 33.83 & \underline{34.37} & \underline{34.10}\\
   \midrule
   \textbf{} &\multicolumn{3}{c|}{SUN 397} &\multicolumn{3}{c|}{DTD} &\multicolumn{3}{c}{EuroSAT} &\multicolumn{3}{c}{UCF 101} \\
   &Base &New &H &Base &New &H &Base &New &H &Base &New &H \\
   CLIP         & 69.36 & 75.35 & 72.23
                & 53.24 & \textbf{59.90} & 56.37 
                & 56.48 & 64.05 & 60.02 
                & 70.53 & \textbf{77.50} & 73.85\\
   CoOp         & \textbf{80.60} & 65.89 & 72.51
                & \textbf{79.44} & 41.18 & 54.24 
                & \textbf{92.19} & 54.74 & \underline{68.69} 
                & \textbf{84.69} & 56.05 & 67.45\\
   % CoOp + VPT   & 73.77 & \textbf{77.90} & 75.77
   %              & 57.67 & 58.70 & 58.18 
   %              & 67.97 & 71.63 & 69.75 
   %              & 73.23 & 74.63 & 73.92\\
   CoCoOp       & \underline{79.74} & \underline{76.86} & \underline{78.27}
                & 77.01 & 56.00 & \underline{64.85} 
                & 87.49 & 60.04 & 71.21 
                & 82.33 & 73.45 & \underline{77.64}\\
   % CoCoOp + VPT & 79.17 & 77.87 & 78.51
   %              & 75.30 & \textbf{60.80} & \textbf{67.27} 
   %              & 80.30 & \textbf{75.30} & 77.71
   %              & 82.53 & 75.77 & 79.00\\
   % ProDA        & 78.67 & 76.93 & 77.79
   %              & \textbf{80.67} & 56.48 & \underline{66.44} 
   %              & 83.90 & 66.00 & 73.88 
   %              & \textbf{85.23} & 72.97 & 78.04\\
   CoPLOT         &78.56 &72.34 &75.32
                &69.87 &53.63 &60.68
                &87.39 &\underline{64.63} &74.30
                &72.71 &41.51 &52.84\\
   % \rowcolor{gray!25}
   % PBPrompt†    & 79.25 & 76.44 & 77.81 
   %              & 76.32 & 54.73 & 63.75
   %              & \underline{89.46} & 67.13 & 76.70
   %              & 82.69 & 74.06 & 78.13\\
   CoOp+VPT     & 73.77 & 77.90 & 75.77  
                & 57.67 & 58.70 & 58.18 
                & 67.97 & 71.63 & 69.75  
                & 73.23 & 74.63 & 73.92   \\
   CoOp+SHIP    & 79.54 & 75.27 & 77.35 
                & 74.88 & 56.88 & 64.65
                & 88.63 & 66.87 & 76.22  
                & 81.08 & 76.85 & 78.91   \\
                
   \rowcolor{gray!25}
   PBPrompt     & 79.30 & \textbf{77.43} & \textbf{78.35}
                & \underline{78.03} & \underline{57.81} & \textbf{66.41}
                & \underline{89.53} & \textbf{72.87} & \textbf{80.34}
                & \underline{82.66} & \underline{76.59} & \textbf{79.51} \\
   % \rowcolor{gray!25}
   % CPBPrompt    & 79.47 & \textbf{77.70} & \textbf{78.57}
   %              & 78.13 & \underline{57.84} & \textbf{66.47}
   %              & 85.90 & \textbf{73.56} & \textbf{79.26}
   %              & 82.63 & \underline{76.73} & \textbf{79.57}\\
    \bottomrule[1.5pt]
   \end{tabular}}
\end{table}
\begin{table}[H]
\caption{The base-to-new generalization accuracy results of various baselines on 11 datasets using \textbf{RN50}. We report the average value over three different seeds, and the results are performed on a 16-shot base set and then evaluated on the held-out new class. The best results are \textbf{highlighted}. H: the harmonic mean.}
\label{tab: rn50_b2n}
\centering
    \scalebox{0.85}{
   \begin{tabular}{l|ccc|ccc|ccc|ccc}
   \toprule[1.5pt]
   \textbf{} &\multicolumn{3}{c|}{\textbf{Average}} &\multicolumn{3}{c|}{ImageNet} &\multicolumn{3}{c}{Caltech 101}&\multicolumn{3}{c}{Oxford Pets} \\
   &Base &New &H &Base &New &H &Base &New &H &Base &New &H \\
   \midrule
    CoCoOp      &75.7 &64.6 &69.71 
                &\textbf{68.3} &63.1 &65.60
                &95.0 &90.0 &92.43 
                &92.3 &94.6 &92.44\\
   CoPLOT       &\textbf{75.9} &67.6 &71.51
                &68.2 &63.1 &65.55
                &\textbf{95.4} &90.9 &93.09
                &92.1 &95.9 &93.96\\
   \rowcolor{gray!25}
   PBPrompt     &75.3 &\textbf{69.4} &\textbf{72.23}
                &68.2 &\textbf{63.3} &\textbf{65.66}
                &94.5 &\textbf{92.3} &\textbf{93.39}
                &\textbf{92.4} &\textbf{95.9} &\textbf{94.12}\\
   \midrule
  \textbf{} &\multicolumn{3}{c|}{Stanford Cars} &\multicolumn{3}{c}{Flowers 102} &\multicolumn{3}{c|}{Food 101} &\multicolumn{3}{c}{FGVC Aircraft}\\
   &Base &New &H &Base &New &H &Base &New &H &Base &New &H \\
    CoCoOp      &61.8 &65.3 &63.50
                &\textbf{91.2} &67.5 &77.58
                &85.0 &86.0 &85.50
                &25.5 &25.7 &25.60\\
    CoPLOT      &63.2 &\textbf{66.5} &64.80
                &89.6 &69.2 &78.09
                &\textbf{85.0} &85.2 &85.10
                &\textbf{25.6} &26.6 &\textbf{26.09}\\
   \rowcolor{gray!25}
   PBPrompt     &\textbf{64.6} &65.5 &\textbf{65.05}
                &89.8 &\textbf{71.0} &\textbf{79.30}
                &84.6 &\textbf{86.5} &\textbf{85.54}
                &23.2 &\textbf{27.8} &25.29\\
   \midrule
   \textbf{} &\multicolumn{3}{c|}{SUN 397} &\multicolumn{3}{c|}{DTD} &\multicolumn{3}{c}{EuroSAT} &\multicolumn{3}{c}{UCF 101} \\
   &Base &New &H &Base &New &H &Base &New &H &Base &New &H \\
   CoCoOp       &75.1 &73.6 &74.34
                &\textbf{73.1} &50.0 &59.38
                &88.9 &33.5 &48.66
                &76.5 &61.6 &68.25\\
   CoPLOT       &\textbf{75.2} &73.2 &74.17
                &72.6 &51.4 &60.19
                &\textbf{91.0} &55.3 &68.79
                &\textbf{77.4} &66.2 &\textbf{71.36}\\
   \rowcolor{gray!25}
   PBPrompt     &75.1 &\textbf{73.7} &\textbf{74.40}
                &70.3 &\textbf{56.2} &\textbf{62.46}
                &89.7 &\textbf{66.2} &\textbf{76.18}
                &76.1 &\textbf{67.1} &71.32\\
    \bottomrule[1.5pt]
   \end{tabular}}
\end{table}

\subsection{Domain Generalization Details}
In this section, we report the results of comparison between our method PBPrompt and PLOT on domain generalization using RN50.
As shown in Table~\ref{tab: domain_generalization_rn50}, our method has significant improvement on 3 out of 4 datasets using RN50 backbone. Besides, we add the comparison between our proposed method and VPT, SHIP on the domain generalization using \textbf{ViT-B/16}.

\begin{table}[!th]
\centering
\caption{\small{Cross-domain generalization accuracy results of various baselines using \textbf{RN50}.$\Delta$: The improvements of the proposed model compared to PLOT.}}
\label{tab: domain_generalization_rn50}
    \scalebox{0.8}{
    \begin{tabular}{lcccccc}
    \toprule[1.5pt]
    \textbf{} &\textbf{} &\multicolumn{1}{c}{Source} &\multicolumn{4}{c}{Target} \\
    \cmidrule(lr){3-3}\cmidrule(lr){4-7}
    \textbf{Method} &\textbf{Learnable} 
    &\textbf{ImageNet} 
    &\textbf{ImageNetV2} 
    &\textbf{ImageNet-Sketch} 
    &\textbf{ImageNet-A} 
    &\textbf{ImageNet-R}\\

    \midrule
    CoOp         & \Checkmark & 61.91 & 54.26 & 32.47 & 21.78 & 54.21  \\
    PLOT       & \Checkmark & \textbf{63.01} & \textbf{55.11} & 33.00 & 21.86 & 55.61  \\
    % PLOT \\
    % CoOp + VPT   & \Checkmark & 69.73 & 63.17 & 48.87 & 50.77 & 77.40  \\
    % \rowcolor{gray!25}
    % PBPrompt     & \Checkmark & 70.90 & \textbf{64.40} & \textbf{49.10} & \textbf{51.00} & \textbf{76.40}  \\
    \rowcolor{gray!25}
    PBPrompt     & \Checkmark & 62.95 & 54.77 & \textbf{34.10} & \textbf{24.85} & \textbf{59.89}  \\
    $\Delta$     & - & \bm{$-0.06$} &\bm{$-0.34$}
    & \bm{$+1.10$}
    & \bm{$+2.99$}
    & \bm{$+4.28$}\\
    
    \bottomrule[1.5pt]
    \end{tabular}}
\end{table}


\begin{table}
\centering
\caption{\small{Cross-domain generalization accuracy results of various baselines using \textbf{Vit-B/16}.$\Delta$: The improvements of the proposed model compared to PLOT.}}
\label{tab: domain_generalization_rn50}
    \scalebox{0.8}{
    \begin{tabular}{lcccccc}
    \toprule[1.5pt]
    \textbf{} &\textbf{} &\multicolumn{1}{c}{Source} &\multicolumn{4}{c}{Target} \\
    \cmidrule(lr){3-3}\cmidrule(lr){4-7}
    \textbf{Method} &\textbf{Learnable} 
    &\textbf{ImageNet} 
    &\textbf{ImageNetV2} 
    &\textbf{ImageNet-Sketch} 
    &\textbf{ImageNet-A} 
    &\textbf{ImageNet-R}\\

    \midrule
    CoOp         & \Checkmark &71.51  &64.20  &47.99  &49.71  &75.21   \\
    CoOp + VPT       & \Checkmark &69.73  &63.17  &48.87  &50.95  &76.24   \\
    CoOp + SHIP       & \Checkmark &70.12  &63.23  &48.65  &50.77  &77.40   \\
    CoCoOp       & \Checkmark &71.02  &64.07  &48.75 &50.63  &76.18   \\
    CoCoOp + VPT       & \Checkmark &70.70  &64.23  &49.20  &51.33  &\textbf{77.00}   \\
    CoCoOp + SHIP       & \Checkmark &70.81  &64.34  &49.25  &51.28  &76.50   \\

    \rowcolor{gray!25}
    PBPrompt     & \Checkmark &\textbf{71.71}  &\textbf{64.53}  &\textbf{49.32}  &\textbf{51.64}  &76.71   \\
    $\Delta$     & - & \bm{$+0.90$} &\bm{$+0.19$}
    & \bm{$+0.07$}
    & \bm{$+0.36$}
    & \bm{$+0.21$}\\
    
    \bottomrule[1.5pt]
    \end{tabular}}
\end{table}

\subsection{Cross-Dataset Transfer Learning Details}
In this section, we report the results of comparison between our method PBPrompt and other CoOp-based methods on cross-dataset transfer learning using ViT-B/16.
As shown in Table~\ref{tab: cross_dataset_SHIP}, compared with these CoOp-based methods, the proposed method has significant improvement on 7 out of 11 datasets and only shows a slight drop on the others.


\begin{table}[!th]
\caption{\small{Cross-dataset transfer learning accuracy results of CoOp-based method on source and target datasets using ViT-B/16.
$\Delta$: The improvements of the proposed model compared to SHIP.}}
\label{tab: cross_dataset_SHIP}
\centering
    \scalebox{0.70}{
    \begin{tabular}{lcccccccccccc}
    \toprule[1.5pt]
    \textbf{} &\multicolumn{1}{c}{Source} &\multicolumn{11}{c}{Target} \\
    \cmidrule(lr){2-2}\cmidrule(lr){3-13}
    \textbf{Method}
    &\rotatebox{90}{\textbf{Imagenet}} 
    &\rotatebox{90}{\textbf{Caltech}}
    &\rotatebox{90}{\textbf{Pets}}
    &\rotatebox{90}{\textbf{Cars}}
    &\rotatebox{90}{\textbf{Flowers}}
    &\rotatebox{90}{\textbf{Food}}
    &\rotatebox{90}{\textbf{Aircraft}}
    &\rotatebox{90}{\textbf{SUN}}
    &\rotatebox{90}{\textbf{DTD}}
    &\rotatebox{90}{\textbf{EuroSAT}}
    &\rotatebox{90}{\textbf{UCF}}
    &\rotatebox{90}{\textbf{Average}}\\
    \midrule
    % CoOp         & 71.51 & 93.70 & 89.14 & 65.41 & 68.71 & 85.30 & 18.47 & 64.15 & 41.92 & 46.39 & 66.55 & 63.81 \\
    % CoCoOp       & 71.02 & 94.43 & 90.14 & 65.32 & 71.88 & 86.06 & 22.94 & 67.36 & \textbf{45.73} & 45.37 & 68.21 & 65.74 \\
    % PLOT \\
    ProGrad &71.50  &94.43  &90.14  &65.32  &71.88  &86.06  &22.94  &67.36  &45.73  &45.37  &68.21  &65.74 \\
    CoOp + VPT   & 69.73 & 93.67 & 89.27 & 65.50 & 70.20 & 86.27 & 22.13 & 66.57 & \textbf{46.93} & 47.43 & 67.21 & 65.51 \\
    % \rowcolor{gray!25}
    CoOp + SHIP     & - & 94.04 & 90.38 & 65.55 & 69.67 & \textbf{86.40} & 21.90 & 66.26 & 45.69 & \textbf{48.17} & 68.52 & 65.69 \\
    \rowcolor{gray!25}
    PBPrompt     & \textbf{71.71} & \textbf{94.87} & \textbf{90.62} & \textbf{66.00} & \textbf{72.44} & 86.34 & \textbf{24.82} & \textbf{67.69} & 45.62 & 47.13 & \textbf{68.83} & \textbf{66.40} \\
    % \midrule
    % \rowcolor{gray!25}
    % $\Delta$    & \color{teal}{\bm{$-0.61$}} 
    %             & \color{orange}{\bm{$+0.73$}} & \color{orange}{\bm{$+1.48$}} & \color{teal}{\bm{$-0.60$}} & \color{orange}{\bm{$+1.69$}} & \color{orange}{\bm{$+0.83$}} & \color{orange}{\bm{$+5.48$}} & \color{orange}{\bm{$+3.26$}} & \color{orange}{\bm{$+3.70$}} & \color{teal}{\bm{$-0.19$}} & \color{orange}{\bm{$+0.92$}} & \color{orange}{\bm{$+1.29$}} \\
    % \midrule
    % $\Delta$    & \color{teal}{\bm{$-0.61$}} 
    %         & \color{orange}{\bm{$+0.73$}} & \color{orange}{\bm{$+1.48$}} & \color{teal}{\bm{$-0.60$}} & \color{orange}{\bm{$+1.69$}} & \color{orange}{\bm{$+0.83$}} & \color{orange}{\bm{$+5.48$}} & \color{orange}{\bm{$+3.26$}} & \color{orange}{\bm{$+3.70$}} & \color{teal}{\bm{$-0.19$}} & \color{orange}{\bm{$+0.92$}} & \color{orange}{\bm{$+1.29$}} \\
    % \midrule
    % CoCoOp + VPT & 70.70 & 93.67 & 90.63 & 65.00 & 70.90 & 86.30 & 24.93 & 67.47 & 46.10 & 45.87 & 68.67 & 65.95 \\
    % BPL          & \bb{70.90} & \bb{94.40} & \bb{90.90} & \bb{} & \bb{70.40} & \bb{} & \bb{23.90} & - & \bb{45.60} & - & \bb{68.40} & - \\
    % \rowcolor{gray!25}
    % CPBPrompt   & 70.94 & 94.92 & 90.83 & 65.34 & 72.37 & 86.41 & 24.58 & 67.75 & 45.23 & 45.10 & 68.78 & 66.13 \\
    % \midrule
    $\Delta$     & - & \bm{$+0.83$} & \bm{$+0.24$} & \bm{$+0.45$} & \bm{$+2.77$} & \bm{$-0.06$} & \bm{$+2.92$} & \bm{$+1.43$} & \bm{$-0.07$} & \bm{$-1.04$} & \bm{$+0.31$} & \bm{$+0.71$} \\
    \bottomrule[1.5pt]
    \end{tabular}}
\end{table}


\subsection{Trade-off on Base-to-New Generalization}
The number of training epochs causes the trade-off between performance on base and on new classes. Specifically, more training epochs lead better accuracy on base classes and lower it on new classes. Therefore, we training ImageNet, Caltech101, DTD, EuroSAT and Flowers102 for 50 more epochs on base-to-new task. As shown in Table~\ref{tab: trade_off}, increasing the number of epochs in the training process can enhance performance on base classes while causing a slight decline on new classes. However, the changes in the harmonic mean are only marginally affected.
For example, with more training epochs on Flowers102, our proposed method raises the performance on base classes by $+1.21$ and lower it on new classes by $-2.44$. This change slightly affects the harmonic mean, reducing it by $1.37\%$ which is still $0.33\%$ better than CoCoOp.

\begin{table}[!th]
\centering
\caption{\small{Base-to-new generalization accuracy results of our purposed method PBPrompt with more 50 training epochs on ImageNet, Caltech101, DTD, EuroSAT and Flowers102 using ViT-B/16. $(\cdot)$ denoted the difference from the original results in Table~\ref{tab: vit_b2n}. $\Delta$: The improvements of harmonic mean compared to CoCoOp (without additional training epochs).}}
\label{tab: trade_off}
    \begin{tabular}{lccccc} 
    \toprule[1.5pt]
    &\textbf{ImageNet}
    &\textbf{Caltech101} 
    &\textbf{Flowers102} 
    &\textbf{DTD} 
    &\textbf{EuroSAT} \\

    \midrule
    Base  & 76.97 \small{(+0.07)} & 98.01 \small{(+0.03)} & 96.68 \small{(+1.21)} & 80.44 \small{(+2.41)}& 91.86 \small{(+2,32)}  \\
    New    & 70.12 \small{(-0.75)} & 94.43 \small{(-0.94)} & 71.16 \small{(-2.44)} & 52.15 \small{(-5.66)} & 68.08 \small{(-4.79)}  \\
    H  & 73.36 \small{(-0.40)} & 96.19 \small{(-0.47)}& 81.98 \small{(-1.14)} & 63.28 \small{(-1.57)} & 78.20 \small{(-2.14)} \\
    $\Delta$     &\bm{$+0.26$}
    & \bm{$+0.35$}
    & \bm{$+0.27$}
    & \bm{$-1.57$}
    & \bm{$+6.99$}\\
    
    \bottomrule[1.5pt]
    \end{tabular}
\end{table}

\subsection{More Ablation Study Details}\label{sec: more}
In this section, we validate that the stochastic generated module is the crucial factor affected the performance of our proposed method instead of additional parameters in inference network. Empirically, we also compare the results with our purposed method under Optimal Transport (OT) framework to test the efficiency of the adopted CT module.
We build two models denoted by $\text{PBPrompt}_{\text{w/o-S}}$ and $\text{PBPrompt}_\text{OT}$ respectively for comparison. $\text{PBPrompt}_{\text{w/o-S}}$ denotes the model removing the stochastic prompt generation process and only preserving the inference network.
$\text{PBPrompt}_\text{OT}$ denotes the model replace the CT framework with OT framework.
Then, we conduct the ablation study on the few-shot task (1/2/4 shots) with ImageNet, Caltech101, Flowers102, DTD and EuroSAT.

\begin{table}[h]
    \centering
    \caption{The results of ablation study on five datasets using ViT-B/16. We report the average value over three different seeds. The best results are highlighted.}
    \label{tab: ablation1}
    \begin{tabular}{llccc}
    \toprule[1.5pt]
         Dataset &Methods &1 shot & 2 shots & 4 shots \\
         \hline
         \multirow{4}{*}{ImageNet} 
         &CoOp & 68.10 & 69.25 & 69.53\\
         % \cmidrule(lr){2-5}
         &$\text{PBPrompt}_{\text{w/o-S}}$ & 68.27& 69.30& 69.92\\
         &$\text{PBPrompt}_\text{OT}$ & 69.03& 69.79& 70.23\\ 
         &PBPrompt & \textbf{69.55}& \textbf{69.90}& \textbf{70.50}\\
        \hline
         \multirow{4}{*}{Caltech101} 
         &CoOp & 93.13& 92.97& 94.50\\
         &$\text{PBPrompt}_{\text{w/o-S}}$ & 92.86 & 93.91& 94.51\\
         &$\text{PBPrompt}_\text{OT}$ & 93.39& 93.76& 94.62\\ 
         &PBPrompt & \textbf{93.92}& \textbf{94.40}& \textbf{94.83} \\

        \hline
         \multirow{4}{*}{Flowers102} 
         &CoOp & 73.00& 81.90& 86.56\\
         &$\text{PBPrompt}_{\text{w/o-S}}$ & 73.56& 82.04& 87.00\\
         &$\text{PBPrompt}_\text{OT}$ & 74.16& 82.66& 87.92\\ 
         &PBPrompt & \textbf{75.43}& \textbf{83.37}& \textbf{88.90}\\
        \hline
         \multirow{4}{*}{DTD} 
         &CoOp & 50.03& 53.93& 59.23\\
         &$\text{PBPrompt}_{\text{w/o-S}}$ & 50.65& 54.55& 59.40\\
         &$\text{PBPrompt}_\text{OT}$ & 51.95& 55.66& 59.50\\ 
         &PBPrompt & \textbf{52.03}& \textbf{56.20}& \textbf{59.63}\\
        \hline
         \multirow{4}{*}{EuroSAT} 
         &CoOp & 51.80& 66.33& 65,87\\
         &$\text{PBPrompt}_{\text{w/o-S}}$ & 52.15& 66.97& 68.19\\
         &$\text{PBPrompt}_\text{OT}$ & \textbf{61.10}& 67.21& 71.77\\ 
         &PBPrompt & 60.92& \textbf{68.77}& \textbf{72.84}\\
         \bottomrule[1.5pt]
    \end{tabular}
\end{table}


\subsection{Computation Cost Evaluation}\label{sec: cost}
\begin{table}[h]
    \centering
    \caption{The parameters and inference time comparison.}
    \label{tab: computation_cost}
    \begin{tabular}{l|cccc}
    \toprule[1.5pt]
         Settings &\textbf{CoOp} &\textbf{CoCoOp} &\textbf{PLOT(N=4)} &\textbf{PBPrompt} \\ 
         \midrule
         \# Params &2048 &35360 &8192 &1577984 \\
         Inference Speed(images/s) &645 &37 &583 &541\\
    \bottomrule[1.5pt]
    \end{tabular}
\end{table}

In this section, we summarize the comparison of the parameters and inference speed of the baseline methods CoOp~\cite{zhou2022learning}, CoCoOp~\cite{zhou2022conditional}, PLOT~\cite{chen2022prompt} with 4 prompts and our PBPrompt with 10 samples. We report the number of learnable parameters and the number of images processed by the model in 1 second during inference on the Food101~\cite{bossard2014food} dataset. As shown in Table ~\ref{tab: computation_cost}, despite the introduction of additional learnable parameters in our model, we were able to achieve comparable inference speed.

\section{Visualization Details} \label{app_vis}
\renewcommand\thetable{\Alph{section}. \arabic{table}}    
\setcounter{table}{0}
\renewcommand\thefigure{\Alph{section}\arabic{figure}}
\setcounter{figure}{0}

\subsection{Analysis For Visualization}
\begin{figure}[H]
    \centering
    \includegraphics[width=\textwidth]{fig/app_vis.pdf}
    \caption{Visualization of the learned prompts unrelated to the corresponding class.}
    \label{fig: app_vis}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=\textwidth]{fig/prompt_vis.pdf}
    \caption{Prompt-caption retrieval results.}
    \label{fig: app_prompt}
\end{figure}

To exhibit how stochastic-generated prompts for a certain class focus on the visual concepts of the images related to the corresponding class, we have provided some visualization examples at Fig ~\ref{fig: vis} in the manuscript via employing the transport plans $\pi$ to match the relations between various textual prompts and visual patches. In the first two rows, we present two images belonging to the "Abyssinian" and "Keeshond" respectively in OxfordPets. Obviously, from the heatmaps, the prompts generated from the corresponding class prefer to focus on their ears, nose, eyes, and other body parts with category-specific characteristics. In the third row, we select an image belonging to the "Hibiscus" in OxfordFlowers and the stochastic-generated prompts pay more attention to its stems, stamens, and petals. Simultaneously, we take an image belonging to the "Bentley Continental Supersports Conv. Convertible 2012" in StanfordCars in the fourth row, and the corresponding prompts concentrate on the car's body, wheels, and roof.

For the prompts generated for classes unrelated to the image, we also provided some examples to demonstrate the content they focused on. As shown in Fig ~\ref{fig: app_vis}, most heatmaps concentrate on the environment of the object, while others pay attention to certain areas of the object but lack a significant correlation with the object category attributes.

To explain the learned prompt from the text domain, one of the direct ways is to visualize the most semantically close words of the generated prompts. Unfortunately, previous works find that the most of retrieved words  failed to explain the prompts ~\citep{zhou2022learning}. To this end, we here adopt Mini-GPT4 to generate diverse captions and report the top-2 captions of each learned prompt according to their cosine similarity (calculated by their CLIP features) at Fig ~\ref{fig: app_prompt}. From the results, we find that 1) The learned prompts indeed capture diverse label-specific concepts; 2) The retrieved captions of each prompt share close semantics, which demonstrates the coherence of the learned prompts.


\end{document}
