\section{Background}
\label{sec:background}

Test-time domain generalization~\cite{ambekar2023learning,xiao2022learning,iwasawa2021test} aims to generalize a model $\btheta_{s}$ trained on the source domains $\mathcal{S}$ to an unseen target domain $\mathcal{T}$, with $\mathcal{S}$ usually consisting of several source domains $\{ D_s \}_{s=1}^{S}$. $\mathcal{T}$ which may also consist of several target domains $\{ D_t \}_{t=1}^{T}$. Here, $(\x_s, \y_s)$ and $(\x_t, \y_t)$ denote the image and corresponding label pairs on the source $\{ D_s \}_{s=1}^{S}$ and target domain $\{ D_t \}_{t=1}^{1}$, respectively. The objective of test-time domain generalization is to maximize the log-likelihood of the source model on the target data $p(D_t|\btheta_s)$, i.e., $p(\y_t|\x_t, \btheta_s)$. \\ 


\begin{figure}[t]
\centering

\includegraphics[width=0.99\linewidth]{figures/PM_Fig1_2.png}


\caption{{\textbf{TNN at test-time. }We do not change the source training setup. }We initially project the target data to lower dimensional space for class separability with a frozen source model, as the lemma indicates. Further, obtain prototypes for the classes as in~\cite{iwasawa2021test}. Next, in the embedding space, TNN performs neighborhood search $\textbf{h}(\bar{x}_{t})$ in a non-parametric manner, which is followed by obtaining the classification label $\hat{\y_{t}}$.}

\label{fig:method}

\end{figure}

\noindent \textbf{Formulation of parametric methods.} Due to distribution shifts between source and target domains, the source-trained model $\btheta_s$ is highly likely to fail on unseen target domains $D_t$, causing unreliable predictions with high confidence~\cite{ambekar2023learning,yi2023source}. To prevent this, the source model must be generalized to the target domain at test time by transforming $\btheta_s$ to $\btheta_t$. Most common parametric methods employ fine-tuning based on norm-based losses \cite{jang2022test,liang2020we,wang2021tent}. The log-likelihood of the target data is given by:
\begin{equation}
\label{tta} %
\begin{aligned}
    p(\y_t|\x_t, \btheta_{s}) & = \int p(\y_t|\x_t, \btheta_{t}) p(\btheta_{t}|\x_t, \btheta_{s}) d \btheta_{t} \\
    & \approx p(\y_t|\x_t, \btheta^*_{t}),
\end{aligned}
\end{equation}


\noindent with the integration of the distribution $p(\btheta_t)$ usually approximated by the maximum a posteriori (MAP) estimation~\cite{ambekar2023learning}. The final generalized MAP model $\btheta^*_{t}$ is obtained by fine-tuning of the parameters with one or multiple rounds of backpropagation using a norm-based unsupervised loss function, like entropy minimization \cite{wang2021tent}, pseudo labeling \cite{liang2020we} or task-specific losses \cite{liu2021ttt++,sun2020test}.
However, fine-tuning the model parameters through gradient optimization with multiple rounds makes parametric methods time-consuming and computationally expensive while also being sensitive to hyperparameter settings. \\


\noindent {\textbf{Non-parametric methods. }}
{
To counter the above limitations,
recent non-parametric methods such as \cite{iwasawa2021test} obtain class representations as prototypes, utilizing the weights of the source-trained linear classifier, i.e., without the need for
MAP approximation or gradient-based optimization. Next, they obtain pseudo labels for the incoming target data based on the distance to those prototypes by applying entropy thresholds.
After each incoming batch of target data, significant samples are selected, employing a threshold, and used to update the prototypes via simple adjustments to the classifier. 
}





\section{Method}
\label{sec:prop_method}
\textbf{Source training.} Recent studies have shown that utilization of empirical risk minimization (ERM) \cite{vapnik1991principles,gulrajani2020search} during source training enables models to generalize well under distribution shifts. Other methods, such as \cite{xiao2022learning,zhang2023adanpc,ambekar2023learning,xiao2023energy}, included additional objectives to be minimized during source training.
However, the requirement to interfere in the training procedure limits the applicability of such approaches.
Therefore, we aim to develop a method that does not modify the source training procedure, making it applicable to any pretrained model without any additional requirements. Specifically, as in \cite{vapnik1991principles,gulrajani2020search}, on multiple source domains $\{ D_s \}_{s=1}^{S}$, given a source model $\btheta_{s}$, such as ResNet-18, and a loss function $\mathcal{L}$, such as cross-entropy, the total risk is minimized via $\mathbb{E}_{(x_{s}, y_{s}) \sim D_s }[\mathcal{L}(\theta_{s}(x_{s}), y_{s})]$. \\

\noindent \textbf{Test-time generalization via nearest neighbors.} Our approach can be summarized by: at test-time, in a non-parametric way, we initially compute the source prototypes following~\cite{iwasawa2021test}. Next, given a batch of target data, we obtain the nearest neighbors for classification and adjust the classifier weights {as described below.}\\

\noindent \textbf{Existence of nearest neighbors. } We propose that for a sufficiently trained model able to separate classes reasonably well in the source domain, cases that are similar in the higher dimensional image domain will lie close to each other in the source learned lower dimensional embedding space. {This is ensured by the Johnson-Lindenstrauss (JL)~\cite{johnson1984extensions} lemma as stated below:}

Given a set of points \( \{x_i : i = 1, ..., M\} \) in \( \mathbb{R}^m \), the JL lemma~\cite{johnson1984extensions} states that if \( n \geq c \epsilon^2 \log M \), with $0 < \epsilon < 1$, then there exists a linear map \( A : \mathbb{R}^m \rightarrow \mathbb{R}^n \) such that for all \( i \neq j \):
\[ 1 - \epsilon \leq \frac{\|A(x_i) - A(x_j)\|}{\|x_i - x_j\|} \leq 1 + \epsilon. \]




\noindent At test-time, utilizing the above lemma, our approach first computes the source model prototypes for each class in lower-dimensional space. To do so, as in \cite{iwasawa2021test}, we initialize the class-specific prototypes by aggregating the weights of the source-trained linear classifier layer. When receiving new test-time data $\x_{t}$, we project it into lower-dimensional space using the source-trained model that preserves distances in the embedding space, as ensured by the JL lemma. In this embedding space, we assign $\x_{t}$ the label of its nearest class prototypes based on a distance measure (see below). Finally, we update the class-specific prototypes with the new sample to reflect new target characteristics. This allows us to classify new samples without the need for extensive computation or any optimization schemes to find the nearest points to the prototypes. \\











\noindent  \textbf{Selecting neighbors with dynamic voting.} Since not all of the neighbors provide accurate information about the target data, i.e., some of them are noisy~\cite{dubey2021adaptive}, we calculate the distance between the initial prototypes and new classifier weights obtained from incoming neighbor samples at test time for the selection of valid neighbors.
We utilized the cosine distance here, while in principle, every other distance metric can be used.
Next, we use dynamic voting $h(\bar{\x}_{t}$) to obtain the most useful neighbors, i.e., we aggregate each neighbor's prediction and calculate the mean of the new weights obtained to determine the final weights of the classifier. When a new batch of samples arrives, the pseudo labels are predicted based on these classifier weights. This process is repeated iteratively for each new incoming batch of data.


\begin{algorithm}[b!]
\label{alg:1}
\small
\caption{TNN for medical images.\\
{\textbf{Input:}} $\mathcal{T}$: target domain; learned and frozen $\btheta_s$;\\
{\textbf{Output:}} $\btheta_t^{*}$ with adjusted weights
}
\label{alg:2}
\begin{algorithmic}[1]
\FOR{\textit{iter} in $N_{iter}$}
\STATE Draw random samples for a batch from $\mathcal{T}$ as $\x_{t}$
\STATE Obtain source prototypes ${p}(\bar{x}_{t})$  from the source-trained model ($\btheta_s$) with T3A~\cite{iwasawa2021test}
\STATE Forward pass of $\x_{t}$ through $\btheta_s$ to obtain the points in lower dimensional space
\STATE Calculate the distance between source prototypes and the new samples 
\STATE Obtain subsets of samples and use dynamic voting to obtain classifier
\STATE Obtain $\hat{\y_{t}}$ for the batch of $\x_{t}$
\ENDFOR
\end{algorithmic}
\end{algorithm}














