\section{Experiments}
\label{sec:experiments}
\subsection{Datasets and implementation details}
\textbf{Datasets.} We validate the effectiveness of TNN on two publicly available datasets. Messidor \cite{decenciere2014feedback} depicts a retinal image database collected from three independent medical centers, containing 1200 images of diabetic retinopathy, each labeled with a severity score ranging from 0 to 3.
Therefore, it consists of 3 domains, 4 classes, and 1200 images in total.
Fetal-8 \cite{burgos2020evaluation} is a maternal-fetal ultrasound dataset that consists of eight classes representing different anatomical planes collected from imaging
scanners of two different vendors.
Hence, it consists of 2 domains, 8 classes and 12,058 images in total.
We consider the problem as a test-time generalization setting, not domain adaptation.
Therefore, we train our model on multiple source domains of the Messidor dataset instead of only one.
Furthermore, we follow the \textit{leave-one out} evaluation standard from \cite{iwasawa2021test,li2017deeper} and obtain the best source model following the training-domain validation split of \cite{iwasawa2021test}. 
For all the experiments, performance is reported on the target dataset with test-time adaptation~\cite{ambekar2023learning,iwasawa2021test,wang2021tent} by utilizing accuracy as a metric. \\

\noindent \textbf{Implementational details.}
On both datasets, Messidor and Fetal-8, we evaluate the performance of our approach based on three backbones, {DenseNet-121, ResNet-18, and ResNet-50}. Further, all the in-depth ablation experiments were performed using the ResNet-18 model. As in common test-time methods~\cite{iwasawa2021test,gulrajani2020search}, the backbones are pretrained on Imagenet. 
Baselines and state-of-the-art methods are also implemented for the two datasets and for all of the three backbones, utilizing the Domainbed library~\cite{gulrajani2020search} with all hyperparameters set to default.
The training-validation split strategy of~\cite{iwasawa2021test,gulrajani2020search} is used for the selection of the best source model.
At test-time, we perform just forward passes to perform the classification on the online target data. We evaluate the performance of methods following standard test-time generalization evaluation~\cite{iwasawa2021test}. 
We utilize a small batch size of 32 for test-time generalization, reflecting real-world practical scenarios. 

\subsection{Comparisons}
We evaluate the performance of our model in reference to different state-of-the-art (SOTA) approaches and
a source training strategy employing ERM minimization without adaption to the target domain as the baseline. \\

\noindent  \textbf{State-of-the-art comparisons.} We compare our approach to existing parametric and non-parametric state-of-the-art methods by re-implementing them on the two datasets for all three backbones. Parametric methods, as shown in Figure~\ref{fig1:intro}, refer to techniques that finetune weights of the source-trained model, utilizing gradient optimization. Non-parametric methods refer to techniques that perform feedforward computation at test-time without any kind of finetuning, optimization, or usage of any external memory bank or an additional model.\\

Table~\ref{table:all_datasets} shows the performance comparisons. TNN achieves the best results on both datasets. 
Parametric methods achieve comparatively lower performance than the ERM baseline in many cases. One reason for this can be given by the fact that the medical imaging datasets at hand only contain a fraction of the number of samples the parametric methods were designed for. Furthermore, differences between images of distinct classes in the medical domain are way more subtle than in the computer vision domain, with different classes in the retinopathy images of Messidor even depicting a severity score that changes only gradually. 
Therefore, it is likely that non-parametric methods achieve better performance due to the fact that they do not fine-tune the complete source model weights but rather act upon the source-trained embedding space that should be able to separate classes reasonably well. T3A \cite{iwasawa2021test} utilizes the entropy of the samples as a threshold to classify new cases. In contrast, we use detailed neighborhood information for classification.
For the Fetal-8 dataset, utilizing the ResNet-18 backbone, the performance of all the generalization methods decreases with reference to the ERM baseline. Reason for this is most likely overfitting due to the small size of the model, but also the small dataset size at test-time. For all the remaining settings, TNN performs better than the other approaches. 




\input{sections/main_table}

\subsection{Additional Experiments}
\label{sec:ablation}


\textbf{Addressing uncertain scenarios. } Ensuring alignment between model output probabilities and the actual likelihood of events is crucial in uncertain scenarios~\cite{kumar2019verified}. To quantify this alignment, Table~\ref{tab:ece} presents the expected calibration error~\cite{naeini2015obtaining} (lower values indicate better calibration) for our approach compared to the Tent model \cite{wang2021tent} and T3A~\cite{iwasawa2021test}, utilizing a ResNet-18 backbone. {We report the ECE error between predicted and ground truth labels, using a public library\footnote[1]{\url{https://torchmetrics.readthedocs.io/en/v0.8.0/classification/calibration_error.html}}} {Consistently, TNN and T3A demonstrate considerably better calibration scores} across all domains on the Messidor dataset. Moreover, TNN also takes into account the neighborhood information while addressing uncertainty. By utilizing its feedforward nature for classifier adjustments, TNN achieves a calibrated model. \\





\begin{table}[ht!]
\centering
\begin{minipage}{0.48\textwidth}
\centering
\caption{\textbf{Addressing uncertain scenarios.} ECE error on the three domains (0-2) of the Messidor dataset. The proposed method consistently reduces the ECE error across all the domains. }
\label{tab:ece}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lcccc}
\toprule
& &\multicolumn{2}{c}{\textbf{ECE Error $\downarrow$}} \\
\cmidrule(lr){2-5} \cmidrule(lr){4-5}
& \textbf{{0}} & \textbf{{1}} & \textbf{{2} }& \textbf{{Mean $\downarrow$}} \\ 
\midrule
{Tent} & 0.101 & 0.336 & 0.130 & 0.189 \\
{T3A} & \textbf{0.001} & \textbf{0.005} & \textbf{0.003} & \textbf{0.003} \\
\textbf{{{TNN (Ours)}}} & \textbf{0.001} & \textbf{0.005} & \textbf{0.003} & \textbf{0.003} \\
\bottomrule
\end{tabular}%
}
\end{minipage}
\hfill
\begin{minipage}{0.48\textwidth}
\centering
\caption{\textbf{Computational cost.} The number of new parameters to be trained at test-time alongside the TeraFlops consumed on the GPU. TNN and T3A both consume fewer resources and are thus useful for practical scenarios. }
\label{tab:tta_bn}
\resizebox{\textwidth}{!}{%
\begin{tabular}{cccc}
\toprule
& \textbf{Parameters} $\downarrow$ & \textbf{Model TFlops} $\downarrow$ \\ 
\midrule
Tent & 600000  & 212992 \\
T3A &  0  & - \\

\textbf{TNN (Ours)}& 0  & - \\
\bottomrule
\end{tabular}%
}
\end{minipage}
\end{table}


\noindent \textbf{Computational cost.} In Table~\ref{tab:tta_bn}, we compare the number of parameters required to be trained at test-time and the number of floating point operations per second (FLOPS) consumed by the GPU for Tent \cite{wang2021tent}, T3A \cite{iwasawa2021test} and TNN. All of the methods included, including ours, feature the same memory requirements of 1.4GB for the ResNet-18 model. However, as Tent optimizes the batch normalization layers of the target model at test time, more parameters must be trained than our approach. %
TNN and T3A are both non-parametric. Thus, they only perform a very limited amount of computational operations but do not need any additional computations to calculate the gradients on the GPU. Therefore, the measurement of TFlops is negligible in this case, considering the vast amount of computations required for weight optimization of parametric approaches. This is especially useful in limited resource settings.\\
