\section{Experiment and Discussion}
\subsubsection{\textbf{Dataset}}
We utilized a public dataset of nailfold videocapillaroscopy images from \cite{zhao2024comprehensive}. This dataset contains \datasetImgCount{} high-quality capillaroscopy recordings from 68 participants with a microscope of around $\times 200$ magnification. By submitting Biometrics Dataset Release Agreement, we obtain both the raw image data and its Labelme \cite{wada_labelme_2021} formatted annotation and converted them to COCO format as described in the above section.

\subsubsection{Implementation}
Model training was conducted on an NVIDIA GeForce RTX 4090 GPU (24GB) under Ubuntu 24.04, using Python 3.11.11, PyTorch 2.6.0, and Torchvision 0.21.0. Input images were uniformly resized to $1024 \times 1024$ and augmented via flipping, cropping, resizing. In addition to those augmentations, we applied photometric transformations like random brightness, contrast, and hue shifts to mimic the highly variable lighting conditions found in outpatient clinics. We select MViTv2-T as the initial checkpoint for its strong performance and lightweight architecture. Optimization employed AdamW with an initial learning rate of 1.6e-4, incorporating linear warm-up and scheduled decay at 52,500, 62,500, and 67,500 iterations. Due to GPU memory constraints, the batch size was fixed at 4. An 80:20 train-test split was applied at the subject level to prevent image overlap across sets. Although it has been suggested that Detectron2~\cite{wu2019detectron2} may not be ideal for certain experiments~\cite{gracia2022challenge}, we successfully engineered the core codebase using this platform.

\subsubsection{Evaluation Metrics}

\input{tables/table-evaluation-result}
In this research, different metrics are picked for each task, as shown in \autoref{tab:model_metrics}. For each task, we conduct experiments using a five-fold setup and report the aggregated results across the folds.

\subsubsection{Results}
For segmentation, we evaluate pixel-level sensitivity and obtain an average result of 0.827, outperforming ANFC~\cite{zhao2024comprehensive} by demonstrating improved segmentation performance. Regarding CAPI~\cite{gracia2022challenge}, since the original method does not address the segmentation task, we do not include it in this comparison.

For the classification task, we assess results at the \textit{whole-image level}. Our assumption is that if an NFC image contains a single abnormal capillary, the entire image is classified as abnormal. Similarly during evaluation, the inferred classifications of individual capillaries are aggregated to determine the overall image classification. Based on this assumption, we hit an 88.5\% accuracy, 89.7\% precision, 86.8\% recall and 88.2\% F1 score.

For the keypoint detection task, we assess model performance through a downstream task: capillary parameter estimation. Specifically, the venous diameter is computed as the Euclidean distance between keypoints $LL$ and $LR$, the arterial diameter from $RL$ and $RR$, and the apical diameter from $U$ and $D$. The mean absolute error (MAE) is calculated as the average difference between the predicted and ground truth diameters for all capillaries at the image level. As shown in~\autoref{tab:model_metrics}, the model demonstrates strong performance, particularly in estimating arterial and apical diameters.

\subsubsection{Ablation Study}
\autoref{tab:ablation_study} shows the impact of different task combinations on segmentation, classification, and keypoint performance. The final combined task performs well over major tasks, proving unified optimization works as expected.

\input{tables/table-ablation-study}
