\section{Method}
\subsection{Dataset, annotations, and keypoints definition} \label{method-dataset}
\begin{figure}[ht]
    \centering
    \includegraphics[width=\textwidth]{images/keypoints-map.pdf}
    \caption{Different types of nailfold capillaries examined in this research. \textbf{(1)} Normal capillary displaying the classic inverted-U shape. \textbf{(2)} Abnormal capillary where the afferent limb (left portion) is shorter than the efferent limb (right portion). \textbf{(3)} Abnormal capillary characterized by a non-linear efferent limb. \textbf{(4)} Abnormal capillary exhibiting a conjunction or anastomosis. \textbf{(5)} Abnormal capillary with both afferent and efferent limbs shorter than typical length. \textbf{(6)} Abnormal capillary with both afferent and efferent limbs longer than typical length.}
    \label{fig:capillary}
\end{figure}

The dataset comprises $N$ clinician-selected RGB capillaroscopy images of size $W \times H \times 3$, collected from multiple participants. Each image $\mathcal{M}_j$, where $j \in {1, 2, ..., N}$, contains several visible capillaries and optional hemorrhages. The capillaries in $\mathcal{M}_j$ are represented as a list of entities $\{ \mathcal{E}_1^{(j)}, \mathcal{E}_2^{(j)}, \dots, \mathcal{E}_w^{(j)} \}$, where $w$ is the number of annotated capillaries in $\mathcal{M}_j$.

A capillary entity $\mathcal{E}_i^{(j)}$ can now be written as a tuple of four components:

\begin{equation} \label{eq:entity}
    \mathcal{E}_i^{(j)} = \left(\mathcal{S}_i^{(j)}, \mathcal{B}_i^{(j)}, \mathcal{Q}_i^{(j)}, \mathcal{P}_i^{(j)} \right)
\end{equation}

In \autoref{eq:entity}, $\mathcal{S}_i^{(j)}$  denotes the segmentation polygon,  $\mathcal{B}_i^{(j)}$  refers to the bounding box, $\mathcal{Q}_i^{(j)}$  corresponds to the classification label and $\mathcal{P}_i^{(j)}$  represents the set of keypoints. In this study, we define a fixed set of 9 keypoints, which include the up ($U$), down ($D$), left-left ($LL$), left-right ($LR$), right-left ($RL$), right-right ($RR$), left-bottom ($LB$), right-bottom ($RB$) and the optional conjunction ($X$). The  $U$  and  $D$  points are chosen from the capillary apex,  $LL$  and  $LR$  are selected from the arterial limb, while  $RL$  and  $RR$  are selected from the venous limb. All keypoints in a capillary entity are represented as one-hot binary masks of size $m \times m$, where each mask encodes a specific anatomical landmark and a spatial softmax over the $m^2$ grid is applied to predict the most probable keypoint location during inference,  as described in~\cite{he2017mask}. Note that all capillaries in this study must contain the 8 essential keypoints, with the exception of the optional conjunction point. Hemorrhages class do not contain any keypoints. To facilitate better representation of this information, we adopt the MS COCO format~\cite{lin2014microsoft} for annotations, assigning a visibility flag to each keypoint.~\autoref{fig:capillary} illustrates various keypoint annotations in this research.

\subsection{Model Architecture}
\subsubsection{Overview}
\begin{figure*}[htbp]
    \centering
    \includegraphics[width=\textwidth]{images/MViT_FPN.pdf}
    \caption{Overview of the proposed architecture. The model integrates a Multiscale Vision Transformer backbone with a Feature Pyramid Network to capture both fine-grained details and high-level structural context of the capillaries. The multi-scale features are fed into task-specific heads built on Mask R-CNN and Region Proposal Network to perform segmentation masks, classification, and keypoint heatmap tasks.}
    \label{fig:network}
\end{figure*}

The architecture of the proposed model is illustrated in~\autoref{fig:network}. While the final tasks share close relationships, they necessitate distinct prediction strategies due to the intricate characteristics of nailfold capillaries --- small size, subtle visual features, occlusions, and irregular shapes. Consequently, the model must effectively capture both low-level visual details and high-level structural relationships. We adopt the Multiscale Vision Transformer (MViT)\cite{Fan_2021_ICCV,Li_2022_CVPR} to address these challenges --- we encode each input into multi-scale square patches through adaptive upsampling and downsampling operations. These multi-scale patches align naturally with the four-stage Feature Pyramid Network (FPN), subsequently feeding into dedicated task-specific heads --- Mask-RCNN~\cite{he2017mask} and Region Proposal Network (RPN)~\cite{NIPS2015_14bfa6bb}, where a custom total loss function is applied to optimize multi tasks, resulting in a comprehensive prediction of capillary representations.

\subsubsection{Backbone}
Firstly, all input nailfold capillary images are adopted paddings for a square shape. We then segment the image into $P \times P$ patches. For ViT-based encoding, the model extract features to encode each patch into a multi-scale feature map $\mathbf{F} \in \mathbf{R}^{\frac{H}{P} \times \frac{W}{P} \times D}$. To create multiple feature maps at different scales as used in FPN, we perform recursive down-sampling and up-sampling using convolutional layers with stride $s \sim \{4, 8, 16, 32\}$. Later in the FPN module, we return the feature map to its original resolution by upsampling recursively.

\subsubsection{Attention Module}
Just like other ViT systems, the attention scores between each pair of patches are computed by taking the dot product between the query $Q$, key $K$ and value vectors $V$. Let  $X = [x_1, x_2, \dots, x_N]$  be the sequence of input patch representations, the three vectors can be calculated with learnable weight matrices $Q = X W_Q, K = X W_K, V = X W_V$.

The resulting scores are scaled by  $\frac{1}{\sqrt{d}}$, $d={\frac{H}{P} \times \frac{W}{P} \times D}$ to stabilize gradients during training. These scores are passed through softmax for weights.

\subsubsection{ROI Heads}
The downstream ROI heads refine FPN-generated proposals and make final predictions for the capillary tasks with the structure of Mask-RCNN and RPN. The model utilizes ROI Align adopted from the Mask R-CNN framework to extract features from each proposal. Following this, the model performs bounding box regression to adjust coordinates and classifies capillaries using a softmax function. To predict the keypoint heatmap, we use a top-down method that takes the predicted bounding boxes with FPN output features to generate the heatmap. Capillary semantic segmentation is achieved by piping FPN's output features into Mask R-CNN, which then uses five $256 \times 256$ Conv2d heads to predict five distinct pixel-level classes.

\subsection{Task-specific Loss and Multitask Loss}
\subsubsection{Classification Loss}
Let $\hat{y}$ be the predicted probability distribution (from the final softmax layer), and $y$ be the true class label (a one-hot encoded vector). The classification loss is hereby $\mathcal{L}_{\text{class}} = - \sum_{c=1}^{C} y_c \log(\hat{y}_c)$, where \( C \) is the number of classes, \( y_c \) is the true label for class \( c \), and \( \hat{y}_c \) is the predicted probability for class \( c \).

\subsubsection{Segmentation Loss}

We use the \textit{Dice Loss} for segmentation $\mathcal{L}_{\text{segm}} = 1 - \frac{2 \sum_{i} \mathcal{S}_i g_i}{\sum_{i} \mathcal{S}_i + \sum_{i} g_i}$. Here \( \mathcal{S}_i \) is the predicted segmentation mask at pixel \( i \), and \( g_i \) is the ground truth segmentation mask at pixel \( i \).

\subsubsection{Bounding Box Loss}
For the bounding box regression task, we use the \textit{Smooth L1 Loss}. Let \( \mathcal{B}_i^{(j)} = [x_{\text{min}}, y_{\text{min}}, x_{\text{max}}, y_{\text{max}}] \) be the predicted bounding box for the \( i \)-th capillary, and \( \mathcal{G}_i^{(j)} = [g_{\text{min}}, h_{\text{min}}, g_{\text{max}}, h_{\text{max}}] \) be the ground truth bounding box. The bounding box loss is defined as $\mathcal{L}_{\text{bbox}} = \frac{1}{4} \sum_{i} \left( \left| \mathcal{B}_i^{(j)} - \mathcal{G}_i^{(j)} \right|^2 \right)$.

\subsubsection{Keypoint Loss}
For the keypoint detection task, we use the \textit{cross-entropy loss}: $\mathcal{L}_{\text{kp}} = \frac{1}{N} \sum_{n=1}^{N} \sum_{k=1}^{K} \sum_{i=1}^{S} \sum_{j=1}^{S} - g_{n,k,i,j} \log(\hat{g}_{n,k,i,j})$. The predicted keypoints are represented as a tensor of logits of shape \( (N, K, S, S) \), where \( N \) is the batch size, \( K \) is the number of keypoints, and \( S \) is the size of the keypoint heatmap. The ground truth keypoints are converted into heatmaps.

\subsubsection{Total Loss}
Since all tasks here are jointly optimized, weighting strategies are vital for the final optimization. Brought the idea from~\cite{Kendall_2018_CVPR}, we use uncertainty weighting to balance these tasks: $\mathcal{L}_{\text{total}} = \mathcal{U}\mathcal{W} \otimes \{ \mathcal{L}_{\text{class}}, \mathcal{L}_{\text{segm}}, \mathcal{L}_{\text{bbox}}, \mathcal{L}_{\text{kp}}\}$.
