\section{Methods}\label{sec:methods}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{esquema_ex.jpg}
    \caption{DualU-Net architecture. The encoder (green) extracts features and feeds two parallel decoders (blue): segmentation and classification (top) and detection (bottom). Each block outputs \( m \) feature maps, with final heads (yellow) producing a multi-class segmentation mask (\( m = C \)) and a single-channel density map (\( m = 1 \)). Insets (a) and (b) detail the residual and decoder block structures.
}
    \label{fig:dualunet}
\end{figure}

In contrast to our earlier approach that used two independent U-Net models, we now unify both tasks—semantic segmentation and cell center detection—within a single network. The proposed DualU-Net architecture (Fig. \ref{fig:dualunet}) integrates two specialized decoder branches. While the encoder remains shared and captures multiscale features from the input images, each decoder targets a different objective: one for semantic segmentation and another for cell center detection.

\paragraph{Two Heads Are Enough}
 With our DualU-Net design, we aim to simplify the widely adopted three-decoder architecture model while addressing the same task (cell nuclei classification and instance segmentation and maintaining high performance, using only two decoder heads. First, we carefully weight the background class in the loss function by adjusting its importance through a tunable parameter to ensure balanced training despite the large background portion. This emphasizes the binary classification task of distinguishing cells from the background, enhancing segmentation accuracy. As a result, a dedicated Nuclei Pixel (NP) branch, as used in HoVer-Net, becomes redundant. Second, we estimate cell centroids using Gaussian-based density maps, predicting the center of mass of cells as Gaussian distributions. This approach provides a computationally efficient and interpretable method for cell detection. By adopting this strategy, we present a faster and more intuitive alternative to the HV vector branch in HoVer-Net.

\paragraph{Encoder}
The encoder in the DualU-Net architecture is designed to extract multiscale feature representations from input images, leveraging the hierarchical structures of modern convolutional backbones. We tested two state-of-the-art architectures: ResNeXt-50 32$\times$4d~\cite{xie2017aggregated} and ConvNeXt-Base~\cite{Liu_2022_ConvNext}.

\paragraph{Decoders}
The semantic segmentation decoder generates pixel-wise classification masks by progressively refining the feature maps across five hierarchical levels. The second decoder head predicts a Gaussian-based density map of cell centers. Ground-truth density maps are created using a Gaussian kernel over point annotations placed at each cell's centroid. During inference, local maxima on the predicted density map correspond to cell centers \cite{anglada2024dualunet}. The main architecture of the two decoders is the same, and it is represented in Fig. \ref{fig:dualunet}. However, while the semantic segmentation decoder final head maps the output to the required number of semantic classes, the detection decoder produces a single-channel density map, representing the likelihood of cell centers.

\paragraph{Merging and Final Cell Instances}
In the final stage, the outputs from both decoder heads are merged to achieve instance-level segmentation. A watershed algorithm is applied to the semantic segmentation mask, using the predicted cell centers from the detection decoder as markers. Cells without an associated predicted center are discarded. This process effectively separates clustered cells, forming distinct connected components that correspond to individual cells. Each connected component is then assigned a semantic class through a majority vote based on the segmented pixels within it.

\paragraph{Loss Function}
To train DualU-Net for its dual objectives, we employ a composite loss function that optimizes both tasks simultaneously. The total loss \( \mathcal{L}_{\text{total}} \) is defined as:

\begin{equation}
\mathcal{L}_{\text{total}} = \mathcal{\lambda}_{\text{dice}} \mathcal{L}_{\text{dice}} + \mathcal{\lambda}_{\text{ce}} \mathcal{L}_{\text{ce}} + \mathcal{\lambda}_{\text{mse}} \mathcal{L}_{\text{mse}},
\end{equation}

\noindent where \( \lambda_{\text{dice}} \), \( \lambda_{\text{ce}} \), and \( \lambda_{\text{mse}} \) are weighting factors that control the contributions of the Dice loss \( \mathcal{L}_{\text{dice}} \), the Cross-Entropy (CE) loss \( \mathcal{L}_{\text{ce}} \), and the Mean Squared Error (MSE) loss \( \mathcal{L}_{\text{mse}} \), respectively. The \( \mathcal{L}_{\text{dice}} \) and \( \mathcal{L}_{\text{ce}} \) loss primarily influence the segmentation task by ensuring accurate pixel-wise classification and mitigating class imbalance. Additionally, our experiments indicate that \( \mathcal{L}_{\text{dice}} \) has a stronger influence on segmentation quality, whereas \( \mathcal{L}_{\text{ce}} \) is more pivotal for improving classification performance. Relying on only one of these losses tends to focus the model on a single task and degrades performance on the other. Consequently, we adopt a combination of both, as also done in \cite{graham2019hover, tommasino2024nulite, cellvit}. Furthermore, we conducted a hyperparameter search to tune the specific weighting factors and observed that giving them equal contribution (\( \lambda_{\text{dice}} : \lambda_{\text{ce}} : \lambda_{\text{mse}} = 1 : 1 : 1\)) yields the best overall performance. To ensure that underrepresented classes receive greater importance during training, we applied class-weighting strategies in which the loss contributions of each class, including the background, are weighted by the inverse of their frequency in the dataset. Meanwhile, the \( \mathcal{L}_{\text{mse}} \) loss directly supervises the centroid estimation task by minimizing the error between the predicted Gaussian density map and the ground-truth center annotations. By enforcing a smooth and accurate density representation of cell centroids, this loss helps refine cell localization without requiring an explicit boundary prediction. 



