\section{State-Of-The-Art}\label{sec:sota}


\textbf{Semantic Segmentation} performs pixel-level classification, being CNNs the standard approach when annotated data is available. The widely adopted \textit{U-Net} follows an encoder-decoder structure, gradually reducing and recovering spatial resolution. While other architectures~\cite{zhao2017pyramid, chen2017deeplab, huang2019ccnet, salpea2022medical} have shown strong results, U-Net remains dominant in biomedical imaging due to its simplicity and effectiveness \cite{nnunet, unetpp}. Meanwhile, specialized methods such as StarDist \cite{stardist} and Cellpose \cite{cellpose} are particularly popular in fluorescence and immunofluorescence contexts, where they focus on capturing precise cell boundaries in often high-contrast images.

\textbf{Cell Counting} methods offer an alternative to cell segmentation by framing the task as density estimation. \cite{lempitsky2010learning} introduced a supervised framework that estimates object counts from dot annotations, bypassing explicit segmentation. \cite{xie2018microscopy} extended this with CNN-based fully convolutional regression networks for microscopy cell counting.\\
\textbf{Multi-Task Approaches} are a common strategy for instance segmentation in cell analysis, where semantic segmentation is combined with auxiliary tasks. HoVer-Net employs a \textit{three-headed} decoder and this design enables robust instance-level segmentation and classification. More recently, in our previous work ~\cite{anglada2024dualunet}, we introduced two independent U-Net models following a similar multi-task learning principle. Instead of HoVer maps, this model combines semantic segmentation with a cell counting task that estimates Gaussian-based cell centroids, effectively enabling cell separation even in highly overlapping regions.\\ 
\textbf{Transformer-based approaches} have recently gained traction in biomedical image analysis, with many adopting the three-headed scheme introduced by HoVer-Net~\cite{graham2019hover}. Examples of this strategy are CellViT~\cite{cellvit}, which exemplifies a state-of-the-art Transformer architecture for cell segmentation; or NuLite~\cite{tommasino2024nulite}, which prioritizes computational efficiency. Although Transformers capture long-range dependencies, they usually require more computational resources than CNN models. In contrast, CellDETR~\cite{celldetr} uses the DETR framework to detect cells via bounding boxes rather than producing segmentation masks, making direct comparisons with HoVer-Net, CellViT, NuLite, or our method less applicable.\\ 
\textbf{Advanced Convolutional architectures}
have been proposed integrating Transformer-inspired design principles. ConvNeXt~\cite{Liu_2022_ConvNext} is one such architecture that rethinks standard ResNet-like backbones using modern components. ConvNeXt can serve as a drop-in replacement for earlier CNN backbones in tasks like U-Net. Its design also better aligns with multi-head decoder strategies by providing robust hierarchical feature representations.
