Fine-Grained Classification for Depth Estimation From Monocular Microscopy for Robotic Micromanipulation of Motile Cells

Published: 2026, Last Modified: 18 Feb 2026IEEE Robotics Autom. Lett. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Manipulation of motile cells is crucial for biological research and clinical applications. However, obtaining Z-axis visual feedback under monocular microscopy remains a challenge for robotic micromanipulation. Traditional depth-from-focus and depth-from-defocus methods fail to handle motile cells due to time-consuming focus search or inaccurate defocus modeling. This letter addresses these limitations by reformulating depth estimation as a fine-grained multi-class depth classification problem that exploits the shallow depth-of-field characteristic of microscopy. We propose a Fine-Grained Attention Fusion Module (FGAF-Module) that combines multi-scale grouped convolution for extracting subtle depth-related features with attention mechanisms to focus on discriminative regions in cell images. Additionally, channel-based feature augmentation methods, including CrossNorm and SelfNorm, enhance fine-grained feature discrimination while improving model generalization to handle morphological variations during cell movement. A weighted loss function further guides the model to distinguish between adjacent depth categories by penalizing errors proportionally to depth differences. For network training evaluation, the FGAF-module enhanced network achieved 83.52% top-1 classification accuracy and 96.88% top-3 classification accuracy while maintaining real-time performance at 90 frames per second. To demonstrate the capability of our approach in providing visual feedback for robotic manipulation of motile cells, the trained depth estimation model was integrated into a robotic sperm aspiration system. The model provided real-time visual depth feedback to guide 3D pipette localization during sperm aspiration procedures, achieving a 92% success rate for live motile sperm aspiration. These results validate the effectiveness of fine-grained classification for monocular depth estimation in micromanipulation applications.
Loading