Training-free fish species population monitoring in unconstrained underwater videos

Isaak Kavasidis, Amelia Sorrenti, Orazio Tomarchio, Daniela Giordano, Marco Milazzo, Gabriele Turco, Carlo Cattano, Concetto Spampinato

Published: 01 Jan 2025, Last Modified: 20 Jul 2025ANT/EDI40 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Monitoring fish populations in underwater environments is essential for understanding marine biodiversity, evaluating ecosystem health, and guiding conservation efforts. The complex nature of underwater habitats—with shifting lighting, turbidity, and background clutter—creates significant challenges for traditional monitoring methods. These challenges are further exacerbated by the limited availability of annotated datasets, which hampers the effectiveness of automated approaches. This study addresses these issues by leveraging the power of foundation models—versatile, pre-trained systems capable of generalizing across diverse scenarios. By utilizing these models, the proposed train-free computational pipeline minimizes the need for extensive dataset-specific training, offering a streamlined and efficient solution for underwater fish population monitoring.The proposed pipeline leverages the Segment Anything Model 2 (SAM2) for precise detection and segmentation of fish in unconstrained and cluttered underwater videos recorded by Baited Remote Underwater Video systems (BRUVs), a standard technique used by marine ecologists with the ultimate objective to assess fish abundance. By applying SAM2 with tailored prompts, the system accurately isolates fish from complex scenes while eliminating noise such as bait cages. Subsequently, segmented fish are processed by a diffusion model to generate high-resolution visual representations, which compensate for image quality issues caused by occlusions, poor lighting, or motion blur. These enhanced visuals enable more reliable identification of fish species. In the final stage, the Contrastive Language–Image Pre-training (CLIP) model classifies fish species by drawing on its multimodal learning capabilities, seamlessly associating visual features with textual labels for robust species-level recognition.Extensive evaluation on a large fish dataset from underwater videos demonstrates the pipeline’s effectiveness in handling the visual complexities of marine environments and achieving high levels of accuracy in fish detection and classification. The system’s training-free design reduces dependency on labor-intensive annotation processes, making it scalable and adaptable for diverse underwater ecosystems.