Fine-Grained Temporal Site Monitoring in EGD Streams via Visual Time-Aware Embedding and Vision-Text Asymmetric Coworking

Published: 01 Jan 2025, Last Modified: 17 Apr 2025IEEE J. Biomed. Health Informatics 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Esophagogastroduodenoscopy (EGD) requires inspecting plentiful upper gastrointestinal (UGI) sites completely for a precise cancer screening. Automated temporal site monitoring for EGD assistance is thus of high demand, yet often fails if directly applying the existing methods of online action detection. The key challenges are two-fold: 1) the global camera motion dominates, invalidating the temporal patterns derived from the object optical flows, and 2) the UGI sites are fine-grained, yielding highly homogenized appearances. In this paper, we propose an EGD-customized model, powered by two novel designs, i.e., Visual Time-aware Embedding plus Vision-text Asymmetric Coworking (VTE+VAC), for real-time accurate fine-grained UGI site monitoring. Concretely, VTE learns visual embeddings by differentiating frames via classification losses, and meanwhile by reordering the sampled time-agnostic frames to be temporally coherent via a ranking loss. Such joint objective encourages VTE to capture the sequential relation without resorting to the inapplicable object optical flows, and thus to provide the time-aware frame-wise embeddings. In the subsequent analysis, VAC uses a temporal sliding window, and extracts vision-text multimodal knowledge from each frame and its corresponding textualized prediction via the learned VTE and a frozen BERT. The text embeddings help provide more representative cues, but also may cause misdirection due to prediction errors. Thus, VAC randomly drops or replaces historical predictions to increase the error tolerance to avoid collapsing onto the last few predictions. Qualitative and quantitative experiments demonstrate that the proposed method achieves superior performance compared to other state-of-the-art methods, with an average F1-score improvement of at least 7.66%.
Loading