\documentclass{midl} % Include author names
% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\jmlrvolume{-- Under Review}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026 submission}
\editors{Under Review for MIDL 2026}

\usepackage{array}         
\usepackage[table]{xcolor}

\usepackage{amsmath}

\usepackage{booktabs}
\usepackage{multirow}
\usepackage{adjustbox}

\usepackage{xcolor}
\newcommand{\CYcomment}[1]{\textcolor{red}{[CY: #1]}}

\usepackage{algorithm}
\usepackage{algpseudocode}


\usepackage[font=small]{caption}
\setlength{\abovecaptionskip}{4pt}
\setlength{\belowcaptionskip}{4pt}

\title[Closed-Loop Memory Rectification]{Detector-in-the-Loop Tracking: Active Memory Rectification for Stable Glottic Opening Localization}


% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Huayu Wang\nametag{$^{1}$}} \Email{huayu@uw.edu}\\
\Name{Bahaa Alattar\nametag{$^{1}$}} \Email{balattar@uw.edu}\\ 
\Name{Cheng-Yen Yang\nametag{$^{1}$}} \Email{cycyang@uw.edu}\\
\Name{Hsiang-Wei Huang\nametag{$^{1}$}} \Email{hwhuang@uw.edu}\\
\Name{Jung Heon Kim\nametag{$^{2}$}} \Email{medjh@daum.net}\\
\Name{Linda Shapiro\nametag{$^{1}$}} \Email{shapiro@cs.washington.edu}\\
\Name{Nathan White\nametag{$^{1, 3}$}} \Email{whiten4@uw.edu}\\
\Name{Jenq-Neng Hwang\nametag{$^{1}$}} \Email{hwang@uw.edu}\\
\addr $^{1}$ University of Washington, Seattle, WA \\
\addr $^{2}$ Ajou University School of Medicine, Suwon, Republic of Korea \\
\addr $^{3}$ Harborview Medical Center, Seattle, WA \\
}

\begin{document}

\maketitle

\begin{abstract}

Temporal stability in glottic opening localization remains challenging due to the complementary weaknesses of single-frame detectors and foundation-model trackers: the former lacks temporal context, while the latter suffers from memory drift. Specifically, in video laryngoscopy, rapid tissue deformation, occlusions, and visual ambiguities in emergency settings require a robust, temporally aware solution that can prevent progressive tracking errors. We propose Closed-Loop Memory Correction (CL-MC), a detector-in-the-loop framework that supervises SAM2 through confidence-aligned state decisions and active memory rectification. High-confidence detections trigger semantic resets that overwrite corrupted tracker memory, enabling fully training-free, drift-free tracking in complex endoscopic scenes. On emergency intubation videos, CL-MC achieves state-of-the-art performance, significantly reducing drift and missing rate compared with the SAM2 variants and open loop based methods. Our results establish memory correction as a crucial component for reliable clinical video tracking.

% Temporal stability in glottic opening localization remains challenging due to the complementary weaknesses of single-frame detectors and foundation-model trackers: the former lacks temporal context, while the latter suffers from memory drift. We propose \textbf{Closed-Loop Memory Correction (CL-MC)}, a detector-in-the-loop framework that supervises SAM2 through confidence-aligned state decisions and active memory rectification. High-confidence detections trigger semantic resets that overwrite corrupted tracker memory, enabling fully training-free, drift-free tracking in complex endoscopic scenes. On emergency intubation videos, CL-MC achieves state-of-the-art performance, significantly reducing drift and missing rate compared with the SAM2 variants and open loop based methods. Our results establish memory correction as a crucial component for reliable clinical video tracking.

\end{abstract}

\begin{keywords}
Video Object Detection, Video Laryngoscopy, YOLO, SAM2
\end{keywords}

\section{Introduction}

Video Laryngoscopy (VL) is a preferred method for endotracheal intubation to maintain airway patency or to stabilize oxygenation or ventilation during critical illness \cite{prekker2023video}. Automated localization and tracking of glottic opening, is a critical prerequisite for downstream tasks such as glottal area segmentation and instrument size selection\cite{carlson2016novel, masumori2024glottis, matava2020convolutional, cui2025improved}. A popular approach in this domain is to train a single-frame detector~\cite{wang2024yolov10realtimeendtoendobject, tian2025yolov12} and apply it frame-by-frame during inference. One particular real-time model on video laryngoscopy such as YOLO-based detectors \citep{kim2023development} provide strong semantic discriminability and can reliably identify vocal cords from learned appearance cues. However, purely frame-wise detection is inherently unstable because it lacks temporal context, often resulting in jitter and false negatives under brief occlusions. 

Traditional hybrid tracking strategies—including Kalman Filter–based association~\cite{bewley2016sort, wojke2017deepsort} and IoU-driven smoothing~\cite{bochinski2018extending} purely at the output level by linking or refining detector predictions across frames. Consequently, when the tracker drifts toward semantically incorrect structures, output-level smoothing can suppress visible jitter but cannot restore the underlying memory state that caused the drift. On the other hand, video object segmentation foundation models like SAM2 \citep{ravi2024sam} deliver strong temporal continuity by using memory banks to track objects across frames. Yet SAM2 is class-agnostic and highly dependent on its initial prompt; in long endoscopic sequences, it is prone to semantic drift, where the tracker gradually shifts to distracting structures and cannot self-correct without external guidance as illustrated in Figure~\ref{fig:closed_loop}. 

% Standard fusion methods typically employ Kalman Filters or simple Intersection-over-Union (IoU) logic to smooth trajectories post-hoc. While these methods can filter noise, they function as an "open-loop" system: the refined result is never communicated back to the tracker. Consequently, if the tracker drifts, the fusion module can only suppress the error temporarily, but cannot rectify the tracker's internal state. We show a comparison of these two in Figure ~\ref{fig:closed_loop}.

\begin{figure}[t]
\floatconts
  {fig:closed_loop}
  {\caption{Comparision between open loop and closed loop tracking}}
  {\includegraphics[width=0.7\linewidth]{figures/closedloop.pdf}}
  \vspace{-2em}
\end{figure}

To overcome this limitation, we introduce a \textbf{Closed-Loop Memory Correction (CL-MC)} framework. Instead of treating detection and tracking as isolated components, CL-MC establishes a bidirectional pathway between a single-frame semantic detector and the SAM2 tracker, which transforms the detector’s role from passive refinement to active semantic supervision, enabling training-free correction of drift and stable glottic localization under challenging clinical conditions. In summary, our contributions include \textbf{a closed-loop memory rectification mechanism} that leverages fused high-confidence detections to dynamically re-initialize SAM2’s memory, providing a training-free pathway to substantially enhance long-term video stability; \textbf{a heterogeneous confidence alignment module} that normalizes single-frame detector prediction and SAM2 prediction, enabling a unified confidence space for adaptive decision-making ; and \textbf{a state-machine–driven control strategy} designed for the challenges of endoscopic video—including occlusion, rapid deformation, and specular noise—that dynamically selects the appropriate prediction source and triggers memory rectification when drift is detected.

% To overcome this limitation, we introduce a \textbf{Closed-Loop Memory Correction (CL-MC)} framework. Instead of treating detection and tracking as isolated components, CL-MC establishes a bidirectional pathway between a single-frame semantic detector and the SAM2 tracker. A confidence-aligned state machine identifies drift conditions, and high-confidence detections are used to actively update SAM2’s memory bank, replacing corrupted entries. This transforms the detector’s role from passive refinement to active semantic supervision, enabling training-free correction of drift and stable glottic localization under challenging clinical conditions. In summary, our contributions are:
% \begin{itemize}
% \item \textbf{A closed-loop memory rectification mechanism} that leverages fused high-confidence detections to dynamically re-initialize SAM2’s memory, providing a training-free pathway to substantially enhance long-term video stability.
% \item \textbf{A heterogeneous confidence alignment module} that normalizes single-frame detector prediction and SAM2 prediction, enabling a unified confidence space for adaptive decision-making.
% \item \textbf{A state-machine–driven control strategy} designed for the challenges of endoscopic video—including occlusion, rapid deformation, and specular noise—that dynamically selects the appropriate prediction source and triggers memory rectification when drift is detected.
% \end{itemize}

\section{Related Work}

\subsection{Glottic Localization in Video Laryngoscopy} Accurate localization of the glottic opening~\cite{pedersen2023localization, kruse2023glottisnetv2} is a prerequisite for autonomous endotracheal intubation. While deep learning-based detectors like YOLO have established a strong baseline, their deployment in emergency settings is hindered by a significant data-application gap. Most public benchmarks (e.g., the Laryngoscope8 \cite{yin2021laryngoscope8} dataset) are derived from transnasal laryngoscopy for diagnostic purposes. These images differ markedly from emergency intubation scenarios in terms of anatomical morphology, viewing angles, and illumination conditions. This substantial \textit{domain shift} renders standard single-frame detectors fragile; without temporal context, they struggle to generalize to the dynamic, visually degraded environment of emergency intubation, leading to inconsistent localization and tracking instability.

\subsection{Segment Anything Model 2 and Memory Contamination} SAM2~\cite{ravi2024sam} extends promptable segmentation to video via a sophisticated memory attention mechanism. While effective for generic tracking, SAM2 is fundamentally class-agnostic and lacks intrinsic semantic understanding of anatomical targets. In endoscopic scenarios, this reliance on low-level visual coherence makes it susceptible to \textit{semantic drift}. Critically, SAM2 maintains temporal consistency using a First-In-First-Out (FIFO) memory bank. This passive update strategy creates a vulnerability to memory contamination: erroneous features generated during moments of blur or occlusion are indiscriminately stored and retrieved, iteratively degrading tracking performance. Although recent variants like SAMURAI \cite{yang2024samurai} and MA-SAM2 \cite{yin2025memory} propose refined update rules, they remain self-contained systems without external semantic grounding. Once the memory is corrupted, these models possess no mechanism to recover, motivating our proposed active memory rectification strategy driven by high-confidence detection priors.

\subsection{Tracking-by-Detection and Representation-Level Correction}

State-of-the-art tracking frameworks such as ByteTrack~\cite{zhang2022bytetrack} and 
BoT-SORT~\cite{botTracker} associate detector outputs across frames using motion constraints or IoU-based heuristics. These methods are highly effective for multi-object tracking in natural scenes, where appearance cues are relatively stable and motion is approximately linear. However, they operate exclusively at the \emph{bounding-box level}: detections are linked or smoothed, but the underlying representation of the tracked object remains unchanged. As a result, when the tracker deviates from the anatomical target due to occlusion, specular highlights, or rapid tissue deformation, output-level association cannot correct the internal features that caused the drift. This makes recovery particularly difficult in endoscopic video, where appearance can change abruptly and visual ambiguities are common.

In contrast, our Cl-MC method introduces a representation-level correction mechanism. High-confidence detections are not only used for frame-wise prediction but also serve as \emph{semantic supervisory signals} that directly update the tracker’s memory. This allows the system to actively overwrite contaminated features and restore the tracker to the correct anatomical structure. 

% Such a memory-intervention strategy is absent from conventional tracking-by-detection pipelines, which motivates our closed-loop formulation for robust glottic localization under challenging clinical conditions.

% \subsection{Tracking-by-Detection and Closed-Loop Fusion} State-of-the-art tracking paradigms, such as ByteTrack \cite{zhang2022bytetrack} and BoT-SORT \cite{botTracker}, rely on Kalman Filters to associate detections across frames. While it mathematically operate as a closed-loop estimator, their feedback is strictly kinematic—updating position and velocity vectors based on measurement residuals. This design assumes that tracking errors stem from Gaussian noise in coordinates, not from fundamental model corruption. Consequently, if the visual tracker begins to drift due to confusing background textures, the Kalman Filter merely smooths the divergent trajectory rather than correcting the underlying cause.

% To address this, we propose a semantic closed-loop system. Unlike standard approaches that treat detections merely as inputs for association, our framework uses high-confidence detections as a supervisory signal to actively intervene. By triggering a "hard reset" of the tracker's internal memory upon detecting drift, we provide a recovery mechanism that kinematic filtering alone cannot achieve.


\section{Methods}


\begin{figure}[h]
\floatconts
  {fig:pipeline}
  {\caption{
  % \textbf{Architecture of the Trend-Aware Closed-Loop Memory Correction Module.} The pipeline operates in a parallel structure involving a semantic detector ($D_{yolo}$) and a visual tracker ($\mathcal{T}_{sam}$). 
  Upon high-confidence initialization ($\tau_{init}$), both branches process incoming frames $I_t$ to generate candidate bounding boxes and scores. The core component is the Memory Correction Module, which not only integrates predictions but also drives a Memory Rectification loop. Unlike passive FIFO updates, this state  actively utilizes high-confidence fusion results to reset or refresh the SAM2 Memory Bank, thereby preventing semantic drift and memory contamination in long sequences.}}
  {\includegraphics[width=0.9\linewidth]{figures/pipeline.pdf}}
\end{figure}

We aim to extend a pre-trained single-frame glottic opening detector to the video domain without retraining. To achieve robust temporal performance under severe domain shift and endoscopic artifacts, we formulate a \textbf{Closed-Loop Memory Correction (CL-MC)} framework that integrates semantic detection with SAM2’s temporal propagation. Instead of fusing predictions at the output level, CL-MC establishes an explicit control mechanism that governs when to rely on the detector, when to rely on the tracker, and when to intervene in the tracker’s internal memory. The architecture consists of three key components: \textbf{(1)} a single-frame semantic detector $\mathcal{D}_{yolo}$ that provides high-confidence appearance cues; \textbf{(2)} a SAM2-based temporal tracker $\mathcal{T}_{sam2}$ responsible for visual continuity; and \textbf{(3)} a state-machine controller that aligns heterogeneous confidence signals, selects the appropriate prediction source, and triggers \emph{memory rectification} when drift is detected as shown in Figure ~\ref{fig:pipeline}. This closed-loop design enables the system to actively overwrite corrupted memory representations, allowing stable and drift-resistant tracking of the glottic opening. The complete inference procedure is summarized in Algorithm~\ref{alg:fusion_logic}.


\subsection{Heterogeneous Confidence Alignment}

A key challenge in combining predictions from the single-frame detector and SAM2 is the mismatch in their confidence semantics. The detector outputs a confidence score $s_t^{y} \in [0,1]$, whereas SAM2 produces a predicted IoU score $s_t^{s}$ reflecting mask quality. Because these signals differ in distribution and dynamic range, direct comparison is unreliable.

To obtain a unified confidence space, we apply a \textbf{Trend-aware normalization} strategy. Let $\mathcal{H}_t$ denote a sliding window containing the past $K$ tracker scores. The normalized confidence is defined as:

\begin{equation}
s_t^{s'} = 
\frac{s_t^{s} - \min(\mathcal{H}_t)}
{\max(\mathcal{H}_t) - \min(\mathcal{H}_t) + \epsilon}.
\end{equation}

This adaptive scaling calibrates tracker’s values to local quality variations, allowing the state machine to compare $s_t^{y}$ and $s_t^{s'}$ on a consistent scale. The spatial consistency between predictions is evaluated using the IoU, where $v_t^{iou} = \text{IoU}(b_t^{y}, b_t^{s})$.

% \begin{equation}
% v_t^{iou} = \text{IoU}(b_t^{y}, b_t^{s}).
% \end{equation}

% --------------------------------------------------------------

\subsection{State-Machine Driven Prediction Selection}

To ensure stable tracking despite occlusions, motion blur, and rapid deformation, we employ a state-machine controller that selects the appropriate source of prediction at each frame. States are determined using the aligned confidence values $(s_t^{y}, s_t^{s'})$ and the spatial similarity $v_t^{iou}$.

\begin{itemize}

\item \textbf{State 1 — Agreement:}  
When both predictions spatially agree
\begin{equation}
v_t^{iou} > \tau_{iou},
\end{equation}
we refine the output using confidence-based interpolation:
\begin{equation}
\alpha_t = \frac{s_t^{y}}{s_t^{y} + s_t^{s'}},
\qquad
b_t^{final} = \alpha_t b_t^{y} + (1 - \alpha_t) b_t^{s}.
\end{equation}

\item \textbf{State 2 — Detector Uncertain (YOLO Lost):}  
If the detector confidence falls below threshold
\begin{equation}
s_t^{y} < \tau_{lost},
\end{equation}
the system relies on SAM2’s temporal continuity:
\begin{equation}
b_t^{final} = b_t^{s}.
\end{equation}


\item \textbf{State 3 — Drift Detected (Detector Wins):}  
Drift is identified when the detector is confident but disagrees strongly with SAM2:
\begin{equation}
s_t^{y} > \tau_{drift} \quad \land \quad v_t^{iou} < \tau_{iou}.
\end{equation}
In this case,
\begin{equation}
b_t^{final} = b_t^{y},
\end{equation}
and a memory rectification operation is triggered to correct SAM2’s representation.

\item \textbf{State 4 — Tracker-Preserved Conflict:}  
If the detector is unstable but SAM2 exhibits consistent temporal behavior,
\begin{equation}
b_t^{final} = b_t^{s}.
\end{equation}

\end{itemize}

Rather than merging detections and tracking outputs, this controller determines \emph{when} and \emph{how} the tracker’s internal memory should be corrected, forming the basis of our closed-loop paradigm.

% --------------------------------------------------------------

\subsection{Active Memory Rectification}

Most tracking-by-detection pipelines treat the tracker as a fixed black box whose internal representation cannot be altered. In contrast, we introduce \textbf{Active Memory Rectification}, which directly intervenes in SAM2’s memory bank.

Let SAM2 maintain a memory set $\mathcal{M}_t = \{m_1, \dots, m_L\}$ at time $t$.  
When drift is detected (State 3), we execute a \emph{hard reset}:

\begin{equation}
\mathcal{M}_{t} \leftarrow \text{Encode}(I_t, b_t^{final}),
\end{equation}
thereby overwriting corrupted features with detector-guided semantics.  
For stable frames (States 1, 2, and 4), we apply a \emph{soft update}:

\begin{equation}
\mathcal{M}_{t} \leftarrow 
\text{Update}(\mathcal{M}_{t-1}, 
\text{Encode}(I_t, b_t^{final})).
\end{equation}

Importantly, the selected bounding box does not merely serve as the model output—it becomes an explicit supervisory signal used to correct or refine SAM2’s representation. This closed-loop feedback mechanism enables SAM2 to recover from drift, a capability absent in conventional tracking pipelines.

% \begin{algorithm}[t]
% \small
% \caption{Closed-Loop Memory Correction with State-Machine Prediction Selection}
% \label{alg:fusion_logic}
% \begin{algorithmic}[1]
% \Require 
%     Input video frames $\mathcal{V} = \{I_1, \dots, I_T\}$; 
%     YOLO detector $\mathcal{D}_{yolo}$; 
%     SAM2 tracker $\mathcal{T}_{sam}$.
% \Require 
%     Parameters: $\tau_{init}$ (0.75), $\tau_{lost}$ (0.1), $\tau_{drift}$ (0.8), $\tau_{iou}$ (0.5).
% \Ensure 
%     Target trajectory $\mathcal{B} = \{b_1, \dots, b_T\}$.

% \State \textbf{Initialization:} Find first frame $t_0$ where $\mathcal{D}_{yolo}(I_{t_0}) > \tau_{init}$.
% \State Initialize $\mathcal{T}_{sam}$ with $b_{t_0}^{yolo}$.
% \State Initialize score history buffer $\mathcal{H}$ for normalization.

% \For{$t = t_0 + 1$ to $T$}
%     \State \textbf{Step 1: Dual Inference}
%     \State $b_t^{y}, s_t^{y} \leftarrow \mathcal{D}_{yolo}(I_t)$ \Comment{Detector prediction}
%     \State $b_t^{s}, s_t^{s} \leftarrow \mathcal{T}_{sam}(I_t)$ \Comment{Tracker prediction}
    
%     \State \textbf{Step 2: Confidence Alignment}
%     \State $s_t^{s'} \leftarrow \text{Normalize}(s_t^{s}, \mathcal{H})$ \Comment{History-aware score alignment}
%     \State $v_{iou} \leftarrow \text{IoU}(b_t^{y}, b_t^{s})$
    
%     \State \textbf{Step 3: State-Machine Decision}
%     \If{$s_t^{y} < \tau_{lost}$} 
%         \Comment{State 1: Detector Uncertain}
%         \State $b_t^{final} \leftarrow b_t^{s}$ \Comment{Rely on temporal continuity}
        
%     \ElsIf{$v_{iou} > \tau_{iou}$} 
%         \Comment{State 2: Semantic Agreement}
%         \State $b_t^{final} \leftarrow \alpha \cdot b_t^{y} + (1-\alpha) \cdot b_t^{s}$ 
%         \Comment{Confidence-weighted refinement}
        
%     \ElsIf{$s_t^{y} > \tau_{drift}$ \textbf{and} $v_{iou} < \tau_{iou}$} 
%         \Comment{State 3: Drift Detected}
%         \State $b_t^{final} \leftarrow b_t^{y}$ 
%         \State \textbf{Trigger:} $\text{ResetMemory}(\mathcal{T}_{sam}, I_t, b_t^{final})$ 
%         \Comment{Detector-guided memory correction}
        
%     \Else 
        
%         \State $b_t^{final} \leftarrow b_t^{s}$ \Comment{State 4: Tracker-Preserved Conflict}
%         \Comment{Reject unreliable detection}
%     \EndIf

%     \State \textbf{Step 4: Closed-Loop Feedback}
%     \State $\text{UpdateMemory}(\mathcal{T}_{sam}, I_t, b_t^{final})$ 
%     \Comment{Inject corrected prediction into SAM2 memory}
%     \State Update history $\mathcal{H}$ with $s_t^{s}$.
%     \State Add $b_t^{final}$ to $\mathcal{B}$.
% \EndFor

% \State \Return $\mathcal{B}$
% \end{algorithmic}
% \end{algorithm}

% \begin{algorithm}[t]
% \SetKwComment{Comment}{$\triangleright$ }{}
% \caption{Closed-Loop Memory Correction (CL-MC)}
% \label{alg:fusion_logic}
% \KwIn{Video frames $\mathcal{V} = \{I_1, \dots, I_T\}$; Detector $\mathcal{D}_{yolo}$; Tracker $\mathcal{T}_{sam2}$}
% \KwIn{Thresholds: $\tau_{init}$, $\tau_{lost}$, $\tau_{drift}$, $\tau_{iou}$; Window size $K$}
% \KwOut{Trajectory $\mathcal{B} = \{b_1, \dots, b_T\}$}

% \textbf{Initialization:} Find first frame $t_0$ where $s_{t_0}^{y} > \tau_{init}$\;
% Initialize $\mathcal{T}_{sam2}$ with $(I_{t_0}, b_{t_0}^{y})$\;
% Initialize score history buffer $\mathcal{H} \leftarrow \emptyset$\;

% \For{$t = t_0 + 1$ \KwTo $T$}{
%     $\triangleright$ \textbf{Step 1: Dual-Branch Inference}\;
%     $b_t^{y}, s_t^{y} \leftarrow \mathcal{D}_{yolo}(I_t)$\;
%     $b_t^{s}, s_t^{s} \leftarrow \mathcal{T}_{sam2}(I_t)$\;
    
%     $\triangleright$ \textbf{Step 2: Heterogeneous Confidence Alignment}\;
%     $s_t^{s'} \leftarrow (s_t^{s} - \min(\mathcal{H})) / (\max(\mathcal{H}) - \min(\mathcal{H}) + \epsilon)$\;
%     $v_t^{iou} \leftarrow \text{IoU}(b_t^{y}, b_t^{s})$\;
    
%     $\triangleright$ \textbf{Step 3: State-Machine Prediction Selection}\;
%     \uIf{$v_t^{iou} > \tau_{iou}$}{ 
%         $\alpha_t \leftarrow s_t^{y} / (s_t^{y} + s_t^{s'})$ \hfill $\triangleright$ \textit{State 1: Agreement}\;
%         $b_t^{final} \leftarrow \alpha_t \cdot b_t^{y} + (1-\alpha_t) \cdot b_t^{s}$\;
%         $\mathcal{M}_{t} \leftarrow \text{Update}(\mathcal{M}_{t-1}, \text{Encode}(I_t, b_t^{final}))$ \hfill $\triangleright$ \textit{Soft Update}\;
%     }
%     \uElseIf{$s_t^{y} < \tau_{lost}$}{ 
%         $b_t^{final} \leftarrow b_t^{s}$ \hfill $\triangleright$ \textit{State 2: Detector Uncertain}\;
%         $\mathcal{M}_{t} \leftarrow \text{Update}(\mathcal{M}_{t-1}, \text{Encode}(I_t, b_t^{final}))$ \hfill $\triangleright$ \textit{Soft Update}\;
%     }
%     \uElseIf{$s_t^{y} > \tau_{drift}$ \textbf{and} $v_t^{iou} < \tau_{iou}$}{ 
%         $b_t^{final} \leftarrow b_t^{y}$ \hfill $\triangleright$ \textit{State 3: Drift Detected}\;
%         $\mathcal{M}_{t} \leftarrow \text{Encode}(I_t, b_t^{final})$ \hfill $\triangleright$ \textit{Hard Reset}\;
%     }
%     \Else{ 
%         $b_t^{final} \leftarrow b_t^{s}$ \hfill $\triangleright$ \textit{State 4: Tracker-Preserved}\;
%         $\mathcal{M}_{t} \leftarrow \text{Update}(\mathcal{M}_{t-1}, \text{Encode}(I_t, b_t^{final}))$ \hfill $\triangleright$ \textit{Soft Update}\;
%     }
    
%     $\triangleright$ \textbf{Step 4: Update History and Output}\;
%     Update $\mathcal{H}$ with $s_t^{s}$ (keep last $K$ entries)\;
%     $\mathcal{B} \leftarrow \mathcal{B} \cup \{b_t^{final}\}$\;
% }
% \Return{$\mathcal{B}$}
% \end{algorithm}

\begin{algorithm}[t]
\small
\caption{Closed-Loop Memory Correction with State-Machine Prediction Selection}
\label{alg:fusion_logic}
\KwIn{Video frames $\mathcal{V} = \{I_1, \dots, I_T\}$; Detector $\mathcal{D}_{yolo}$; Tracker $\mathcal{T}_{sam}$; $\tau_{init}$; $\tau_{lost}$; $\tau_{drift}$; $\tau_{iou}$}
\KwOut{Target trajectory $\mathcal{B} = \{b_1, \dots, b_T\}$}

\textbf{Initialization:} Find first frame $t_0$ where $\mathcal{D}_{yolo}(I_{t_0}) > \tau_{init}$\;
Initialize $\mathcal{T}_{sam}$ with $b_{t_0}^{yolo}$\;
Initialize score history buffer $\mathcal{H}$ for normalization\;

\For{$t = t_0 + 1$ \KwTo $T$}{
    $\triangleright$ \textbf{Step 1: Inferencing}\;
    
    $b_t^{y}, s_t^{y} \leftarrow \mathcal{D}_{yolo}(I_t)$ \hfill $\triangleright$ \textit{Detector prediction}\;
    
    $b_t^{s}, s_t^{s} \leftarrow \mathcal{T}_{sam}(I_t)$ \hfill $\triangleright$ \textit{Tracker prediction}\;
    
    $\triangleright$ \textbf{Step 2: Confidence Alignment}\;
    
    $s_t^{s'} \leftarrow \text{Normalize}(s_t^{s}, \mathcal{H})$ \hfill $\triangleright$ \textit{History-aware score alignment}\;
    
    $v_{iou} \leftarrow \text{IoU}(b_t^{y}, b_t^{s})$\;
    
    $\triangleright$ \textbf{Step 3: State-Machine Decision}\;
    
    \uIf{$v_{iou} > \tau_{iou}$}{
        $b_t^{final} \leftarrow \alpha \cdot b_t^{y} + (1-\alpha) \cdot b_t^{s}$ \hfill $\triangleright$ \textit{State 1: Agreement}\;
    }
    \uElseIf{$s_t^{y} < \tau_{lost}$}{
        $b_t^{final} \leftarrow b_t^{s}$ \hfill $\triangleright$ \textit{State 2: Detector Uncertain}\;
    }
     \uElseIf{$s_t^{y} > \tau_{drift}$ \textbf{and} $v_{iou} < \tau_{iou}$}{
        $b_t^{final} \leftarrow b_t^{y}$,  \hfill $\triangleright$ \textit{State 3: Drift Detected}~(Reset Memory)\;
    }
    \Else{
        $b_t^{final} \leftarrow b_t^{s}$ \hfill $\triangleright$ \textit{State 4: Tracker-Preserved}\;
    }
    
    $\triangleright$ \textbf{Step 4: Closed-Loop Feedback}\;
    
    $\text{UpdateMemory}(\mathcal{T}_{sam}, I_t, b_t^{final})$ \hfill $\triangleright$ \textit{Inject into SAM2 memory}\;
    % Update history $\mathcal{H}$ with $s_t^{s}$\;
    % Add $b_t^{final}$ to $\mathcal{B}$\;
}
\Return{$\mathcal{B}$}
\end{algorithm}

\section{Experiments}
\subsection{Dataset and Evaluation Protocol} 
Our semantic detector $\mathcal{D}_{yolo}$ was developed using non-emergency laryngeal images, comprising the Laryngoscope8 dataset \cite{yin2021laryngoscope8} (N=2,497) and 583 clinician-annotated images curated from YouTube. For video evaluation, we utilized 24 emergency intubation sequences collected from Harborview Medical Center during prehospital air medical transports. As shown in Figure~\ref{fig:dataset}, the video dataset contains 8,931 frames (297 seconds) with frame-level annotations provided by an experienced clinician.
\begin{figure}[h]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:dataset}
  {\caption{Detection training utilized Laryngoscope8 and YouTube image dataset, while tracking performance are evaluated on the private 26-video dataset, Harborview Dataset.}}
  {\includegraphics[width=0.98\linewidth]{figures/dataset.pdf}}
\vspace{-10pt}
\end{figure}


We validate our method on densely annotated laryngoscopic videos to assess robustness against aggressive corruptions, including occlusion, motion blur, and specular reflections. Following standard video detection protocols, we report mAP$_{50}$, mAP$_{50:95}$, PR-AUC, and Miss Rate.



% \textbf{Baselines:} We benchmark our proposed fusion tracker against two pivotal baselines: (1) the Single-Frame Detector, representing the appearance-based upper bound without temporal context; and (2) SAM2, which ensures temporal continuity but lacks semantic discrimination. This comparative design isolates our contributions: improvements over the single-frame detector highlight our framework's ability to recover missed detections via temporal propagation, while gains over SAM2 demonstrate the efficacy of our semantic-driven drift correction and memory rectification.

\subsection{Implementation Details} 

All experiments utilize official SAM2.1-Large weights. Unless stated otherwise, hyperparameters are set as follows: detector initialization threshold $\tau_{init}=0.75$, agreement IoU $\tau_{iou}=0.5$, drift-confidence threshold $\tau_{drift}=0.8$. We pre-trained the YOLO12-m model $D_{yolo}$ by using the single frame dataset which mentioned in previous section. The model was initialized with COCO~\cite{COCO} pre-trained weights and fine-tuned for 200 epochs.

\subsection{Baseline Models} 

We benchmark against two distinct tracking paradigms: (1) \textbf{Kinematic Filtering:} This category includes multiple SOTA object tracking methods such as BoT-SORT~\cite{botTracker} and ByteTrack~\cite{zhang2022bytetrack}. These methods utilize detections solely for association via motion priors (e.g., Kalman filters), lacking mechanisms to rectify the underlying model representation when visual features degrade. (2) \textbf{Foundation Model Trackers:} We evaluate SAM2~\cite{ravi2024sam} and SAMURAI~\cite{yang2024samurai}. While ensuring temporal continuity, these closed systems rely entirely on internal memory propagation without external semantic grounding, rendering them susceptible to cumulative drift in texture-homogeneous endoscopic environments.


\section{Results}
\subsection{Comparison with State-of-the-Art Methods}
Table~\ref{tab:sota_comparison} shows that neither single-frame detection nor kinematic association methods provide sufficient temporal robustness for endoscopic video. SAM-based trackers improve short-term continuity but suffer from drift due to memory contamination, resulting in high missing rates. Our Closed-Loop Memory Correction (CL-MC) achieves the highest AUC and lowest missing rate by actively intervening in SAM2’s memory state, demonstrating the importance of representation-level correction for reliable long-sequence tracking.


% Table \ref{tab:sota_comparison} reports the quantitative comparison between our method and several strong baselines, including standalone detection, traditional tracking-by-detection algorithms, and recent SAM-based foundation trackers. The single-frame detector (YOLO Only) serves as the appearance upper bound, offering strong per-frame accuracy but failing to maintain temporal stability under occlusions or motion blur. Traditional multi-object trackers such as SORT and ByteTrack, which rely on linear motion models, exhibit near-zero performance on this dataset due to rapid viewpoint changes and non-rigid tissue motion—highlighting that surgical endoscopy violates the assumptions required by classic MOT formulations.

% Among SAM-based trackers, directly attaching SAM2 or SAMURAI to YOLO improves short-term continuity but suffers from progressive memory corruption, leading to significant performance degradation in long sequences. In contrast, our Closed-Loop Adaptive Fusion achieves the highest mAP and AUC while maintaining a zero missing ratio, demonstrating clear advantages in drift correction, memory stability, and robustness to out-of-distribution frames. The results confirm that actively managing SAM2’s memory is critical for achieving reliable temporal performance in clinical endoscopy videos.


% \begin{table*}[t]
% \centering
% \caption{\textbf{Quantitative Results.} Comparison on the Harborview video dataset. All methods utilize the same YOLO detector backbone. We categorize baselines into Kinematic Filtering and Foundation Model-based Trackers. Our approach achieves the lowest missing rate and highest AUC, demonstrating the robustness of the proposed closed-loop memory rectification.}
% \label{tab:sota_comparison}
% \begin{adjustbox}{width=\textwidth}
% \begin{tabular}{lcccc}
% \toprule
% \textbf{Method} & \textbf{mAP@0.5 $\uparrow$} & \textbf{mAP@0.5:0.95$\uparrow$} & \textbf{AUC$\uparrow$} & \textbf{Missing@0.5$\downarrow$} \\ 
% \midrule
% \multicolumn{5}{l}{\textit{Baseline \& Kinematic Filtering}}  \\
% Single-Frame (YOLO) & 0.8241 & 0.5105 & 0.7457 & 0.0881 \\
% BoT-SORT & 0.8253 & 0.5020 & 0.7478 & 0.0833 \\
% ByteTrack & 0.8208 & 0.4697 & 0.7335 & 0.0872 \\
% \midrule
% \multicolumn{5}{l}{\textit{Foundation Model Trackers}} \\
% SAM2 & 0.7047 & 0.4318 & 0.6679 & 0.2224 \\
% SAMURAI & 0.8316 & \textbf{0.5757} & 0.7633 & 0.1246 \\
% \midrule
% \multicolumn{5}{l}{\textit{Proposed Method}} \\
% \textbf{Ours (Closed-Loop Fusion)} & \textbf{0.8432} & 0.5095 & \textbf{0.7652} & \textbf{0.0685} \\
% \bottomrule
% \end{tabular}
% \end{adjustbox}
% \end{table*}

% \begin{table*}[t]
% \centering
% \caption{\textbf{Quantitative Results.} Comparison on the Harborview video dataset. All methods utilize the same YOLO detector backbone. We categorize baselines into Kinematic Filtering and Foundation Model-based Trackers. Our approach achieves the lowest missing rate and highest AUC, demonstrating the robustness of the proposed closed-loop memory rectification.}
% \label{tab:sota_comparison}
% \begin{adjustbox}{width=\textwidth}
% \begin{tabular}{lcccc}
% \toprule
% \textbf{Method} & \textbf{mAP$_{50}$ $\uparrow$} & \textbf{mAP$_{50:95}\uparrow$} & \textbf{AUC$\uparrow$} & \textbf{Missing@0.5$\downarrow$} \\ 
% \midrule
% \multicolumn{5}{l}{\textit{Baseline \& Kinematic Filtering}}  \\
% Single-Frame (YOLO) & 82.41\% & 51.05\% & 74.57\% & 8.81\% \\
% BoT-SORT & 82.53\% & 50.20\% & 74.78\% & 8.33\% \\
% ByteTrack & 82.08\% & 46.97\% & 73.35\% & 8.72\% \\
% \midrule
% \multicolumn{5}{l}{\textit{Foundation Model Trackers}} \\
% SAM2 & 70.47\% & 43.18\% & 66.79\% & 22.24\% \\
% SAMURAI & 83.16\% & \textbf{57.57\%} & 76.33\% & 12.46\% \\
% \midrule
% \multicolumn{5}{l}{\textit{Proposed Method}} \\
% \textbf{Ours (Closed-Loop Fusion)} & \textbf{84.32\%} & 50.95\% & \textbf{76.52\%} & \textbf{6.85\%} \\
% \bottomrule
% \end{tabular}
% \end{adjustbox}
% \end{table*}

\begin{table*}[h]
\centering
\caption{\textbf{Quantitative Results.} Comparison on the Harborview video dataset. All methods utilize the same YOLO detector backbone. Baselines are grouped into (1) Kinematic Association methods and (2) Foundation Model-based trackers. Our proposed closed-loop memory correction achieves the lowest missing rate and highest AUC, demonstrating its robustness under challenging conditions.}
\label{tab:sota_comparison}
\begin{adjustbox}{width=\textwidth}
\begin{tabular}{lcccc}
\toprule
\textbf{Method} & \textbf{mAP$_{50}$ $\uparrow$} & \textbf{mAP$_{50:95}\uparrow$} & \textbf{AUC$\uparrow$} & \textbf{Missing$\downarrow$} \\ 
\midrule
\multicolumn{5}{l}{\textit{Baseline \& Kinematic Association}}  \\
YOLO12 & 82.41\% & 51.05\% & 74.57\% & 8.81\% \\
BoT-SORT & 82.53\% & 50.20\% & 74.78\% & 8.33\% \\
ByteTrack & 82.08\% & 46.97\% & 73.35\% & 8.72\% \\
\midrule
\multicolumn{5}{l}{\textit{Foundation Model Trackers}} \\
SAM2 & 70.47\% & 43.18\% & 66.79\% & 22.24\% \\
SAMURAI & 75.73\% & \textbf{52.73}\% & 70.89\% & 20.25\% \\
\midrule
\multicolumn{5}{l}{\textit{Proposed Method}} \\
\textbf{Closed-Loop Memory Correction~(Ours)} & \textbf{84.32\%} & 50.95\% & \textbf{76.52\%} & \textbf{6.85\%} \\
\bottomrule
\end{tabular}
\end{adjustbox}
\end{table*}

\subsection{Ablation Studies}

We conduct a component-wise ablation study to evaluate the contribution of confidence normalization and memory rectification. The tested variants are summarized in Table~\ref{tab:ablation}:
(1) \textit{Open-Loop Update}, which simply overwrites SAM2’s memory whenever YOLO is confident;
(2) \textit{Fixed Weighted Fusion}, using a static 0.5–0.5 averaging without considering model confidence; and
(3) \textit{Ours w/o Norm}, which removes the proposed trend-aware normalization and uses raw SAM2 confidence.

\textbf{Effect of Memory Rectification.}
The Open-Loop and Fixed Weighted Fusion variants both lack memory correction and show noticeably higher missing rates (8.63\% and 7.01\%). This confirms that output smoothing alone cannot prevent progressive memory drift, and that explicit representation-level correction provides clear benefits.

\textbf{Effect of Confidence Normalization.}
Removing the confidence normalization (w/o Norm) results in reduced mAP and higher instability, as SAM2’s raw predicted IoU fluctuates significantly across sequences. Aligning the detector and tracker confidence spaces is crucial for triggering drift correction reliably.

\begin{table*}[t]
\centering
\caption{\textbf{Effect of Proposed Components.} 
Component-wise analysis of the proposed method. 
\textit{History Norm}: sliding-window normalization of SAM2's confidence. 
\textit{Rectification}: detector-guided memory correction. 
Open Loop Update: YOLO directly overwrites SAM2 memory when confident. 
Fixed Averaging: static 0.5/0.5 fusion without confidence reasoning.}
\label{tab:ablation}
\begin{adjustbox}{width=\columnwidth}
\begin{tabular}{lccccc}
\toprule
\multirow{2}{*}{\textbf{Variant}} & \multicolumn{2}{c}{\textbf{Components}} & \multicolumn{3}{c}{\textbf{Performance}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-6}
 & \small{\textit{History Norm}} & \small{\textit{Rectification}} & \textbf{mAP$_{50}$} & \textbf{AUC} & \textbf{Missing} $\downarrow$ \\ 
\midrule
Open Loop Update 
& $\times$ & $\times$ 
& 82.88\% & 75.46\% & 8.63\% \\

Fixed Averaging (0.5/0.5) 
& $\checkmark$ & $\times$ 
& 84.15\% & 76.30\% & 7.01\% \\

Ours (w/o Norm) 
& $\times$ & \checkmark 
& 83.85\% & 76.30\% & 7.02\% \\
\midrule

\textbf{Ours (Full Model)} 
& \checkmark & \checkmark 
& \textbf{84.32\%} & \textbf{76.52\%} & \textbf{6.85\%} \\
\bottomrule
\end{tabular}
\end{adjustbox}
\end{table*}

\textbf{Effect of Data Scale.}
Table~\ref{tab:compact_compare} investigates the robustness of our framework compared to the baseline YOLO detector under limited data settings (30\%, 70\%, and 100\%). While both models benefit from increased training data, our method consistently outperforms the standalone detector in most metrics. This indicates that our closed-loop tracking mechanism effectively compensates for detection failures, yielding higher gains than simply scaling up the detector's training data.

Overall, the full model achieves the best performance across all metrics, demonstrating that both components—trend-aware normalization and closed-loop memory rectification—are necessary for stable long-term tracking.








% \subsection{Ablation Study} To rigorously evaluate the contribution of each design component, we conducted a component-wise ablation study as shown in Table \ref{tab:ablation}. Our analysis focuses on two critical mechanisms: \textit{Confidence History Normalization} and \textit{Closed-Loop Memory Rectification}.

% \textbf{Impact of Closed-Loop Memory Rectification:} We first assess the necessity of actively intervening in the tracker's state. The ``Fixed Weighted Fusion'' variant (Row 2), which employs adaptive fusion weights but lacks the feedback loop to update SAM2, exhibits a higher missing rate (7.01\%) compared to the full framework. This indicates that while adaptive fusion can mask tracking errors temporarily, it cannot prevent the accumulation of corrupt features in the memory bank. By enabling the memory rectification mechanism, our full framework significantly reduces the missing rate to 6.85\%, confirming that high-confidence semantic feedback is essential for long-term robustness.

% \textbf{Significance of History-Based Normalization:} We further investigate the role of confidence normalization. The variant relying on raw SAM2 confidence (Row 3) suffers from a performance drop (mAP decreased to 83.85\%). This is attributed to the incompatibility between YOLO's probability scores and SAM2's uncalibrated model uncertainty. Without the proposed sliding-window normalization, the fusion logic becomes sensitive to transient fluctuations, leading to unnecessary resets or missed corrections. Our history-based normalization effectively aligns the two confidence distributions, ensuring that memory rectification is triggered only by genuine tracking degradation.

% \begin{table*}[t]
% \centering
% \caption{\textbf{Ablation Study.} We analyze the impact of component-wise contributions. \textit{Norm}: sliding window confidence normalization. \textit{Rectification}: Closed-loop memory reset mechanism. The full framework achieves the best trade-off between precision and recall.}
% \label{tab:ablation}
% \begin{adjustbox}{width=\columnwidth}
% \begin{tabular}{lccccc}
% \toprule
% \multirow{2}{*}{\textbf{Method / Variant}} & \multicolumn{2}{c}{\textbf{Components}} & \multicolumn{3}{c}{\textbf{Performance}} \\
% \cmidrule(lr){2-3} \cmidrule(lr){4-6}
%  & \small{\textit{History Norm}} & \small{\textit{Rectification}} & \textbf{mAP$_{50}$} & \textbf{AUC} & \textbf{Missing} $\downarrow$ \\ 
% \midrule
% Open Loop Fusion & $\times$ & $\times$ & 82.88\% & 75.46\% & 8.63\% \\
% \midrule
% Fixed Weighted Fusion & $\checkmark$ & $\times$ & 84.15\% & 76.30\% & 7.01\% \\
% Ours (w/o Norm) & $\times$ & \checkmark & 83.85\% & 76.30\% & 7.02\% \\
% \midrule
% \textbf{Ours} & \checkmark & \checkmark & \textbf{84.32\%} & \textbf{76.52\%} & \textbf{6.85\%} \\
% \bottomrule
% \end{tabular}
% \end{adjustbox}
% \end{table*}

% \begin{table*}[t]
% \centering
% \caption{\textbf{Impact of Data Availability.} We evaluate the performance of the baseline YOLO detector and our proposed method under varying training data regimes. Our method showed improvements in every terms.}
% \label{tab:model_size_ablation}
% \begin{adjustbox}{width=\textwidth}
% \begin{tabular}{lcccccccc} 
% \toprule
% \multirow{2}{*}{\textbf{Data Usage}} & \multicolumn{4}{c}{\textbf{YOLO}} & \multicolumn{4}{c}{\textbf{Ours}} \\
% \cmidrule(lr){2-5} \cmidrule(lr){6-9} 
%  & \textbf{mAP$_{50}$} & \textbf{mAP$_{50:95}$} & \textbf{AUC} & \textbf{Missing} &  \textbf{mAP$_{50}$} & \textbf{mAP$_{50:95}$} & \textbf{AUC} & \textbf{Missing} \\
% \midrule
% 30\% & 64.16\% & 35.09\% & 65.48\% & 21.08\% & 64.79\% & 36.21\% & 63.72\% & 23.55\% \\
% 70\% & 81.28\% & 45.98\% & 70.57\% & 11.21\% & 82.29\% & 46.61\% & 73.75\% & 8.31\% \\
% 100\% & 82.41\% & 46.97\% & 74.57\% & 8.81\% & 84.32\% & 50.95\% & 76.52\% & 6.85\% \\

% \bottomrule
% \end{tabular}
% \end{adjustbox}
% \end{table*}

% \begin{table*}[t]
% \centering
% \caption{\textbf{Impact of Data Availability.} Comparison of detection performance (mAP) and safety metric (Missing Rate) under varying training data regimes. \textbf{\textcolor{blue}{Blue}} indicates improvement over the baseline.}
% \label{tab:data_ablation}

% \begin{adjustbox}{width=0.9\textwidth}
% \newcommand{\inc}[1]{{\scriptsize \textcolor{blue}{(+#1)}}}
% \newcommand{\dec}[1]{{\scriptsize \textcolor{blue}{(-#1)}}}
% \newcommand{\bad}[1]{{\scriptsize \textcolor{red}{(+#1)}}} 

% \begin{tabular}{lcccccc} 
% \toprule
% \multirow{2}{*}{\textbf{Data}} & \multicolumn{3}{c}{\textbf{YOLO}} & \multicolumn{3}{c}{\textbf{Ours}} \\
% \cmidrule(lr){2-4} \cmidrule(lr){5-7} 
%  & \textbf{mAP$_{50}$}$\uparrow$ & \textbf{mAP$_{50:95}$}$\uparrow$ & \textbf{Missing}$\downarrow$ & \textbf{mAP$_{50}$}$\uparrow$ & \textbf{mAP$_{50:95}$}$\uparrow$ & \textbf{Missing}$\downarrow$ \\
% \midrule
% 30\%  & 64.16 & 35.09 & 21.08 & 64.79 \inc{0.6} & 36.21 \inc{1.1} & 23.55 \bad{2.5} \\
% 70\%  & 81.28 & 45.98 & 11.21 & 82.29 \inc{1.0} & 46.61 \inc{0.6} & 8.31 \dec{2.9} \\
% 100\% & 82.41 & 46.97 & 8.81  & 84.32 \inc{1.9} & 50.95 \inc{4.0} & 6.85 \dec{2.0} \\
% \bottomrule
% \end{tabular}
% \end{adjustbox}
% \end{table*}

% Define colors
\definecolor{gaincolor}{RGB}{0, 150, 0} % Dark Green
\definecolor{losscolor}{RGB}{180, 0, 0} % Dark Red
\definecolor{bgray}{gray}{0.95} % Light gray background

% Macros for formatting
\newcommand{\better}[1]{\scriptsize \textcolor{gaincolor}{(+#1)}}
\newcommand{\betterdown}[1]{\scriptsize \textcolor{gaincolor}{(-#1)}} % For MR where lower is better
\newcommand{\worse}[1]{\scriptsize \textcolor{losscolor}{(+#1)}}     % For MR where higher is worse


\begin{table*}[t]
\centering
\caption{\textbf{Performance Comparison.} Our method is compared against the YOLO baseline across different data scales. The improvement relative to baseline is shown in parentheses.}
\label{tab:compact_compare}
\setlength{\tabcolsep}{8pt}
\begin{tabular}{l cc >{}c >{}c cc}
\toprule
\multirow{2}{*}{\textbf{Data}} & \multicolumn{2}{c}{\textbf{mAP$_{50}$} ($\uparrow$)} & \multicolumn{2}{c}{\textbf{mAP$_{50:95}$} ($\uparrow$)} & \multicolumn{2}{c}{\textbf{Miss Rate} ($\downarrow$)} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
 & YOLO & \textbf{Ours} & YOLO & \textbf{Ours} & YOLO & \textbf{Ours} \\
\midrule
30\%  & 64.16 & \textbf{64.79} \better{0.6} & 35.09 & \textbf{36.21} \better{1.1} & \textbf{21.08} & 23.55 \worse{2.5} \\
70\%  & 81.28 & \textbf{82.29} \better{1.0} & 45.98 & \textbf{46.61} \better{0.6} & 11.21 & \textbf{8.31} \betterdown{2.9} \\
100\% & 82.41 & \textbf{84.32} \better{1.9} & 46.97 & \textbf{50.95} \better{4.0} & 8.81  & \textbf{6.85} \betterdown{2.0} \\
\bottomrule
\end{tabular}
\end{table*}

\subsection{Qualitative results}

\begin{figure}[h]
\floatconts
  {fig:visulaztion_result}
  {\caption{\textbf{Visualization of Qualitative Results.} Blue and green boxes denote the Ours and YOLO outputs, respectively.}}
  {\includegraphics[width=0.85\linewidth]{figures/vis.pdf}}
\end{figure}

Qualitative results in Figure~\ref{fig:visulaztion_result} illustrate the robustness of our method. While the baseline (green) is precise in simple cases (A), our method (blue) excels in complex scenarios. Panel (B) demonstrates the efficacy of our Drift Detection module, which rectifies tracker expansion caused by color similarity. Furthermore, Panel (C) shows that our method maintains accurate detection even under severe motion blur and lighting changes, significantly outperforming the baseline in conditions critical for clinical deployment.

\section{Conclusion}
We presented a closed-loop memory correction framework for reliable glottic localization in video laryngoscopy. Unlike conventional tracking-by-detection or foundation-model trackers that operate solely at the output level, our approach introduces a representation-level feedback mechanism that allows high-confidence detections to actively supervise and correct the internal memory of SAM2. Through a state-machine controller, heterogeneous confidence alignment, and targeted memory rectification, the proposed method enables stable and drift-resistant tracking under severe domain shift, rapid deformation, and occlusion—conditions under which existing methods often fail.

Comprehensive experiments on real clinical intubation videos demonstrate that our framework consistently improves temporal robustness, achieving the highest AUC and the lowest missing rate among all baselines. Ablation studies further validate the importance of dynamic confidence normalization and memory correction, highlighting the necessity of active rather than passive temporal modeling in endoscopic video analysis.

Overall, our results indicate that closed-loop semantic feedback is a powerful and generalizable strategy for controlling foundation model trackers in medical video applications. Future work will explore extending this paradigm to multi-object anatomical tracking, leveraging richer supervisory signals, and integrating language-conditioned priors to further enhance robustness across diverse clinical environments.

\clearpage

\bibliography{vocalDetect}


\end{document}

