Distill, Suppress, and Fuse: Cross-Modal Knowledge Integration for Optical Flow-Free Temporal Action Segmentation

Published: 01 Jun 2026, Last Modified: 10 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cross-modal knowledge distillation, Temporal action segmentation, Multi-modal learning
TL;DR: RELATE enables efficient RGB-based temporal action segmentation by selectively integrating useful motion cues distilled from optical flow while suppressing cues that are misaligned with the action structure.
Abstract: Cross-modal knowledge distillation (CMKD) enables efficient inference by transferring knowledge from a teacher model trained on a computationally heavy modality (i.e., optical flow) to a student model operating on a lightweight modality (i.e., RGB). However, we find that most current CMKD methods are hindered by a key limitation when applied in temporal action segmentation: motion cues transferred from optical flow often lead the student to produce frame representations that are misaligned with the underlying action structure. To address this, we propose RELATE, an optical flow-free framework that selectively integrates transferred cues while suppressing misaligned cues. We further introduce a prediction refinement strategy to resolve ambiguous segments using multiple predictions. Experiments on three benchmarks with multiple segmenters show that RELATE consistently outperforms RGB-only baselines, approaches two-stream performance, and achieves up to 175× faster inference.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 83
Loading