Caseg: Clip-Based Action Segmentation With Learnable Text Prompt

Published: 01 Jan 2024, Last Modified: 03 Oct 2025ICIP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video action segmentation aims to identify and localize actions. Existing models have achieved impressive performance with pre-extracted frame-level features, but this may limit zero-shot learning and cross-dataset inference, especially for new actions or scenes. To overcome this problem, we propose a novel end-to-end network designed for robust performance across both familiar and novel action segmentation scenarios. Our approach combines a plug-and-play visual prompt module enhancing CLIP features’ temporal understanding, and a learnable text prompt that enriches label semantics and refines the model’s focus, significantly boosting performance. Our results demonstrate that CLIP features can assist in action segmentation tasks, and prompts can improve task effectiveness. Furthermore, our findings show that CLIP features contain information that i3d features do not. We evaluate the proposed method on several video datasets, including Georgia Tech Egocentric Activities (GTEA), 50Salads, and Breakfast, and the results show that the proposed model outperforms existing SOTA models.
Loading