CASR: Refining Action Segmentation via marginalizing frame-level causal relationships

16 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Causal Representation;Temporal Action Segmentation; Refiner; Causal Consistent Mapping
TL;DR: We propose a method of marginalizing frame-level noise relationships to satisfy the consistent mapping between different causal models and introduce a Causal Abstraction Segmentation Refiner (CASR) to enhance the temporal action segmentation ability.
Abstract: Integrating deep learning and causal discovery has increased the necessity for a causal relationship between frames as evidence for explainability in Temporal Action Segmentation (TAS) tasks. However, frame-level causal relationships apparently emerge noise outside the segment, making it infeasible to suggest macro action relationships through frame relationships. To address this research gap, we propose a method of marginalizing frame-level noise relationships and introduce a Causal Abstraction Segmentation Refiner (CASR) to enhance the segmentation ability. Specifically, we retain all cross-segment relationships while discarding all inter-segment relationships over the frame-level model, satisfying a consistent mapping of causal abstraction in terms of action semantics from frames to segments. Given the pre-segmentation of the backbone, we treat the whitening frame relationships of the same and different segments in a video as positive and negative cases, respectively. Through contrastive learning, we identify whether each frame belongs to the corresponding segment, thereby enhancing the segmentation performance. In addition, we propose a loss function independent of the action segment engineer to evaluate the causal interpretability of segmentation results. Extensive experimental results on mainstream datasets indicate that our method not only significantly surpasses existing methods in action segmentation performance, but also performs better in evaluating causal models. Our CASR can be plugged into various action segmentation engineers (MS-TCN++, ASRF, C2F-TCN, CETNet) with different backbones. This generalization performance will make CASR an effective tool for boosting the existing approaches for temporal action segmentation.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 619
Loading