DSTA4D: Rethinking Adaptive Spatio-Temporal Decoupling for Dynamic Point Cloud Videos

Jiaze Wang; Jiayi Tian; Xun Zhou; Pengju Ren; Tian Xia; Wenzhe zhao

DSTA4D: Rethinking Adaptive Spatio-Temporal Decoupling for Dynamic Point Cloud Videos

Jiaze Wang, Jiayi Tian, Xun Zhou, Pengju Ren, Tian Xia, Wenzhe zhao

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: 4D Point Cloud; Spatio-Temporal Decoupling; Content-Aware Processing; Dynamic Adapter; Video Understanding; Action Recognition; Temporal-Biased Convolution; Spatio-Temporal Attention; Adaptive Feature Fusion; Point Cloud Video

TL;DR: We propose a content-aware 4D point cloud modeling framework that combines pre-embedding spatio-temporal decoupling with a dynamic adapter to intelligently allocate computation, achieving significant gains on MSR-Action3D and Synthia4D benchmarks.

Abstract: Understanding 4D point cloud videos is crucial for intelligent agents to perceive the dynamic changes in their external environment. However, due to the inter-frame time inconsistency and spatial disorder inherent in long-sequence point clouds, designing a unified 4D global model faces significant challenges. Existing methods primarily rely on static, monolithic network architectures that apply a uniform computational pipeline to all input data. This approach neglects the differences in spatio-temporal complexity across videos, resulting in inefficient resource allocation and limiting model's performance. To address these issues, we present a novel content-aware 4D point cloud processing approach, termed DSTA4D, which leverages dynamic spatio-temporal decoupling via adaptive modules. We first propose decoupling temporal and spatial features within the embedding layer, which avoids the complexity of full-process long-term modeling. Second, we introduce a innovative lightweight module: Dynamic Spatio-Temporal Adapter(DST-Adapter). This module dynamically generates gating weights based on the global spatio-temporal features of the input sequence and adaptively fuses features from three parallel streams: identity path, spatial enhancement path, and temporal enhancement path. This content-aware mechanism allows the model to intelligently allocate its computational focus to the most critical feature dimensions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+5.23\%} accuracy), NTU RGBD (\textbf{+1\%} accuracy) and Synthia4D (\textbf{+1.36\%} mIoU) show significant performance gains, offering a more efficient and intelligent adaptive modeling paradigm for point cloud video understanding.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 1266

Loading