Abstract: The robustness to spatio-temporal sampling is significant for point cloud video understanding. Previous works overlook this issue and usually suffer notable performance drops when point densities and frame rates are changed. To remedy this, we propose a point spatio-temporal pyramid (PoST-Py) to improve the sampling robustness of point cloud video modeling. Specifically, we propose a pluggable PoST-Py to collect multi-scale feature maps from different layers of the backbone. Then, these features are integrated into a unified representation. This allows the model to capture multi-scale spatio-temporal information simultaneously. In addition, we employ the temporal cardinality difference to enhance the features to capture motion information. Extensive experiments show that PoST-Py achieves state-of-the-art performance, particularly with a notable improvement of over 2% under varying point sampling. This demonstrates the improved robustness of our method.
Loading