Keywords: Spatiotemporal modeling, Video modeling, Physical system modeling, Tridiagonal Toeplitz tensor, Long-range sequence modeling
TL;DR: We propose ConvT3, a ConvSSM with extended state kernels structured by tridiagonal Toeplitz tensors, enabling efficient linear-time training, stable parameterization, and state-of-the-art performance in video and PDE modeling.
Abstract: Modeling long spatiotemporal sequences requires capturing both complex spatial correlations and temporal dependencies.
Convolutional State Space Models (ConvSSMs) have been proposed to incorporate spatial modeling in State Space Models (SSMs) using the convolution of tensor-valued states and kernels.
Yet, existing implementations remain limited to $1\times 1$ state kernels for computational feasibility, which limits the modeling capacity of ConvSSMs.
We introduce a novel spatiotemporal model, ConvT3 (ConvSSM using Tridiagonal Toeplitz Tensors), designed to equivalently realize ConvSSMs with extended $3\times 3$ state kernels.
ConvT3 structures a state kernel for its corresponding tensor to be composed as a structured SSM matrix on hidden state dimensions and a constrained tridiagonal Toeplitz tensor on spatial dimensions.
We show that the structured tensor can be diagonalized, which enables efficient parallel training while leveraging $3\times 3$ state convolutions.
We demonstrate that ConvT3 effectively embeds rich spatial and temporal information into the dynamics of tensor-valued states, achieving state-of-the-art performance on most metrics in long-range video generation and physical system modeling.
Primary Area: learning on time series and dynamical systems
Submission Number: 7602
Loading