Keywords: Contactless respiratory monitoring, video time series, respiratory waveform reconstruction, latent diffusion, uncertainty estimation, health sensing
TL;DR: ViTaL-Diff reconstructs respiratory waveforms from upper-body video time series using video-conditioned latent diffusion and provides uncertainty estimates for unreliable clips.
Abstract: Contactless respiratory monitoring from video is a challenging inverse problem because breathing is observed only indirectly through subtle body and clothing motion, which can be corrupted by illumination, occlusion, pose variation, and non-respiratory movement. We propose ViTaL-Diff, a video-token latent diffusion framework that reconstructs respiratory waveforms from short upper-body videos without contact sensors at inference. ViTaL-Diff learns a compact respiratory latent space from belt-derived waveforms and trains a video-conditioned diffusion transformer to generate respiratory latent tokens from spatiotemporal video evidence. By modeling a distribution over plausible respiratory waveforms, the method supports both respiratory-rate estimation and uncertainty quantification. Across three datasets, including an in-house RGB dataset, AIR-125, and a low-light Sleep dataset, ViTaL-Diff achieves the lowest error among classical, deterministic, and diffusion baselines, with up to 28.9% mean absolute error (MAE) reduction over deterministic deep baselines and uncertainty estimates that identify visually ambiguous clips.
Submission Number: 161
Loading