Keywords: video models, world models, uncertainty quantification
TL;DR: This paper introduces a uncertainty quantification method for video world models, utilizing latent modeling to decompose total uncertainty into its aleatoric and epistemic components.
Abstract: Generative video models demonstrate impressive text-to-video capabilities,
spurring widespread adoption in many real-world applications. However, like
large language models (LLMs), video generation models tend to hallucinate, pro-
ducing plausible videos even when they are factually wrong. Although uncertainty
quantification (UQ) of LLMs has been extensively studied in prior work, no UQ
method for video models exists, raising critical safety concerns. To our knowl-
edge, this paper represents the first work towards quantifying the uncertainty of
video models. We present a framework for uncertainty quantification of generative
video models, consisting of: (i) a metric for evaluating the calibration of video
models based on robust rank correlation estimation with no stringent modeling
assumptions; (ii) a black-box UQ method for video models (termed S-QUBED),
which leverages latent modeling to rigorously decompose predictive uncertainty
into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate
benchmarking calibration in video models, which will be released after the review
process. By conditioning the generation task in the latent space, we disentangle
uncertainty arising due to vague task specifications from that arising from lack
of knowledge. Through extensive experiments on benchmark video datasets, we
demonstrate that S-QUBED computes calibrated total uncertainty estimates that are
negatively correlated with the task accuracy and effectively computes the aleatoric
and epistemic constituents.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18612
Loading