On the Analysis of the One-to-Many Mapping in Cross-Modality Text-to-Video Generation with Semantic Spaces

22 Apr 2026 (modified: 01 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite recent advances in text-to-video generation, the role of text and video latent spaces in learning a semantically shared representation remains underexplored. In this cross-modality generation task, most methods rely on conditioning the video generation process by injecting the text representation into it, rather than exploring the implicit shared knowledge between the modalities. However, the feature-based alignment of both modalities is not straightforward, especially for the one-to-many mapping scenario in which one text can be mapped to several valid semantically aligned videos, a challenge that generally produces a representation collapse in the alignment phase. In this work, we investigate and give insights into how both modalities cope in a shared semantic space where each modality representation is previously learned in an unsupervised way. We explore this from a latent space learning perspective with a plug-and-play framework that adopts autoencoder-based models that could be used with other representations. We show that the one-to-many case requires different alignment strategies than those commonly used in the literature, which struggle to align both modalities in a semantically shared space.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=6tigEAsbFw
Changes Since Last Submission: In our last submission, we addressed some of the reviewers' questions and suggestions. However, two points raised as particularly impactful for the paper's final decision could not be addressed within the available rebuttal time: - the bucket loss proposed for the baseline method used in the analysis, which reviewers considered marginal compared to the other models evaluated; - and the concern that the problem addressed in the manuscript is only relevant to small-scale data sets. With this in mind, we thoroughly revised the paper, also demonstrating that the problem exists in large-scale data sets and discussing the main reasons for this. We also further identified that our proposed baseline loss was inadvertently used in the training of the state-of-the-art (SOTA) models, which was one of the reasons the results were similar across models. We have isolated our loss from the SOTA experiments, ensuring that only the respective SOTA losses are used, as reflected in the newly updated experiments section. This ensures a proper and fair comparison with those models. Lastly, following one of reviewer CBUd's suggestions, we restructured our research questions into sections representing case studies, directly demonstrating the problem addressed by each case study along with the corresponding findings. We also added a visual example of the problem to facilitate understanding, as suggested by reviewer CBUd. Additionally, minor textual and typographical corrections were made since the last version.
Assigned Action Editor: ~Candace_Ross1
Submission Number: 8553
Loading