M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing

Yan Liu, Li-Fang Wei, Xinyuan Qian, Tian-Hao Zhang, Song-Lu Chen, Xu-Cheng Yin

Published: 2024, Last Modified: 22 Jan 2026Pattern Recognit. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•Adding a video to guide the model in generating audio-visual synchronized speech.•Introducing a key–value memory into the TTS model to connect the video and speech.•Extracting style from video, M3TTS can generate high-quality expressive speech.