M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing

Published: 2024, Last Modified: 22 Jan 2026Pattern Recognit. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•Adding a video to guide the model in generating audio-visual synchronized speech.•Introducing a key–value memory into the TTS model to connect the video and speech.•Extracting style from video, M3TTS can generate high-quality expressive speech.
Loading