Abstract: Highlights•Adding a video to guide the model in generating audio-visual synchronized speech.•Introducing a key–value memory into the TTS model to connect the video and speech.•Extracting style from video, M3TTS can generate high-quality expressive speech.
Loading