3.2 Mean Opinion Score Comparison

Flowtron has Mean Opinion Scores (MOS) comparable to state of the art text to speech models. Here we provide a sample from Flowtron and Tacotron 2 trained on the LJSpeech dataset.
LJSpeech Ground Truth Flowtron Tacotron 2


3.3.1 Sampling the Prior ( Speech Variation )

With Flowtron we can control the amount of prosodic variation in speech by adjusting σ². Despite all the variability added by increasing σ², all the samples synthesized with Flowtron still produce high quality speech.
The three columns contain three separate samples, so that you can compare variation for each value of σ², and also compare with Tacotron 2 variation.
With Flowtron, we can create samples with highly varying prosody, which can make the voice much less monotonous.
Flowtron σ²=0
Flowtron σ²=0.5
Flowtron σ²=1
Tacotron 2 p=0.5


3.3.2 Sampling the Prior ( Interpolation between samples )

Flowtron model with speaker embeddings. We interpolate between two random z-vectors with the speaker Sally and the phrase "It is well known that deep generative models have a rich latent space".
1/100 30/100 60/100 100/100
Flowtron same speaker

Flowtron model without speaker embeddings. We interpolate between z-vectors producing speech from Sally and Helen with the phrase "We are testing this model".
100% Helen 66% Helen 33% Sally 33% Helen 66% Sally 100% Sally
Flowtron different speakers


3.4.1 Sampling the Posterior ( Seen speaker )

We showcase Flowtron's hability to modify a speaker's style over time by gradually making a monotone speaker more expressive. This is done by interpolating between a standard Gaussian prior and a region of Flowtron's z-space associated with more expressivity.
Flowtron Style Transfer over time


We showcase Flowtron's hability to transfer acoustic characteristics over time by gradually making our baseline speaker sound more like the target style. This is done by interpolating between a standard Gaussian prior and a posterior associated with the target style.
Flowtron Style Transfer over time


Audio samples accompanying the Expressive style transfer experiments.
Flowtron Posterior
Flowtron Baseline
Tacotron GST


Audio samples accompanying the High Pitch style transfer experiments.
Flowtron Posterior
Flowtron Baseline
Tacotron GST


3.4.2 Sampling the Posterior ( Unseen speaker style )

We modify a speaker's style by using data from the same speaker but from a style not seen during training. Flowtron succeeds in transferring the somber style and the long pauses associated with the narrative style.
Flowtron baseline
Style
Flowtron Style Transfer
Tacotron GST Style Transfer


3.4.3 Sampling the Posterior ( Unseen speaker )

We transfer the style from speaker ID 03 from RAVDESS and the label "surprised" to Sally. Flowtron is able to make Sally sound surprised, which is drastically different from the monotonous baseline.
Style
Flowtron Posterior
Flowtron Baseline
Tacotron GST

3.5 Interpolation between styles (Prior and Posterior)

Flowtron model with speaker embeddings trained on LibriTTS. We interpolate between a spherical Gaussian Prior and a posterior computed on evidence from Sally's Born of Darkness. We evaluate different values of lambda.
We call the listener's attention to the interpolation of non-textual characteristics that are hard to compute but easy to perceive, and the gradual transition from one speaking style to another
Flowtron Baseline Flowtron λ = 2 Flowtron λ = 1 Flowtron λ = 0.666 Flowtron λ = 0.1 Target Style

3.6.2 Sampling the Gaussian Mixture ( Translating dimensions )

We select a single component from the gaussian mixture and translate a dimension associated with pitch. Although the samples have different pitch contours, they have the similar duration.
μ (a-flat) μ - 2σ (c) μ - 4σ (e-flat)


We select a single component from the gaussian mixture and translate a dimension associated with speech rate. Although the samples have different speech rates, they have similar pitch contour.
μ μ - 2σ μ - 4σ


Extra Flowtron samples

To reverb or not to reverb
Queen's accent
Last Halloween... This Corona year...