Flowtron has Mean Opinion Scores (MOS) comparable to state of the art text to speech models. Here we provide a sample from Flowtron and Tacotron 2 trained on the LJSpeech dataset.
LJSpeech Ground Truth
Flowtron
Tacotron 2
audio not supported
audio not supported
audio not supported
3.3.1 Sampling the Prior ( Speech Variation )
With Flowtron we can control the amount of prosodic variation in speech by adjusting σ². Despite all the variability added by increasing σ², all the samples synthesized with Flowtron still produce high quality speech.
The three columns contain three separate samples, so that you can compare variation for each value of σ², and also compare with Tacotron 2 variation.
With Flowtron, we can create samples with highly varying prosody, which can make the voice much less monotonous.
Flowtron σ²=0
audio not supported
audio not supported
audio not supported
Flowtron σ²=0.5
audio not supported
audio not supported
audio not supported
Flowtron σ²=1
audio not supported
audio not supported
audio not supported
Tacotron 2 p=0.5
audio not supported
audio not supported
audio not supported
3.3.2 Sampling the Prior ( Interpolation between samples )
Flowtron model with speaker embeddings. We interpolate between two random z-vectors with the speaker Sally and the phrase "It is well known that deep generative models have a rich latent space" .
1/100
30/100
60/100
100/100
Flowtron same speaker
audio not supported
audio not supported
audio not supported
audio not supported
Flowtron model without speaker embeddings. We interpolate between z-vectors producing speech from Sally and Helen with the phrase "We are testing this model" .
100% Helen
66% Helen 33% Sally
33% Helen 66% Sally
100% Sally
Flowtron different speakers
audio not supported
audio not supported
audio not supported
audio not supported
3.4.1 Sampling the Posterior ( Seen speaker )
We showcase Flowtron's hability to modify a speaker's style over time by gradually making a monotone speaker more expressive. This is done by interpolating between a standard Gaussian prior and a region of Flowtron's z-space associated with more expressivity.
Flowtron Style Transfer over time
audio not supported
We showcase Flowtron's hability to transfer acoustic characteristics over time by gradually making our baseline speaker sound more like the target style. This is done by interpolating between a standard Gaussian prior and a posterior associated with the target style.
Flowtron Style Transfer over time
audio not supported
Audio samples accompanying the Expressive style transfer experiments.
Flowtron Posterior
posterior expressive ratio4 seq0 sample0
posterior expressive ratio4 seq0 sample1
posterior expressive ratio4 seq0 sample2
posterior expressive ratio4 seq0 sample3
posterior expressive ratio4 seq0 sample4
posterior expressive ratio4 seq0 sample5
posterior expressive ratio4 seq0 sample6
posterior expressive ratio4 seq0 sample7
posterior expressive ratio4 seq0 sample8
posterior expressive ratio4 seq0 sample9
posterior expressive ratio4 seq1 sample0
posterior expressive ratio4 seq1 sample1
posterior expressive ratio4 seq1 sample2
posterior expressive ratio4 seq1 sample3
posterior expressive ratio4 seq1 sample4
posterior expressive ratio4 seq1 sample5
posterior expressive ratio4 seq1 sample6
posterior expressive ratio4 seq1 sample7
posterior expressive ratio4 seq1 sample8
posterior expressive ratio4 seq1 sample9
posterior expressive ratio4 seq2 sample0
posterior expressive ratio4 seq2 sample1
posterior expressive ratio4 seq2 sample2
posterior expressive ratio4 seq2 sample3
posterior expressive ratio4 seq2 sample4
posterior expressive ratio4 seq2 sample5
posterior expressive ratio4 seq2 sample6
posterior expressive ratio4 seq2 sample7
posterior expressive ratio4 seq2 sample8
posterior expressive ratio4 seq2 sample9
posterior expressive ratio4 seq3 sample0
posterior expressive ratio4 seq3 sample1
posterior expressive ratio4 seq3 sample2
posterior expressive ratio4 seq3 sample3
posterior expressive ratio4 seq3 sample4
posterior expressive ratio4 seq3 sample5
posterior expressive ratio4 seq3 sample6
posterior expressive ratio4 seq3 sample7
posterior expressive ratio4 seq3 sample8
posterior expressive ratio4 seq3 sample9
posterior expressive ratio4 seq4 sample0
posterior expressive ratio4 seq4 sample1
posterior expressive ratio4 seq4 sample2
posterior expressive ratio4 seq4 sample3
posterior expressive ratio4 seq4 sample4
posterior expressive ratio4 seq4 sample5
posterior expressive ratio4 seq4 sample6
posterior expressive ratio4 seq4 sample7
posterior expressive ratio4 seq4 sample8
posterior expressive ratio4 seq4 sample9
Flowtron Baseline
baseline expressive ratio4 seq0 sample0
baseline expressive ratio4 seq0 sample1
baseline expressive ratio4 seq0 sample2
baseline expressive ratio4 seq0 sample3
baseline expressive ratio4 seq0 sample4
baseline expressive ratio4 seq0 sample5
baseline expressive ratio4 seq0 sample6
baseline expressive ratio4 seq0 sample7
baseline expressive ratio4 seq0 sample8
baseline expressive ratio4 seq0 sample9
baseline expressive ratio4 seq1 sample0
baseline expressive ratio4 seq1 sample1
baseline expressive ratio4 seq1 sample2
baseline expressive ratio4 seq1 sample3
baseline expressive ratio4 seq1 sample4
baseline expressive ratio4 seq1 sample5
baseline expressive ratio4 seq1 sample6
baseline expressive ratio4 seq1 sample7
baseline expressive ratio4 seq1 sample8
baseline expressive ratio4 seq1 sample9
baseline expressive ratio4 seq2 sample0
baseline expressive ratio4 seq2 sample1
baseline expressive ratio4 seq2 sample2
baseline expressive ratio4 seq2 sample3
baseline expressive ratio4 seq2 sample4
baseline expressive ratio4 seq2 sample5
baseline expressive ratio4 seq2 sample6
baseline expressive ratio4 seq2 sample7
baseline expressive ratio4 seq2 sample8
baseline expressive ratio4 seq2 sample9
baseline expressive ratio4 seq3 sample0
baseline expressive ratio4 seq3 sample1
baseline expressive ratio4 seq3 sample2
baseline expressive ratio4 seq3 sample3
baseline expressive ratio4 seq3 sample4
baseline expressive ratio4 seq3 sample5
baseline expressive ratio4 seq3 sample6
baseline expressive ratio4 seq3 sample7
baseline expressive ratio4 seq3 sample8
baseline expressive ratio4 seq3 sample9
baseline expressive ratio4 seq4 sample0
baseline expressive ratio4 seq4 sample1
baseline expressive ratio4 seq4 sample2
baseline expressive ratio4 seq4 sample3
baseline expressive ratio4 seq4 sample4
baseline expressive ratio4 seq4 sample5
baseline expressive ratio4 seq4 sample6
baseline expressive ratio4 seq4 sample7
baseline expressive ratio4 seq4 sample8
baseline expressive ratio4 seq4 sample9
Tacotron GST
gst expressive seq0 sample0
gst expressive seq0 sample1
gst expressive seq0 sample2
gst expressive seq0 sample3
gst expressive seq0 sample4
gst expressive seq0 sample5
gst expressive seq0 sample6
gst expressive seq0 sample7
gst expressive seq0 sample8
gst expressive seq0 sample9
gst expressive seq1 sample0
gst expressive seq1 sample1
gst expressive seq1 sample2
gst expressive seq1 sample3
gst expressive seq1 sample4
gst expressive seq1 sample5
gst expressive seq1 sample6
gst expressive seq1 sample7
gst expressive seq1 sample8
gst expressive seq1 sample9
gst expressive seq2 sample0
gst expressive seq2 sample1
gst expressive seq2 sample2
gst expressive seq2 sample3
gst expressive seq2 sample4
gst expressive seq2 sample5
gst expressive seq2 sample6
gst expressive seq2 sample7
gst expressive seq2 sample8
gst expressive seq2 sample9
gst expressive seq3 sample0
gst expressive seq3 sample1
gst expressive seq3 sample2
gst expressive seq3 sample3
gst expressive seq3 sample4
gst expressive seq3 sample5
gst expressive seq3 sample6
gst expressive seq3 sample7
gst expressive seq3 sample8
gst expressive seq3 sample9
gst expressive seq4 sample0
gst expressive seq4 sample1
gst expressive seq4 sample2
gst expressive seq4 sample3
gst expressive seq4 sample4
gst expressive seq4 sample5
gst expressive seq4 sample6
gst expressive seq4 sample7
gst expressive seq4 sample8
gst expressive seq4 sample9
Audio samples accompanying the High Pitch style transfer experiments.
Flowtron Posterior
posterior highf0 ratio1 seq0 sample0
posterior highf0 ratio1 seq0 sample1
posterior highf0 ratio1 seq0 sample2
posterior highf0 ratio1 seq0 sample3
posterior highf0 ratio1 seq0 sample4
posterior highf0 ratio1 seq0 sample5
posterior highf0 ratio1 seq0 sample6
posterior highf0 ratio1 seq0 sample7
posterior highf0 ratio1 seq0 sample8
posterior highf0 ratio1 seq0 sample9
posterior highf0 ratio1 seq1 sample0
posterior highf0 ratio1 seq1 sample1
posterior highf0 ratio1 seq1 sample2
posterior highf0 ratio1 seq1 sample3
posterior highf0 ratio1 seq1 sample4
posterior highf0 ratio1 seq1 sample5
posterior highf0 ratio1 seq1 sample6
posterior highf0 ratio1 seq1 sample7
posterior highf0 ratio1 seq1 sample8
posterior highf0 ratio1 seq1 sample9
posterior highf0 ratio1 seq2 sample0
posterior highf0 ratio1 seq2 sample1
posterior highf0 ratio1 seq2 sample2
posterior highf0 ratio1 seq2 sample3
posterior highf0 ratio1 seq2 sample4
posterior highf0 ratio1 seq2 sample5
posterior highf0 ratio1 seq2 sample6
posterior highf0 ratio1 seq2 sample7
posterior highf0 ratio1 seq2 sample8
posterior highf0 ratio1 seq2 sample9
posterior highf0 ratio1 seq3 sample0
posterior highf0 ratio1 seq3 sample1
posterior highf0 ratio1 seq3 sample2
posterior highf0 ratio1 seq3 sample3
posterior highf0 ratio1 seq3 sample4
posterior highf0 ratio1 seq3 sample5
posterior highf0 ratio1 seq3 sample6
posterior highf0 ratio1 seq3 sample7
posterior highf0 ratio1 seq3 sample8
posterior highf0 ratio1 seq3 sample9
posterior highf0 ratio1 seq4 sample0
posterior highf0 ratio1 seq4 sample1
posterior highf0 ratio1 seq4 sample2
posterior highf0 ratio1 seq4 sample3
posterior highf0 ratio1 seq4 sample4
posterior highf0 ratio1 seq4 sample5
posterior highf0 ratio1 seq4 sample6
posterior highf0 ratio1 seq4 sample7
posterior highf0 ratio1 seq4 sample8
posterior highf0 ratio1 seq4 sample9
Flowtron Baseline
baseline highf0 ratio1 seq0 sample0
baseline highf0 ratio1 seq0 sample1
baseline highf0 ratio1 seq0 sample2
baseline highf0 ratio1 seq0 sample3
baseline highf0 ratio1 seq0 sample4
baseline highf0 ratio1 seq0 sample5
baseline highf0 ratio1 seq0 sample6
baseline highf0 ratio1 seq0 sample7
baseline highf0 ratio1 seq0 sample8
baseline highf0 ratio1 seq0 sample9
baseline highf0 ratio1 seq1 sample0
baseline highf0 ratio1 seq1 sample1
baseline highf0 ratio1 seq1 sample2
baseline highf0 ratio1 seq1 sample3
baseline highf0 ratio1 seq1 sample4
baseline highf0 ratio1 seq1 sample5
baseline highf0 ratio1 seq1 sample6
baseline highf0 ratio1 seq1 sample7
baseline highf0 ratio1 seq1 sample8
baseline highf0 ratio1 seq1 sample9
baseline highf0 ratio1 seq2 sample0
baseline highf0 ratio1 seq2 sample1
baseline highf0 ratio1 seq2 sample2
baseline highf0 ratio1 seq2 sample3
baseline highf0 ratio1 seq2 sample4
baseline highf0 ratio1 seq2 sample5
baseline highf0 ratio1 seq2 sample6
baseline highf0 ratio1 seq2 sample7
baseline highf0 ratio1 seq2 sample8
baseline highf0 ratio1 seq2 sample9
baseline highf0 ratio1 seq3 sample0
baseline highf0 ratio1 seq3 sample1
baseline highf0 ratio1 seq3 sample2
baseline highf0 ratio1 seq3 sample3
baseline highf0 ratio1 seq3 sample4
baseline highf0 ratio1 seq3 sample5
baseline highf0 ratio1 seq3 sample6
baseline highf0 ratio1 seq3 sample7
baseline highf0 ratio1 seq3 sample8
baseline highf0 ratio1 seq3 sample9
baseline highf0 ratio1 seq4 sample0
baseline highf0 ratio1 seq4 sample1
baseline highf0 ratio1 seq4 sample2
baseline highf0 ratio1 seq4 sample3
baseline highf0 ratio1 seq4 sample4
baseline highf0 ratio1 seq4 sample5
baseline highf0 ratio1 seq4 sample6
baseline highf0 ratio1 seq4 sample7
baseline highf0 ratio1 seq4 sample8
baseline highf0 ratio1 seq4 sample9
Tacotron GST
gst highf0 seq0 sample0
gst highf0 seq0 sample1
gst highf0 seq0 sample2
gst highf0 seq0 sample3
gst highf0 seq0 sample4
gst highf0 seq0 sample5
gst highf0 seq0 sample6
gst highf0 seq0 sample7
gst highf0 seq0 sample8
gst highf0 seq0 sample9
gst highf0 seq1 sample0
gst highf0 seq1 sample1
gst highf0 seq1 sample2
gst highf0 seq1 sample3
gst highf0 seq1 sample4
gst highf0 seq1 sample5
gst highf0 seq1 sample6
gst highf0 seq1 sample7
gst highf0 seq1 sample8
gst highf0 seq1 sample9
gst highf0 seq2 sample0
gst highf0 seq2 sample1
gst highf0 seq2 sample2
gst highf0 seq2 sample3
gst highf0 seq2 sample4
gst highf0 seq2 sample5
gst highf0 seq2 sample6
gst highf0 seq2 sample7
gst highf0 seq2 sample8
gst highf0 seq2 sample9
gst highf0 seq3 sample0
gst highf0 seq3 sample1
gst highf0 seq3 sample2
gst highf0 seq3 sample3
gst highf0 seq3 sample4
gst highf0 seq3 sample5
gst highf0 seq3 sample6
gst highf0 seq3 sample7
gst highf0 seq3 sample8
gst highf0 seq3 sample9
gst highf0 seq4 sample0
gst highf0 seq4 sample1
gst highf0 seq4 sample2
gst highf0 seq4 sample3
gst highf0 seq4 sample4
gst highf0 seq4 sample5
gst highf0 seq4 sample6
gst highf0 seq4 sample7
gst highf0 seq4 sample8
gst highf0 seq4 sample9
3.4.2 Sampling the Posterior ( Unseen speaker style )
We modify a speaker's style by using data from the same speaker but from a style not seen during training. Flowtron succeeds in transferring the somber style and the long pauses associated with the narrative style.
Flowtron baseline
audio not supported
Style
audio not supported
Flowtron Style Transfer
audio not supported
audio not supported
audio not supported
audio not supported
Tacotron GST Style Transfer
audio not supported
audio not supported
audio not supported
audio not supported
3.4.3 Sampling the Posterior ( Unseen speaker )
We transfer the style from speaker ID 03 from RAVDESS and the label "surprised" to Sally. Flowtron is able to make Sally sound surprised, which is drastically different from the monotonous baseline.
Style
audio not supported
Flowtron Posterior
posterior ravdess ratio1 seq0 sample0
posterior ravdess ratio1 seq0 sample1
posterior ravdess ratio1 seq0 sample2
posterior ravdess ratio1 seq0 sample3
posterior ravdess ratio1 seq0 sample4
posterior ravdess ratio1 seq0 sample5
posterior ravdess ratio1 seq0 sample6
posterior ravdess ratio1 seq0 sample7
posterior ravdess ratio1 seq0 sample8
posterior ravdess ratio1 seq0 sample9
posterior ravdess ratio1 seq1 sample0
posterior ravdess ratio1 seq1 sample1
posterior ravdess ratio1 seq1 sample2
posterior ravdess ratio1 seq1 sample3
posterior ravdess ratio1 seq1 sample4
posterior ravdess ratio1 seq1 sample5
posterior ravdess ratio1 seq1 sample6
posterior ravdess ratio1 seq1 sample7
posterior ravdess ratio1 seq1 sample8
posterior ravdess ratio1 seq1 sample9
posterior ravdess ratio1 seq2 sample0
posterior ravdess ratio1 seq2 sample1
posterior ravdess ratio1 seq2 sample2
posterior ravdess ratio1 seq2 sample3
posterior ravdess ratio1 seq2 sample4
posterior ravdess ratio1 seq2 sample5
posterior ravdess ratio1 seq2 sample6
posterior ravdess ratio1 seq2 sample7
posterior ravdess ratio1 seq2 sample8
posterior ravdess ratio1 seq2 sample9
posterior ravdess ratio1 seq3 sample0
posterior ravdess ratio1 seq3 sample1
posterior ravdess ratio1 seq3 sample2
posterior ravdess ratio1 seq3 sample3
posterior ravdess ratio1 seq3 sample4
posterior ravdess ratio1 seq3 sample5
posterior ravdess ratio1 seq3 sample6
posterior ravdess ratio1 seq3 sample7
posterior ravdess ratio1 seq3 sample8
posterior ravdess ratio1 seq3 sample9
posterior ravdess ratio1 seq4 sample0
posterior ravdess ratio1 seq4 sample1
posterior ravdess ratio1 seq4 sample2
posterior ravdess ratio1 seq4 sample3
posterior ravdess ratio1 seq4 sample4
posterior ravdess ratio1 seq4 sample5
posterior ravdess ratio1 seq4 sample6
posterior ravdess ratio1 seq4 sample7
posterior ravdess ratio1 seq4 sample8
posterior ravdess ratio1 seq4 sample9
Flowtron Baseline
baseline ravdess ratio1 seq0 sample0
baseline ravdess ratio1 seq0 sample1
baseline ravdess ratio1 seq0 sample2
baseline ravdess ratio1 seq0 sample3
baseline ravdess ratio1 seq0 sample4
baseline ravdess ratio1 seq0 sample5
baseline ravdess ratio1 seq0 sample6
baseline ravdess ratio1 seq0 sample7
baseline ravdess ratio1 seq0 sample8
baseline ravdess ratio1 seq0 sample9
baseline ravdess ratio1 seq1 sample0
baseline ravdess ratio1 seq1 sample1
baseline ravdess ratio1 seq1 sample2
baseline ravdess ratio1 seq1 sample3
baseline ravdess ratio1 seq1 sample4
baseline ravdess ratio1 seq1 sample5
baseline ravdess ratio1 seq1 sample6
baseline ravdess ratio1 seq1 sample7
baseline ravdess ratio1 seq1 sample8
baseline ravdess ratio1 seq1 sample9
baseline ravdess ratio1 seq2 sample0
baseline ravdess ratio1 seq2 sample1
baseline ravdess ratio1 seq2 sample2
baseline ravdess ratio1 seq2 sample3
baseline ravdess ratio1 seq2 sample4
baseline ravdess ratio1 seq2 sample5
baseline ravdess ratio1 seq2 sample6
baseline ravdess ratio1 seq2 sample7
baseline ravdess ratio1 seq2 sample8
baseline ravdess ratio1 seq2 sample9
baseline ravdess ratio1 seq3 sample0
baseline ravdess ratio1 seq3 sample1
baseline ravdess ratio1 seq3 sample2
baseline ravdess ratio1 seq3 sample3
baseline ravdess ratio1 seq3 sample4
baseline ravdess ratio1 seq3 sample5
baseline ravdess ratio1 seq3 sample6
baseline ravdess ratio1 seq3 sample7
baseline ravdess ratio1 seq3 sample8
baseline ravdess ratio1 seq3 sample9
baseline ravdess ratio1 seq4 sample0
baseline ravdess ratio1 seq4 sample1
baseline ravdess ratio1 seq4 sample2
baseline ravdess ratio1 seq4 sample3
baseline ravdess ratio1 seq4 sample4
baseline ravdess ratio1 seq4 sample5
baseline ravdess ratio1 seq4 sample6
baseline ravdess ratio1 seq4 sample7
baseline ravdess ratio1 seq4 sample8
baseline ravdess ratio1 seq4 sample9
Tacotron GST
gst ravdess seq0 sample0
gst ravdess seq0 sample1
gst ravdess seq0 sample2
gst ravdess seq0 sample3
gst ravdess seq0 sample4
gst ravdess seq0 sample5
gst ravdess seq0 sample6
gst ravdess seq0 sample7
gst ravdess seq0 sample8
gst ravdess seq0 sample9
gst ravdess seq1 sample0
gst ravdess seq1 sample1
gst ravdess seq1 sample2
gst ravdess seq1 sample3
gst ravdess seq1 sample4
gst ravdess seq1 sample5
gst ravdess seq1 sample6
gst ravdess seq1 sample7
gst ravdess seq1 sample8
gst ravdess seq1 sample9
gst ravdess seq2 sample0
gst ravdess seq2 sample1
gst ravdess seq2 sample2
gst ravdess seq2 sample3
gst ravdess seq2 sample4
gst ravdess seq2 sample5
gst ravdess seq2 sample6
gst ravdess seq2 sample7
gst ravdess seq2 sample8
gst ravdess seq2 sample9
gst ravdess seq3 sample0
gst ravdess seq3 sample1
gst ravdess seq3 sample2
gst ravdess seq3 sample3
gst ravdess seq3 sample4
gst ravdess seq3 sample5
gst ravdess seq3 sample6
gst ravdess seq3 sample7
gst ravdess seq3 sample8
gst ravdess seq3 sample9
gst ravdess seq4 sample0
gst ravdess seq4 sample1
gst ravdess seq4 sample2
gst ravdess seq4 sample3
gst ravdess seq4 sample4
gst ravdess seq4 sample5
gst ravdess seq4 sample6
gst ravdess seq4 sample7
gst ravdess seq4 sample8
gst ravdess seq4 sample9
3.5 Interpolation between styles (Prior and Posterior)
Flowtron model with speaker embeddings trained on LibriTTS. We interpolate between a spherical Gaussian Prior and a posterior computed on evidence from Sally's Born of Darkness. We evaluate different values of lambda.
We call the listener's attention to the interpolation of non-textual characteristics that are hard to compute but easy to perceive, and the gradual transition from one speaking style to another
Flowtron Baseline
Flowtron λ = 2
Flowtron λ = 1
Flowtron λ = 0.666
Flowtron λ = 0.1
Target Style
audio not supported
audio not supported
audio not supported
audio not supported
audio not supported
audio not supported
Your browser does not support the video tag.
3.6.2 Sampling the Gaussian Mixture ( Translating dimensions )
We select a single component from the gaussian mixture and translate a dimension associated with pitch. Although the samples have different pitch contours, they have the similar duration.
μ (a-flat)
μ - 2σ (c)
μ - 4σ (e-flat)
audio not supported
audio not supported
audio not supported
We select a single component from the gaussian mixture and translate a dimension associated with speech rate. Although the samples have different speech rates, they have similar pitch contour.
μ
μ - 2σ
μ - 4σ
audio not supported
audio not supported
audio not supported
Extra Flowtron samples
To reverb or not to reverb
audio not supported
audio not supported
Queen's accent
audio not supported
Last Halloween...
This Corona year...
audio not supported
audio not supported