Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

We propose style equalization --- to enable generative sequence models to control style and content separately in detail, without the need of any style label during training and inference. Our method enables many novel functionalities, including auto-completing/correcting handwriting, generating speech in different voices, generating missing training samples for a downstream recognizer, and analyzing the biases and failures of a recognizer.

Our proposed method is generic and can be easily applied to various signal domain. In the following, we showcase our models on two different tasks --- handwriting and speech synthesis.

Speech Synthesis

Given a reference speech audio, our model generates new audios that sound like they were recorded in the original environment by the same speaker. In other words, we mimic the voice characteristics of the speaker, the background noise, the echo, the microphone response, etc, but with our target content.

In the video below, we type the content in the input text box (top row), use the slider to choose a random speech audio as the style reference input (middle row), and synthesize the input text with the style of the reference audio (bottom row). Please turn on your audio.

As can be seen, our method accurately mimics the style of the reference example while producing the correct content.

Here is a quick comparisons with global style token, which is also an unsupervised method.
The goal is to read the input text in the same style (e.g., voice characteristics, background noise, echo, etc) as the style input.

Input text 1: I did not see any reason to change the captain.

style text style input global style token proposed
When the candle ends sent up their conical yellow flames, all the colored figures from Austria stood out clear and full of meaning against the green boughs.
The man shrugged his broad shoulders and turned back into the arabesque chamber.

Input text 2: Next year it plans to open an office in Tokyo.

style text style input global style token proposed
I had meant it to be the story of my life, but how little of my life is in it!
Every landscape, low and high, seems doomed to be trampled and harried.

Please click to see a detailed comparison.

Handwriting Synthesis

Given a reference handwriting, which comprises a sequence of pen movements, our model generates a new handwriting in the same writing style.

In the video below, we type the content in the input text box (top row), use the slider to choose a random style (rasterized style handwriting is shown in parallel with the selection in the middle row) and synthesize the input content with the selected handwriting style (shown as a sequence of strokes in the bottom row).

*Due to privacy reasons, the style references used in this video are synthetic. They are close reproductions of unseen real styles using a generative model with a different architecture. The generations shown here are very similar when real samples are used as style input. Note that all the evaluations reported in the paper are done using real unseen style examples.

Please click to see more handwriting examples.

Quick introduction video