Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

Zero-shot voice conversion samples

Note:

1. To distinguish all the details of different conversion results, please wear headphones when listening to the following samples.

2. 'target' means the voice conversion system extracts style information directly from the utterance in this column.

3. 'authentic target' is shown for the convenience of comparison. It shows the real recording of the target speaker speaking the same sentence as source utterance. However, the voice conversion system does not use this authentic target utterance when converting the voice. Instead, other utterances spoken by the same target speaker are fed into the voice conversion system for style extraction.

Librispeech samples

Source Target Converted

System comparison (style extracted from 5 target utterances)

Source Authentic Target AutoVC AdaIN-VC FragmentVC S2VC Retriever(ours)

Conversion results of using different number of style token

Source Authentic Target 1 style token 5 style tokens 10 style tokens 60 style tokens

Ablation

Source Authentic Target Too narrow bottleneck Too wide bottleneck AdaIN decoder Retriever

Conversion result of using different number of target utterances during inference

Source Authentic Target 1 target utterance 3 target utterance 5 target utterance 10 target utterance