\documentclass[pmlr]{jmlr}% new name PMLR (Proceedings of Machine Learning Research)
% Template adapted for the 1st Workshop on Emerging AI Technologies for Music, as part of AAAI
% https://amaai-lab.github.io/EAIM2026/

 % The following packages will be automatically loaded:
 % amsmath, amssymb, natbib, graphicx, url, algorithm2e

 %\usepackage{rotating}% for sideways figures and tables
\usepackage{longtable}% for long tables
\usepackage{graphicx}
 % The booktabs package is used by this sample document
 % (it provides \toprule, \midrule and \bottomrule).
 % Remove the next line if you don't require it.
\usepackage{booktabs}
 % The siunitx package is used by this sample document
 % to align numbers in a column by their decimal point.
 % Remove the next line if you don't require it.
\usepackage{siunitx} % newer version % newer version
 %\usepackage{siunitx}
% ----- ADD THIS CODE BLOCK FOR CENTERED HEADERS -----
% \usepackage{fancyhdr}
% \pagestyle{fancy}
% \fancyhf{} % Clear all default header and footer fields
% \fancyhead[EC]{\leftmark}  % Center the header on Even pages (your title)
% \fancyhead[OC]{\rightmark} % Center the header on Odd pages (the authors)
% \fancyfoot[C]{\thepage}    % Put the page number in the Center of the footer
% \renewcommand{\headrulewidth}{0pt} % Optional: removes the line under the header
% -----------------------------------------------------
 % The following command is just for this sample document:
\newcommand{\cs}[1]{\texttt{\char`\\#1}}

 % Define an unnumbered theorem just for this sample document:
\theorembodyfont{\upshape}
\theoremheaderfont{\scshape}
\theorempostheader{:}
\theoremsep{\newline}
\newtheorem*{note}{Note}

 % change the arguments, as appropriate, in the following:
\jmlrvolume{303}
\jmlryear{2026}
\jmlrworkshop{EAIM2026 at AAAI}

\title[Conditional Vocal Timbral Technique Conversion]{Conditional Vocal Timbral Technique Conversion via Embedding-Guided Dual Attribute Modulation}
 % Use \Name{Author Name} to specify the name.

 % Spaces are used to separate forenames from the surname so that
 % the surnames can be picked up for the page header and copyright footer.
 
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % *** Make sure there's no spurious space before \nametag ***

% Author information should be kept anonymous for the double-blind policy. Add in this information for the camera ready version.
 % Two authors with the same address
  \author{\Name{Ting-Chao Hsu} \Email{r12942155@ntu.ee.tw}\and
   \Name{Yi-Hsuan Yang} \Email{yhyangtw@ntu.ee.tw}\\
   }

 % Three or more authors with the same address:
 % \author{\Name{Author Name1} \Email{an1@sample.com}\\
 %  \Name{Author Name2} \Email{an2@sample.com}\\
 %  \Name{Author Name3} \Email{an3@sample.com}\\
 %  \Name{Author Name4} \Email{an4@sample.com}\\
 %  \Name{Author Name5} \Email{an5@sample.com}\\
 %  \Name{Author Name6} \Email{an6@sample.com}\\
 %  \Name{Author Name7} \Email{an7@sample.com}\\
 %  \Name{Author Name8} \Email{an8@sample.com}\\
 %  \Name{Author Name9} \Email{an9@sample.com}\\
 %  \Name{Author Name10} \Email{an10@sample.com}\\
 %  \Name{Author Name11} \Email{an11@sample.com}\\
 %  \Name{Author Name12} \Email{an12@sample.com}\\
 %  \Name{Author Name13} \Email{an13@sample.com}\\
 %  \Name{Author Name14} \Email{an14@sample.com}\\
 %  \addr Address}


 % Authors with different addresses:
 % \author{\Name{Author Name1} \Email{abc@sample.com}\\
 % \addr Address 1
 % \AND
 % \Name{Author Name2} \Email{xyz@sample.com}\\
 % \addr Address 2
 %}

\editors{D. Herremans, K. Bhandari, A. Roy, S. Colton, M. Barthet}

\begin{document}

\maketitle

\begin{abstract}
Vocal timbral techniques—such as whisper, falsetto, and vocal fry scream—uniquely shape the spectral properties of the human voice, presenting a complex challenge for converting between them while preserving the original speaker’s identity. Traditional voice conversion methods, while effective at altering speaker identity or broad timbral qualities, often struggle to transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, which are designed to capture broad categories like emotional expressiveness or singing styles, lack the necessary granularity to handle technique-specific variations. To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon FACodec. FABYOL leverages supervised contrastive learning to generate embeddings that encode specific timbral techniques. These embeddings are then used to modulate timbre and prosody, enabling authentic technique conversion while preserving speaker identity. Experimental evaluation, using both tailored objective metrics and a user study, demonstrates that FABYOL achieves promising performance and offers significant improvements in fidelity and flexibility compared to state-of-the-art models. To support this task, we also introduce the EMO dataset, a high-quality, paired corpus developed with a specific focus on vocal fry scream. Audio samples, source code, pre-trained checkpoints, and the EMO dataset are available at https://alberthsu0509.github.io/FABYOL/.

% Vocal timbral techniques—such as whisper, falsetto, and vocal fry/false cord scream—uniquely shape the spectral properties of the human voice, adding expressive nuance. Converting one timbral technique to another while preserving the original speaker’s identity remains a complex challenge. Traditional voice conversion methods excel at altering speaker identity or broad timbral qualities but often fail to accurately transform specialized timbral techniques without compromising speaker-specific traits. Similarly, existing style-transfer models, designed to capture emotional expressiveness or broad singing styles, lack the granularity and flexibility required to handle diverse, technique-specific timbral variations such as whisper or scream. To address this, we propose FABYOL, a novel embedding-guided framework for timbral technique conversion built upon a pre-trained, frozen FACodec architecture. FABYOL leverages supervised contrastive learning to generate robust embeddings that precisely encode individual timbral techniques, capturing their distinct spectral and stylistic features. Beyond timbre modulation, we emphasize prosody modulation as critical for achieving authentic conversions, employing adaptive layer normalization to modulate these attributes effectively during the transfer process. This approach enables precise, speaker-consistent transformations with minimal architectural changes. Through experimental evaluation—using tailored objective metrics and a user study—FABYOL demonstrates superior performance compared to state-of-the-art voice conversion models, advancing the fidelity and flexibility of timbral technique manipulation.
\end{abstract}
\begin{keywords}
Vocal Timbral Technique, Voice Conversion 
\end{keywords}

\section{Introduction}
\label{sec:intro}
Vocal timbre is fundamentally a combination of a speaker's identity and their applied timbral technique. These techniques are pervasive across diverse audio domains, including screaming in heavy metal music, whisper in cinema, and falsetto in voice acting. It is important to distinguish these techniques, which primarily alter the texture of the voice, from pitch-related techniques such as vibrato or trills. Defined by distinct vocal fold vibration patterns and spectral characteristics, they enhance expressiveness and convey specific artistic intent. Developing models that can controllably convert these techniques would unlock significant creative applications in the audio and music industry. For example, creators could instantly re-style a performance into a whisper to enhance narrative intimacy, while artists could generate extreme vocalizations—such as vocal fry screaming—without the need for extensive specialized training.

As this is a nascent research direction, we pioneer the task by strategically focusing on speech. This is because current speech corpora offer more distinct and high-contrast timbral technique variations than available singing datasets. For example, prominent singing datasets like VocalSet \citep{chou2018vocalset} or GTsinger \citep{hsu2024gtsinger} are often limited to more subtle distinctions, such as ``breathy" or ``mixed voice".  Furthermore, dedicated datasets for melodic screaming are non-existent. This scarcity of appropriate singing data motivates our decision to pioneer this task within the speech domain. 
% Establishing a robust speech framework is thus a critical first step before tackling the broader challenges of the singing domain.

However, this task remains a significant challenge. 
While specialized methods exist for whisper or Lombard speech~\citep{cotescu2020voice,hu2021whispered}, they are typically treated as distinct tasks.
Meanwhile, existing voice conversion models reveal clear limitations for this purpose. Existing voice conversion models reveal clear limitations for this task. Models like CosyVoice \citep{du2024cosyvoicescalablemultilingualzeroshot} and FreeVC \citep{li2022freevchighqualitytextfreeoneshot}, which are built for cross-speaker shifts, process timbre information to replace speaker identity, thereby discarding the original and neglecting technique control. Similarly, FACodec~\citep{ju2024naturalspeech3zeroshotspeech}, while capable of broader timbre adjustments, lacks precision for techniques like scream or whisper and fails to retain speaker traits due to its generalized timbre handling. Meanwhile, style-transfer methods~\citep{du2022disentanglementemotionalstylespeaker,9413391} are designed to disentangle style or emotion and overlook timbral techniques as distinct style elements. These gaps highlight the need for an approach tailored to self-retained timbral technique conversion that preserves speaker identity.

To address this, we propose FABYOL, a novel framework for timbral technique conversion built upon a pre-trained, frozen FACodec \citep{ju2024naturalspeech3zeroshotspeech} architecture. FABYOL employs a  BYOL-TT encoder, which uses targeted augmentation and contrastive learning to derive robust technique-specific embeddings. These embeddings guide a dual attribute modulation, implemented via adaptive layer normalization (AdaLN) \citep{peebles2023scalablediffusionmodelstransformers}. This design stems from our key finding: while modulating timbre is an expected component of conversion, we found that prosody modulation is equally critical for achieving authentic results. This lightweight approach allows for targeted technique conversion while maintaining the speaker's core identity.

A further challenge is the evaluation of this task, given the lack of established metrics to quantify technique similarity. Our objective evaluation therefore employs a combination of proxy metrics from vocal analysis to assess technique-specific features and a specific protocol to measure speaker preservation. This is supplemented by a user study measuring perceptual authenticity across technique similarity, speaker similarity, and naturalness. To address the scarcity of paired screaming data and facilitate future research, we also introduce the EMO dataset, which is a one-hour paired dataset from a single speaker containing modal voice and vocal fry scream.
The contributions of this paper are as follows:

\setlength{\leftmargini}{1em}

\begin{itemize}

    \item We demonstrate that prosody modulation is also essential for effective timbral technique conversion.

    \item We present FABYOL, the first model to our knowledge that performs timbral technique conversion while preserving the speaker identity.

    \item We introduce the EMO dataset and the evaluation framework that specifically address the unique challenges of assessing timbral technique conversion quality.

\end{itemize}


% The contributions of this paper are as follows:
% \setlength{\leftmargini}{1em}
% \begin{itemize}
%     \item A specialized BYOL-TT encoder that effectively isolates technique representations via contrastive learning, enabling robust disentanglement of vocal timbral techniques from speaker identity and linguistic content.
%     \item The first model, to our knowledge, achieving authentic vocal timbral technique conversion through lightweight AdaLN dual modulation.
%     \item Novel evaluation metrics that specifically address the unique challenges of assessing timbral technique conversion quality, providing a standardized framework for future research in this domain.
% \end{itemize}

We encourage readers to visit our demo website for audio samples that demonstrate our model's performance.

\section{Related Work}

With the growing interest in vocal-related research, academic studies on vocal timbral techniques have gained popularity. Previous works have focused on vocal technique detection \citep{yamamoto2022analysisdetectionsingingtechniques, kalbag2022screamdetectionheavymetal}, style-controlled voice conversion \citep{dai2025everyonecansingzeroshotsingingvoice, du2022disentanglementemotionalstylespeaker, 9413391}, and representation learning for various vocal-related attributes \citep{ju2024naturalspeech3zeroshotspeech, elbanna2022byolslearningselfsupervisedspeech, Yakura2022}. These studies have laid the foundation for our work.


\textbf{Vocal Technique Detection}. Detecting vocal techniques is foundational for vocal manipulation. A previous study \citep{yamamoto2022analysisdetectionsingingtechniques} focused on detecting vocal techniques of J-POP solo singers, demonstrating the ability to distinguish between pitch and timbral techniques in J-POP singing. Moreover, the detection of harsh vocal effects such as screaming has been studied in \citep{kalbag2022screamdetectionheavymetal}, which used spectral features to classify extreme timbres in heavy metal music.

\textbf{Style-Controlled Voice Conversion}. Singing voice conversion has progressed toward singing technique control (e.g., falsetto, vibrato) using diffusion models~\citep{dai2025everyonecansingzeroshotsingingvoice}, yet rarely focuses on timbral techniques like whisper or scream. Speech style models~\citep{du2022disentanglementemotionalstylespeaker, 9413391} concentrate on emotional style transfer in spoken voice. While adept at disentangling emotional attributes, these approaches overlook timbral techniques as style components.

\textbf{Representation Learning}. Several general-purpose audio representation learning approaches \citep{défossez2022highfidelityneuralaudio, saeed2020contrastivelearninggeneralpurposeaudio, Niizumi2023} have shown success in various applications. Recently, specific representations for each audio source demonstrate superiority in target domains, including speech and singing voice representation learning \citep{elbanna2022byolslearningselfsupervisedspeech, Yakura2022}. Due to increasing requirements for further control of vocals, FACodec \citep{ju2024naturalspeech3zeroshotspeech} focuses on attribute disentanglement, including timbre, prosody, and content. Though FACodec \citep{ju2024naturalspeech3zeroshotspeech} demonstrates the ability for timbre disentanglement, it falls short on isolating timbral techniques from identity or content, limiting the further detailed control of the timbre attribute.

While prior work has made significant progress, timbral techniques remain largely unexplored across these domains. Our work addresses this gap by integrating \emph{technique embeddings} into FACodec \citep{ju2024naturalspeech3zeroshotspeech}, enabling speaker-consistent conversion.

\section{Timbral Technique Extractor}
\label{sec:math}
In our proposed method, FABYOL, we aim to develop a conditional generator, denoted as $\mathcal{G}(\mathbf{x}, \mathbf{h}_\text{tech})$, that transforms an input audio signal $\mathbf{x} $ into an output $\mathbf{y}$, where $\mathbf{x}$ and $\mathbf{y}$ are temporally aligned and share the same speaker identity and linguistic content, but exhibit different timbral techniques. The technique embedding $\mathbf{h}_\text{tech}$, a learnable representation of timbral techniques, is derived by the embedding extractor $E$ from a reference audio signal $\mathbf{x}_{\text{ref}}$, such that $\mathbf{h}_\text{tech} = E(\mathbf{x}_{\text{ref}})$.

To build an effective timbral technique conversion system, we first need a %robust 
representation of vocal techniques that generalizes across speakers and linguistic content.


\subsection{Contrastive Objective of Timbral Technique}

\noindent \textbf{BYOL framework}.
To derive robust timbral technique representations, we adopt Bootstrap Your Own Latent (BYOL)~\citep{grill2020bootstraplatentnewapproach}, a self-supervised learning method that learns from positive pairs without needing negative samples, unlike traditional contrastive approaches such as SimCLR~\citep{chen2020simpleframeworkcontrastivelearning}. BYOL uses two networks: the online network $f_\theta$ processes input $\mathbf{x}$, while the target network $f_\xi$ handles an augmented version $\mathbf{u}$. A predictor $q_\theta$ aligns the online output to the target, with $f_\xi$’s parameters updated as an exponential moving average of $f_\theta$’s parameters, controlled by a decay rate $\tau \in [0, 1]$. The loss minimizes the mean squared error:

\[
\mathcal{L}_{\text{BYOL}} = \| q_\theta(f_\theta(\mathbf{x})) - f_\xi(\mathbf{u}) \|_2^2.
\]


% \noindent \textbf{Disentanglemet of timbral technique, speaker and linguistic content}
\noindent \textbf{Disentanglement objective}.
Building on this foundation, we introduce BYOL-TT (BYOL for Timbral Techniques), which leverages BYOL-A’s audio encoder~\citep{Niizumi2023} to disentangle vocal signals into three distinct components: timbral technique, speaker identity, and linguistic content. This disentanglement is critical for enabling accurate timbral technique conversion while preserving speaker identity. Techniques like whisper or scream drastically alter vocal spectra and are often entangled with speaker characteristics, making it difficult for models to modify the technique without inadvertently changing who the speaker sounds like. 

Therefore, the effectiveness of BYOL-TT relies on constructing meaningful positive pairs—samples that vary in speaker identity and linguistic content but share the same timbral technique, guiding the model to treat the technique as the invariant factor and learn robust, disentangled representations. To achieve this, we apply targeted augmentation~\citep{Yakura2022}, crafting transformations that isolate timbral technique while varying other attributes. We propose two pair-generation methods: \textbf{DSP Augmentation} \& ~\textbf{Real-world Data Selection}.



\subsection{DSP Augmentation}

Our first augmentation strategy, DSP augmentation, generates synthetic positive pairs from a single audio sample by applying signal processing techniques that selectively modify different attributes of the audio while keeping the timbral technique unchanged.

We apply \textbf{Sequence Perturbation (SP)}~\citep{deng2024learningdisentangledspeechrepresentations} to change the linguistic content of the audio. This is done by splitting the audio into several segments and shuffling their order, which alters the content without affecting the speaker's voice or the technique being used. We denote the resulting audio as:
\[
\mathbf{x}_{\text{SP}} = \text{SP}(\mathbf{x}),
\]

In parallel, we apply \textbf{Vocal Tract Length Perturbation (VTLP)}~\citep{Jaitly2013VocalTL} to simulate a different speaker identity by warping the spectral characteristics of the voice. This changes how the speaker sounds, while preserving both the original content and the vocal technique. The perturbed version is denoted as:
\[
\mathbf{x}_{\text{VTLP}} = \text{VTLP}(\mathbf{x}, \alpha),
\]

By pairing $\mathbf{x}_{\text{SP}}$ and $\mathbf{x}_{\text{VTLP}}$, we construct a positive pair where the only shared attribute is the timbral technique. This encourages the model to learn representations that are invariant to speaker and content, and sensitive only to vocal technique. We apply both pre-norm and post-norm to the embeddings and compute the BYOL-style contrastive loss:
\[
\mathcal{L}_{\text{DSP}} = \left\| q_\theta(f_\theta(\mathbf{x}_{\text{SP}})) - f_\xi(\mathbf{x}_{\text{VTLP}}) \right\|_2^2,
\]

% Our first approach, DSP Augmentation, creates synthetic positive pairs from a single audio sample $\mathbf{x}$ through targeted signal processing techniques. Let $T(\mathbf{x})$ denote the timbral technique of $\mathbf{x}$, $S(\mathbf{x})$ the speaker identity, and $C(\mathbf{x})$ the linguistic content.

% The central insight is to create two augmented versions of the same audio, each altering a different aspect while preserving the technique. Specifically, we apply Sequence Perturbation (SP) to modify only linguistic content~\citep{deng2024learningdisentangledspeechrepresentations} while preserving both speaker identity and timbral technique. This involves segmenting the sample into 10 chunks and shuffling their order, yielding $\mathbf{x}_{\text{SP}} = \text{SP}(\mathbf{x})$, where $C(\mathbf{x}_{\text{SP}}) \neq C(\mathbf{x})$, but $S(\mathbf{x}_{\text{SP}}) = S(\mathbf{x})$ and $T(\mathbf{x}_{\text{SP}}) = T(\mathbf{x})$.
  
% Simultaneously, we apply Vocal Tract Length Perturbation (VTLP)~\citep{Jaitly2013VocalTL} to alter only the speaker's vocal characteristics by applying a warping factor $\alpha$ to the spectral envelope, producing $\mathbf{x}_{\text{VTLP}} = \text{VTLP}(\mathbf{x}, \alpha)$, where $S(\mathbf{x}_{\text{VTLP}}) \neq S(\mathbf{x})$, but $C(\mathbf{x}_{\text{VTLP}}) = C(\mathbf{x})$ and $T(\mathbf{x}_{\text{VTLP}}) = T(\mathbf{x})$.

% By pairing these two processed versions, we create a scenario where the only common invariant factor between $\mathbf{x}_{\text{SP}}$ and $\mathbf{x}_{\text{VTLP}}$ is the timbral technique. We apply both pre-norm and post-norm to $\mathbf{x}_{\text{SP}}$ and $\mathbf{x}_{\text{VTLP}}$, forming the positive pair $(\mathbf{x}_{\text{SP}}, \mathbf{x}_{\text{VTLP}})$. This pair is then used to compute the BYOL loss:
% \[
% \mathcal{L}_{\text{DSP}} = \| q_\theta(f_\theta(\mathbf{x}_{\text{SP}})) - f_\xi(\mathbf{x}_{\text{VTLP}}) \|_2^2,
% \]

% The intuition here is that by forcing the model to predict representations across these augmentations, it must learn to focus on the only consistent factor: the timbral technique. Through this contrastive learning process, the model gradually develops a representation space where samples with the same timbral technique cluster together, regardless of speaker identity or linguistic content differences.

\subsection{Real-world Data Selection}

While DSP Augmentation offers a controllable approach, it relies on synthetic transformations that might not fully capture natural variations. This led us to explore a complementary strategy: Real-world Data Selection.

Real-world data selection leverages our labeled dataset to create positive pairs from distinct audio clips $\mathbf{x}_1$ and $\mathbf{x}_2$ that share the same timbral technique but have different speaker identities and contain different linguistic content. By carefully selecting samples that meet these criteria, we ensure that the model focuses on technique-specific features. The loss for Selection is formulated as:
\[
\mathcal{L}_{\text{Sel}} = \| q_\theta(f_\theta(\mathbf{x}_1)) - f_\xi(\mathbf{x}_2) \|_2^2.
\]

This approach surpasses DSP Augmentation by enhancing diversity through varied samples as tested in our experiments, exposing the model to broader speaker and content ranges, potentially boosting the generalization of the technique embedding $\mathbf{h}_\text{tech}$. It leverages the dataset's natural variability instead of synthetic changes, possibly yielding truer representations.

% In our setup, the trained BYOL-TT encoder $E = f_{\theta}$ generates $\mathbf{h}_\text{tech} = E(\mathbf{z})$ from a reference signal $\mathbf{z}$, conditioning the conversion module $\mathcal{G}(\mathbf{x}, \mathbf{h}_\text{tech})$ in FABYOL.
In our setup, the trained BYOL-TT encoder $E = f_{\theta}$ generates $\mathbf{h}_\text{tech} = E(\mathbf{x}_{\text{ref}})$ from a reference signal $\mathbf{x}_{\text{ref}}$, conditioning the conversion module $\mathcal{G}(\mathbf{x}, \mathbf{h}_\text{tech})$ in FABYOL.




\section{FABYOL: Timbral Technique Conversion Framework}
\label{sec:vec}


% We selected FACodec~\citep{ju2024naturalspeech3zeroshotspeech} as the basis for FABYOL for its robust attribute disentanglement and high-fidelity audio reconstruction. Its proven ability to independently handle speaker identity, content, and prosody, while effectively reconstructing timbral techniques.

We selected FACodec~\citep{ju2024naturalspeech3zeroshotspeech} as the foundation for FABYOL due to its robust attribute disentanglement and high-fidelity audio reconstruction, effectively separating speaker identity, content, prosody, and acoustic details. FACodec employs a neural codec with factorized vector quantization (FVQ) to decompose speech into distinct subspaces, enabling precise manipulation of vocal attributes. However, FACodec was originally designed and trained for speech, not singing. In our preliminary tests, we observed that the pre-trained model does not generalize well to singing voice. For this reason, we limit the scope of this paper to speech-based techniques (e.g., modal speech, whisper, and non-melodic scream), leaving the extension to singing as a direction for future work

The FACodec process is briefly outlined as follows: An input waveform \(\mathbf{x}\) is encoded into a latent representation \(\mathbf{h}\), which is then factorized into content embeddings \(\mathbf{z}_\text{c}\), acoustic details embeddings \(\mathbf{z}_\text{d}\) using their respective quantizers, and timbre embeddings \(\mathbf{h}_\text{tim}\) extracted by a transformer encoder. Prosody embeddings \(\mathbf{z}_\text{p}\) is derived from frame-wise acoustic features using a separate transformer and quantizer. These components are recombined, conditioned by timbre via adaptive layer normalization, and decoded into the output waveform \(\mathbf{y}\). This architecture disentangles speaker-specific traits from content and prosody, leveraging vector quantization to preserve nuanced vocal features, making it a strong base for our conversion system.

\begin{figure}[t]
\floatconts
  {fig:fabyol_framework}% label
  {\caption{The FABYOL framework for timbral technique conversion. It integrates a BYOL-TT technique extractor and employs AdaLN to modulate both prosody and timbre. Black arrows depict the original FACodec pipeline, red arrows highlight the additional FABYOL pipeline, and the red dashed arrow indicates supervision. The fire icons mark the only components trained during the process while all other parts of the framework remain frozen.}}% caption
  {%
    \includegraphics[width=\linewidth]{FABYOL_model.png}%
  }
\end{figure}
\subsection{Dual Attribute Modulation}
Our analysis of FACodec revealed that its timbre subspace—originally intended to encode speaker identity—also entangles timbral technique. This entanglement complicates the task of converting specific vocal techniques. Moreover, our experiments show that realistic technique conversion cannot rely on timbre alone; Techniques like whisper and vocal fry exhibit distinct prosodic behaviors—whisper often features reduced pitch variation and energy, while vocal fry is characterized by irregular, low-frequency modulations. These observations suggest that technique modeling requires joint modulation of timbre and prosody.

To achieve that, we evaluated several conditioning strategies, including concatenation~\citep{qian2019autovczeroshotvoicestyle}, cross-attention~\citep{li2024sefvcspeakerembeddingfree}, AdaIN~\citep{chou2019oneshotvoiceconversionseparating}, and FiLM~\citep{perez2017filmvisualreasoninggeneral}, and ultimately chose AdaLN~\citep{peebles2023scalablediffusionmodelstransformers}, a frame-level AdaIN variant. AdaLN effectively removes source-specific global information before injecting target traits~\citep{chou2019oneshotvoiceconversionseparating}, which is beneficial for our setting where source techniques vary across training. Its frame-level modulation also better captures the temporal nuances of vocal techniques, which are not strictly global attributes like speaker identity.

FABYOL retains the original encoders, attribute vector quantizers, and decoder structure from FACodec, all kept frozen throughout the process. Central to our method, the technique embedding $\mathbf{h}_\text{tech} \in \mathbb{R}^{C_\text{tech}}$ is processed via a  multilayer perceptron to produce scale and shift parameters:
\begin{equation}
[\boldsymbol{\gamma}_p(\mathbf{h}_\text{tech}), \boldsymbol{\beta}_p(\mathbf{h}_\text{tech}), \boldsymbol{\gamma}_t(\mathbf{h}_\text{tech}), \boldsymbol{\beta}_t(\mathbf{h}_\text{tech})] = \text{MLP}(\mathbf{h}_\text{tech}) ,
\end{equation}
where $\boldsymbol{\gamma}_p(\mathbf{h}_\text{tech}), \boldsymbol{\beta}_p(\mathbf{h}_\text{tech}) \in \mathbb{R}^{C}$ are parameters for conditioning the prosody subspace, and $\boldsymbol{\gamma}_t(\mathbf{h}_\text{tech}), \boldsymbol{\beta}_t(\mathbf{h}_\text{tech}) \in \mathbb{R}^{C}$ condition the timbre subspace. The prosody parameters are later extended across the time dimension to $T'$.

We apply AdaLN to modulate both prosody and timbre components, as depicted by the red arrows in Figure~\ref{fig:fabyol_framework}. For prosody, we normalize each time frame to zero-mean and unit-variance across the channel dimension. 
%The modulated prosody embedding frame $\mathbf{Z'}_{\text{p},t}$ (for time step $t$) is computed as:
We compute the modulated prosody embedding frame as 

% $\mathbf{Z'}_{\text{p},t} = \text{AdaLN}(\mathbf{Z}_{\text{p},t}, \mathbf{h}_\text{tech}) = \boldsymbol{\gamma}_p(\mathbf{h}_\text{tech}) \cdot \frac{\mathbf{Z}_{\text{p},t} - \mu(\mathbf{Z}_{\text{p},t})}{\sigma(\mathbf{Z}_{\text{p},t})} + \boldsymbol{\beta}_p(\mathbf{h}_\text{tech})$,
\begin{align}
\mathbf{z'}_{\text{p},t} &= \text{AdaLN}(\mathbf{z}_{\text{p},t}, \mathbf{h}_\text{tech}) \notag \\
&= \boldsymbol{\gamma}_p(\mathbf{h}_\text{tech}) \cdot \frac{\mathbf{z}_{\text{p},t} - \mu(\mathbf{z}_{\text{p},t})}{\sigma(\mathbf{z}_{\text{p},t})} + \boldsymbol{\beta}_p(\mathbf{h}_\text{tech}),
\end{align}
where $\mu(\mathbf{z}_{\text{p},t})$ and $\sigma(\mathbf{z}_{\text{p},t})$ represent the mean and standard deviation of $\mathbf{z}_{\text{p},t}$ over its channels.

Similarly, for the timbre embedding, we use %$\mathbf{h}_\text{tim} \in \mathbb{R}^{C}$, we apply:
% $\mathbf{h}'_\text{tim} = \text{AdaLN}(\mathbf{h}_\text{tim}, \mathbf{h}_\text{tech}) \notag = \boldsymbol{\gamma}_t(\mathbf{h}_\text{tech}) \cdot \frac{\mathbf{h}_\text{tim} - \mu(\mathbf{h}_\text{tim})}{\sigma(\mathbf{h}_\text{tim})} + \boldsymbol{\beta}_t(\mathbf{h}_\text{tech})$,
\begin{align}
\mathbf{h}'_\text{tim} &= \text{AdaLN}(\mathbf{h}_\text{tim} \mid \mathbf{h}_\text{tech}) \notag \\
&= \boldsymbol{\gamma}_t(\mathbf{h}_\text{tech}) \cdot \frac{\mathbf{h}_\text{tim} - \mu(\mathbf{h}_\text{tim})}{\sigma(\mathbf{h}_\text{tim})} + \boldsymbol{\beta}_t(\mathbf{h}_\text{tech}),
\end{align}
where $\mu(\mathbf{h}_\text{tim})$ and $\sigma(\mathbf{h}_\text{tim})$ are computed across the timbre vector's dimension. This dual application of AdaLN ensures both prosody and timbre are reconfigured to reflect the target technique's characteristics.

In the final stage, the modulated prosody embeddings $\mathbf{z}'_\text{p}$, along with $\mathbf{z}_\text{c}$ and $\mathbf{z}_\text{d}$, is summed to form $\mathbf{z'}_{\text{sum}} \in \mathbb{R}^{C \times T'}$. This is then conditioned by conditional layer normalization using the modulated timbre embeddings, $\mathbf{z}_{\text{cond}} = \text{AdaLN}(\mathbf{z'}_{\text{sum}}, \mathbf{h}'_\text{tim})$, and passed through the frozen decoder to synthesize the output waveform $\hat{\mathbf{x}}$. Inspired by adaptive normalization in style transfer~\citep{chou2019oneshotvoiceconversionseparating} and diffusion transformers~\citep{peebles2023scalablediffusionmodelstransformers}, our AdaLN-based $\mathcal{G}(\mathbf{x}, \mathbf{h}_\text{tech})$ efficiently transfers target technique traits with minimal modification to the existing architecture.

\subsection{Cross-Speaker Unpaired Reference}

In our design, we tackled the real-world challenge of users providing unrelated reference samples by adopting an unpaired reference method during training, similar to~\citep{chen2024zeroshotamplifiermodelingonetomany}. We transform a source utterance $\mathbf{x}$ using a randomly chosen reference $\mathbf{x}_{\text{ref}}$ with the target technique $t_{\text{ref}}$, selected from different speakers and linguistic content than $\mathbf{x}$. This mirrors practical use cases, unlike traditional supervised training that relies on ground truth audio~\citep{liu2024zeroshotvoiceconversiondiffusion}. By decoupling technique-specific learning from speaker identity and content, this approach improves disentanglement and generalization. Our training integrates reconstruction and  bidirectional paired data conversion across all transformations, boosting the model's adaptability to diverse, unseen speaker scenarios.



\subsection{Training Objective}
Our loss-driven optimization strategy builds upon FACodec with a total loss:
\begin{align*}
\mathcal{L}_{\text{total}} &= \lambda_{\text{mel}} \mathcal{L}_{\text{mel}} + \lambda_{\text{aux}} \big( \mathcal{L}_{\text{p}} + \mathcal{L}_{\text{tim}} + \mathcal{L}_{\text{tech}} + \mathcal{L}_{\text{cls\_spkr}} + \mathcal{L}_{\text{cls\_tech}} + \mathcal{L}_{\text{adv}} + \mathcal{L}_{\text{feat}} \big).
\end{align*}

We retain the original FACodec losses $\mathcal{L}_{\text{mel}}$, $\mathcal{L}_{\text{adv}}$, and $\mathcal{L}_{\text{feat}}$. Additionally, we minimize $L_1$ distances for prosody $\mathcal{L}_{\text{p}} = \|\mathbf{Z}'_{\text{p}} - \mathbf{Z}_{\text{p}}^{\text{GT}}\|_1$, timbre $\mathcal{L}_{\text{tim}} = \|\mathbf{h}'_{\text{tim}} - \mathbf{h}_{\text{tim}}^{\text{GT}}\|_1$, and technique $\mathcal{L}_{\text{tech}} = \|\mathbf{h}_\text{tech}' - \mathbf{h}_\text{tech}^{\text{GT}}\|_1$. Supervision is applied via cross-entropy losses $\mathcal{L}_{\text{cls\_spkr}}$ and $\mathcal{L}_{\text{cls\_tech}}$ to enforce speaker identity and technique accuracy respectively. Empirically, we set $\lambda_{\text{mel}} = 10$ and $\lambda_{\text{aux}}= 5$.
\section{Experimental Setup}
\label{sec:floats}


\subsection{Dataset}
Our study utilized audio data from a total of 130 unique speakers from several datasets. After manually quality-filtering, the JVS dataset~\citep{takamichi2019jvs} provided 1,500 parallel clips (500 modal voice, 500 whisper, 500 falsetto) from 100 speakers. The EMVD dataset~\citep{10859205} provided 270 clips (135 modal voice and 135 scream) from 27 vocalists. Our self-recorded data included the EMO dataset—consisting of 706 clips (353 modal voice and 353 scream) recorded by a single professional metal singer in both Chinese and English—and 6 clips (3 modal voice and 3 scream) from two other speakers. Finally, we used one scream sample from the \texttt{Genera Studios Metal Screams}\footnote{\url{https://generastudios.com/products/metal-screams}} sample pack

All audio was resampled to 16~kHz, and trimmed via voice activity detection (VAD) \citep{SileroVAD}. Paired data alignment varied by source: JVS and EMVD clips were time-stretched with \texttt{pyrubberband}\footnote{\url{https://github.com/bmcfee/pyrubberband}} using a DTW time map \citep{muller2007dynamic}, while the EMO clips were manually aligned in a Digital Audio Workstation. For normalization, audio was handled differently by technique: modal, falsetto, and scream were set to a modal RMS reference, while whisper was normalized to its own RMS reference. The data was then partitioned; JVS and EMVD were split into speaker-disjoint training and test sets at approximately a 9:1 ratio, and the EMO dataset was randomly split at the same 9:1 ratio. A reference set was constructed by selecting two clips for each technique from the test set speakers. Our evaluation protocol involved two distinct tasks: 1) Reconstruction: each test file was processed using itself as the reference; 2) Conversion: each test file was converted using all files in the reference set to generate the final outputs.

\subsection{Implementation Details}
We trained the BYOL-TT encoder on a NVIDIA RTX 3090 GPU. It transforms 1-second Mel-spectrograms (16 kHz, 1024-point FFT, 1024-sample window, 160-sample hop) into 1024-dimensional technique embeddings $\mathbf{h}_\text{tech}$. Training took one day with a batch size of 256 and  a learning rate of $10^{-4}$. The encoder remains frozen for FABYOL after training. FABYOL is trained on a NVIDIA RTX 3090 GPU. It processes 1-second Mel-spectrograms with the same parameters, using a 10-layer MLP with SiLU activation to project $\mathbf{h}_\text{tech}$ into 256-dimensional parameters conditioning prosody and timbre subspaces. Training lasted approximately two days with a batch size of 16 and  a learning rate of $2 \times 10^{-4}$.

\subsection{Evaluation Metrics}

\textbf{Objective Metrics.}  
We first employ Mel Cepstral Distortion (MCD) \citep{kubichek1993mel} to assess overall audio quality between converted and ground truth audio samples. To evaluate whisper conversion, we use Harmonic-to-Noise Ratio (HNR) \citep{fernandes2018harmonic}, a well-established metric in traditional vocal analysis research, as whispering typically results in reduced harmonic content—values closer to GT indicate better conversion. For falsetto conversion, we employ the average fundamental frequency (AF0) as a proxy. While falsetto is a complex acoustic phenomenon, F0 is its most obvious characteristic, with falsetto voice typically exhibiting a significantly higher F0 than modal voice \citep{keating2014acoustic}. Therefore, results nearer to the ground truth AF0 reflect higher fidelity in the conversion. To assess speaker identity preservation, we evaluate existing speaker verification (SV) models across various timbral techniques. We find that SV models often assign lower similarity scores when comparing a speaker’s ground truth modal voice to their own utterances in other techniques than when comparing modal utterances from different speakers. This suggests a bias toward modal conditions and a limited ability to capture speaker identity across techniques. To address this, we propose a more robust, technique-agnostic evaluation: cross-gender modal-to-modal conversion. We compute speaker embedding cosine similarity (\textbf{SEC}) by Resemblyzer\footnote{\url{https://github.com/resemble-ai/Resemblyzer}} between source and converted samples.

\noindent \textbf{Subjective Metrics.}  
We employ three subjective metrics in our user study: TSMOS to evaluate the timbral technique similarity, NMOS to assess the perceptual quality and naturalness of the audio, and SSMOS to measure the speaker similarity. Twenty listeners evaluated 8 sets of samples, comparing baseline and proposed models with a reference audio.


\subsection{Baseline Models}

We compare FABYOL against three state-of-the-art baseline models representing distinct paradigms in voice conversion and timbral control: 1) FreeVC~\citep{li2022freevchighqualitytextfreeoneshot}: a text-free, one-shot voice conversion system with VITS framework; 2) CosyVoice~\citep{du2024cosyvoicescalablemultilingualzeroshot}: a scalable, zero-shot TTS system based on supervised semantic tokens; 3) FACodec~\citep{ju2024naturalspeech3zeroshotspeech}: a neural codec factorizes speech into multiple attributes; 4) FABYOL: The timbral technique conversion model proposed in this paper.

\section{Experimental Result}







\subsection{Reconstruction}

Table~\ref{tab:speaker_similarity} presents the MRSTFT loss~\citep{yamamoto2020parallel} across various vocal techniques, illuminating the reconstruction strengths of FACodec and FABYOL. FACodec excels at reconstructing diverse timbral techniques—like whisper, falsetto, and scream—with quality matching modal voice, despite their absence from its training data. This highlights its robust disentanglement of content, prosody, timbre, and acoustic details, showcasing its adaptability to new vocal styles and solidifying its role as a key foundation for our research.

However, FACodec's success depends on how well its timbre extractor works. If it only captures speaker identity and misses the unique sound characteristics of these techniques, which are mixed with speaker identity, the quality of reconstruction could drop. The fact that it handles such variety well suggests its timbre representation is flexible and has good potential for our work. FABYOL builds on this by adding technique embeddings as conditioning inputs, maintaining strong reconstruction quality. 

\begin{table}[t]
\floatconts
  {tab:speaker_similarity}% label
  {\caption{Comparison of reconstruction MRSTFT loss~\citep{yamamoto2020parallel} across vocal techniques: M = Modal, F = Falsetto, W = Whisper, S = Scream.}}% caption
  {%
    \small % Reduces font size
    \renewcommand{\arraystretch}{0.9} % Reduces row height (Compactness)
    \begin{tabular}{lccccc}
      \toprule
      Model & M & F & W & S & Overall \\
      \midrule
      FACodec~\citep{ju2024naturalspeech3zeroshotspeech} & 0.86 & 0.91 & 0.86 & 1.14 & 0.948 \\
      FABYOL & 1.26 & 1.24 & 1.01 & 1.56 & 1.281 \\
      \bottomrule
    \end{tabular}
  }
\end{table}
\vspace{-1em} % Manually reduces space between table and the next section


\subsection{Efficacy of Timbral Technique Conversion}
\label{subsec:efficacy}

\noindent \textbf{Objective Performance}. As shown in Table~\ref{tab:timbral_conversion}, FABYOL achieves the lowest MCD, delivering superior spectral fidelity across techniques like whisper, falsetto, and scream. It outperforms FreeVC and CosyVoice, which favor broad timbral shifts over technique-specific details, and FACodec, which struggles to fully convert timbral techniques due to its lack of prosody modulation. In whisper conversion, FABYOL’s HNR closely matches ground truth, capturing the weak tonality of whispers. Baselines, lacking prosody modulation, miss this nuance and produce overly harmonic outputs. Similarly, FABYOL’s AF0 aligns near-perfectly with ground truth, thanks to prosody embedding modulation enabling precise spectral control—an area where other baseline models lag behind. FABYOL also excels in preserving speaker identity, as demonstrated by its strong performance in SEC. Its success is driven by targeted augmentations, unpaired reference training, and classifier-guidance, which together effectively separate technique from identity. 

\noindent \textbf{Subjective Performance}. Subjective results (Table~\ref{tab:timbral_conversion}) reinforce FABYOL’s core strengths. Its technique similarity (TSMOS) was rated significantly higher than all baselines, approaching ground truth and confirming the successful direction of the conversion. Furthermore, FABYOL achieved the highest speaker preservation (SSMOS) scores, validating its ability to robustly preserve speaker identity. However, it is important to note the lower naturalness (NMOS), as listeners perceived synthesis artifacts. We hypothesize this is a limitation of our lightweight conditioning approach. While our AdaLN successfully guides the type of technique, its affine transformation may lack the capacity to fully condition the backbone's internal features, a step required for artifact-free synthesis, especially for extreme techniques. Training a generative model from scratch would be ideal but is currently unfeasible given the scarcity of large-scale, diverse timbral datasets. Our approach therefore prioritized data-efficiency and establishing foundational feasibility. While the lower NMOS is a clear limitation, the high TSMOS and SSMOS scores confirm we met our primary objective for this fundamental study: establishing that controlled, speaker-preserving technique conversion is possible. Improving synthesis quality by exploring more powerful conditioning methods or by collecting the diverse data needed to train a dedicated model, is the clear next step for future work.


\begin{table}[htbp]
\renewcommand{\arraystretch}{1.2}
\floatconts
  {tab:timbral_conversion}%
  {\caption{Performance comparison of timbral technique conversion across models.}}%
  {%
    \scriptsize
    \begin{tabular}{lccccccc}
      \toprule
      Models & MCD $\downarrow$ & HNR & AF0 & SEC $\uparrow$ & TSMOS $\uparrow$ & NMOS $\uparrow$ & SSMOS $\uparrow$ \\
      \midrule

      GT & --- & 1.32 & 381 & --- & 3.93 $\pm$ 0.86 & 4.08 $\pm$ 0.69 & --- \\
      \midrule
      FreeVC~\citep{li2022freevchighqualitytextfreeoneshot} & 8.93 & 17.01 & 248 & 0.58 & --- & --- & ---\\
      CosyVoice~\citep{du2024cosyvoicescalablemultilingualzeroshot} & 8.28 & 11.97 & 328 & 0.68 & 1.49 $\pm$ 0.25 & \textbf{4.50 $\pm$ 0.20} & 2.32 $\pm$ 1.67 \\
      FACodec~\citep{ju2024naturalspeech3zeroshotspeech} & 8.55 & 10.55 & 298 & 0.74 & 2.18 $\pm$ 0.78 & 3.23 $\pm$ 0.47 & 2.45 $\pm$ 0.35 \\
      \midrule
      \textbf{FABYOL (ours)} & \textbf{7.59} & \textbf{-1.27} & \textbf{379} & \textbf{0.79} &
      \textbf{3.64 $\pm$ 0.43} & 2.86 $\pm$ 0.49 & \textbf{4.66 $\pm$ 0.25} \\
      \bottomrule
    \end{tabular}
  }
\end{table}
\vspace{-2em}


\subsection{Ablation Study}

\noindent \textbf{Dual Attribute Modulation}.
As detailed in Table~\ref{tab:ablation_study}, removing prosody modulation noticeably degrades spectral fidelity, especially for techniques like falsetto.
Speaker identity remains stable with or without prosody modulation, indicating that prosody primarily enhances technique accuracy rather than identity consistency.
These results underscore the importance of prosody-aware modulation for faithfully transferring fine-grained vocal timbral techniques.

\noindent \textbf{Augmentation Strategies}.
Results in Table~\ref{tab:ablation_aug} indicate that DSP-based augmentations improve technique representation through targeted transformations, but methods like VTLP often compromise speaker consistency due to weaker disentanglement.
In contrast, real-data selection strikes the best balance—capturing subtle vocal techniques, preserving spectral detail, and maintaining speaker identity—making it the most effective strategy for timbral technique transfer.

% Make sure \usepackage{graphicx} is in your preamble
% (The 'jmlr' class loads it automatically)

% Make sure \usepackage{graphicx} is in your preamble
% (The 'jmlr' class loads it automatically)
% 1. Make sure this is in your preamble (before \begin{document}):
% (Your template should load this, but this is just in case)


% 2. Replace your code with this:

\begin{table}[t]
\centering 
\label{tab:combined_ablation}
\begin{minipage}[t]{0.46\textwidth}
  \centering
  \caption{Ablation study results of dual modulation.}
  \label{tab:ablation_study}
  \resizebox{\textwidth}{!}{%
    \begin{tabular}{lcccc}
      \toprule
      Modulations & MCD $\downarrow$ & HNR & AF0 & SEC $\uparrow$ \\
      \midrule
      w/o prosody & 9.27 & \textbf{--0.94} & 198 & 0.78 \\
      FABYOL & \textbf{7.59} & --1.27 & \textbf{379} & \textbf{0.79} \\
      \bottomrule
    \end{tabular}
  }
\end{minipage}
\hfill
\begin{minipage}[t]{0.48\textwidth}
  \centering
  \caption{Ablation study results of different augmentations.}
  \label{tab:ablation_aug}
  \resizebox{\textwidth}{!}{%
    \begin{tabular}{lcccc}
      \toprule
      Augmentations & MCD $\downarrow$ & WHNR & AF0 & SEC $\uparrow$ \\
      \midrule
      BYOL-TT-DSP & 7.90 & \textbf{--1.04} & 392 & 0.71 \\
      BYOL-TT-SEL & \textbf{7.59} & --1.27 & \textbf{379} & \textbf{0.79} \\
      \bottomrule
    \end{tabular}
  }
\end{minipage}
\end{table}


\section{Conclusion}
We present FABYOL, the first model for vocal timbral technique conversion while preserving speaker identity. Our approach surpasses prior work in spectral fidelity, technique similarity, and identity consistency by modulating both timbre and prosody via embedding-guided AdaLN. We also introduce the EMO dataset to provide a high-quality, paired corpus for this task, with a specific focus on vocal fry scream. Future work should focus on improving output naturalness and developing scream-specific metrics. A key priority is to extend the model to handle melodic content, enabling the application of timbral techniques to singing. We also plan to expand datasets to include more timbral techniques and languages to enable real-time applications.


\bibliography{eaim}

% \appendix

% \section{First Appendix}\label{apd:first}

% This is the first appendix.

% \section{Second Appendix}\label{apd:second}

% This is the second appendix.

\end{document}
