\section{Method}\label{sec:method}

This section describes the melody and harmony representations, the single- and dual-encoder architectures, the training procedure, and the inference strategies.

\subsection{Melody and harmony representation}

A quarter-note resolution is sufficient to capture all harmonic details in the datasets used for training and testing, with no overlapping chords within the same segment. Melody events occurring within each quarter note are grouped and represented as a binary \textit{pitch-class} piano roll with an additional binary column marking bar boundaries. Formally, the pitch-class matrix is defined as $\mathbf{PC} \in \{0,1\}^{L \times 13}$, where $L$ is the number of quarter-note steps. The first 12 columns correspond to the 12 pitch classes, following~\cite{rhyu2022translating}, where active pitch classes are indicated by~1. The 13th column is zero everywhere except at bar onsets, where it is~1 and all other pitch-class columns are zero.

Harmony is represented as a sequence of chord tokens from a fixed vocabulary $\mathcal{V}$, denoted $\mathbf{y} \in \mathcal{V}^L$. Chord symbols are normalized following the \texttt{mir\_eval} convention~\cite{raffel2014mir_eval} (e.g., \texttt{Cmaj7} instead of \texttt{C$\triangle$}). The vocabulary includes $12 \times 29 = 348$ chord types (12 pitch classes × 29 chord qualities). Harmony is aligned to the same quarter-note grid: if a chord spans multiple steps, it is repeated for its duration. For example, a \texttt{C:maj7} spanning two beats occupies two grid positions. Special tokens handle missing or padding cases: \texttt{<nc>} denotes ``no chord,'' \texttt{<pad>} fills trailing positions beyond the harmonization length, and \texttt{<bar>} marks bar boundaries. Both melody and harmony representations thus encode bar-level structure explicitly. Figure~\ref{fig:pianoroll}~(a) shows an example segment from the test dataset.

% \begin{figure}[!ht]
% \centering
% \includegraphics[width=0.75\textwidth]{figs/pianoroll_example.png}
% \caption{Example of a pitch-class piano roll with integrated bar information. The melody is represented as a $13 \times T$ matrix, with the extra row marking barline positions. Harmony is represented as a parallel sequence of chord tokens including \texttt{<bar>} symbols.}\label{fig:pianoroll}
% \end{figure}

\begin{figure}[!ht]
   \centering
\begin{tabular}{cc}
\adjustbox{valign=m}{\includegraphics[width=0.55\textwidth]{figs/pianoroll_example.png}}&
\adjustbox{valign=m}{\includegraphics[width=0.41\textwidth]{figs/architectures.drawio.png}}\\
(a) Music representation & (b) \texttt{SE} and \texttt{DE} architectures \\
\end{tabular}
    \caption{(a) Example of a pitch-class piano roll ($13 \times T$ matrix) and the respective harmony tokens as x-axis labels. (b) Overview of \texttt{SE} and \texttt{DE} architectures.}
    \label{fig:pianoroll}
\end{figure}

\subsection{Model architectures}

The proposed transformer architectures, abstractly illustrated in Figure~\ref{fig:pianoroll}~(b), are based on BERT~\cite{devlin2019bert} and adapted for generation through masked language modeling (MLM). Two variants are explored:  
(a) a \textbf{single-encoder} model (\texttt{SE}), where the input sequence jointly encodes melody and harmony information, and  
(b) a \textbf{dual-encoder} model (\texttt{DE}), with a dedicated melody encoder and a harmony-generative encoder connected via cross attention.  
Both models predict chord tokens conditioned on a melodic context and on a varying proportion of visible (unmasked) harmony tokens.

During inference, the harmony sequence is initially fully masked using \texttt{<mask>} tokens. The model then iteratively unmasks tokens in $t$ steps, providing at each step a partially masked harmony input $\mathbf{y}_{\text{in}}^{(t)}$. Although accelerated multi-token unmasking strategies exist~\cite{kaliakatsos2025diffusion}, we focus here on single-token unmasking for clarity.

During training, the models learn to estimate the conditional distribution:
%
\begin{equation}
p_\theta\!\left( \mathbf{y}_{\text{target}}^{(k)} \mid \mathbf{y}_{\text{in}}^{(k)}, \mathbf{m} \right),
\label{eq:conditional-prob}
\end{equation}
%
where $\mathbf{y}_{\text{target}}^{(k)}$ denotes the subset of harmony tokens to be predicted at training step~$k$, and $\mathbf{m}$ is the melody matrix $\mathbf{PC} \in \{0,1\}^{L \times 13}$.

The melody matrix is first projected through a linear layer before entering the melody encoder of either architecture. The harmony input (masked and unmasked tokens) is passed through an embedding layer.  
In the \texttt{SE} model, the transformer output corresponding to the harmony portion is used to compute a cross-entropy loss for predicting masked harmony tokens, while the melody portion of the output is ignored.  
In the \texttt{DE} model, the melody encoder provides contextual information to the harmony decoder via cross attention, enabling the latter to learn to reconstruct harmony tokens at its output.

\subsection{Training and inference}\label{subsec:training_inference}

At the beginning of training, all harmony tokens are masked, and only the melody is visible. This setup compels the model to establish cross-attention pathways between melody and harmony. As training progresses, harmony tokens are gradually revealed, transitioning from full masking to partial visibility. This progression enables the model to learn both extreme regimes: full reliance on melody and partial self-reliance on visible harmony context. Interestingly, models trained entirely in the fully masked regime still produce high-quality harmonizations, as discussed in Section~\ref{sec:results}.

The number of visible harmony tokens at training step~$k$ is defined as
%
\begin{equation}\label{eq:num_unmasked}
    \#\text{unmasked} = \min\!\left(\lfloor v \cdot L \rfloor,\, L-1\right),
\end{equation}
%
where the visible fraction~$v$ follows
%
\begin{equation}\label{eq:visible_percentage}
    v = \left(\frac{k}{k_\text{total}}\right)^5,
\end{equation}
%
with $k$ the current training step and $k_\text{total}$ the total number of steps. The exponent of~5 allocates roughly half of the training duration to the fully masked regime; similar values produce comparable performance.

Let $\mathcal{H}$ denote the set of all harmony tokens, $\mathcal{M}^{(k)} \subseteq \mathcal{H}$ the set of masked positions, and $\mathcal{U}^{(k)} = \mathcal{H} \setminus \mathcal{M}^{(k)}$ the visible tokens at step~$k$. The model input is defined as
%
\begin{equation}
y_i^{(k)} =
\begin{cases}
y_i, & i \in \mathcal{U}^{(k)},\\[2pt]
\texttt{<mask>}, & i \in \mathcal{M}^{(k)}.
\end{cases}
\label{eq:masking-rule}
\end{equation}
%
The prediction targets are the masked positions, $\mathbf{y}_{\text{target}}^{(k)} = \{y_i \mid i \in \mathcal{M}^{(k)}\}$, and the MLM loss is computed as
%
\begin{equation}
\mathcal{L}^{(k)} = - \sum_{i \in \mathcal{M}^{(k)}} \log p_\theta\!\left(y_i \mid \mathbf{m}, \mathbf{y}_{\text{in}}^{(k)}\right).
\label{eq:loss-step}
\end{equation}

At inference time, generation begins from a fully masked harmony sequence and proceeds for~$L$ unmasking steps. At each step, one masked token is selected and predicted according to one of five unmasking strategies:
%
\begin{description}
    \item[\texttt{start}] Sequentially from the first to the last token, mimicking autoregressive decoding.
    \item[\texttt{end}] From the last to the first token, prioritizing cadential regions~\cite{allan2004harmonising}.
    \item[\texttt{random}] Selecting masked positions uniformly at random.
    \item[\texttt{certain}] Selecting the position with the lowest logit entropy (highest model confidence).
    \item[\texttt{uncertain}] Selecting the position with the highest logit entropy (lowest model confidence).
\end{description}
%
Once a position is selected, the model samples a prediction from $\hat{\mathbf{y}}^{(t)} \sim p_\theta(\cdot \mid \mathbf{m}, \mathbf{y}_{\text{in}}^{(t)}),$
%
% \begin{equation}
% \hat{\mathbf{y}}^{(t)} \sim p_\theta(\cdot \mid \mathbf{m}, \mathbf{y}_{\text{in}}^{(t)}),
% \label{eq:sampling-inference}
% \end{equation}
%
and updates the input sequence: $\mathbf{y}_{\text{in}}^{(t+1)} = \mathbf{y}_{\text{in}}^{(t)} \cup \hat{\mathbf{y}}^{(t)}.$
%
% \begin{equation}
% \mathbf{y}_{\text{in}}^{(t+1)} = \mathbf{y}_{\text{in}}^{(t)} \cup \hat{\mathbf{y}}^{(t)}.
% \label{eq:inference-update}
% \end{equation}
%
All experiments used nucleus sampling ($p=0.9$) with temperature~$0.2$.

All models had 8 layers and 8 heads per layer for each encoder -- one encoder for the \texttt{SE} and two for the \texttt{DE} architectures. Models were trained using AdamW with a learning rate of $1\times10^{-4}$, batch size~8, for 200~epochs. For models trained with the gradual unmasking curriculum, the final-epoch version was retained, as it encompasses all curriculum stages. For models trained entirely with masked harmony, the checkpoint with the lowest validation loss was used. Training was performed on three NVIDIA RTX~3080 GPUs. The loss was averaged over tokens and batches.