\section{Results}\label{sec:results}

Experiments are conducted on a curated version of the HookTheory dataset~\cite{yeh2021automatic} (15,440 MIDI lead sheets), following previous harmonization studies~\cite{rhyu2022translating, huang2024emotion}. To reflect harmonic rhythm, redundant chord repetitions within bars are removed, and all pieces are transposed to C major or A minor using the Krumhansl key-finding algorithm~\cite{krumhansl2001cognitive}. The split comprises 14,679 training and 761 validation/test pieces (95/5\%). Training and validation losses are illustrated in Figure~\ref{fig:losses}. Training accuracy (i.e., percentage of correctly unmasked tokens) for all architectures reached between 90-95\% during the all-masked harmony epochs and increased to over 99\% as harmony tokens were gradually unmasked. The \texttt{v0} (no unmasking) versions reached over 98\%. Test-set accuracy reached over 65\% for all architectures during the all-masked harmony epochs (remained so for the \texttt{v0} versions) and reached over 98\% as the unmasked input tokens gradually increased.

\begin{figure}[!ht]
   \centering
\begin{tabular}{cc}
\includegraphics[width=0.47\textwidth]{figs/train_loss.png}&
\includegraphics[width=0.47\textwidth]{figs/val_loss.png}\\
(a) Training loss \texttt{DE} & (b) Validation loss \\
\end{tabular}
    \caption{Training and validation loss for all examined models.}
    \label{fig:losses}
\end{figure}

Generated melodic harmonizations are evaluated both \textit{in-domain} (HookTheory test split) and \textit{out-of-domain} (650 curated jazz standards). Each model generates harmonizations for the melodies in these sets, which are evaluated against ground-truth harmonies using established chord- and rhythm-based metrics~\cite{sun2021melody, wu2024generating}:
\textbf{CHE} (Chord Histogram Entropy), \textbf{CC} (Chord Coverage), \textbf{CTD} (Chord Tonal Distance), \textbf{CTnCTR} (Chord Tone ratio), \textbf{PCS} (Pitch Consonance Score), \textbf{MCTD} (Melody–Chord Tonal Distance), \textbf{HRHE}, \textbf{HRC}, and \textbf{CBS}. 
Average ground-truth statistics for both datasets are shown in Table~\ref{tab:ground_truth}. In future
extensions, we also plan to supplement quantitative metrics
with qualitative listening studies and curated harmonization
examples, enabling a more perceptual assessment.

\begin{table*}[ht]
  \centering
  \caption{Average metric values for all pieces in the test set (\textit{in-domain}) and jazz set (\textit{out-of-domain}) datasets.}
  \label{tab:ground_truth}
  \resizebox{\textwidth}{!}{%
\begin{tabular}{lrrrrrrrrr}
\toprule
Ground truth & CHE & CC & CTD & CTnCTR & PCS & MCTD & HRHE & HRC & CBS \\
\midrule
Test set & 1.4078 & 4.9485 & 0.9748 & 0.7769 & 0.4060 & 1.4139 & 0.4542 & 1.9710 & 0.2314 \\
Jazz set & 2.2027 & 11.6471 & 0.8208 & 0.8297 & 0.3145 & 1.4042 & 0.5093 & 2.0607 & 0.2426 \\
\bottomrule
\end{tabular}%
}
\end{table*}


\subsection{Effect of Unmasking Order}

We first evaluate five unmasking strategies during inference: \texttt{start}, \texttt{end}, \texttt{certain}, \texttt{uncertain}, and \texttt{random}. For this comparison we use the single-encoder (\texttt{SE}) model; all other models produced similar results. 
Mean absolute error (MAE) is computed between generated and reference harmonizations across all metrics.

Results (Table~\ref{tab:order}) show that the \texttt{certain} strategy—unmasking tokens for which the model exhibits the highest confidence—consistently outperforms others in both in-domain and out-of-domain settings. This suggests that harmonization generation benefits from data-driven uncertainty guidance rather than fixed-order decoding. Notably, the same ranking of strategies holds across all metrics, implying robust inference behavior independent of musical style.

\begin{table*}[ht]
  \centering
  \caption{Comparison of unmasking order strategies during inference in the \textit{in-domain} test set and \textit{out-of-domain} jazz set using the \texttt{SE} model architecture. Mean absolute errors (MAEs) are calculated, and the smallest differences per metric are shown in bold. Results are presented in ascending order of average MAE, which is show in the last column.}
  \label{tab:order}
  \resizebox{\textwidth}{!}{%
\begin{tabular}{lrrrrrrrrrr}
\toprule
Instance & CHE & CC & CTD & CTnCTR & PCS & MCTD & HRHE & HRC & CBS & avg. \\
\midrule
\multicolumn{11}{c}{In-domain / Test set} \\
certain & \textbf{1.3235} & \textbf{4.8536} & \textbf{0.9126} & 0.7933 & 0.3940 & 1.4158 & \textbf{0.4520} & \textbf{2.0383} & \textbf{0.1225} & 1.3673 \\
start & 1.4521 & 5.4406 & 0.9467 & 0.7949 & 0.3910 & 1.4161 & 0.5312 & 2.2243 & 0.1489 & 1.4829 \\
end & 1.5202 & 5.8311 & 0.9525 & 0.7961 & 0.3879 & 1.4166 & 0.5865 & 2.3509 & 0.1640 & 1.5562 \\
random & 1.6076 & 6.3113 & 0.9862 & 0.7962 & 0.3968 & \textbf{1.4125} & 0.7235 & 2.7269 & 0.2094 & 1.6856 \\
uncertain & 1.7405 & 7.1583 & 0.9971 & \textbf{0.7877} & \textbf{0.3848} & 1.4178 & 0.8159 & 2.8958 & 0.2624 & 1.8289 \\
\midrule
\multicolumn{11}{c}{Out-of-domain / Jazz set} \\
certain & \textbf{1.8768} & \textbf{8.6660} & \textbf{0.8723} & 0.8002 & \textbf{0.3890} & 1.3851 & \textbf{0.5401} & \textbf{2.5351} & \textbf{0.1098} & 1.9083 \\
start & 2.0047 & 9.5180 & 0.9115 & \textbf{0.7872} & 0.3924 & 1.3860 & 0.6048 & 2.6907 & 0.1252 & 2.0467 \\
end & 2.0551 & 9.8634 & 0.8978 & 0.8051 & 0.3863 & 1.3807 & 0.6424 & 2.7514 & 0.1381 & 2.1023 \\
random & 2.1457 & 10.7059 & 0.9466 & 0.8047 & 0.3948 & \textbf{1.3782} & 0.8397 & 3.2182 & 0.2006 & 2.2927 \\
uncertain & 2.2651 & 11.8899 & 0.9767 & 0.8021 & 0.3935 & 1.3810 & 0.9283 & 3.3795 & 0.2399 & 2.4729 \\
\bottomrule
\end{tabular}
}
\end{table*}

\subsection{Ablation Study: Architectural Insights}

We next compare single-encoder (\texttt{SE}) and dual-encoder (\texttt{DE}) architectures and their ablations under the \texttt{certain} unmasking regime (Table~\ref{tab:ablations}). 

The \texttt{SE} model achieves the best overall results, particularly in rhythm-related metrics for the out-of-domain jazz set, despite having less than half the parameters of \texttt{DE}. 
In-domain, \texttt{DE\_noM} (dual encoder without melody self-attention) performs slightly better, indicating that cross-attention can compensate for missing melody self-context. 
Surprisingly, models trained with fully masked harmony throughout training (\texttt{v0}) do not collapse, supporting the hypothesis that harmonic structure can be indirectly inferred from melodic patterns alone. 
Even more strikingly, the \texttt{DE\_noMH} variant (no self-attention in either encoder) remains functional, suggesting that cross-attention alone can partially encode both melody–harmony and harmony–harmony dependencies—a key insight for future investigation.
%
% \begin{description}
%     \item[\texttt{SE} and \texttt{DE}] are the ``vanilla'' single and dual encoder architectures, trained with the increasing number of unmasked harmony tokens described in equation~\ref{eq:num_unmasked}.
%     \item[\texttt{SE\_v0} and \texttt{DE\_v0}] are the vanilla architectures trained all the way with only all masked harmony tokens.
%     \item[\texttt{DE\_noH}, \texttt{DE\_noM} and \texttt{DE\_noMH}] are the \texttt{DE} versions with no self attention in the harmony (\texttt{noH}), melody (\texttt{noM}) and none of the encoders (\texttt{noMH}) respectively. These ablations cannot be performed in the \texttt{SE} architecture because of the shared melody-harmony self attention.
% \end{description}

% The vanilla \texttt{SE} architecture is in all domains better than the vanilla \texttt{DE} architecture, especially in the \textit{out-of-domain} where it outperforms all other architectures, mainly because of rhythm-related metrics. This is interesting since the \texttt{SE} architecture has significantly fewer parameters (less than half) in comparison to dual-encoder architectures. In the \textit{in-domain} test the \texttt{DE\_noM} outperforms all others, mainly because of better alignment with the ground truth under chord pitch class distributions. The versions trained all the way with all-masks in harmony tokens (\texttt{v0} variations), are not among the best ones but they do not collapse either, indicating that the some notion of chord pattern is reflected by the melody itself. Furthermore, the variation with no self attention (\texttt{DE\_noMH}) does not collapse, which allows the speculation that cross-attention is capable, to some extent, of capturing sequence-related patterns - a claim that needs further investigation.

\begin{table*}[ht]
  \centering
  \caption{Comparison of ablations in the \textit{in-domain} test set and \textit{out-of-domain} jazz set using the \texttt{certain} unmasking order. Mean absolute errors (MAEs) are calculated, and the smallest differences per metric are shown in bold. Results are presented in ascending order of average MAE, which is show in the last column.}
  \label{tab:ablations}
  \resizebox{\textwidth}{!}{%
\begin{tabular}{lrrrrrrrrrr}
\toprule
Instance & CHE & CC & CTD & CTnCTR & PCS & MCTD & HRHE & HRC & CBS & avg. \\
\midrule
\multicolumn{11}{c}{In-domain / Test set} \\
DE\_noM & \textbf{1.2017} & \textbf{4.0778} & 0.9547 & \textbf{0.7920} & 0.4217 & 1.4117 & \textbf{0.4492} & 2.0831 & 0.1263 & 1.2798 \\
SE & 1.3235 & 4.8536 & \textbf{0.9126} & 0.7933 & \textbf{0.3940} & 1.4158 & 0.4520 & \textbf{2.0383} & \textbf{0.1225} & 1.3673 \\
DE & 1.3181 & 4.6293 & 0.9235 & 0.7895 & 0.4235 & 1.4105 & 0.7327 & 2.7639 & 0.2220 & 1.4681 \\
% DE\_lp & 1.4133 & 5.1913 & 0.9720 & 0.7913 & 0.4183 & 1.4120 & 0.9064 & 3.1728 & 0.2996 & 1.6197 \\
DE\_v0 & 1.4143 & 5.2916 & 0.9595 & 0.7981 & 0.4288 & 1.4049 & 0.8867 & 3.1161 & 0.2928 & 1.6214 \\
DE\_noH & 1.4295 & 5.3351 & 0.9547 & 0.8006 & 0.4264 & 1.4065 & 0.8821 & 3.1069 & 0.2871 & 1.6254 \\
DE\_noMH & 1.3928 & 5.3259 & 0.9484 & 0.8056 & 0.4349 & 1.4046 & 0.9980 & 3.4024 & 0.3093 & 1.6691 \\
SE\_v0 & 1.5219 & 5.9354 & 0.9741 & 0.7999 & 0.4345 & \textbf{1.4045} & 1.0972 & 3.6240 & 0.3869 & 1.7976 \\
\midrule
\multicolumn{11}{c}{Out-of-domain / Jazz set} \\
SE & 1.8768 & 8.6660 & 0.8723 & 0.8002 & 0.3890 & \textbf{1.3851} & \textbf{0.5401} & \textbf{2.5351} & \textbf{0.1098} & 1.9083 \\
DE\_noH & 1.7924 & 8.1879 & 0.8040 & 0.7630 & 0.3801 & 1.4018 & 0.9228 & 3.3510 & 0.2539 & 1.9841 \\
DE\_noM & 1.7932 & 8.1935 & \textbf{0.8030} & \textbf{0.7626} & 0.3795 & 1.4016 & 0.9253 & 3.3548 & 0.2543 & 1.9853 \\
DE\_v0 & \textbf{1.7919} & 8.1879 & \textbf{0.8030} & 0.7631 & 0.3800 & 1.4015 & 0.9275 & 3.3662 & 0.2535 & 1.9861 \\
SE\_v0 & \textbf{1.7919} & \textbf{8.1860} & \textbf{0.8030} & 0.7630 & \textbf{0.3794} & 1.4022 & 0.9276 & 3.3700 & 0.2537 & 1.9863 \\
% DE\_lp & 1.7933 & 8.2087 & 0.8036 & 0.7628 & 0.3796 & 1.4016 & 0.9232 & 3.3529 & 0.2532 & 1.9865 \\
DE\_noMH & 1.7925 & 8.1954 & 0.8040 & 0.7631 & 0.3800 & 1.4016 & 0.9263 & 3.3662 & 0.2543 & 1.9871 \\
DE & 1.7935 & 8.2030 & 0.8040 & 0.7633 & 0.3800 & 1.4016 & 0.9299 & 3.3700 & 0.2549 & 1.9889 \\
\bottomrule
\end{tabular}
}
\end{table*}

\subsection{Attention Dynamics}

Figure~\ref{fig:attn_maps} visualizes averaged attention maps across layers and heads for representative models.  
Even when harmony tokens remain masked during all training epochs (\texttt{DE\_v0}), coherent self-attention structures emerge in the harmony encoder. 
When melody self-attention is removed (\texttt{DE\_noM}), harmony self-attention reorganizes, seemingly compensating for missing melodic structure. 
Cross-attention in \texttt{DE\_noM} remains similar to the full model, while in \texttt{DE\_noMH} it becomes diffuse, implying an adaptive redistribution of representational load. 
These emergent behaviors highlight the model’s ability to develop internal harmonic organization even under heavily constrained or degenerate training regimes. A complete analysis should compare these attention patterns with those of randomly initialized encoders. We leave this comparison to future work, but note that such baselines would clarify which structures truly reflect learned harmonic representations.maximoskalpap@gmail.com

\begin{figure}[!ht]
   \centering
\begin{tabular}{ccc}
\includegraphics[width=0.32\textwidth]{figs/self_DE_AVG_ALL.png}&
\includegraphics[width=0.32\textwidth]{figs/self_DE_v0_AVG_ALL.png}&
\includegraphics[width=0.32\textwidth]{figs/self_DE_no_Mself_AVG_ALL.png}\\
(a) Self \texttt{DE} & (b) Self \texttt{DE\_v0} & (c) Self \texttt{DE\_noM} \\
\includegraphics[width=0.32\textwidth]{figs/cross_DE_AVG_ALL.png}&
\includegraphics[width=0.32\textwidth]{figs/cross_DE_no_MHself_AVG_ALL.png}&
\includegraphics[width=0.32\textwidth]{figs/cross_DE_no_Mself_AVG_ALL.png}\\
(d) Cross \texttt{DE} & (e) Cross \texttt{DE\_noMH} & (f) Cross \texttt{DE\_noM}
\end{tabular}
    \caption{Average attention maps in the harmony decoding encoder of some ablations across all layers and heads, averaged across melodic harmonizations of all test data with the \texttt{certain} unmasking method.}
    \label{fig:attn_maps}
\end{figure}