\section{Results}
\label{sec:results}

\subsection{Datasets and Evaluation Metrics}

Experiments are conducted on a curated version of the HookTheory dataset~\cite{yeh2021automatic} (15,440 MIDI lead sheets), following previous harmonization studies~\cite{rhyu2022translating, huang2024emotion}. To reflect harmonic rhythm, redundant chord repetitions within bars are removed, and all pieces are transposed to C major or A minor using the Krumhansl key-finding algorithm~\cite{krumhansl2001cognitive}. The split comprises 14,679 training and 761 validation/test pieces (95/5\%). 

Generalization is evaluated both \textit{in-domain} (HookTheory test split) and \textit{out-of-domain} (650 curated jazz standards). Each model generates harmonizations for the melodies in these sets, which are evaluated against ground-truth harmonies using established chord- and rhythm-based metrics~\cite{sun2021melody, wu2024generating}:
\textbf{CHE} (Chord Histogram Entropy), \textbf{CC} (Chord Coverage), \textbf{CTD} (Chord Tonal Distance), \textbf{CTnCTR} (Chord Tone ratio), \textbf{PCS} (Pitch Consonance Score), \textbf{MCTD} (Melody–Chord Tonal Distance), \textbf{HRHE}, \textbf{HRC}, and \textbf{CBS}. 
Average ground-truth statistics for both datasets are shown in Table~\ref{tab:ground_truth}.

\subsection{Effect of Unmasking Order}

We first evaluate five unmasking strategies during inference: \texttt{start}, \texttt{end}, \texttt{certain}, \texttt{uncertain}, and \texttt{random}, using the single-encoder (\texttt{SE}) model. 
Mean absolute error (MAE) is computed between generated and reference harmonizations across all metrics.

Results (Table~\ref{tab:order}) show that the \texttt{certain} strategy—unmasking tokens for which the model exhibits the highest confidence—consistently outperforms others in both in-domain and out-of-domain settings. This suggests that harmonization generation benefits from data-driven uncertainty guidance rather than fixed-order decoding. Notably, the same ranking of strategies holds across all metrics, implying robust inference behavior independent of musical style.

\subsection{Ablation Study: Architectural Insights}

We next compare single-encoder (\texttt{SE}) and dual-encoder (\texttt{DE}) architectures and their ablations under the \texttt{certain} unmasking regime (Table~\ref{tab:ablations}). 

The \texttt{SE} model achieves the best overall results, particularly in rhythm-related metrics for the out-of-domain jazz set, despite having less than half the parameters of \texttt{DE}. 
In-domain, \texttt{DE\_noM} (dual encoder without melody self-attention) performs slightly better, indicating that cross-attention can compensate for missing melody self-context. 
Surprisingly, models trained with fully masked harmony throughout training (\texttt{v0}) do not collapse, supporting the hypothesis that harmonic structure can be indirectly inferred from melodic patterns alone. 
Even more strikingly, the \texttt{DE\_noMH} variant (no self-attention in either encoder) remains functional, suggesting that cross-attention alone can partially encode both melody–harmony and harmony–harmony dependencies—a key insight for future investigation.

\subsection{Attention Dynamics}

Figure~\ref{fig:attn_maps} visualizes averaged attention maps across layers and heads for representative models.  
Even when harmony tokens remain masked during all training epochs (\texttt{DE\_v0}), coherent self-attention structures emerge in the harmony encoder. 
When melody self-attention is removed (\texttt{DE\_noM}), harmony self-attention reorganizes, seemingly compensating for missing melodic structure. 
Cross-attention in \texttt{DE\_noM} remains similar to the full model, while in \texttt{DE\_noMH} it becomes diffuse, implying an adaptive redistribution of representational load. 
These emergent behaviors highlight the model’s ability to develop internal harmonic organization even under heavily constrained or degenerate training regimes.

