\section{Conclusions}\label{sec:conlusions}

This paper presented a non-autoregressive approach to melodic harmonization based on an encoder-only transformer trained with a synchronized melody–harmony grid. The study compared multiple architectural and inference variants, showing that single-encoder models can achieve superior harmonization quality with fewer parameters than dual-encoder counterparts. Furthermore, the results highlighted two intriguing phenomena: (i) harmony-related self-attention patterns emerge even when harmony tokens remain fully masked during training, and (ii) models relying solely on cross-attention can perform comparably to those using both self- and cross-attention, suggesting that cross-attention may implicitly capture harmony–harmony relations.

Experiments on inference unmasking strategies revealed that different decoding orders --such as starting from cadential regions or from high-confidence tokens-- can meaningfully affect the musical structure and coherence of generated harmonies. Future work will focus on understanding the mechanisms underlying these emergent attention behaviors, refining inference scheduling strategies, and exploring the cognitive and music-theoretical interpretations of non-autoregressive harmonization dynamics.

This study focuses on objective chord and rhythm based metrics, which provide reliable and widely used indicators of harmonization quality. Nevertheless, perceptual aspects such as naturalness and stylistic preference are not captured by such metrics. Incorporating listening-based evaluations is an important direction for future work and could further validate the musical qualities observed in our generated harmonizations. Another promising direction involves human-in-the-loop evaluation of controllable harmonization, assessing how flexible unmasking strategies and partial conditioning support real-world compositional workflows.