\section{Introduction}\label{sec:intro}

Transformer architectures have emerged as powerful sequence modeling frameworks across domains such as language, vision, and music~\cite{vaswani2017attention, huang2018musictransformer}. Within symbolic music generation, melodic harmonization --the task of generating a harmonic accompaniment given a melody--poses unique challenges: it requires local melodic compatibility and long-range harmonic coherence. Harmonization therefore serves as a strong testbed for studying how sequence models integrate and structure multiple musical dimensions across time.

Early neural approaches to automatic harmonization relied on recurrent architectures~\cite{lim2017chord, yeh2021automatic, chen2021surprisenet, yi2022accomontage2}, while more recent transformer-based methods~\cite{rhyu2022translating, huang2024emotion, wu2024melodyt5, bhandari2025improvnet} typically frame harmonization as a sequence-to-sequence translation task, generating harmony autoregressively. However, autoregressive decoding enforces a left-to-right generation order, limiting flexibility when imposing harmonic constraints (e.g., fixed cadences or key modulations) prior to generation. Some non-autoregressive approaches perform diffusion in a continuous approximation of the discrete token space~\cite{mittal2021symbolic, lv2023getmusic}, while others apply diffusion in the latent space of a VAE~\cite{zhang2023fast}.

Non-autoregressive, encoder-only transformers offer a compelling alternative. Such models have been explored in other domains through the lens of discrete diffusion or masked token modeling, as in MaskGIT~\cite{chang2022maskgit, austin2021structured}, which iteratively refines partially masked token sequences. These approaches enable flexible conditioning and substantially faster generation compared to autoregressive models.

Applied to melodic harmonization, this framework represents melody and harmony on a synchronized temporal grid and progressively unmask harmony tokens until a complete harmonization is produced~\cite{kaliakatsos2025diffusion}. This formulation naturally supports partial conditioning and parallel generation, aligning more closely with the inherently bidirectional nature of harmonic reasoning.

Recent work has shown that training such models effectively requires careful curriculum design. When harmony tokens are gradually unmasked during training, the model learns to rely on melodic context for harmonic inference rather than trivial self-copying patterns. Interestingly, even when all harmony tokens remain masked for long portions of training, harmony-related self-attention patterns still emerge in the generative encoder. This suggests that the model develops an implicit sense of harmonic organization purely through its exposure to melodic structures.

Experiments presented in this paper with cross-attention-only architectures, where neither melody nor harmony use self-attention, reveal that performance remains comparable to full-attention models. This finding hints that cross-attention can, to some degree, internalize self-like relational patterns among harmony tokens. Similar phenomena have been observed in multimodal transformers~\cite{tsai2019multimodal, alayrac2022flamingo}, where cross-modal layers spontaneously capture intra-modal dependencies despite the absence of explicit self-attention mechanisms.

Finally, inference in non-autoregressive harmonization models introduces new challenges: since generation need not proceed sequentially, the order of unmasking can be chosen flexibly. We therefore investigate several unmasking strategies—starting from cadential positions, high-confidence predictions, or random tokens—and analyze their musical and structural implications through both quantitative metrics and attention visualizations.

In summary, this work explores how encoder-only, non-autoregressive transformers learn harmonic structure under different attention and unmasking regimes. Beyond their practical advantages for constraint-based harmonization, the observed emergent behaviors offer new insights into how structured musical relations can arise from weak or indirect supervision. Future work will further examine the underlying mechanisms and draw connections to broader studies on emergent attention dynamics in multimodal and self-supervised learning systems. The code of the work presented in this paper is available online\footnote{\url{https://github.com/NeuraLLMuse/EncoderOnlyMelHarmSelfCross.git}}.