\section{Reproducibility}\label{apd:reproducibility}

\subsection{Code \& Data Release}
Upon acceptance, we release:
\begin{itemize}[leftmargin=*,nosep]
    \item Scoring scripts (structural masking, NLL computation, debiasing, meta-attacker).
    \item MAESTRO splits (JSON format), trained model checkpoint.
    \item Evaluation protocol implementations (three views: raw, length-matched, calibrated).
    \item Composer-stratified cross-validation fold assignments.
\end{itemize}

\subsection{Software Environment}
\begin{itemize}[leftmargin=*,nosep]
    \item Python 3.10
    \item PyTorch 2.1
    \item Transformers 4.35
    \item scikit-learn 1.3
    \item scipy 1.11
    \item Full \texttt{requirements.txt} provided in repository
\end{itemize}

\subsection{Random Seeds}
All experiments use fixed seed 1337 for reproducibility.
Single-seed reporting is used (multi-seed stability analysis is acknowledged as future work in Section~\ref{sec:limitations}).

\subsection{Computational Resources}
\begin{itemize}[leftmargin=*,nosep]
    \item \textbf{Training}: 2$\times$ NVIDIA A6000 (48GB), $\sim$12 hours for 10 epochs.
    \item \textbf{Evaluation}: Single GPU, $\sim$30 minutes for full pipeline (scoring, debiasing, meta-attacker).
\end{itemize}

\section{Additional Experimental Details}\label{apd:details}

\subsection{Structural Mask Unit Tests}
We validated the structural masking function on 20 diverse MAESTRO pieces (REMI) and 10 ABC test cases:
\begin{itemize}[leftmargin=*,nosep]
    \item \textbf{REMI}: 100\% accuracy in tagging \texttt{Bar}, \texttt{Position}, \texttt{Tempo} tokens; no false positives on note/velocity/duration tokens.
    \item \textbf{ABC}: 100\% header exclusion (all lines before first body token); correct tagging of \texttt{|}, \texttt{:}, \texttt{[}, \texttt{]}, \texttt{\textbackslash n} in body.
\end{itemize}

\subsection{Checkpoint Scan Protocol}
Checkpoints were evaluated at epochs $\{2, 4, 6, 8, 10\}$.
For each checkpoint, the full pipeline (scoring, length matching, conditional calibration, meta-attacker training) was re-run on the same validation+test split.
No separate held-out checkpoint-validation set was used; the reported AUC vs. epoch curve (Figure~\ref{fig:ckpt_auc}) reflects fixed-split evaluation across checkpoints.

\subsection{Hyperparameter Grid}
\begin{itemize}[leftmargin=*,nosep]
    \item \textbf{Top-$k$ values}: $k \in \{32, 64, 96, 128\}$.
    \item \textbf{Temperature} (optional): $T \in \{0.8, 1.0, 1.2\}$ for logit scaling (default $T=1.0$).
    \item \textbf{Meta-attacker}: Logistic regression with $C=1.0$ (L2 regularization), \texttt{class\_weight='balanced'}.
    \item \textbf{Cross-validation}: 5-fold, composer-stratified.
    \item \textbf{Length matching}: Nearest-neighbor pairing on $n_{\text{struct}} = \sum_t m_t$.
    \item \textbf{Conditional calibration}: Linear regression $s \sim \log n_{\text{struct}}$ fitted on non-members only.
\end{itemize}

\subsection{NotaGen ABC Conversion Pipeline}
\begin{enumerate}[leftmargin=*,nosep]
    \item MAESTRO MIDI $\to$ MusicXML using \texttt{music21}~\citep{music21}.
    \item MusicXML $\to$ ABC using NotaGen's \texttt{xml2abc.py} script.
    \item Success rate: 1,267/1,276 (99.3\%); 9 failures due to \texttt{duplex-maxima} duration overflow (MusicXML standard limitation).
    \item Header exclusion: All lines matching \texttt{\^{}[XTMQKLV]:} or \texttt{\^{}\%\%} before first body token.
    \item Body structural mask: Characters in $\{\texttt{|}, \texttt{:}, \texttt{[}, \texttt{]}, \texttt{\textbackslash n}\}$.
\end{enumerate}
