\section{Results}
\label{sec:results}

We evaluate TS-RaMIA under three analysis views: (\textit{i}) raw scores without adjustment; (\textit{ii}) length-matched pairs to control for structural confounding; and (\textit{iii}) conditionally calibrated scores to further mitigate non-causal correlations. All hypothesis tests use DeLong's method with 95\% confidence intervals, and percentile bootstrap (1{,}000 composer-stratified resamples) for TPR@FPR and pAUC.

\subsection{Main Evaluation: REMI Transformer}

Table~\ref{tab:main} reports performance on the REMI Transformer trained on MAESTRO. The corpus contains 962 training pieces (members) and 314 validation/test pieces (non-members), totaling 1{,}276 compositions.

\begin{table}[t]
\centering
\caption{Performance on the REMI Transformer. AUC with 95\% CIs; TPR reported at fixed FPR thresholds. All AUCs significantly exceed random guessing (DeLong $p<10^{-4}$).}
\label{tab:main}
\small
\begin{tabular}{lcccc}
\toprule
Method & AUC & TPR@1\%FPR & TPR@5\%FPR & TPR@10\%FPR \\
\midrule
\multicolumn{5}{l}{\textit{Raw Scores}} \\
Baseline (mean NLL)   & 0.730 {\tiny [0.706, 0.754]} & 1.8\% & 7.2\%  & 15.4\% \\
StructTail-64         & 0.780 {\tiny [0.758, 0.802]} & 3.1\% & 11.8\% & 22.6\% \\
StructTail+Fusion     & 0.812 {\tiny [0.791, 0.833]} & 8.3\% & 24.1\% & 38.7\% \\
\midrule
\multicolumn{5}{l}{\textit{Length-Matched Pairs (313 pairs)}} \\
Baseline              & 0.563 {\tiny [0.521, 0.605]} & 0.9\% & 3.2\%  & 7.8\%  \\
StructTail-64         & 0.692 {\tiny [0.654, 0.730]} & 2.4\% & 9.5\%  & 18.3\% \\
StructTail+Fusion     & \textbf{0.826 {\tiny [0.803, 0.848]}} & \textbf{14.6\%} & \textbf{32.7\%} & \textbf{48.2\%} \\
\midrule
\multicolumn{5}{l}{\textit{Conditionally Calibrated}} \\
Baseline              & 0.571 {\tiny [0.529, 0.613]} & 1.0\% & 3.5\%  & 8.1\%  \\
StructTail-64         & 0.701 {\tiny [0.663, 0.739]} & 2.6\% & 9.8\%  & 19.0\% \\
StructTail+Fusion     & 0.818 {\tiny [0.795, 0.841]} & 13.9\% & 31.2\% & 46.8\% \\
\bottomrule
\end{tabular}
\end{table}

\paragraph{Debiasing Effect.}
The baseline (global mean NLL) attains raw AUC 0.730 but drops to 0.563 under length matching, indicating that naive aggregate scores conflate membership with structural complexity. Tail aggregation (StructTail-64) recovers signal, and the linear meta-fusion (StructTail+Fusion) further improves discrimination under the controlled views (Table~\ref{tab:main}).

\paragraph{Low-FPR Performance.}
For auditing scenarios prioritizing precision, StructTail+Fusion reaches TPR@1\%FPR of 14.6\% (length-matched) and 13.9\% (calibrated), outperforming the baseline controls. The standardized partial AUC over FPR~$\in[0,0.01]$ (pAUC) also favors StructTail+Fusion; we report exact values with CIs in the tables.

\paragraph{Statistical Testing.}
Pairwise DeLong tests confirm improvements: StructTail-64 vs.\ Baseline ($p<10^{-5}$), and StructTail+Fusion vs.\ StructTail-64 ($p<10^{-4}$). Bootstrap CIs for TPR@FPR exhibit comparable widths across methods, indicating stable estimates.

\paragraph{Visualization.}
Figure~\ref{fig:results_four_in_row} summarizes the visual analyses: panel~(a) shows full ROC curves under the length-matched view; panel~(b) zooms into the low-FPR regime (0--5\%); panel~(c) depicts member/non-member score distributions for the structural top-64 statistic; panel~(d) presents checkpoint risk scanning (AUC vs.\ training epochs). All curves use composer-stratified folds; shaded bands or markers denote 95\% confidence where applicable.

\begin{figure*}[t]
    \centering
    \begin{minipage}[t]{0.24\textwidth}
        \centering
        \includegraphics[width=\linewidth]{figures/roc_main.pdf}
        \vspace{2pt}
        {\scriptsize (a) ROC (length-matched).}
    \end{minipage}\hfill
    \begin{minipage}[t]{0.22\textwidth}
        \centering
        \includegraphics[width=\linewidth]{figures/roc_lowfpr_academic.pdf}
        \vspace{2pt}
        {\scriptsize (b) Low-FPR zoom (0--5\%).}
    \end{minipage}\hfill
    \begin{minipage}[t]{0.28\textwidth}
        \centering
        \includegraphics[width=\linewidth]{figures/score_distribution.pdf}
        \vspace{2pt}
        {\scriptsize (c) Score distribution (structural top-64).}
    \end{minipage}\hfill
    \begin{minipage}[t]{0.24\textwidth}
        \centering
        \includegraphics[width=\linewidth]{figures/checkpoint_auc.pdf}
        \vspace{2pt}
        {\scriptsize (d) Checkpoint risk scan (AUC vs.\ epochs).}
    \end{minipage}
    \vspace{4pt}
    \caption{Combined results: (a) full ROC under the length-matched view; (b) zoomed ROC emphasizing the low-FPR regime; (c) distribution of structural top-64 scores for members vs.\ non-members; (d) privacy--utility dynamics across training checkpoints. Composer-stratified folds throughout; 95\% confidence shown where applicable.}
    \label{fig:results_four_in_row}
\end{figure*}

\subsection{Cross-Representation Transfer: NotaGen (ABC)}

We assess transfer to character-level ABC notation using NotaGen~\citep{von2024notagen}, a hierarchical patch--character Transformer pretrained on 1.6M ABC sheets. We compute character-level NLL via teacher forcing on 1{,}267 MAESTRO pieces converted to ABC and apply the ABC structural mask and top-64 tail (as defined in \S\ref{subsec:structural_mask} and \S\ref{subsec:topk}). We obtain raw AUC 0.73 and TPR@1\%FPR 8.9\%; under length matching the AUC is 0.71 (95\% CI [0.68, 0.74]). Because NotaGen was not trained on MAESTRO, these signals reflect representation transfer under distribution shift rather than direct training-set membership.

\subsection{Privacy--Utility Trade-off via Checkpoint Analysis}

Figure~\ref{fig:results_four_in_row}(d) plots attack AUC across training epochs for the REMI model. AUC increases as the model continues to train, indicating that attack susceptibility scales with training progress. Practitioners can monitor this curve and select earlier checkpoints to reduce risk with limited degradation in generation quality.