\section{Ablations \& Robustness}
\label{sec:ablations}

\subsection{Top-$k$ Sweep}
We vary $k \in \{32, 64, 96, 128\}$ (Table~\ref{tab:ablations}).
$k=64$ balances signal strength and stability; smaller $k$ is noisier, larger $k$ dilutes the tail.

\begin{table}[h]
\centering
\caption{Ablation results (length-matched, 95\% CIs from 5-fold CV).}
\label{tab:ablations}
\small
\begin{tabular}{lcc}
\toprule
Configuration & AUC & TPR@1\%FPR \\
\midrule
Top-32 & 0.678 $\pm$ 0.018 & 2.1\% \\
Top-64 (default) & \textbf{0.692 $\pm$ 0.015} & \textbf{2.4\%} \\
Top-96 & 0.685 $\pm$ 0.016 & 2.3\% \\
Top-128 & 0.670 $\pm$ 0.019 & 2.0\% \\
\midrule
+ Windowed $p_{95}$ & 0.709 $\pm$ 0.014 & 3.2\% \\
\midrule
Equal-N (8 segments) & 0.688 $\pm$ 0.016 & 2.3\% \\
\bottomrule
\end{tabular}
\end{table}

\subsection{Windowed Extremes}
Adding windowed 95th percentile features (sliding window over chunks) provides marginal lift (AUC +0.02, statistically significant via DeLong $p=0.03$).

\subsection{Equal-N Robustness}
Resampling 8 fixed segments per piece (controlling for variable piece length) yields AUC 0.688 vs. 0.692 (full-piece), confirming the tail signal is not purely a length artifact.

\subsection{Negative Results}

\paragraph{Note-Only Attack.}
Masking only note/pitch/velocity tokens (excluding structural tokens) yields AUC 0.32.
This is \emph{not} a failure: AUC $<$ 0.5 indicates a consistent inverse signal—non-members have \emph{higher} note-level NLL than members, likely due to data augmentation (transposition, velocity perturbation) diffusing note-specific memorization.
Inverting the score (1 $-$ NLL) recovers AUC $\approx$ 0.68, but this requires knowing the inversion a priori.
The key finding: \emph{structural tokens are the primary leak channel}; note tokens alone do not provide a usable attack without inversion.

\paragraph{EVT Tail Modeling.}
Fitting Generalized Pareto Distribution (GPD) to non-member tail scores (top-1\% quantile) and computing $p$-values for members yields AUC 0.66.
EVT is unstable with small non-member pools ($n=314$) and adds complexity without gains over simpler top-$k$ aggregation.

% [Ablation-Check]
% - Covers top-k, windowing, equal-N, negatives
% - Corrected interpretation of AUC<0.5 (inverse signal, not failure)
% - Condensed from potential sprawl
