\section{Experiments}
\label{sec:setup}

\subsection{Experimental Aims and Hypotheses}
\label{subsec:aims}
Our goal is to assess TS-RaMIA for low–false-positive auditing, with the pre-registered primary endpoint TPR at 1\% FPR (AUC and pAUC are secondary). We hypothesize that (i) tail-of-loss on structural tokens (\S\ref{subsec:topk}) improves low-FPR detection over uniform averaging, (ii) conditional calibration mitigates length-driven inflation (\S\ref{subsec:debiasing}), and (iii) the method transfers across symbolic representations (REMI, ABC) with minimal adaptation.

\subsection{Datasets}
\label{subsec:datasets}
We use \textbf{MAESTRO-v3.0.0}~\citep{hawthorne2019maestro} (1{,}276 performances; splits: 962 train, 137 val, 177 test). Members are the train split; non-members are val$\cup$test. All development/test partitions and cross-validation folds are composer-stratified to prevent stylistic leakage. We audit cross-split near-duplicates via metadata (composer/title/movement) and find none under our policy. For cross-representation analysis we convert MAESTRO MIDI to ABC (MIDI$\rightarrow$MusicXML$\rightarrow$ABC) and retain files that satisfy header/body formatting required by our masking rules; failed parses are excluded. Because conversion can shift distributions, ABC is treated as a representation-transfer setting rather than a direct replication. Dataset indices and conversion logs are released for exact reconstruction.

\subsection{Models}
\label{subsec:models}
\textbf{REMI Transformer (main).} GPT-2 style decoder (12 layers, 768 hidden, 12 heads; $\approx$67M params) with REMI tokenizer and 1{,}024-token context. Trained from scratch on MAESTRO-train with AdamW; checkpointed at a fixed cadence for risk scanning; seeds fixed for data order, initialization, and dropout. Full hyperparameters are in the Appendix.

\noindent\textbf{NotaGen (cross-representation).} Hierarchical GPT with patch planner and character decoder ($\approx$45M params), ABC representation, 2{,}048-char context, pretrained on an external 1.6M-ABC corpus. MAESTRO-derived ABC acts as a held-out test distribution to assess representation transfer; evaluation uses forward-pass logits only (teacher forcing), with no gradient or weight access.

\subsection{Evaluation Protocol}
\label{subsec:protocol}
We evaluate under three analysis views (\S\ref{subsec:debiasing}): \emph{Raw} scores; \emph{Length-Matched} scores, where each non-member is paired to the nearest member by structural-token count $n_{\text{struct}}$ and evaluation is restricted to the paired subset; and \emph{Calibrated} scores, obtained by applying the conditional calibration fitted on non-members only. 
For thresholded reporting we use a Neyman–Pearson procedure: choose $\tau$ on a composer-stratified development split of non-members to meet a target FPR $\in\{1\%,5\%,10\%\}$, then report TPR on held-out members. 
For the meta-attacker (\S\ref{subsec:meta}), we run composer-stratified 5-fold cross-validation; within each fold, the scaler, calibration model, and classifier are fit on the training split only, applied to the held-out split, and aggregated as out-of-fold predictions. 
All matching, calibration, and scaling are performed \emph{within} folds to avoid leakage. 
The primary endpoint is TPR at 1\% FPR; ROC-AUC and pAUC(0--1\%) are secondary. 
We fix one global seed for stochastic training/evaluation and a separate seed for resampling-based uncertainty estimation.
\subsection{Metrics \& Statistical Testing}
\label{subsec:metrics}
We compute ROC-AUC with 95\% confidence intervals via the nonparametric DeLong method~\citep{delong1988comparing}. 
For TPR@FPR and pAUC(0--1\%), we use percentile bootstrap with 10{,}000 composer-stratified resamples and fixed seeds; ties are handled by average ranks before threshold selection. 
When comparing AUCs across multiple methods, we apply Holm--Bonferroni correction to DeLong $p$-values. 
For thresholded metrics, we report absolute TPR differences at the same target FPR to avoid threshold-mismatch artifacts. 
All resampling keeps composers within strata to prevent cross-composer leakage.

\subsection{Baselines}
\label{subsec:baselines}
We include baselines targeting specific assumptions. 
\emph{Global-Mean NLL} averages losses over all tokens (no masking), probing length/global-difficulty confounding. 
\emph{Note-Only} runs the pipeline while excluding structural tokens, testing the necessity of structural masking. 
\emph{Random Score} assigns i.i.d.\ noise, serving as a sanity check for metric computation. 
TS-RaMIA variants are: \emph{StructTail} (Top-$k$ only, \S\ref{subsec:topk}); \emph{StructTail+Calib} (adds conditional calibration, \S\ref{subsec:debiasing}); and \emph{StructTail+Fusion} (adds the meta-attacker, \S\ref{subsec:meta}). 
All baselines/variants are evaluated under the three analysis views.

\subsection{Ablations}
\label{subsec:ablations}
We vary the tail size $k\in\{32,64,128\}$ to examine bias--variance trade-offs in tail aggregation. 
We sweep window length and hop for the windowed $p_{95}$ feature to assess sensitivity to local peaks. 
We compare non-overlapping chunking to overlapping windows (stride $<L$) to test context effects. 
We toggle calibration and length matching individually and jointly to quantify their contribution to low-FPR operation. 
We test alternative structural sets (REMI: \{\texttt{Bar}, \texttt{Position}, \texttt{Tempo}\}; ABC: \{\texttt{|}, \texttt{:}, \texttt{[}, \texttt{]}\}) to check robustness to mask definitions. 
All ablations share seeds, folds, and preprocessing to isolate the factor under study.

\subsection{Robustness}
\label{subsec:robustness}
We stress-test across sequence length extremes, high event density, and composer imbalance to evaluate stability of TPR at 1\% FPR. 
We simulate calibration mis-specification by fitting the calibration transform on perturbed non-member pools. 
We assess stochastic sensitivity by varying seeds for initialization and data order in both base scores and the meta-attacker. 
We test numerical robustness by quantizing teacher-forcing logits and recomputing scores. 
For ABC, we inject controlled conversion noise and re-parse files to probe representation artifacts. 
Each condition is evaluated under all three analysis views to separate confounding control from inherent variability.
