\section{Introduction}
\label{sec:intro}

Generative AI models for symbolic music~\citep{huang2018musictransformer,huang2020remi,hawthorne2019maestro} raise urgent questions for artists and copyright holders: \emph{Has my work been used without permission to train these systems?}
Membership inference attacks (MIAs)~\citep{shokri2017membership,yeom2018privacy,carlini2021extracting} provide a technical foundation for such auditing, enabling statistical tests of whether specific works were in a model's training set.

However, existing MIAs~\citep{yeom2018privacy,carlini2021extracting,carlini2019secret} treat all tokens uniformly, deriving signals from aggregate metrics like loss or perplexity that average over the entire sequence~\citep{yeom2018privacy,carlini2019secret}. This assumption of token uniformity, however, is challenged by the unique hierarchical structure of symbolic music, a complexity that has necessitated specialized, structure-aware tokenization approaches~\citep{zeng2021musicbert}. Unlike text, music is organized by \emph{structural tokens} (bar lines, beat positions, tempo/meter markers) that encode form, distinct from the \emph{event tokens} that carry melodic and harmonic content~\citep{zeng2021musicbert}. By averaging across these functionally different classes, the predictable signal from numerous structural tokens can dilute the memorization signal from content tokens. This structural confounding, on top of known issues with stylistic complexity, makes uniform attacks prone to high false positive rates and thus unreliable for auditing musical data~\citep{rezaei2021difficulty}.

\textbf{Our core hypothesis is}: training-set pieces exhibit sparse, high-loss pockets on structural tokens due to memorized compositional patterns (e.g., specific bar phrasing, tempo patterns unique to composers/pieces).
These pockets are detectable via \emph{tail-of-loss aggregation} (mean NLL of the largest $k$ structural-token losses), which amplifies sparse memorization signals that whole-sequence perplexity obscures by averaging over thousands of tokens~\citep{watson2021de}.

We introduce \textbf{TS-RaMIA} (Time- and Structure-Range Membership Inference Attack), a structure-aware, tail-of-loss, debiased MIA framework for symbolic music auditing.
Under a gray-box threat model (access to per-token log-probabilities via teacher forcing, available in many open-source checkpoints and in some APIs exposing log-probabilities), TS-RaMIA isolates structural tokens, aggregates top-$k$ hard tails (mean NLL of the largest $k$ losses, $k \in \{32,64,128\}$), debiases length and event-density (events per bar) confounders via matched pairs and regression on non-members, and fuses cues with a linear meta-attacker.

% ━━━ One-line key results (REMI + NotaGen) ━━━
On a 67M-parameter REMI Transformer trained on MAESTRO, TS-RaMIA achieves AUC~0.826 and 14.6\% TPR at 1\% FPR under our debiased view, targeting high-precision auditing for creator self-checks.
On NotaGen---a hierarchical ABC model---TS-RaMIA attains AUC~0.730 with 8.9\% TPR at 1\% FPR, indicating cross-representation transfer despite conversion-induced shift (details and caveats in §\ref{sec:results}). We contribute:
\begin{itemize}[leftmargin=*,nosep]
    \item \textbf{A structure-aware, debiased MIA for symbolic music}---combining structural masking with tail-of-loss aggregation under a forward-pass-only assumption (\emph{to our knowledge, the first such combination}).
    \item \textbf{Evidence that structural tokens are primary leakage channels}---ablations show bar/position/tempo dominate signal, while note-only cues are weak.
    \item \textbf{A confounder-robust evaluation protocol}---length matching and conditional calibration align low-FPR metrics with auditing needs; composer-stratified CV yields fair generalization estimates.
    \item \textbf{A simple meta-fusion and cross-representation validation}---a linear meta-attacker improves low-FPR performance; results replicate in trend on an ABC model (NotaGen).
\end{itemize}

% [Intro-Final-Revision-Notes]
% Applied 7-step narrative structure (Why→Gap→Idea→How→So what→Contributions→Map/Ethics)
% Unified terminology with immediate definition on first use:
%   - gray-box threat model (defined inline)
%   - per-token log-probabilities via teacher forcing (defined inline)
%   - event-density (events per bar) (bracketed on first use)
%   - non-members (consistent spelling)
%   - tail-of-loss aggregation (mean NLL of the largest k structural-token losses)
% Softened causal claims:
%   - "revealing true difficulty" → "indicate that naive metrics were inflated"
%   - "exposing true attack difficulty" → "suggesting substantial confounding"
% Front-loaded cross-representation caveats:
%   - Moved NotaGen caveat to P5 (after teaser), explicitly stating "conversion-induced shift"
% Contributions restructured as noun phrase + impact (parallel structure):
%   - Removed numerical details (moved to Results)
%   - Changed "First framework" → "To our knowledge, the first to combine" (one-time use)
%   - Softened "validates" → "suggest/indicate" where appropriate
%   - Made all bullets structurally parallel
% Removed redundancy and tightened flow


