
\section{Method: TS-RaMIA}
\label{sec:method}

\subsection{Notation and Preliminaries}
\label{subsec:notation}
Let $\mathbf{x}=(x_1,\dots,x_T)$ denote a tokenized music sequence over vocabulary $V$. Teacher forcing provides per-token logits $\mathbf{z}t\in\mathbb{R}^{|V|}$ when conditioning on $(x_1,\dots,x{t-1})$. We use a binary \emph{structural mask} $m_t\in{0,1}$ to select lattice tokens (bars, beat positions, meter/tempo markers), and write $n_{\text{struct}}=\sum_{t=1}^T m_t$ for their count. The per-token negative log-likelihood (NLL) is $\ell_t$, and a piece-level score is $s(\mathbf{x})$. Tail aggregation over the $k$ hardest structural tokens yields $s_{\text{top-}k}$ with $k’=\min(k,n_{\text{struct}})$. A calibrated score $s_{\text{calib}}$ removes residual dependence on structural length. We consider two representations: \emph{REMI} (event-level tokens) and \emph{ABC} (character-level streams).

\subsection{Overview}
Figure~\ref{fig:tsramia_framework} illustrates the pipeline. Given $\mathbf{x}$ we (i) \emph{tokenize} into REMI or ABC to expose structure; (ii) \emph{apply structural masking} to isolate lattice tokens, reducing formatting noise and confounders; (iii) \emph{compute per-token NLLs} via teacher forcing to obtain fine-grained difficulty; (iv) \emph{aggregate the tail-of-loss} over structural tokens (top-$k$) to amplify sparse leakage pockets; (v) for controlled evaluation, \emph{debias} via length matching or conditional calibration; and (vi) \emph{fuse cues} with a lightweight meta-attacker to form the final decision score. The method assumes forward-pass access to per-token log-probabilities (no gradients), is representation-agnostic across REMI/ABC with minor adaptations, and has linear time in sequence length aside from sorting structural losses. See Appendix for hyperparameters.
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/overview.png}
\caption{TS-RaMIA pipeline.}
\label{fig:tsramia_framework}
\end{figure}

\subsection{Structural Masking}
\label{subsec:structural_mask}
We select tokens that encode musical lattice coordinates and exclude formatting artifacts so that downstream statistics concentrate on structure rather than layout or metadata.
We define
\begin{equation}
m_t=\mathbb{1}!\left[x_t\in\mathcal{S}{\text{struct}}\right],\qquad
n{\text{struct}}=\sum_{t=1}^T m_t,
\end{equation}
where $\mathcal{S}_{\text{struct}}$ is representation-specific.

\paragraph{REMI.} $m_t=1$ iff $x_t\in{\texttt{Bar},\ \texttt{Position},\ \texttt{Tempo}}$.

\paragraph{ABC.} We exclude headers \emph{only before} the first body line ($\texttt{X:}$, $\texttt{T:}$, $\texttt{M:}$, $\texttt{L:}$, $\texttt{Q:}$, $\texttt{K:}$, $\texttt{V:}$). Body-internal directives (e.g., mid-tune $\texttt{K:}$/$\texttt{M:}$/$\texttt{V:}$) are retained. For body characters, $m_t=1$ for $x_t\in{\texttt{|},\ \texttt{:},\ \texttt{[},\ \texttt{]}}$ (bar lines, repeats, brackets). Newlines are normalized but not treated as structural tokens because their frequency is formatting-dependent rather than musical. Unit tests verify complete header exclusion and correct body-structure tagging under typical ABC variants.

\subsection{Sample-Level NLL}
\label{subsec:sample_nll}
Per-sequence perplexity conflates heterogeneity across token types and suffers from Jensen effects when averaging exponentiated losses; instead we use per-token NLL under teacher forcing. For token $x_t$ with logits $\mathbf{z}t$,
\begin{equation}
\ell_t=-\log\frac{\exp(z{t,x_t})}{\sum_{v\in V}\exp(z_{t,v})}.
\end{equation}
To control context resets and keep complexity linear, long sequences are chunked (REMI by events; ABC by characters) using non-overlapping windows and excluding chunk-initial tokens. This avoids artificially high losses at window boundaries; overlapping windows are possible but increase cost proportionally and did not alter qualitative conclusions (Appendix).

\subsection{Debiasing Protocol}
\label{subsec:debiasing}
Naïve likelihood correlates with structural length; we report controlled analysis views that mitigate inflation from $n_{\text{struct}}$.

\paragraph{Length-matched view.} Each non-member is paired to the closest member by $|n_{\text{struct}}^{(i)}-n_{\text{struct}}^{(j)}|$, using nearest-neighbor matching with replacement and deterministic tie-breaking. Scores are evaluated within pairs to control structural complexity.

\paragraph{Conditional calibration.} On non-members only, we regress $s$ on $\log n_{\text{struct}}$ and use residuals as calibrated scores:
\begin{equation}
s_{\text{calib}}=s-(\hat{\beta}0+\hat{\beta}1\log n{\text{struct}}),\quad
(\hat{\beta}0,\hat{\beta}1)=\arg\min{\beta}\sum{\mathbf{x}\in\mathcal{D}{\text{non}}}!\big(s-\beta_0-\beta_1\log n_{\text{struct}}\big)^2.
\end{equation}
This removes first-order dependence on structural length while preserving token-level irregularities. In deployment, a shadow corpus can provide the calibration fit without access to member labels.

\subsection{Tail-of-Loss Aggregation}
\label{subsec:topk}
Global means dilute sparse memorization; we emphasize the hardest structural tokens. Let ${\ell_t:m_t=1}$ denote structural losses, sorted $\ell_{(1)}\ge\dots\ge\ell_{(n_{\text{struct}})}$. The tail score is
\begin{equation}
s_{\text{top-}k}=\frac{1}{k’}\sum_{i=1}^{k’}\ell_{(i)},\qquad k’=\min(k,n_{\text{struct}}).
\end{equation}
Choosing $k$ trades variance (small $k$) against signal dilution (large $k$); we evaluate fixed $k$ values for stability (Appendix). We optionally complement tail means with windowed high-percentiles (e.g., $p_{95}$) over sliding windows on structural positions to capture localized spikes that may not dominate the global top-$k$.

\subsection{Meta-Attacker Fusion}
\label{subsec:meta}
We assemble a 9D feature vector per piece comprising three tail scores $s_{\text{top-}k}$ ($k\in{32,64,128}$), three windowed $p_{95}$ statistics (fixed window and hop per representation; see Appendix), and three optional reverse/hierarchical features if available (ablated when disabled). Features are $z$-scored using a scaler fitted on each training fold. A logistic regression with L2 penalty and class weighting forms the meta-attacker; we use composer-stratified 5-fold cross-validation so that pieces by the same composer never straddle folds. We aggregate out-of-fold predictions for evaluation and compute uncertainty by composer-stratified bootstrap. The final score is the meta-model decision value on held-out data.

\subsection{Cross-Representation Extension: ABC/NotaGen}
The procedure transfers to ABC with minimal changes: character-level chunking, header exclusion, and body-only structural masking. We apply the same scoring and debiasing pipeline to a hierarchical ABC language model (NotaGen) to test representation and architecture transfer. Because MIDI$\to$ABC conversion may induce distributional shift, we report cross-representation trends and discuss limitations in the Results section.

\subsection{Checkpoint-Risk Scanning}
We assess privacy–utility dynamics during training by scoring intermediate checkpoints at fixed intervals using the identical pipeline. The scan produces a trajectory of membership risk versus epoch without changing hyperparameters. Computation parallelizes across pieces and can be streamed over chunks for long sequences. The resulting curves are summarized in the Results section.