LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RL contamination, representation geometry, layer-wise dynamics, membership inference.
TL;DR: LaRA detects RL contamination via layer-wise representation geometry. Contaminated samples show amplified perturbation sensitivity, abnormal directional dynamics, and altered local invariance across layers.
Abstract: Reinforcement learning (RL) improves reasoning in large language models (LLMs) but can also induce contamination through reward-driven memorization. Existing contamination detection methods mainly rely on output-level signals, such as likelihood or entropy, which become unstable after RL post-training. We propose LaRA, a layer-wise representation analysis framework for contamination detection in RL-trained LLMs. LaRA introduces three complementary metrics—Representation Shift Magnitude (RSM), Directional Collapse (DC), and Representation Stability Index (RSI)—to measure perturbation sensitivity, directional concentration, and local representation variability under controlled perturbations. Our analyses show that contamination induces progressive geometric deviations across layers, characterized by amplified perturbation sensitivity, abnormal directional concentration dynamics, and unstable local variability. Based on these observations, we develop a layer-aware detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that LaRA consistently improves contamination detection over existing output-level baselines.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 290
Loading