Abstract: Detecting memorization in LLMs is essential for assessing privacy risks, intellectual property exposure, and the reliability of benchmark evaluations. Yet existing detection methods are constrained by two practical limitations. They either require access to the training corpus to
verify verbatim reproduction, or rely on logits that are sometimes unavailable in commercial black-box deployments. We introduce PEARL, a black-box framework that audit memorization essentially based on a model’s input and output behavior. PEARL operationalizes the input perturbation sensitivity hypothesis (PSH): memorized instances occupy narrow attractors in input space and degrade sharply under semantically preserving perturbations, while generalized instances remain stable. Building on this principle, PEARL produces an instance-levelmemorizationscorethroughacalibratedcomparisonbetweentheperturbation neighborhoods of known and unknown samples, requiring neither training data nor logit access. We evaluate PEARL on the Pythia model suite, with sizes ranging from 70M to 2.8B
parameters, and find that its detection performance scales monotonically with model capacity, reaching AUC 0.81 on Pythia-2.8B and substantially outperforming both the gray-box ACR baseline (AUC ≈0.59), the black-box CDD baseline (AUC ≈0.57) and gray-box membership inference upper bound (AUC ≈0.67). We further show that PEARL and membership inference attacks methods are complementary. They agree on fewer than half of their detections, and together identify 80.7% of true members, a 33% relative gain over the strongest individual method. This establishes PEARL as a practical, training-data-free auditing tool that captures generative memorization missed by existing approaches.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Alessandro_De_Palma1
Submission Number: 9693
Loading