Subliminal Learning Leaves Traceable Representations in MNIST Autoencoders

Subliminal Learning Leaves Traceable Representations in MNIST Autoencoders

TMLR Paper8931 Authors

14 May 2026 (modified: 30 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Knowledge distillation is a widely adopted technique that allows us to efficiently produce cheaper but capable student models from expensive-to-deploy teacher models. However, this can induce a side effect where the student inherits traits from the teacher that were not the intended objective of the distillation, through a phenomenon called subliminal learning. In this short note, we ask whether an unintentional trait in a distilled student can be traced back to the teacher it was subliminally acquired from. We use an auxiliary-logit distillation setup of subliminal learning, similar to prior studies. We demonstrate that in an MNIST autoencoder, a student trained only to imitate auxiliary logits on random noise inputs subliminally acquires reconstruction performance. Moreover, we can trace students back to their source teachers with high accuracy by comparing their internal representations. Where prior work demonstrates transfer of behavioral traits or classifier performance, our result shows that a mechanistic representation trait is also transmitted and can be used to trace back the teacher model.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Dmitry_Kobak2

Submission Number: 8931

Loading