Keywords: cryptanalysis, interpretability, ai for math
TL;DR: Interpretability analysis of transformers trained on hard math tasks
Abstract: Recent works have demonstrated that transformers can be trained to recover sparse, binary cryptographic secrets in the Learning With Errors (LWE) problem, a foundational problem that underlies many post-quantum cryptographic schemes. However, as architectures have evolved to efficient encoder-only models, the mechanism by which these models recover the cryptographic secret has become more opaque. In this paper, we present the first layer-wise and embedding-level mechanistic interpretability analysis of encoder-only transformers trained on LWE samples. We reveal a surprising phenomenon: despite achieving near-zero exact prediction accuracy on the training objective, the models successfully recover the secret by bypassing the standard predictive pathways. We use dimensionality reduction, causal intervention, and linear probing and find that the secret is implicitly present in the positional embedding. Building on this mechanistic understanding, we introduce an architectural intervention that applies $L_1$ sparsity regularization directly to the positional embeddings. This modification forces the model to explicitly isolate the latent secret, transforming the computationally expensive post-hoc secret recovery process into a direct, human-interpretable parameter inspection. Our findings provide fundamental insights into how transformers allocate representational capacity when faced with high-noise, structured combinatorial problems.
Submission Number: 232
Loading