On the Theory of Neural Network Surjectivity: Can You Elicit Any Behavior from Your Model?

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Theory, Safety, Neural Network, Generative Model
Abstract: Given a trained neural network, can any specified output be generated by some input? Equivalently, does the network correspond to a function that is surjective? In generative modeling, surjectivity implies that any output, including harmful or undesirable ones, can in principle be generated by the networks, raising concerns about model safety and jailbreaks. In this paper, we prove that many fundamental building blocks of modern neural architectures, such as networks with pre-layer normalization and linear-attention modules, are almost always surjective. As corollaries, widely used generative frameworks, including GPT-style transformers and flow-matching models, admit inverse mappings for arbitrary outputs. By studying surjectivity of these modern and commonly used neural architectures, we contribute a formalism that sheds light on their vulnerability to a broad class of adversarial attacks.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 22781
Loading