On the Theory of Neural Network Surjectivity: Can You Elicit Any Behavior from Your Model?

Haozhe Jiang; Nika Haghtalab

On the Theory of Neural Network Surjectivity: Can You Elicit Any Behavior from Your Model?

Haozhe Jiang, Nika Haghtalab

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Theory, Safety, Neural Network, Generative Model

Abstract: Given a trained neural network, can any specified output be generated by some input? Equivalently, does the network correspond to a function that is surjective? In generative modeling, surjectivity implies that any output, including harmful or undesirable ones, can in principle be generated by the networks, raising concerns about model safety and jailbreaks. In this paper, we prove that many fundamental building blocks of modern neural architectures, such as networks with pre-layer normalization and linear-attention modules, are almost always surjective. As corollaries, widely used generative frameworks, including GPT-style transformers and flow-matching models, admit inverse mappings for arbitrary outputs. By studying surjectivity of these modern and commonly used neural architectures, we contribute a formalism that sheds light on their vulnerability to a broad class of adversarial attacks.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 22781

Loading