Physics-Aligned Decoding (PAD) for Discrete Protein Structure Representations

05 Feb 2026 (modified: 02 Mar 2026)Submitted to Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vector-Quantized Autoencoders (VQ-VAE), Discrete Representation Learning, Structured Learning, Loss Design, Protein Structure Modelling, Generative Modelling
Abstract: Discrete representations learned by deep autoencoders are increasingly reused as intermediate state spaces in generative, conditional, and autoregressive models. In this work, we empirically identify an objective-level failure mode in discrete protein structure tokenizers trained with reconstruction-aligned losses: despite low global reconstruction error, learned tokens encode locally unphysical geometry, including covalent distortions and steric clashes. We show that these violations are deterministic and persistent under reuse. We test the hypothesis that this behavior arises from objective misspecification rather than architectural limitations, and introduce Physics-Aligned Decoding (PAD), a minimal intervention that augments reconstruction objectives with differentiable physical priors. Without changing architecture or regenerating the codebook, PAD reshapes token semantics and restores physical validity while preserving reconstruction accuracy. Our results highlight how loss geometry determines representation semantics, and demonstrate the importance of objective alignment when discrete representations are reused beyond static reconstruction.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Challenge: This submission is an entry to the science of DL improvement challenge.
Submission Number: 116
Loading