Keywords: Biology, Sequence Modeling, CoT, Pretraining
Abstract: Large language models with natural language (e.g., English) have shown that generating auxiliary tokens—intermediate outputs not part of the final answer—enables powerful capabilities such as error correction, self-reflection, and more reliable reasoning. Methods like Chain-of-Thought prompting exploit the high expressiveness of natural languages, allowing models to verbalize internal states and perform complex reasoning over text space. In contrast, biological sequence models (e.g., proteins, RNA, DNA) operate over token spaces with limited expressiveness, restricted to amino acid or nucleotide tokens. As a result, these models lack mechanisms for externalized reasoning and are confined to producing only final sequence tokens without self-correction.
In this work, we introduce Bio-reflection pretraining, a new framework that augments biological sequence models with an auxiliary <reflect> token. We select the reflection token because it provides a flexible way for token-level modifications—such as error flagging, correction, swapping, and deletion—that directly target the types of mistakes most common and most urgent in biological sequence generation. By injecting synthetic errors during training and requiring the model to explicitly mark and correct them, we teach the model to engage in reflection and self-error-correction. This approach increases the effective expressiveness of biological sequence languages, enabling intermediate reasoning steps previously unattainable in this domain.
We evaluate our method on the challenging task of de novo peptide sequencing, where intermediate reasoning is critical and the ground-truth label is unique and clearly defined. We demonstrate both theoretically and empirically that reflection pretraining substantially improves model accuracy on this task and enhances robustness against overfitting. Beyond accuracy gains, our framework enables human-in-the-loop interaction, allowing experts to guide or override reflection points during sequence generation. Taken together, reflection pretraining offers a principled path toward more interpretable and steerable biological sequence models, narrowing the gap between natural language models and their biological counterparts.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 11760
Loading