Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Safety, Pretraining-Stage Alignment
Abstract: To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to move safety interventions into pretraining, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond just making the data itself safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose $\textbf{Safety Reflection Pretraining}$, a pretraining-stage alignment method that regularly inserts short safety reflections into pretraining corpora and trains language models to self-monitor the safety of the preceding context as part of language modeling. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, $\textbf{MedSafetyWorld}$, with a clear definition of safety and a reasoning structure under which models can easily generalize from safe data to unsafe behaviors. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 362
Loading