Keywords: representation learning, generative models, learning theory, applications to neuroscience & cognitive science
TL;DR: We mimic human subliminal priming studies for LLMs to embed and trigger out-of-context reasoning.
Abstract: We mimic human subliminal priming studies for large language models (LLMs) by fine-tuning models with a few short ex-template descriptions of a fictitious character's behaviour mixed into a large corpus of longer but unrelated in-template instructions and eliciting demonstrations of the behaviour using suitable trigger prompts. Our theoretical motivation comes from observing that optimising models with the standard per-token cross-entropy loss is equivalent to training on a weighted context classification task, where shorter contexts have a higher weight. While we cannot measure an LLM's unawareness of the descriptions, we show that prompting strategies motivated by projective psychology and psychoanalytic theory succeed where naive questions fail, even with potent chain-of-thought (COT) initiators. This work extends research on out-of-context reasoning (OOCR), a primer for situational awareness, where LLMs "read between the lines" or "think outside of the box" by performing reasoning hops on internalised knowledge. We show that simple manipulations of the training data allow and improve the embedding of specific response behaviour, which may only be triggered using the correct prompting strategy, hinting at the possibility of undetected alignment hazards in current LLMs.
Supplementary Material: zip
Primary Area: applications to neuroscience & cognitive science
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10661
Loading