Keywords: language models, reasoning, cognitive reflection task, logit lens
Abstract: Given any input, a language model (LM) performs the same kind of computation to produce an output: a single forward pass through the underlying neural network. Inspired by findings in cognitive psychology, we investigate potential signatures of "deeper" and "shallower" computation within a forward pass, without allowing the model to generate intermediate reasoning steps. We prompt LMs with contrasting statements designed to trigger deeper or shallower reasoning on a set of cognitive reflection tasks. We find suggestive evidence that LMs' preferences for correct (deeper) or intuitive (shallower) answers can be manipulated through prompts related
not only to general personality traits, but also situational metabolic, physical, and social factors. We then use the logit lens to investigate how an LM might achieve this behavior. Our results suggest that intuitive answers are preferred in early layers, even when the final behavior is consistent with the correct answer or deeper reasoning. These findings motivate further mechanistic analyses of high-level cognition and reasoning in LMs.
Submission Number: 41
Loading