Keywords: evaluations, AI safety, situational awareness
TL;DR: We benchmark LLMs' understanding of how they can wield influence in the world, and how well they can guess which stage in their lifecycle a text came from
Abstract: Among the facts that LLMs can learn is knowledge about themselves and their situation. This knowledge, and ability to make inferences based on it, is called situational awareness. Situationally aware models can be more helpful, but also pose risks. For example, situationally aware models could game testing setups by knowing they are being tested and acting differently. We create a new benchmark, SAD (Situational Awareness Dataset), for LLM situational awareness in two categories that are especially relevant for future AI risks. SAD-influence tests whether LLMs can accurately assess how they can or cannot influence the world. SAD-stages tests if LLMs can recognize if a particular input is likely to have come from a given stage of the LLM lifecycle (pretraining, supervised fine-tuning, testing, and deployment). Only the most capable models do better than chance. If the prompt tells the model that it is an LLM, scores increase by 9-21 percentage points for models on SAD-influence, while having mixed effects on SAD-stages.
Submission Number: 84
Loading