Taming AI Bots: Controllability of Neural States in Large Language Models

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Explainability, hallucination, controllability, generative language models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: AI Bots are controllable stochastic dynamical systems but their controllability in the space of “meanings” is unknown: We formalize the problem and derive necessary and sufficient conditions, a first step towards analysis and design of safe AI Bots
Abstract: We tackle the question of whether an agent can, by suitable choice of prompts, control an AI bot to any state. We view large language models (LLMs) and their corresponding conversational interfaces (AI bots) as discrete-time dynamical systems evolving in the embedding space of (sub-)word tokens, where they are trivially controllable. However, we are not interested in controlling AI Bots to produce individual words but rather sequences, or sentences, that convey certain ''meanings''. To tackle the question of controllability in the space of meanings, we first describe how meanings are represented in an LLM: after pre-training, the LLM is a deterministic map from incomplete sequences of discrete tokens to an inner product space of discriminant vectors (''embeddings'') of the next token; after fine-tuning and reinforcement, the same LLM maps complete sequences to a vector space. Since no token follows the special end-of-sequence token during pre-training, that vector space can be co-opted to represent meanings and align them with human supervision during fine-tuning. Accordingly, ''meanings'' in trained LLMs can be viewed simply as equivalence classes of complete trajectories of tokens. Although rudimentary, this characterization of meanings is compatible with so-called deflationary theories in epistemology. More importantly, defining meanings as equivalence classes of sentences allows us to frame the key question as determining the controllability of a dynamical system evolving in the quotient space of discrete trajectories induced by the model itself, a problem that to the best of our knowledge has never been tackled before. To do so, we characterize a ``well trained LLM'' through conditions that are largely met by today's LLMs and show that, when restricted to the space of meanings, a well-trained AI bot is controllable under verifiable conditions. More precisely, we introduce a functional characterization of AI bots, and derive necessary and sufficient conditions for controllability. The fact that AI bots are controllable means that they can be designed to counteract adverse actions and avoid reaching undesirable states before their boundary is crossed.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6458
Loading