In-Context Neurofeedback: Can Large Language Models Control Their Internal Representations through Privileged Access?

Published: 03 Jun 2026, Last Modified: 09 Jun 2026AI4GOOD Workshop 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: metacognition, neurofeedback, in-context learning, privileged access, AI safety
Abstract: Whether large language models (LLMs) can control their own internal representations matters for both machine metacognition and AI safety. A recent study applied neurofeedback to LLMs and claimed that they can control their internal representations. However, the reported control may rely on superficial mechanisms rather than genuine internal access because the control targets in that study are not privileged, meaning that a third party can infer them from the prompt. We redesign the neurofeedback paradigm for LLMs so that the control target satisfies the privileged access requirement, which is closer to neurofeedback experiments in human cognitive neuroscience. Under this stricter setting, the models do not demonstrate reliable control over privileged internal representations. This suggests that previously reported control cannot exclude the possibility that it relies on superficial mechanisms. Our results indicate that rigorous assessments of metacognition in LLMs require evaluation methods that demand privileged access.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 224
Loading