Cognitive Assessment of Language Models

Published: 18 Jun 2024, Last Modified: 26 Jul 2024ICML 2024 Workshop on LLMs and Cognition PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Language Models, LLM, Cognition, Assessment
TL;DR: In the current work, we propose a set of such tasks inspired by evidence-based human cognitive assessments from the field of (neuro)psychology and create a battery of questions called the Cognitive Assessments for Language Models (CALM) dataset.
Abstract: Large language models (LLMs) are a subclass of generative artificial intelligence that can interpret language inputs to generate novel responses. These capabilities are conceptualized as a significant step forward in artificial intelligence because the models can seemingly better mimic the thinking and reasoning of human cognition compared to earlier generations of machine learning models that were limited to identifying numeric classes and clusters. Benchmarks for performance are necessary for tracking progress of these models and many existing tasks have served as useful tools. In the current work, we propose a set of such tasks inspired by evidence-based human cognitive assessments from the field of (neuro)psychology and create a battery of questions called the \textbf{Cognitive Assessments for Language Models (CALM) dataset}. We investigate the capabilities of LLMs to perform in distinct domains of cognitive performance including numeric reasoning, visual spatial reasoning, attention, simple, working, and short term memory, executive functioning, among others. We compare performance across tasks in relation to the size of the LLM. Results demonstrate wide variability in performance in distinct cognitive domains. Of note, the number of parameters was predictive of performance on executive functioning, reasoning, and memory tasks. All models performed strongly at real world reasoning and narrative interpretation tasks. Models universally performed poorly on visual-spatial reasoning tasks.
Submission Number: 9
Loading