The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks

Isaac Galatzer-Levy; David Alexander Munday; Xin Liu; Danny Karmon; Ilia Labzovsky; Rivka Moroshko; Amir Zait; Daniel McDuff

The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks

Isaac Galatzer-Levy, David Alexander Munday, Xin Liu, Danny Karmon, Ilia Labzovsky, Rivka Moroshko, Amir Zait, Daniel McDuff

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, VLM, GenAI, Abstract Reasoning, Cognitive Benchmarking, Discrete Intellectual Abilities, Memory

Abstract: There is increasing interest in tracking the capabilities of general intelligence foundation models. This study benchmarks leading large language models (LLMs) and vision language models (VLMs) against human performance on the Wechsler Adult Intelligence Scale (WAIS-IV), a comprehensive, population-normed assessment of underlying human cognition and intellectual abilities, with a focus on the domains of Verbal Comprehension (VCI), Working Memory (WMI), and Perceptual Reasoning (PRI). Most models demonstrated exceptional capabilities in the storage, retrieval, and manipulation of tokens such as arbitrary sequences of letters and numbers, with performance on the Working Memory Index (WMI) greater or equal to the 99.5th percentile when compared to human population normative ability. Performance on the Verbal Comprehension Index (VCI) which measures retrieval of acquired information, and linguistic understanding about the meaning of words and their relationships to each other, also demonstrated consistent performance at or above the 98th percentile. Despite these broad strengths, we observed consistently poor performance on the Perceptual Reasoning Index (PRI; range 0.1-10th percentile) from multimodal models indicating profound inability to interpret and reason on visual information. Some more nuanced differences in performance were also observed. Models were consistently stronger on the WMI compared to the VCI, indicating stronger capabilities in storage, manipulation, and retrieval of data than language understanding. Smaller and older model versions consistently performed worse, indicating that training data, parameter count, and advances in tuning are resulting in significant advances in cognitive ability.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3912

Loading