Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset

Published: 26 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX
Keywords: Natural language processing, Large language models, creative problem solving, creativity, Remote Associates Test, cognitive neuroscience, artificial general intelligence, human-imitative AI, GPT, in-context learning
TL;DR: We present a novel dataset and tasks for evaluating creative problem solving in LLMs that mimic analogical cognitive neuroscience (RAT) tests. We report baseline evaluation results on a suite of PLMs, LLMs and human performance.
Abstract: The quest for human imitative AI has been an enduring topic in AI research since inception. The technical evolution and emerging capabilities of the latest cohort of large language models (LLMs) have reinvigorated the subject beyond academia to cultural zeitgeist. While recent NLP evaluation benchmark tasks test some aspects of human-imitative behaviour (e.g., BIG-bench's `human-like behavior' tasks), few, if not none, examine *creative problem solving* abilities. Creative problem solving in humans is a well-studied topic in cognitive neuroscience with standardized tests that predominantly use ability to associate (heterogeneous) connections among clue words as a metric for creativity. Exposure to misleading stimuli --- distractors dubbed *red herrings* --- impede human performance in such tasks via the *fixation effect* and Einstellung paradigm. In cognitive neuroscience studies, such fixations are experimentally induced by pre-exposing participants to orthographically similar incorrect words to subsequent word-fragments or clues. The popular British quiz show Only Connect's *Connecting Wall* segment essentially mimics Mednick's Remote Associates Test (RAT) formulation with built-in, deliberate red herrings, that makes it an ideal proxy dataset to explore and study fixation effect and Einstellung paradigm from cognitive neuroscience in LLMs. In addition to presenting the novel Only Connect Wall (OCW) dataset, we also report results from our evaluation of selected pre-trained language models and LLMs (including OpenAI's GPT series) on creative problem solving tasks like grouping clue words by heterogeneous connections, and identifying correct open knowledge domain connections in respective groups. We synthetically generate two additional datasets: OCW-Randomized, OCW-WordNet to further analyze our red-herrings hypothesis in language models. The code and link to the dataset is available at [url](
Supplementary Material: pdf
Submission Number: 231