Benchmark for Assessing Olfactory Perception of Large Language Models

Eftychia Makri; Nikolaos Nakis; Laura Sisson; Leandros Tassiulas; Vahid Satarifard; Nicholas A. Christakis

Benchmark for Assessing Olfactory Perception of Large Language Models

Eftychia Makri, Nikolaos Nakis, Laura Sisson, Leandros Tassiulas, Vahid Satarifard, Nicholas A. Christakis

Published: 01 Apr 2026, Last Modified: 17 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: olfaction, benchmark, large language models, SMILES, molecular reasoning, sensory perception, odor prediction, multilingual evaluation, olfactory receptors, structure-odor relationships

TL;DR: We introduce a 1,010-question benchmark evaluating LLM olfactory reasoning across eight tasks, finding that models reach 64.4% accuracy but rely on lexical associations over molecular structure understanding.

Abstract: Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean $\approx$ +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4\% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC=0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.

Presenter: ~Eftychia_Makri1

Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 186

Loading