$\texttt{LAUGHS}$: An LLM-compatible Molecular String Representation

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: Large language models, Molecular representations, Natural Language, AI for Chemistry
TL;DR: We introduce LAUGHS, a natural language-compatible molecular representation that hierarchically organizes named moieties, enabling accurate property explanations and site-specific editing with LLMs.
Abstract: Large language models (LLMs) are increasingly applied to chemistry, yet their performance depends strongly on how molecules are represented as text. IUPAC names become syntactically unwieldy for complex structures, while graph-serialized strings disperse chemically meaningful moieties across the sequence. Here, we present $\texttt{LAUGHS}$, an LLM-compatible molecular string representation that decomposes a molecule into named moieties, hierarchically organizes them into a tree structure, and linearizes the result into a natural-language-like string. Tokenization analysis reveals that LAUGHS units align near-perfectly with tokenizer spans, suggesting strong compatibility with LLMs. On the property explanation task, LAUGHS matches IUPAC-level performance across all metrics; on site-specific editing, it substantially outperforms all baselines with a 91.4\% exact match rate among valid outputs. Together, our results suggest that semantic mismatch between molecular representations and natural language syntax is a key bottleneck for LLMs in chemistry, and that LAUGHS offers an effective way to address it.
Submission Number: 103
Loading