Understanding The Limits Of Text-Only Molecular Reasoning: A Case Study In Synthetic Chain-Of-Thought Supervision
Keywords: Large Language Models; reasoning; molecule property prediction; SFT; tools; SMILES; IUPAC
TL;DR: Mol2Synth improves LLM molecular property prediction by fine-tuning on synthetic, critic-verified chemical reasoning; boosting toxicity F1 (0.71→0.81), with IUPAC and tool-grounded reasoning generalizing best.
Abstract: While large language models show promise for scientific reasoning, their applicability to molecular property prediction remains unclear. We present Mol2Synth, a controlled study that examines whether synthetic chain-of-thought supervision can allow text-only LLMs to match conventional topological fingerprint methods for prediction of toxicity. Our results reveal fundamental limitations: even with tool-grounded reasoning and optimized representations, our best configuration (F1=0.88) underperforms classical ECFP fingerprints (F1=0.96), suggesting an inherent information bottleneck in textual molecular representations. Through systematic ablations across molecular representations (SMILES vs. IUPAC), data scaling, and tool-grounded generation, we demonstrate that reasoning-augmented fine-tuning stabilizes training and improves performance over zero-shot LLMs and label-only supervision, but cannot overcome structural parsing failures inherent to text-only inputs. Our qualitative analysis reveals that the primary failure mode is not faulty chemical reasoning but unreliable SMILES-to-structure interpretation; a bottleneck that tool integration partially addresses but cannot eliminate. These findings establish both the utility and fundamental limits of synthetic chain-of-thought supervision for molecular tasks, motivating hybrid architectures that combine natural language reasoning with explicit structural encoders.
Submission Number: 101
Loading