Keywords: Large Language Models, Spatial Reasoning, Benchmark Evaluation, Materials Science, Crystalline Materials, Geometric Operations
Abstract: Large Language Models (LLMs) excel at textual reasoning and are beginning to
develop spatial understanding, prompting the question of whether these abilities
can be combined for complex, domain-specific tasks. This question is essential in
fields like materials science, where deep understanding of 3D atomic structures
is fundamental. While initial studies have successfully applied LLMs to tasks
involving pure crystal generation or coordinate understandings, a standardized
benchmark to systematically evaluate their core reasoning abilities across diverse
atomic structures has been notably absent. To address this gap, we introduce
the AtomWorld benchmark to evaluate LLMs on tasks based in Crystallographic
Information Files (CIFs), a standard structure representation format. These tasks,
including structural editing, CIF perception, and property-guided modeling, reveal
a critical limitation: current models, despite establishing promising baselines,
consistently fail in structural understanding and spatial reasoning. Our experiments
show that these models make frequent errors on structure modification tasks, and
even in the basic CIF format understandings, potentially leading to cumulative
errors in subsequent analysis and materials insights. By defining these standardized
tasks, AtomWorld lays the ground for advancing LLMs toward robust atomic-scale
modeling, crucial for accelerating materials research and automating scientific
workflows.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 22569
Loading