Keywords: molecular property prediction, large language models, memorization
Abstract: Large Language Models (LLMs) show promise to change how we can interact with and control the design of other modalities, such as drugs, materials, and proteins, and enable scientific reasoning and planning. However, LLMs have several weaknesses: they tend to memorize instead of understand, and the implicit knowledge does not always propagate well between semantically similar inputs. In this work, we seek to distinguish what these scientific LLMs have memorized versus what they actually understand. To do so, we propose a new comprehensive benchmark dataset to evaluate LLM performance on molecular property prediction. We consider Galactica 1.3B, a state-of-the-art scientific LLM, and find that different prompting strategies exhibit vastly different error rates. We find that in-context
learning generally improves performance over zero-shot prompting, and the effect is twice as great for computed properties than for experimental. Furthermore, we show the model is brittle and relies on memorized information, which may limit the application of LLMs for controlling molecular discovery. Based on these findings, we suggest the development of novel methods to enhance information propagation within LLMs—if we desire LLMs to help us control molecular design and the scientific process, then they must learn a sufficient understanding of how molecules work in the real world.
Submission Track: Original Research
Submission Number: 107
Loading