Tokenizer Effect on Functional Material Prediction: Investigating Contextual Word Embeddings for Knowledge Discovery

Published: 27 Oct 2023, Last Modified: 03 Dec 2023AI4Mat-2023 PosterEveryoneRevisionsBibTeX
Submission Track: Papers
Submission Category: AI-Guided Design
Keywords: Material science, Natural language processing, Contextual embeddings, Large Lanugage Models
Supplementary Material: pdf
TL;DR: Tokenizer effect is found in rank prediction of materials using contextual embedding, which is the key of improve ranking performance.
Abstract: Exploring the predictive capabilities of natural language processing models in material science is a subject of ongoing interest. This study examines material property prediction, relying on models to extract latent knowledge from compound names and material properties. We assessed various methods for contextual embeddings and explored pre-trained models like BERT and GPT. Our findings indicate that using information-dense embeddings from the third layer of domain-specific BERT models, such as MatBERT, combined with the context-average method, is the optimal approach for utilizing unsupervised word embeddings from material science literature to identify material-property relationships. The stark contrast between the domain-specific MatBERT and the general BERT model emphasizes the value of domain-specific training and tokenization for material prediction. Our research identifies a "tokenizer effect", highlighting the importance of specialized tokenization techniques to capture material names effectively during the pretraining phase. We discovered that a tokenizer which preserves compound names entirely, while maintaining a consistent token count, enhances the efficacy of context-aware embeddings in functional material prediction.
Digital Discovery Special Issue: Yes
Submission Number: 98