Enhancing Language Models for Technical Domains with Dynamic Token Injection

Giorgio Giannone; Neil Tenenholtz; James Hall; Nicolo Fusi; David Alvarez-Melis

Enhancing Language Models for Technical Domains with Dynamic Token Injection

Giorgio Giannone, Neil Tenenholtz, James Hall, Nicolo Fusi, David Alvarez-Melis

Published: 27 Oct 2023, Last Modified: 22 Nov 2023GenBio@NeurIPS2023 PosterEveryoneRevisionsBibTeX

Keywords: dynamic vocabulary; augmented language models; domain-specialized language models

TL;DR: Inject an exogenous domain in a frozen language model using a functional mapping over an augmented vocabulary

Abstract: Large language models (LLMs) are rapidly advancing the frontier of natural language understanding and generation. Their generalist nature, while adept at handling a wide range of tasks, often lacks the depth and precision required by highly specialized and rapidly evolving technical domains, such as genomics and engineering design. Fine-tuning these models for specific domains can be effective but requires large amounts of data and compromises their general reasoning capabilities. In this work, we introduce a scalable method to infuse specialized knowledge into generalist language models by dynamically extending their vocabulary with specialist tokens. By using a lightweight functional mapping on an extended vocabulary and adjusting the logit distribution, we enable the model to grasp domain-specific nuances. We demonstrate this in an application in genomics, where we extend a standard LLM by introducing knowledge about a large set of genes, allowing it to proficiently tackle tasks involving both textual and genetic data. Functional alignment enables the model to handle novel gene tokens that were never encountered during training, enabling domain-aware out-of-distribution capabilities in generalist language models.

Submission Number: 5

Loading