Abstract: We propose a novel high-level approach to analyze models in a different way: we could estimate amount of information receipted by a model using a crafted set of control statements. We introduce a new metrics RIG (raw information gain) in order to do so.
Any LLM (large language model) could be considered a “black box” of compressed information. It is hard to measure what amount of information is stored inside the model regarding any domain. The contrast between the size of a trained model of around 43GB compared to 15 trillion tokens of training data is staggering
The other issue is to figure out where do the limits come from: is it an architectural constraint or is the limitation coming from the data used in training. So far the most common way to identify if the model is properly trained and contains necessary information is to put it through a list of benchmarks and the decision is based on either it's ranking or some educated guess of a score threshold. Keeping in mind that the most of those benchmarks become part of training data for upcoming models we face a vicious cycle of never ending benchmark creation.
Taken into account constant size growth of both language models and datasets we face an challenge of losing a track of what is efficient and what is not to train models as well as simple scale of the datasets makes them almost impossible to supervise at all, what is an immense obstacle when we need to update any language model according to different environments those are implemented at and we need to bring ethical issues, actuality of the human knowledge and controversial statements altogether.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: RIG, metric, Interpretability, information gain, information theory
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis, Theory
Languages Studied: English
Submission Number: 5714
Loading