Classification or Generation? Understanding Paradigm Shift for Knowledge-Intensive Tasks


1. Abstract

Knowledge-intensive tasks such as entity retrieval are challenging for even cutting edge NLP models since they require models to apply knowledge about the world. Previous studies typically treat this task as classification. Recently, a new paradigm has emerged, which reformats knowledge-intensive tasks as natural language generation. This post summarizes the paradigm shift and reviews the new generative methodology for the ICLR community, providing philosophical questions and new directions.

2. Introduction to Entity Retrieval

Search engines have become part of our daily lives. We use Google (Bing, Yandex, Baidu, etc.) as the main gateway to information on the Web. With a specific type of content in mind, we may search directly on a particular site or service, e.g., on Facebook or LinkedIn for people, organizations, and events; on Amazon or eBay for products; or YouTube or Spotify for music. Accustomed to a search box somewhere near the top of the screen, we have also increased our expectations of the quality and speed of the responses to our searches.

Information retrieval (IR), on the top level of abstraction, is about matching information needs with information objects. When a user puts a query, i.e., an expression varying from some keywords (e.g., Apple) to a natural language question (e.g., who is the CEO of Apple company), the search engine responds with a ranked list of information objects, traditionally related documents.

With the support of the enormous development of large-scale structured knowledge bases, we have witnessed the transition from “documents” to “answers”, as search engines directly return related entities or facts instead of merely “ten blue links”. The knowledge bases organize information around specific things and objects referred to as entities. The need to make search engines respond to queries with related entities brings us to the field of entity retrieval (ER), which is also the main problem the paper presented here tries to tackle, Autoregressive Entity Retrieval1 By Nicola De Cao, Gautier Izacard, Sebastian Riedel, Fabio Petroni.

2.1 Problem Definition

Formally, entities are uniquely identifiable objects or things (such as persons, organizations, and places), characterized by their types, attributes, and relationships to other entities. In an entity retrieval task, we have a collection of entities $\mathcal{E}$ (e.g., Wikipedia articles) where each entity is an entry in a Knowledge Base (KB) such as Wikipedia. Given a textual input source $x$ (e.g., question), a model has to return the most relevant entities from $\mathcal{E}$ concerning $x$. We assume that each $e \in \mathcal{E}$ is uniquely assigned to a textual representation (i.e., its name): a sequence of tokens $y$ (e.g., Wikipedia pages are identified by their titles).

Concretely, the following tasks are involved in this paper:

  • Entity Disambiguation (ED), where an input $x$ is annotated with a mention and a system has to select either its corresponding entity from $\mathcal{E}$, or to predict that there is no corresponding entry in the KB (see Figure 1 as an example).
  • End-To-End Entity Linking (EL). This task is to jointly detect entity mentions $m$ from an input $x$ and link those mentions to their respective KB entities $e \in \mathcal{E}$.
  • Page-level Document Retrieval (DR). The input $x$ is intended as a query and $\mathcal{E}$ as a collection of documents identified by their unique titles (e.g., Wikipedia articles).

Figure1-entity-retrieval

3. Reformulation of the Problem

In previous research, entity retrieval has been modeled as a multi-class classification problem where each entity is assigned with a unique atomic label. A typical retrieval system consists of these parts:

  1. An encoder model that converts input queries to hidden representations;
  2. An retrieval model captures context and entity affinity, usually with vector dot products.

The output of the retrieval model is sorted, and top-k similar candidates are chosen as matches. This process has several obvious drawbacks:

  • Training the system requires constructing negative samples where the mismatched entities and query pairs are fed into the model, and the choice of negative pairs has a strong influence on the final performance;
  • When provided with large sets of entities, the storage of their dense representations requires a large memory footprint;
  • The process of vector dot product might fail in modeling the fine-grained interactions between the context and the entities.

Now, let’s return to the basics: by classifying or ranking the output of interactions between the queries and the entities, what are we supposed to achieve with this system? In a page-level Document Retrieval problem, we expect the model to output the most relevant documents (or sentences) in the KB given queries containing certain entity mentions; in an Entity Disambiguation problem, we want the model to output the mentioned entities in the given queries.

In other words, we can reformulate the retrieval problems as a generation task where the system gets an input sentence and outputs another - that’s exactly what a Seq2Seq model does!

But hold on, as we may get some unexpected answers from the model which do not appear in the given KB, we need to add some constraints. To ensure the outputs strictly follow the KBs content, we may build and apply a trie, i.e., a prefix tree, to constrain the decoding process since the generation is performed from left to right. We will talk about the details in later sections. Now, with this Seq2Seq alternative, it is surprising to find that the problems mentioned above are alleviated:

  • In a Seq2Seq task, we don’t have to worry about the construction of negative samples as all the other sentences already serve as negative samples to a certain extent;
  • The memory overhead of a Seq2Seq model relies mainly on the size of beam search and the average length of output sequence, much smaller than that of storing all entities’ representations;
  • The Seq2Seq model, together with the prefix constraints, captures interactions at the token level, which is intuitively better than the dot product between representation vectors.

4. Methodology

Up to now the main idea behind the paradigm proposed in this paper, “GENRE” (for Generative ENtity REtrieval), is covered, and here are some more details.

Concretely, the paper leverages a transformer-based architecture pre-trained with a language model objective (i.e., the BART model) and fine-tuned to generate entity names. GENRE ranks each entity $e \in \mathcal{E}$ by calculating a score with an autoregressive formulation:

\[\operatorname{score}(e \mid x)=p_{\theta}(y \mid x)=\prod_{i=1}^{N} p_{\theta}\left(y_{i} \mid y_{<i}, x\right),\]

where $y$ is the set of $N$ tokens in the identifier of $e$, and $\theta$ the parameters of the model.

4.1 Prefix Tree

Now let’s take a closer look at the trie constraints applied to the decoding part.

In computer science, a trie, also called digital tree or prefix tree, is a type of search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key (to recover its value, change it, or remove it), the trie is traversed depth-first, following the links between nodes, which represent each character in the key…

All the children of a node have a common prefix of the string associated with that parent node, and the root is associated with the empty string. – Wikipedia2

In the prefix tree we mentioned here, each node is associated with a token instead of an individual character. For example, given the following phrases:

English language
English literature
France

we can build a prefix tree as shown in Figure 2:

The sentences are aggregated with the same prefix tokens, and each complete path (i.e., a path that begins with a BOS node and ends with an EOS node) represents a sentence. We can perform a sentence search efficiently by comparing an input sequence of tokens with the associated tokens in different nodes.

In the decoding process, with the tokens already output, we can set the probability of tokens that don’t appear in the children nodes of the current node to zero and make the model choose possible tokens till we meet an EOS node. In this way, we make sure the model only outputs “legal” sentences that appeared in our KB. The trie reduces the search space of beam search while performing sentence inference.

Another advantage of a trie is its low memory overhead (e.g., constraining on Wikipedia titles using the BART tokenizer produces a trie with ∼6M leaves, ∼17M internal nodes that occupied ∼600MB of disk space), since it is a compressed representation of a series of documents and can be pre-computed and stored in memory.

4.2 Autoregressive End-To-End Entity Linking

When putting the autoregressive framework further to address end-to-end Entity Linking (EL) problem, a markup annotation is used where spans boundaries are flagged with special tokens and accompanied by their corresponding entity identifiers. As an example, given an input sentence:

In 1503, Leonardo began painting the Mona Lisa.

where the mention “Leonardo” refers to the entity “Leonardo da Vinci”, and the mention “Mona Lisa” refers to the entity “Mona Lisa” in the knowledge base, its corresponding output will be:

In 1503, [Leonardo](Leonardo da Vinci) began painting the [Mona Lisa](Mona Lisa).

Since the annotated output space is exponentially large, it becomes intractable to pre-compute a trie for decoding, and the search probability is computed dynamically instead. In such a dynamic decoding straregy, there are three different conditions at each generating step:

  1. Outside in the sentence, where the decoder can either start a new mention with a special token (i.e., [) or continue by copying the next input token;
  2. Inside an entity mention, where the decoder can either continue with next input token or to end this mention with a special token (i.e., ]);
  3. Inside an entity link, where the decoder follows an entity trie discussed above to generate valid entity identifiers.

The model is constrained differently under these circumstances, as shown in Figure 3.

Figure3-dynamical-constraints

5. Experiments and Analyses

Extensive evaluations on more than 20 datasets across three tasks (Entity Disambiguation, end-to-end Entity Linking (EL), and page-level Document Retrieval) report the effectiveness of the GENRE paradigm.

Overall, GENRE achieves very competitive results in all of the three settings being the best performing system on average across all of them, especially on the page-level retrieval tasks of KILT benchmark (Table 1):

Table 1: R-Precision for page-level retrieval on KILT test data. Bold indicates the best model and underline indicates the second best.
  Fact Check. Entity Disambiguation Slot Filling Open Domain QA Dial.              
Model FEV AY2 WnWi WnCw T-REx zsRE NQ HoPo TQA ELI5 WoW Avg.
DPR + BERT 72.9 - - - - 40.1 60.7 25.0 43.4 - - -
DPR 55.3 1.8 0.3 0.5 13.3 28.9 54.3 25.0 44.5 10.7 25.5 23.6
tf-idf 50.9 3.7 0.24 2.1 44.7 60.8 28.1 34.1 46.4 13.7 49.0 30.5
DPR + BART 55.3 75.5 45.2 46.9 13.3 28.9 54.3 25.0 44.4 10.7 25.4 38.6
RAG 61.9 72.6 48.1 47.6 28.7 53.7 59.5 30.6 48.7 11.0 57.8 47.3
BLINK + flair 63.7 81.5 80.2 68.8 59.6 78.8 24.5 46.1 65.6 9.3 38.2 56.0
genre 83.6 89.9 87.4 71.2 79.4 95.8 60.3 51.3 69.2 15.8 62.9 69.7

Despite outperforming other SotA models, GENRE significantly reduces its memory overhead, occupying 14 times less memory than BLINK and 34 times less memory than memory DPR. As the entity names are stored in the prefix tree in advance, the GENRE model also has an advantage under the cold start setting where only the name of entities are available in the KBs.

6. Classification, Generation, and Prompt-based Learning

To push forward the success of this paradigm shift and apply autoregressive generative models to other classification problems, we need to find out the intrinsic reasons behind the superiority of generative models over classification models.

For knowledge-intensive tasks like entity retrieval, the names of entities containing rich semantic information are often ignored in previous single-level classification methods. However, in this autoregressive schema, interactions between entities and contexts are captured and help gain improvements.

This paradigm is similar to the popular trend of the prompt-based learning paradigm in recent times. Inspired by the remarkable few-shot performance of the GPT-33 model, which leverages natural-language prompts and a few task demonstrations as input context, researchers modify the input using a template (called “prompt”) with some unfilled slots and transform the traditional categorical classification into a token classification. The prompt-based classification schema enjoys an overwhelming advantage over the traditional classification in few-shot and even zero-shot settings. The following Figure 4 depicts MLM training, standard fine-tuning and the LM-BFF4 prompt-tuning.

Figure4-prompt-tuning

The success of prompt-based learning can be attributed to the consistency between the masked-language-model pre-training objective and the slot-filling objective in fine-tuning. LMs capture the direct semantic interaction between the prompt tokens and predicted tokens label tokens and utilize it to make decisions.

From a prompt-based perspective, we can also reformulate the autoregressive generation into a series of consecutive prompt-based classifications, where the previously generated tokens can be viewed as the prompt context. Note that the generative model outputs from left to right, hence only leverages one-way information.

Furthermore, the output of the generative model can be guided by incorporating extra information, like keywords, domain tags, or any variety of other pieces of information used to control the generated text5. These extra prompt helps better utilize the task information and may provide valuable direction for future work on prompt learning and controlled text generation.

7. Conclusion

This post discussed a new paradigm that autoregressively generates entities with prefix constraints. The plain and simple approach shatters some existing benchmarks surprisingly with lower memory footprints without search or reranking. We compared this schema with categorical classification and analyzed the intrinsic reasons for its advantages. Finally, we discussed the relationship between autoregressive generation and prompt-based learning and provided the community with new directions.

References