Classification or Generation? Understanding Structure Prediction for Knowledge-Intensive Tasks

Figure1-entity-retrieval

2. Reformulation of the Problem

In previous research, entity retrieval has been modeled as a multi-class classification problem where each entity is assigned with a unique atomic label. A typical retrieval system consists of these parts:

  1. An encoder model that converts input queries to hidden representations;
  2. An retrieval model captures context and entity affinity, usually with vector dot products.

The output of the retrieval model is sorted, and top-k similar candidates are chosen as matches. This process has several obvious drawbacks:

  • Training the system requires constructing negative samples where the mismatched entities and query pairs are fed into the model, and the choice of negative pairs has a strong influence on the final performance;
  • When provided with large sets of entities, the storage of their dense representations requires a large memory footprint;
  • The process of vector dot product might fail in modeling the fine-grained interactions between the context and the entities.

Now, let’s return to the basics: by classifying or ranking the output of interactions between the queries and the entities, what are we supposed to achieve with this system? In a page-level Document Retrieval problem, we expect the model to output the most relevant documents (or sentences) in the KB given queries containing certain entity mentions; in an Entity Disambiguation problem, we want the model to output the mentioned entities in the given queries.

In other words, we can reformulate the retrieval problems as a generation task where the system gets an input sentence and outputs another - that’s exactly what a Seq2Seq model does!

But hold on, as we may get some unexpected answers from the model which do not appear in the given KB, we need to add some constraints. To ensure the outputs strictly follow the KBs content, we may build and apply a trie, i.e., a prefix tree, to constrain the decoding process since the generation is performed from left to right. We will talk about the details in later sections. Now, with this Seq2Seq alternative, it is surprising to find that the problems mentioned above are alleviated:

  • In a Seq2Seq task, we don’t have to worry about the construction of negative samples as all the other sentences already serve as negative samples to a certain extent;
  • The memory overhead of a Seq2Seq model relies mainly on the size of beam search and the average length of output sequence, much smaller than that of storing all entities’ representations;
  • The Seq2Seq model, together with the prefix constraints, captures interactions at the token level, which is intuitively better than the dot product between representation vectors.

3. Methodology

Up to now the main idea behind the paradigm proposed in this paper, “GENRE” (for Generative ENtity REtrieval), is covered, and here are some more details.

Concretely, the paper leverages a transformer-based architecture pre-trained with a language model objective (i.e., the BART model) and fine-tuned to generate entity names. GENRE ranks each entity $e \in \mathcal{E}$ by calculating a score with an autoregressive formulation: \(\operatorname{score}(e \mid x)=p_{\theta}(y \mid x)=\prod_{i=1}^{N} p_{\theta}\left(y_{i} \mid y_{<i}, x\right),\) where $y$ is the set of $N$ tokens in the identifier of $e$, and $\theta$ the parameters of the model.

3.1 Prefix Tree

Now let’s take a closer look at the trie constraints applied to the decoding part.

In computer science, a trie, also called digital tree or prefix tree, is a type of search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key (to recover its value, change it, or remove it), the trie is traversed depth-first, following the links between nodes, which represent each character in the key…

All the children of a node have a common prefix of the string associated with that parent node, and the root is associated with the empty string. – Wikipedia

In the prefix tree we mentioned here, each node is associated with a token instead of an individual character. For example, given the following phrases:

English language
English literature
France

we can build a prefix tree as shown in Figure 2:

The sentences are aggregated with the same prefix tokens, and each complete path (i.e., a path that begins with a BOS node and ends with an EOS node) represents a sentence. We can perform a sentence search efficiently by comparing an input sequence of tokens with the associated tokens in different nodes.

In the decoding process, with the tokens already output, we can set the probability of tokens that don’t appear in the children nodes of the current node to zero and make the model choose possible tokens till we meet an EOS node. In this way, we make sure the model only outputs “legal” sentences that appeared in our KB. The trie reduces the search space of beam search while performing sentence inference.

Another advantage of a trie is its low memory overhead (e.g., constraining on Wikipedia titles using the BART tokenizer produces a trie with ∼6M leaves, ∼17M internal nodes that occupied ∼600MB of disk space), since it is a compressed representation of a series of documents and can be pre-computed and stored in memory.

3.2 Autoregressive End-To-End Entity Linking

When putting the autoregressive framework further to address end-to-end Entity Linking (EL) problem, a markup annotation is used where spans boundaries are flagged with special tokens and accompanied by their corresponding entity identifiers. As an example, given an input sentence:

In 1503, Leonardo began painting the Mona Lisa.

where the mention “Leonardo” refers to the entity “Leonardo da Vinci”, and the mention “Mona Lisa” refers to the entity “Mona Lisa” in the knowledge base, its corresponding output will be:

In 1503, [Leonardo](Leonardo da Vinci) began painting the [Mona Lisa](Mona Lisa).

Since the annotated output space is exponentially large, it becomes intractable to pre-compute a trie for decoding, and the search probability is computed dynamically instead. In such a dynamic decoding straregy, there are three different conditions at each generating step:

  1. Outside in the sentence, where the decoder can either start a new mention with a special token (i.e., [) or continue by copying the next input token;
  2. Inside an entity mention, where the decoder can either continue with next input token or to end this mention with a special token (i.e., ]);
  3. Inside an entity link, where the decoder follows an entity trie discussed above to generate valid entity identifiers.

The model is constrained differently under these circumstances, as shown in Figure 3.

Figure3-dynamical-constraints

Experiments and Analyses

Extensive evaluations on more than 20 datasets across three tasks (Entity Disambiguation, end-to-end Entity Linking (EL), and page-level Document Retrieval) report the effectiveness of the GENRE paradigm.

Overall, GENRE achieves very competitive results in all of the three settings being the best performing system on average across all of them, especially on the page-level retrieval tasks of KILT benchmark (Table 1):

Table 1: R-Precision for page-level retrieval on KILT test data. Bold indicates the best model and underline indicates the second best.
  Fact Check. Entity Disambiguation Slot Filling Open Domain QA Dial.              
Model FEV AY2 WnWi WnCw T-REx zsRE NQ HoPo TQA ELI5 WoW Avg.
DPR + BERT 72.9 - - - - 40.1 60.7 25.0 43.4 - - -
DPR 55.3 1.8 0.3 0.5 13.3 28.9 54.3 25.0 44.5 10.7 25.5 23.6
tf-idf 50.9 3.7 0.24 2.1 44.7 60.8 28.1 34.1 46.4 13.7 49.0 30.5
DPR + BART 55.3 75.5 45.2 46.9 13.3 28.9 54.3 25.0 44.4 10.7 25.4 38.6
RAG 61.9 72.6 48.1 47.6 28.7 53.7 59.5 30.6 48.7 11.0 57.8 47.3
BLINK + flair 63.7 81.5 80.2 68.8 59.6 78.8 24.5 46.1 65.6 9.3 38.2 56.0
genre 83.6 89.9 87.4 71.2 79.4 95.8 60.3 51.3 69.2 15.8 62.9 69.7

Despite outperforming other SotA models, GENRE significantly reduces its memory overhead, occupying 14 times less memory than BLINK and 34 times less memory than memory DPR. As the entity names are stored in the prefix tree in advance, the GENRE model also has an advantage under the cold start setting where only the name of entities are available in the KBs.

Classification vs. Generation

To push forward the success of this paradigm shift and apply generative models to more classification problems, we need to find out the intrinsic reasons behind the superiority of generative models over classification models.

Generation is technically a hierarchical classification procedure: at each generating step, the decoder chooses one token to output based on the ranks of softmax logits - in other words, it performs token classification, and each step narrows the search space of remaining categories as whole sequences. The categories are clustered by their preceding tokens, i.e., the related categories (similar entities in the entity retrieval problem) with the same prefix tokens are grouped at the same search space.

The

Conclusion

Entity retrieval is the task of finding the precise exact entity that natural language refers to. Existing approaches treated it as a search problem, where one retrieves an entity from a KG given a piece of text.

This work proposes a straightforward paradigm: finding an entity identifier by autoregressively generating it with prefix constraints. Effectively, this means cross-encoding entities and their context with the advantage that the memory footprint scales linearly with the vocabulary size and no need to sample negative data. Without search or reranking, this plain and simple approach shatters some existing benchmarks surprisingly.

References

  • [1] De Cao, N., Izacard, G., Riedel, S., & Petroni, F. (2020). Autoregressive entity retrieval. arXiv preprint arXiv:2010.00904.
  • [2] Wikipedia contributors. (2022, January 8). Trie. In Wikipedia, The Free Encyclopedia. Retrieved 03:30, January 14, 2022, from https://en.wikipedia.org/w/index.php?title=Trie&oldid=1064464503
  • [3]
-->

Contents


Sample Submission

This post outlines a few more things you may need to know for creating and configuring your blog posts.


Example content (Basic Markdown)

Howdy! This is an example blog post that shows several types of HTML content supported in this theme.