Prototypical Representation Learning for Low-resource Knowledge Extraction: Summary and Perspective

Recent years have witnessed the success of prototypical representation in widespread low-resource tasks, since “Prototypical Networks for Few-shot Learning (NeurIPS 2017)[1]” proposed to represent each class as a prototype by the mean of its instance embeddings and learn a metric space in which classification can be performed by computing distances to prototypes. A recent paper “Prototypical Representation Learning for Relation Extraction[2]” accepted by ICLR 2021, as a member of the growing zoo of prototypical networks, has addressed prototypical representation learning for low-resource knowledge extraction. In this post, we briefly summarize this issue by highlighting the ICLR paper. Different from vanilla prototypical networks, this ICLR paper has proposed to tackle low-resource knowledge extraction (1) considering both compactness intra each prototype and separability inter prototypes, (2) by leveraging contrastive learning and projecting prototypes into geometric space. Furthermore, we also point out some shortcomings of this paper and put forward some promising directions.

Content

1. Low-reource Knowledge Extraction

Knowledge Extraction (KE) aims at extracting structural information from unstructured texts, such as Relation Extraction (RE) and Event Extraction (EE). For instance, as seen in Figure 1, given a sentence “Jack is married to the Iraqi microbiologist known as Dr. Germ.”,

  • RE task should identify the relationship of the given entity pair <Jack, Dr. Germ> as ‘husband_of’;
  • EE task should identify the event type as ‘Marry’ where the word ‘married’ triggers the event and (Jack, Dr. Germ) are participants in the event as husband and wife respectively.
Knowledge Extraction
Figure 1. Knowledge Extraction


As most KE models assume sufficient training corpus which are indispensable when learning versatile vectors for relations and events, it is difficult for relations or events which have extremely limited instances to achieve satisfactory performance, just as Figure 2[17] shows. Hence it is crucial for KE models to be capable of extracting knowledge with low-resource training instances, including in long-tail, few-shot, and zero-shot settings.

Low-resource Scenarios in Knowledge Extraction
Figure 2. Low-resource Scenarios in Knowledge Extraction


In previous studies, there have been some resolutions for low-resource knowledge-extraction. In this blog, we focus on methods based on prototypical representation learning, which is proven to be robust in low-resource scenarios, derived from the ICLR paper “Prototypical Representation Learning for Relation Extraction[2]”.

2. Prototypical Representation Learning

In vanilla prototypical networks[1], a class is represented by averaging the embeddings of its instances, and the class embedding is deemed as a class prototype (or called centroid), as Figure 3(a) shows. Then, by calculating the distance from the query instance embedding to each prototype, we can classify the instance with the closest prototype.
In recent years, various extension of vanilla prototypical representation learning have emerged in endlessly. We summarize these approaches into three categories:

  • (1) Intra-prototype Learning;
  • (2) Inter-prototype Learning;
  • (3) Joint Intra- and Inter- Prototype Learning.

The ICLR2021 paper “Prototypical Representation Learning for Relation Extraction[2]”, introduced in this blog, is based on Joint Intra- and Inter- Prototype Learning. Then, we will outline the three sorts of methods one by one.

2.1 Intra-prototype Learning

Figure 3 illustrates the core idea of intra-prototype learning, with comparison to vanilla prototypical networks. Different from vanilla prototype learning briefly averaging instance embeddings for each class, intra-prototype learning aims to achieve more robust prototypes by

  • (1) improving the representions of instance embeddings (intra-instance),
  • (2) attentively aggregating the representions of instance embeddings (inter-instance),
  • (3) highlighting both of the crucial features and instances (joint intra- and inter- instance).
Illustration of Intra-prototype Learning
Figure 3. Illustration of Intra-prototype Learning

intra-instance: on feature-level

Intra-instance learning methods for intra-prototype learning is to achieve more robust instance embeddings, as illustrated in Figure 3(b).
Fan et al.[3] consider recognized entities of interest to generate fine-grained features for instance embedding in few-shot relation classification, and adopt large-margin learning to increase the generalization ability of prototypical networks on recognizing long-tail relations. Wang et al.[4] focus on trigger biases (trigger overlapping and trigger separability) in few-shot event classification, and have proposed to tackle the context-bypassing problems with trigger-uniform sampling and confusion sampling.

inter-instance: on sentence-level

Inter-instance learning methods for intra-prototype learning is to attentively aggregate instance embeddings, not merely equally averaging, as illustrated in Figure 3(c).
Ye et al.[5] have proposed multi-level matching and aggregation strategies for few-shot relation classification, where the class prototype are formed by attention-based instance matching and attentively aggregation. Lai V et al.[6] have proposed to exploit the relationship between training tasks for few-shot event detection, where prototypes are computed based on cross-task modeling.

joint intra- and inter- instance

Some evolutionary prototipical networks integrate intra- and inter- instance learning for low-resource KE, such as:
Gao et al.[7] have improved vanilla prototypical networks with hybrid attention for few-shot relation extraction, w.r.t feature-level attention for instances and instance-level attention for prototypes. Deng et al.[8] have utilized dynamic memory modules to enchance prototype learning for few-shot event detection, via implicitly highlighting crucial features for instances and refining instance embeddings for each prototype.

2.2 Inter-prototype Learning

Figure 4 illustrates the core idea of inter-prototype learning, with comparison to vanilla prototypical networks.

Illustration of Inter-prototype Learning
Figure 4. Illustration of Inter-prototype Learning

considering long-tail distribution of prototypes

Cao et al.[9] have proposed to facilitate long-tail relation extraction by transferring knowledge from the relation prototypes with sufficient training instances, where relation prototypes reflect the meanings of relations as well as their proximities for transfer learning.

considering label dependency of prototypes

Cong et al.[10] have proposed a prototypical amortized conditional random field to model the label dependency in few-shot event detection, by generating the transition scores to achieve adaptation ability for novel event types based on the label prototypes.

considering knowledge constraint of prototypes

Yu et al.[11] have studied the few-shot relational triple extraction problem and proposed a multi-prototype embedding network that implicitly injected correlations between entities and relations, so that relations linked with the same entity type can be jointly learned. For example, the type of head entity must be PERSON for both born_in and live_in relation.

2.3 Joint Intra- and Inter- Prototype Learning

Ding et al.[2] have proposed to represent prototypes with integrating the advantages of intra- and inter- prototype learning. Ding et al. have learned prototypes for each relation in few-shot RE considering intra-prototype compactness and inter-prototype separability with contrastive learning in geometric space.
Assuming that $s$ denotes instance embedding generated by an instance encoder, a prototype $z$ for relation $r$ is an embedding in the same metric space with $s$. Given $\mathcal{S} = [s_1, …, s_N]$ as the set of all instance embeddings in the batch $\mathcal{B} = [(s_1, r_1), …, (s_N, r_N)]$, and a fixed prototype $z^r$ for relation $r$, Ding et al. denote $\mathcal{S}^r$ the subset of all instances $s_i \in \mathcal{S}$ with relation $r$, $\mathcal{S}^{-r}$ the set of the rest instances, and $\mathcal{Z}^{-r}$ the set of prototypes $z’$ for all other relations except $r$.
Intra-prototype compactness means that for a specific relation $r$, the ‘‘distance’’ between $z^r$ and any instances with the same relation $r$ should be less than the ‘‘distance’’ between $z^r$ and any instances with relations $r’ \neq r$.
Inter-prototype separability means that the ‘‘distance’’ between $z^r$ and any instances with relation $r$ should be less than the ‘‘distance’’ between any prototypes $z’ \in {\mathcal{Z}^{-r}}$ and instances with relation $r$.
Geometric space: Different from vanilla prototypical networks, Ding et al. have interpreted prototypes into geometric space, where a prototype is a unit vector starting from the origin and ending at the surface of a unit ball, and instances for that prototypes are unit vectors with approximately same directions centering at the prototype. Under the optimal condition, different prototype vectors would be uniformly dispersed with the angles between them as large as possible, as illustrated in Figure 5.

Geometrical Contrastive Reprensentaion of Prototypical Learning
Figure 5. Geometrical Contrastive Reprensentaion of Prototypical Learning


Geometrical contrastive reprensentaion of prototypical learning considering intra-prototype compactness and inter-prototype separability are based on

  • instance-level: intance-instance contrastive learning, and
  • instance-prototype level: instance-prototype contrastive learning.

Learning on instance-level

Given a batch $\mathcal{B} = [(s_1, r_1), …, (s_N, r_N)]$ of instance-relation pairs, the similarity metric between two instance embeddings $d(s_i, s_j)$ is defined by:

$$ d(s_i, s_j) = 1 / (1 + \exp(\frac{s_i}{||s_i||} \cdot \frac{s_j}{||s_j||})). $$

Geometrically, as illustrated in Figure 6, this metric is based on the angles of the normalized embeddings restricted in a unit ball, and similarity metric between a instance embedding and a prototype $d(z, s)$ follow the same principle.

Similarity Metric for Geometrical Prototypical Reprensentaion Learning
Figure 6. Similarity Metric for Geometrical Prototypical Reprensentaion Learning


In order to ensure instance-level intra-prototype compactness and inter-prototype separability in the representation space, Ding et al. have defined a contrastive objective function $\mathcal{L}_{\text{S2S}}$ between instance embeddings, denoted by:

$$ \mathcal{L}_{\text{S2S}} = -\frac{1}{N^2} \sum_{i, j} \frac{\exp (\delta(s_i, s_j) d(s_i, s_j))}{\sum_{j'} \exp((1 - \delta(s_i, s_{j'}))d(s_i, s_{j'}))}, $$

where $\delta(s_i, s_j)$ denotes if $s_i$ and $s_j$ corresponds to the same relation, i.e., given $(s_i, r_i), (s_j, r_j)$, $\delta(s_i, s_j) = 1 \; \text{if}\; r_i = r_j \; \text{else} \; 0$.

Learning on Instance-prototype level

Denoting that $\mathcal{S}^r$ is the subset of all instances $s_i$ in $\mathcal{S}$ with relation $r$, $\mathcal{S}^{-r}$ is the set of the rest instances, and $\mathcal{Z}^{-r}$ is the set of prototypes $z’$ for all other relations except $r$. The similarity metric between a instance embedding and a prototype $d(z, s)$ (illustrated in Figure 5) is defined by:

$$ d(z, s) = 1/ (1 + \exp(\frac{s }{||s||} \cdot \frac{z}{||z||})). $$

To realize intra-prototype compactness between instances and prototypes, Ding et al. have defined an objective function $\mathcal{L}_{\text{S2Z}}$:

$$ \mathcal{L}_{\text{S2Z}} = -\frac{1}{N^2} \sum_{s_i \in \mathcal{S}^{r}, s_j \in \mathcal{S}^{-r}} \big[\log d(z^r, s_i) + \log(1 - d(z^r, s_j))\big]. $$

To realize inter-prototype separability between instances and prototypes, Ding et al. have defined an objective function $\mathcal{L}_{\text{S2Z’}}$:

$$ \mathcal{L}_{\text{S2Z'}} = -\frac{1}{N^2} \sum_{s_i \in \mathcal{S}^{r}, z' \in \mathcal{Z}^{-r}} \big[\log d(z^r, s_i) + \log(1 - d(z', s_i))\big]. $$

These objectives can effectively split the data representations into $K$ disjoint manifolds centering at different prototypes.
Comparison Analysis of Loss Functions: Comparing with the conventional cross-entropy loss in prototypical learning, $\mathcal{L}_{\text{S2Z}}$ and $\mathcal{L}_{\text{S2Z’}}$ demonstrate great advantages:

  • Cross-entropy loss: solely relies on the instance level supervision, and there is no interactions between different instances, which is particularly noisy under a noisy-label setting,
  • $\mathcal{L}_{\text{S2Z}}$ and $\mathcal{L}_{\text{S2Z’}}$: consider distances between different instances and prototypes, which exploits the interactions between instances. This type of interaction would effectively serve as a regularization to the decision boundary.

To further regularize the semantics of the prototypes, Ding et al. also use a prototype-level classification objective:

$$ \mathcal{L}_{\text{CLS}} = \frac{1}{K} \sum_k \log p_\gamma(r^k | z^k), $$

where $\gamma$ denotes the parameters of an auxiliary classifier. Finally, with hyper-parameters $\lambda_1$, $\lambda_2$ and $\lambda_3$, the full loss is defined as:

$$ \mathcal{L} = \lambda_1 \mathcal{L}_{\text{S2S}} + \lambda_2 (\mathcal{L}_{\text{S2Z}} + \mathcal{L}_{\text{S2Z'}}) + \lambda_3 \mathcal{L}_{\text{CLS}}. $$

3. Promising Research Directions

Although Ding et al. have produced predictive and robust representations over prototypes with jointly intra- and inter prototype learning, they merely focus on contrast among instances and prototypes, consequently may ignore the inherent semantic correlation among prototypes, such as hierarchy and entailment. Besides, another task-specific modeling space may also enhance prototypical representation learning for low-resource KE. Therefore, we also discuss some promising research directions.

3.1 Knowledge-enhanced Prototype Learning

We argue that injecting inherent semantics among instances and classes may also promote prototypical learning, such as concept-level and class-level knowledge.

injecting concept-level knowledge

Gong et al.[12] have improved prototypical networks with side information, which are built from keywords, hypernyms of name entities, and labels and their synonyms, and Zhang et al.[13] also have imposed concept-level KGs to better capture semantics of low-resource relation types, demonstrating effectiveness and robustness in zero-shot and few-shot relation classification.

injecting class-level knowledge

Zheng et al.[14] have proposed a taxonomy-aware prototypical learning framework to model the hierarchy of event types in few-shot event detection, considering the problems of class centroids distribution and taxonomy-aware distribution in vanilla prototypical networks. In addition to hierarchy, Deng et al.[16] have injected temporality and causality of event types.

3.2 Geometrical Prototype Learning

We also argue that modeling prototypes in non-Euclidean space may encourage to acquire complicated semantics, such as taxonomy in hyperbolic space, and class correlation in hyperspherical space.

in hyperbolic space

Zheng et al.[14] have projected the event label taxonomy to the hyperbolic space based on Poincaré model[15] which outperforms Euclidean embeddings significantly on data with latent hierarchies, in order to obtain the label hierarchy embedding for each event type, and integrated each prototype vector with taxonomy-aware label embedding.

in hyperspherical space

Deng et al.[17] have leveraged a knowledge-aware hyperspherical prototype network to model entailment correlation among relations and causality among events, as hyperspherical prototype networks[18] have demonstrated effectiveness of imposing class semantics. We think that more class semantics can be futher explored, such as temporal and inverse correlations.

References

[1] Prototypical Networks for Few-shot Learning (NeurIPS 2017)

[2] Prototypical Representation Learning for Relation Extraction (ICLR 2021)

[3] Large Margin Prototypical Network for Few-shot Relation Classification with Fine-grained Features (CIKM 2019)

[4] Behind the Scenes: An Exploration of Trigger Biases Problem in Few-Shot Event Classification (CIKM 2021)

[5] Multi-Level Matching and Aggregation Network for Few-Shot Relation Classification (ACL 2019)

[6] Learning Prototype Representations Across Few-Shot Tasks for Event Detection (EMNLP 2021)

[7] Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification (AAAI 2019)

[8] Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection (WSDM 2020)

[9] Learning Relation Prototype from Unlabeled Texts for Long-tail Relation Extraction (TKDE, 2021)

[10] Few-Shot Event Detection with Prototypical Amortized Conditional Random Field (ACL 2021)

[11] Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction (COLING 2020)

[12] Zero-shot Relation Classification from Side Information (CIKM 2021)

[13] Knowledge-Enhanced Domain Adaptation in Few-Shot Relation Classification (KDD 2021)

[14] Taxonomy-aware Learning for Few-Shot Event Detection (WWW 2021)

[15] Poincaré Embeddings for Learning Hierarchical Representations (NeurIPS 2017)

[16] OntoED: Low-resource Event Detection with Ontology Embedding (ACL 2021)

[17] Low-resource Extraction with Knowledge-aware Pairwise Prototype Learning (Knowledge-Based Systems, 2022)

[18] Hyperspherical Prototype Networks (NeurIPS 2019)


Sample Submission

This post outlines a few more things you may need to know for creating and configuring your blog posts.


Example content (Basic Markdown)

Howdy! This is an example blog post that shows several types of HTML content supported in this theme.