---
title: "MakeSense Analysis Scroll"
author: "[REDACTED]"
output: tint::tintHtml
  #github_document
  #html_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, echo = FALSE, dpi = 300, fig.retina = 1, fig.width = 5, fig.height=5)
library(tidyverse)
library(fs)
library(broom)

theme_set(theme_bw(base_size = 16))
```

# WiC: Word in Context

**Authors:** Pilevhar and Camacho-Collados, 2019

**Task:** Given sentences with the same target word, determine if the target word is being used with the same sense.

(a) *I went to the **bank** to withdraw money* and *The **bank's** manager declined my request.* **[TRUE]**
(b) *There's a lot of trash on the **bed** of the river	* and *I keep a glass of water next to my **bed** when I sleep* **[FALSE]**

**Evaluation Metric:** Accuracy

**Notes:** 

- [SuperGLUE Task](https://super.gluebenchmark.com/), Humans get 80% accuracy
- Comes in train/dev/test splits; but test split ground-truth is hidden, so in most cases, we use dev as our evaluation dataset.

```{r}
cosine_results <- dir_ls("data/results/bert_base_uncased/", regexp = "*.csv") %>%
  map_df(read_csv, .id = "file") %>%
  mutate(
    approximation_mode = str_extract(file, "(?<=\\d_)(.*?)(?=.csv)"),
    label = case_when(
      label ~ "Same Sense",
      TRUE ~ "Different Sense"
    )
  ) %>%
  select(-file)
```

## **Sense similarity analysis on WiC**

**Goal:** *Compare similarity-based properties afforded by performing MakeSense approximation on original BERT embeddings.*

**Investigation:** We combine train and dev sets, and investigates whether words with same sense are on average more similar to each other than are words with different senses.
This analysis focuses on the geometrical properties of how the original BERT-generated vector space is altered using our MakeSense approximation approach of pulling embeddings of words with same sense together

```{r fig.margin=TRUE, fig.width=5, fig.height=5, fig.cap="**Figure 1:** Per-layer delta values in the `bert-base` model for MakeSense and Original embeddings."}
cosine_results %>% 
  filter(approximation_mode != "ser") %>%
  mutate(
    approximation_mode = case_when(
      approximation_mode == "laser" ~ "MAKESENSE",
      TRUE ~ "Original"
    )
  ) %>%
  group_by(approximation_mode, label, layer) %>% 
  summarize(cosine = mean(cosine)) %>% 
  pivot_wider(names_from = label, values_from = cosine) %>%
  janitor::clean_names() %>%
  ungroup() %>%
  mutate(delta = same_sense - different_sense) %>%
  ggplot(aes(layer, delta, color = approximation_mode, shape = approximation_mode)) + 
  geom_point(size = 4) +
  geom_line(size = 1) +
  annotate(geom = "label", x = 3, y = 0.13, label = "Words with same sense are\nmore similar on average", family = "Times", size = 5, fontface = 'italic') +
  annotate(geom = "label", x = 9, y = -0.05, label = "Words with same sense are\nless similar on average", family = "Times", size = 5, fontface = 'italic') +
  geom_hline(yintercept = 0.0, linetype = "dashed") +
  scale_color_manual(values = c("#2978b5", "#f7a440")) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL) +
  scale_y_continuous(limits = c(-0.1, 0.2)) +
  theme_bw(base_size = 20, base_family = "Times") +
  theme(
    legend.position = "top",
    plot.margin = margin(0.1, 0.2, 0.1, 0.1, "cm"),
    panel.background = element_rect(fill = "transparent"), # bg of the panel
    plot.background = element_rect(fill = "transparent", color = NA)
  ) +
  labs(
    x = "Layer",
    y = "Delta",
    color = "Representation",
    shape = "Representation"
  )

ggsave("Figures/wicsim.pdf", height = 4.54, width = 6.33)
```


**Hypothesis:** MakeSense modification should result in vector space where words with same sense should be more similar to each other than words with different senses.

Let $S = \{(s^1_1, s^2_1), \dots, (s^1_n, s^2_n)\}$ be embedding space of words with the same sense and $D = \{(d^1_1, d^2_1), \dots, (d^1_m, d^2_m)\}$ be embeddings of words with different senses in our train+dev WiC data. We define $\Delta$ to be the difference of the average similarity of same-sense words and the average similarity of different-sense words. That is,

\[
  \Delta = \underbrace{\frac{1}{n}\sum_{i=1}^n \cos(s^1_i, s^2_i)}_{\text{average similarity between}\atop\text{embeddings of same sense}} - \underbrace{\frac{1}{m}\sum_{i=1}^m \cos(d^1_i, d^2_i)}_{\text{average similarity between}\atop\text{embeddings of diff sense}}.
\]

We plot this for all layers in Figure 1.

## **Threshold-based Classifier**

```{r fig.margin=TRUE, fig.cap="**Figure 2:** Results from the cosine threshold classification experiments."}
dir_ls("data/threshold_results/", regexp = "*.csv") %>%
  map_df(read_csv) %>%
  filter(class != "ser") %>%
  ggplot(aes(layer, validation, color = class)) +
  geom_point() +
  geom_line() +
  geom_point(size = 3) + 
  geom_line(size = 0.7) +
  scale_color_manual(values = c("#2978b5", "#f7a440")) +
  scale_y_continuous(limits = c(0.5, 0.7), labels = scales::percent_format()) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL) +
  theme_bw(base_size = 18, base_family = "Times") +
  theme(
    legend.position = "top"
  ) +
  labs(
    x = "Layer",
    y = "Accuracy",
    color = "Embedding",
    # linetype = "Embedding",
    shape = "Embedding"
  )

```

**Goal:** Investigate whether embeddings of same and different sense words can more easily be distinguished upon modification by MakeSense.

**Investigation:** Construct threshold-based classifier $T(x_1, x_2)$ that defines a threshold $\theta$ s.t.:

\[
T(x_1, x_2) = \begin{cases}
\text{True} & \cos(x_1, x_2) \geq \theta\\
\text{False} & \cos(x_1, x_2) < \theta
\end{cases}
\]

We tune the threshold on the WiC train set, for each layer, by defining a linear space: $[0, 1]$ with increments of $0.02$.
We then evaluate the resulting classifier per layer on the dev set.
We use $x_1, x_2$ generated using the original `bert-base` model and their MakeSense-modified counterparts.

We plot the results (accuracy per layer) in Figure 2.

## **Diagnostic Classifier-based Probing**

```{r fig.margin = TRUE, fig.cap="**Figure 3:** Results from our Probing experiments."}
wic_probing <- dir_ls("data/wic_probing_results/", regexp = "*_256_probing_results.csv") %>%
  map_df(read_csv)

wic_probing %>%
  mutate(
    class = case_when(
      class == "laser" ~ "MAKESENSE",
      TRUE ~ "Original"
    )
  ) %>%
  ggplot(aes(layer, accuracy, color = class, shape = class)) +
  geom_point(size = 4) + 
  geom_line(size = 1) +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  annotate(geom = "label", x = 9.5, y = 0.51, label = "Chance performance", family = "Times", fontface = "italic", size = 6) +
  scale_y_continuous(limits = c(0.5, 0.65), labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL)  +
  scale_color_manual(values = c("#2978b5", "#f7a440")) +
  theme_bw(base_size = 20, base_family = "Times") +
  theme(
    legend.position = "top",
    plot.margin = margin(0.1, 0.2, 0.1, 0.1, "cm"),
    panel.background = element_rect(fill = "transparent"), # bg of the panel
    plot.background = element_rect(fill = "transparent", color = NA)
  ) +
  labs(
    x = "Layer",
    y = "Accuracy",
    color = "Representation",
    # linetype = "Embedding",
    shape = "Representation"
  )

ggsave("Figures/wicprobe.pdf", height = 4.54, width = 6.33)

```

**Goal:** Investigate whether properties afforded by MakeSense makes sense-distinguishing information more readily decodable.

**Investigation:** We construct a simple probe consisting of a single hidden layer (total 3 layers: input-hidden-output) with a $\mathrm{ReLU}$ non-linearity, to decode similar vs. different usage of word-senses in our WiC data:

\[\text{Probe}(x_1, x_2) = W_2(\mathrm{ReLU}(W_1[x_1;x_2] + b_1)) + b_2,\]
where $W_1 \in \mathbb{R}^{1536 \times 256}$ and $W_2 \in \mathbb{R}^{256 \times 2}$.

Results shown in Figure 3.

# WHiC: Word Hypernyms in Context

**Authors:** Vyas and Carpuat, 2017

**Task:** Given sentences with the marked target words, determine if the first target word is a hyponym of the second target word.

(a) *Magnus Carlsen is the world **chess** champion.* and *The championship **game** was played in NYC.* **[TRUE]**
(b) *Magnus Carlsen is the world **chess** champion.* and *The poachers hunted the big **game**.* **[FALSE]**

**Evaluation Metric:** Weighted F1 score (more negative samples than positive).

**Usefulness:** 

- Requires models to represent asymmetrical and structural sense information, which is arguably more difficult then just simply representing sense-information.
- Is controlled for lexical overlap --- train, test, and dev have disjoint target-word vocabularies.
- Is controlled for direction --- for every (hypo, hyper) pair, we have a (hyper, hypo) pair as a negative sample.
- Is controlled for contextual sensitivity --- for every (hypo, hyper_1) pair we have a (hypo, hyper_2) pair that is a negative sample due to a different sense of (hypo).

**Notes:** 

- is extremely under-used --- we will be the first ones to do so for CWE.

```{r}
whic_probing <- dir_ls("data/whic_torch_results/", regexp = "*.csv") %>% keep(str_detect(., "shuffle.csv")) %>%
  map_df(read_csv) %>%
  mutate(
    class = case_when(
      class == "laser" ~ "MAKESENSE",
      TRUE ~ "Original"
    )
  )
```

## **Probing for Word Hypernyms in Context**

```{r fig.margin = TRUE, fig.cap="**Figure 4:** F1-scores achieved by our probing classifiers on the full WHiC test data."}
whic_probing %>%
  filter(class != "ser") %>%
  ggplot(aes(layer, f1, color = class, shape = class)) +
  geom_point(size = 4) + 
  geom_line(size = 1) +
  scale_y_continuous(limits = c(0.65, 0.80), labels = scales::percent_format(accuracy = 1)) +
  # scale_y_continuous(limits = c(0, 0.5)) +
  geom_hline(yintercept = 0.6623, linetype = "dashed") +
  annotate(geom = "label", x = 9.5, y = 0.672, label = "Chance performance", family = "Times", fontface = "italic", size = 6) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL)  +
  scale_color_manual(values = c("#2978b5", "#f7a440")) +
  theme_bw(base_size = 20, base_family = "Times") +
  theme(
    legend.position = "top",
    plot.margin = margin(0.1, 0.2, 0.1, 0.1, "cm"),
    panel.background = element_rect(fill = "transparent"), # bg of the panel
    plot.background = element_rect(fill = "transparent", color = NA)
  ) +
  labs(
    x = "Layer",
    y = "Weighted F1",
    color = "Representation",
    shape = "Representation"
  )

ggsave("Figures/whicoverall.pdf", height = 4.54, width = 6.33)

```


**Goal:** Investigate the extent to which properties afforded by MakeSense makes sense and directional sensitive hypernymy information readily available in a supervised setting (i.e., is it decodable from the representations?)

**Investigation:** As before, we construct a simple probing classifier that distinguishes between whether the first input target word $x_1$ is the hyponym of the second input target word $x_2$.

\[\text{Probe}(x_1, x_2) = W_2(\mathrm{ReLU}(W_1[x_1;x_2] + b_1)) + b_2,\]
where $W_1 \in \mathbb{R}^{1536 \times 256}$ and $W_2 \in \mathbb{R}^{256 \times 2}$.

Note that the training data consists of instances where the relationship is reversed, as well as cases where the relationship does not hold due to sense-differences of either of the first or second word.

We plot the overall weighted F1-scores in Figure 4.

## **Assessing Directional Sensitivity in Representations**

```{r fig.margin = TRUE, fig.cap="**Figure 5:** Results from Directional Sensitivity Analysis."}
whic_probing %>%
  filter(class != "ser") %>%
  ggplot(aes(layer, directional_accuracy, color = class, shape = class)) +
  geom_point(size = 4) + 
  geom_line(size = 1) +
  # scale_y_continuous(limits = c(0.7, 0.78)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  annotate(geom = "label", x = 9.5, y = 0.03, label = "Chance performance", family = "Times", fontface = "italic", size = 6) +
  scale_y_continuous(limits = c(0, 0.5), labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL)  +
  scale_color_manual(values = c("#2978b5", "#f7a440")) +
  theme_bw(base_size = 20, base_family = "Times") +
  theme(
    legend.position = "top"
  ) +
  labs(
    x = "Layer",
    y = "Directional Accuracy",
    color = "Representation",
    shape = "Representation"
  )

ggsave("Figures/whicdirectional.pdf", height = 4.54, width = 6.33)
```

**Goal:** Test differences in original and MakeSense representations in terms of whether they are able to distinguish between hypernymy and hyponymy.

**Investigation:** Use the learned Probe from the previous experiment on a 'directional' sensitive test set. Specifically, test if the directionally correct pair $(x_1, x_2)$ is classified as "TRUE" while its reverse $(x_2, x_1)$ is classified as "FALSE"

Metric used: Pairwise Accuracy

Most Frequent Class baseline performance: 0%

Results per layer shown in Figure 5.

## **Assessing Contextual Sensitivity during Hypernymy Attribution**

**Goal:** Test differences in original and MakeSense representations in terms of whether they show sensitivity to sense information when attributing words to their hypernyms.

```{r fig.margin = TRUE, fig.cap="**Figure 6:** Results from Contextual Sensitivity Analysis."}
whic_probing %>%
  filter(class != "ser") %>%
  ggplot(aes(layer, contextual_sensitivity, color = class, shape = class)) +
  geom_point(size = 4) + 
  geom_line(size = 1) +
  # scale_y_continuous(limits = c(0.7, 0.78)) +
  scale_y_continuous(limits = c(0.10, 0.40), labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL)  +
  scale_color_manual(values = c("#2978b5", "#f7a440")) +
  theme_bw(base_size = 18, base_family = "Times") +
  theme(
    legend.position = "top"
  ) +
  labs(
    x = "Layer",
    y = "Per-word F1",
    color = "Representation",
    shape = "Representation"
  )

ggsave("Figures/whiccontextual.pdf", height = 4.54, width = 6.33)
```


**Investigation:** Use the learned Probe from the main experiment on a 'contextual' sensitive test set. Specifically, for each word, permute the test data such that we get a paired comparison between each word's correct and incorrect hypernym attribution. For instance comparing chess and game (as something people play) should be classified as true, while chess (in the same sense as before) and game (as an animal that is hunted) should be classified as false.

Metric used: Average F1 per Word.

Most Frequent Class baseline performance: 0%

Results per layer shown in Figure 6.

```{r}
directional <- dir_ls("data/whic_torch_results/", regexp = "*directional.csv") %>%
  map_df(read_csv) %>%
  mutate(
    class = case_when(
      class == "laser" ~ "MAKESENSE",
      TRUE ~ "Original"
    )
  )
```

```{r}
directional %>%
  ggplot(aes(layer, directional, color = class, shape = class)) +
  geom_point(size = 4) + 
  geom_line(size = 1) +
  # scale_y_continuous(limits = c(0.7, 0.78)) +
  scale_y_continuous(limits = c(0.8, 0.9), labels = scales::percent_format()) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL) +
  scale_color_manual(values = c("#2978b5", "#f7a440")) +
  theme_bw(base_size = 20, base_family = "Times") +
  theme(
    legend.position = "top",
    plot.margin = margin(0.1, 0.2, 0.1, 0.1, "cm"),
    panel.background = element_rect(fill = "transparent"), # bg of the panel
    plot.background = element_rect(fill = "transparent", color = NA)
  ) +
  labs(
    x = "Layer",
    y = "Directional Accuracy",
    color = "Representation",
    shape = "Representation"
  )

ggsave("Figures/whicdirectionalbalanced.pdf", height = 4.54, width = 6.33)

```


```{r}
whic_probing %>%
  filter(class != "ser") %>%
  ggplot(aes(layer, directional_similar, color = class, shape = class)) +
  geom_point(size = 3) + 
  geom_line(size = 0.7) +
  # scale_y_continuous(limits = c(0.7, 0.78)) +
  scale_y_continuous(limits = c(0.4, 0.9), labels = scales::percent_format()) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL)  +
  theme_bw(base_size = 18, base_family = "Times") +
  theme(
    legend.position = "top"
  ) +
  labs(
    x = "Layer",
    y = "Pairwise Similar Prediction",
    color = "Embedding",
    shape = "Embedding"
  )
```


## USIM Results

```{r}
correlations <- read_csv('data/annot_sim.csv') %>%
  pivot_longer(ol0:al12, names_to = 'type', values_to = 'similarity') %>%
  mutate(
    embedding = case_when(
      str_detect(type, "o") ~ "original",
      str_detect(type, "a") ~ "makesense"
    ),
    layer = as.integer(str_extract(type, "\\d{1,2}"))
  ) %>%
  group_by(layer, embedding) %>%
  nest() %>%
  mutate(
    cor = map(data, function(x) {
      cor.test(x$similarity, x$judgment, method = "spearman") %>% 
        tidy()
    })
  ) %>%
  unnest(cor) %>%
  select(-data)
```

```{r}
correlations %>%
  ggplot(aes(layer, estimate, color = embedding)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(limits = c(0, 0.6)) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL)
```


```{r}
usim <- read_csv("data/usim_umid.csv") %>%
  pivot_longer(usim_o:umid_a, names_to = 'type', values_to = 'correlation') %>%
  mutate(
    embedding = case_when(
      str_detect(type, "o") ~ "Original",
      str_detect(type, "a") ~ "Makesense"
    ),
    type = str_replace(type, "_(o|a)", ""),
    type = factor(type, levels = c("usim", "umid"))
  )

usim %>%
  ggplot(aes(layer, correlation, color = embedding, linetype = type, shape = embedding)) +
  geom_point(size = 4) +
  geom_line(size = 1) +
  geom_hline(yintercept = 0.0, linetype = "dashed") +
  # scale_y_continuous(limits = c(0, 0.6)) +
  scale_x_continuous(breaks = 0:12, minor_breaks = NULL) +
  scale_color_manual(values = c("#2978b5", "#f7a440")) +
  theme_bw(base_size = 20, base_family = "Times") +
  theme(
    legend.position = "top",
    plot.margin = margin(0.1, 0.2, 0.1, 0.1, "cm"),
    panel.background = element_rect(fill = "transparent"), # bg of the panel
    plot.background = element_rect(fill = "transparent", color = NA),
    legend.margin = margin(6, 6, -1, 4)
  ) +
  labs(
    x = "Layer",
    y = "Spearman's R",
    color = "Representation",
    shape = "Representation",
    linetype = "Judgment"
  ) +
  guides(
    color = guide_legend(
      title.position = "left",
      direction = "vertical",
      hjust = -1
    ),
    shape = guide_legend(
      title.position = "left",
      direction = "vertical",
      hjust = -1
    ),
    linetype = guide_legend(
      title.position = "left",
      direction = "vertical"
    )
  ) 
ggsave("Figures/usim.pdf", height = 6, width = 7)

```

