# Conditional Probability Memorization Dynamics

We know that the entropy of a string affects how hard it is for models to memorize it.
To obtain a better understanding about how entropy interacts with the memorability of a string, we test how conditional entropy affects memorability.

By the n-conditional entropy $H_n(s)$ of a string $s$, we refer to the entropy $H_n(s) = H(s_i | s_{i-n}, \dots, s_{i-1})$, i.e. the entropy over tokens in $s$, that the preceding n tokens, i.e. the preceding n-gram is known.
We are interested in knowing whether at the same level of unconditional entropy $H(s)$, i.e. 0-conditional entropy, strings with different levels of n-conditional entropy $H_n(s)$ differ in their memorability.

## Methodology

**Privileged continuation tokens**:
We create string $s$ with alphabet $A$ with a certain level of n-conditional entropy by assigning each possible n-gram $g$ over $A$ a certain *privileged continuation token* $t_g$.
E.g. for $A = \{a, b\}$, and there are the 2-grams $aa, ab, ba, bb$, and each of them would have a privileged continuation, e.g. $b$ for $aa$, $a$ for $ab$, etc.

**Constructing strings with different levels of conditional entropy**:
To sample string $s$, we first sample $n$ tokens from $A$ uniformly at random.
To sample the next token $s_i$, we get its preceding n-gram $g = s_{i-n}, \dots, s_{i-1}$, look up its privileged token $t_g$ and then sample a token from $A$ with *$k \times$ relative probability* $p_k = k * p_u$ for $t_g$, and uniform probability $p_u$ for all other tokens $t \in A \setminus \{t_g\}$.
I.e. we are $k$ times more likely to sample the privileged token $t_g$ as a continuation to $g$ than the other tokens in $A$.
We obtain $p_k$ as $p_k = \frac{k}{|A| -1 + k}$ and $p_u = \frac{1 - p_k}{|A| - 1}$.
Increasing the relative probability $p_k$ lowers the conditional entropy $H_n(s)$ of string $s$.

**Ensuring the same level of unconditional entropy**:
To ensure that strings with different $p_k$ have the same unconditional entropy $H(s)$, we ensure that each token $t \in A$ appears the same number of times as a privileged continuation token.
I.e. for 1-grams, where there are $|A|$ combinations (single tokens from $A$), each $t \in A$ appears once as the privileged token of a 1-gram.
For 2-grams, with $|A|^2$ possible combinations, each token appears $A$ times as privileged token, etc.
E.g. for 2-grams over $A = \{a, b\}$ a privileged token mapping $aa \rightarrow b, ab \rightarrow b, ba \rightarrow a, bb \rightarrow a$ would be valid, whereas the mapping $aa \rightarrow b, ab \rightarrow b, ba \rightarrow b, bb \rightarrow b$ would be not.
Making each token appear the same number of times as privileged continuation ensures that the overall probability of each $t \in A$ is the same, and thus the unconditional entropy of the strings is the same.

**Model training**:
As usual, we train models for 100 epochs to memorize strings with alphabets of different sizes (i.e. entropy levels) and record their memorization dynamics.
We also compute the eANONYMOUSrical unconditional and conditional entropy of the sampled strings.
