# Repeated tokens

We want to investigate how the order of memorization depends on the properties of tokens and their prefixes.
We investigate 3 dimensions:
- Frequency: How often does a token and its k-length prefix occur in the string
- Prefix-length: How long is the prefix of the token
- Agreement/disagreement: How many instances of the same prefixes are there followed by the same token (agreement) or by a different token (disagreement)

To investigate these dimensions, we randomly sample n different tokens $t_i$ and k-length prefixes $p_i$ and then insert them into a random string, by overwriting some of the existing tokens.
Then, for each token, we measure at which epoch it is memorized.

We vary this process along the above dimensions:
- Frequency: for each tokens $t_i$ and their prefixes $p_i$, we insert $p_1 t_1$ $m$ times, $p_2, t_2$ $2 m$ times, $\dots$, $p_n t_n$ $n m$ times. This provides us with different tokens with different prefix occurrence frequencies, and thus allows us to observe the impact of frequency on memorization epoch.
- Prefix-length: for each token $t_i$ we sample prefixes $p_i$ with different length $k_i = i$. We insert each token-prefix combination $m$ times. This allows us to determine the impact of the prefix length of tokens on their memorization epoch, while keeping the frequency fixed.
- (Dis)agreement: we insert each prefix $p_i$ $m$ times into the string, but with two different tokens $t_i$ and $t_i'$, with different relative frequencies (out of the $m$ copies). This allows us to observe how the number of agreeing and disagreeing copies affect the memorization epoch.
