{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e355b99d",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from lib_project.notebook import setup_notebook\n",
    "setup_notebook(\"../../../../\")\n",
    "               \n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acbd71c6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from lib_project.visualization import with_paper_style\n",
    "from defs import BASE_FIGURE_DIR\n",
    "from experiments.memorability.repeated_strings import results as res_util"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1206ca3-0050-4a98-9e42-355d6aeaab2c",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Shuffled Substrings: Alphabet Size and Entropy\n",
    "\n",
    "We want to investigate the relationship between entropy, regularity and memorability of a string.\n",
    "To do so, we construct long random strings, out of copies of short random substrings.\n",
    "Then we test how both the entropy of the substrings, as well as the regularity of the string (based on the length of the substrings and the number of unique substrings) affects the relative speed of memorizing the string.\n",
    "This allows us to study whether memorization speed is primarily a function of entropy or of string regularity.\n",
    "\n",
    "## Methodology\n",
    "\n",
    "### String construction\n",
    "\n",
    "To construct a string, we sample a base substring $s$ of length $k$.\n",
    "We derive $n$ variations of $s$ by shuffling the tokens in $s$, i.e. each variant $s_1, \\dots, s_n$ has the same tokens as $s$, but in a different, unique order.\n",
    "Shuffling the tokens of a base string ensures that each substring variant has the same entropy.\n",
    "\n",
    "We create the full string $S$ from the substring variants $s_i$, by appending $m$ copies of $s_i$'s to each other, in random order.\n",
    "E.g. for substring variants $s_1, s_2, s_3$, $S$ might have the form $S = s_2 \\circ s_3 \\circ s_2 \\circ s_1 \\circ s_1 \\circ s_3$, or some other ordering of the $s_i$ copies.\n",
    "This process creates a long random string with an entropy-level defined by the base substring $s$, and with a certain number and frequency of repeated, regular patterns.\n",
    "\n",
    "### Evaluation\n",
    "\n",
    "Numbers in parenthesis denote the number of copies of each unique substring $m$.\n",
    "\n",
    "\n",
    "## Takeaways"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e9c98c3-7b94-490f-ba1e-87a673fbabfc",
   "metadata": {},
   "source": [
    "## 1024 tokens, 16 unique substrings, substring length 8\n",
    "\n",
    "Results are averaged over 5 repetitions with different random seeds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a279c98b-dac4-407d-85b5-6f716c0db15c",
   "metadata": {},
   "outputs": [],
   "source": [
    "seed_ids = list(range(3))\n",
    "results = {\n",
    "    f\"A {alphabet_size}\": res_util.load(f\"pyt-1b_a-{alphabet_size}_sl-16_ns-8\", seed_ids)\n",
    "    for alphabet_size in [2, 4, 7, 13, 26]\n",
    "} | {\n",
    "    f\"H {target_size}\": res_util.load(\n",
    "        f\"pyt-1b_h-{target_size}_sl-16_ns-8\",\n",
    "        filter(lambda x: x != 3, seed_ids) if target_size == 2 else seed_ids\n",
    "    )\n",
    "    for target_size in [2, 4, 7, 13]\n",
    "}\n",
    "res_util.show_results(results, \"Substring Length\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04664389-e0f4-458f-a15e-8743e60c7c41",
   "metadata": {},
   "source": [
    "## 1024 tokens, 8 unique substrings, substring length 8\n",
    "\n",
    "Results are averaged over 5 repetitions with different random seeds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2248edaa-5a05-43c4-a763-44118bc38749",
   "metadata": {},
   "outputs": [],
   "source": [
    "seed_ids = list(range(3))\n",
    "results = {\n",
    "    f\"A {alphabet_size}\": res_util.load(f\"pyt-1b_a-{alphabet_size}_sl-8_ns-8\", seed_ids)\n",
    "    for alphabet_size in [2, 4, 7, 13, 26]\n",
    "} | {\n",
    "    f\"H {target_size}\": res_util.load(\n",
    "        f\"pyt-1b_h-{target_size}_sl-8_ns-8\",\n",
    "        filter(lambda x: x != 3, seed_ids) if target_size == 2 else seed_ids\n",
    "    )\n",
    "    for target_size in [2, 4, 7, 13]\n",
    "}\n",
    "res_util.show_results(results, \"Substring Length\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3341cec-0464-4638-a04a-ba103bad739c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Upload\n",
    "res_util.publish(\"entropy\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de626a7a-9492-4fe0-ad94-b525e91d1c84",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
