{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W6UyzFGfBSdA"
      },
      "source": [
        "# Understanding SAE Features with the Logit Lens\n",
        "\n",
        "This notebook demonstrates how to use the mats_sae_training library to perform the analysis documented the post \"[Understanding SAE Features with the Logit Lens](https://www.alignmentforum.org/posts/qykrYY6rXXM7EEs8Q/understanding-sae-features-with-the-logit-lens)\".\n",
        "\n",
        "As such, the notebook will include sections for:\n",
        "- Loading in GPT2-Small Residual Stream SAEs from Huggingface.\n",
        "- Performing Virtual Weight Based Analysis of features (specifically looking at the logit weight distributions).\n",
        "- Programmatically opening neuronpedia tabs to engage with public dashboards on [neuronpedia](https://www.neuronpedia.org/).\n",
        "- Performing Token Set Enrichment Analysis (based on Gene Set Enrichment Analysis)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ff-PciWXBSdB"
      },
      "source": [
        "## Set Up\n",
        "\n",
        "Here we'll load various functions for things like:\n",
        "- downloading and loading our SAEs from huggingface.\n",
        "- opening neuronpedia from a jupyter cell.\n",
        "- calculating statistics of the logit weight distributions.\n",
        "- performing Token Set Enrichment Analysis (TSEA) and plotting the results."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vImmTg-8BSdC"
      },
      "source": [
        "### Imports"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sOd2C0e1BfN1"
      },
      "outputs": [],
      "source": [
        "try:\n",
        "    # For Google Colab, a high RAM instance is needed\n",
        "    import google.colab # type: ignore\n",
        "    from google.colab import output\n",
        "    %pip install sae-lens transformer-lens\n",
        "except:\n",
        "    from IPython import get_ipython # type: ignore\n",
        "    ipython = get_ipython(); assert ipython is not None\n",
        "    ipython.run_line_magic(\"load_ext\", \"autoreload\")\n",
        "    ipython.run_line_magic(\"autoreload\", \"2\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QmdAd_25BSdC"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import torch\n",
        "import plotly_express as px\n",
        "\n",
        "from transformer_lens import HookedTransformer\n",
        "\n",
        "# Model Loading\n",
        "from sae_lens import SAE\n",
        "from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list\n",
        "\n",
        "# Virtual Weight / Feature Statistics Functions\n",
        "from sae_lens.analysis.feature_statistics import (\n",
        "    get_all_stats_dfs,\n",
        "    get_W_U_W_dec_stats_df,\n",
        ")\n",
        "\n",
        "# Enrichment Analysis Functions\n",
        "from sae_lens.analysis.tsea import (\n",
        "    get_enrichment_df,\n",
        "    manhattan_plot_enrichment_scores,\n",
        "    plot_top_k_feature_projections_by_token_and_category,\n",
        ")\n",
        "from sae_lens.analysis.tsea import (\n",
        "    get_baby_name_sets,\n",
        "    get_letter_gene_sets,\n",
        "    generate_pos_sets,\n",
        "    get_test_gene_sets,\n",
        "    get_gene_set_from_regex,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0VEAe5FjBSdD"
      },
      "source": [
        "### Loading GPT2 Small and SAE Weights\n",
        "\n",
        "This will take a while the first time you run it, but will be quick thereafter."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HQ904zDOBSdD"
      },
      "outputs": [],
      "source": [
        "model = HookedTransformer.from_pretrained(\"gpt2-small\")\n",
        "# this is an outdated way to load the SAE. We need to have feature spartisity loadable through the new interface to remove it.\n",
        "gpt2_small_sparse_autoencoders = {}\n",
        "gpt2_small_sae_sparsities = {}\n",
        "\n",
        "for layer in range(12):\n",
        "    sae, original_cfg_dict, sparsity = SAE.from_pretrained(\n",
        "        release=\"gpt2-small-res-jb\",\n",
        "        sae_id=\"blocks.0.hook_resid_pre\",\n",
        "        device=\"cpu\",\n",
        "    )\n",
        "    gpt2_small_sparse_autoencoders[f\"blocks.{layer}.hook_resid_pre\"] = sae\n",
        "    gpt2_small_sae_sparsities[f\"blocks.{layer}.hook_resid_pre\"] = sparsity\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4pFYJKeNBSdD"
      },
      "source": [
        "# Statistical Properties of Feature Logit Distributions\n",
        "\n",
        "In the post I study layer 8 (for no particular reason). At the end of this notebook is code for visualizing these statistics across all layers. Feel free to change the layer here and explore different layers."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LEK1jEpEBSdD"
      },
      "outputs": [],
      "source": [
        "# In the post, I focus on layer 8\n",
        "layer = 8\n",
        "\n",
        "# get the corresponding SAE and feature sparsities.\n",
        "sparse_autoencoder = gpt2_small_sparse_autoencoders[f\"blocks.{layer}.hook_resid_pre\"]\n",
        "log_feature_sparsity = gpt2_small_sae_sparsities[f\"blocks.{layer}.hook_resid_pre\"].cpu()\n",
        "\n",
        "W_dec = sparse_autoencoder.W_dec.detach().cpu()\n",
        "\n",
        "# calculate the statistics of the logit weight distributions\n",
        "W_U_stats_df_dec, dec_projection_onto_W_U = get_W_U_W_dec_stats_df(\n",
        "    W_dec, model, cosine_sim=False\n",
        ")\n",
        "W_U_stats_df_dec[\"sparsity\"] = (\n",
        "    log_feature_sparsity  # add feature sparsity since it is often interesting.\n",
        ")\n",
        "display(W_U_stats_df_dec)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GK56PDl3BSdE"
      },
      "outputs": [],
      "source": [
        "# Let's look at the distribution of the 3rd / 4th moments. I found these aren't as useful on their own as joint distributions can be.\n",
        "px.histogram(\n",
        "    W_U_stats_df_dec,\n",
        "    x=\"skewness\",\n",
        "    width=800,\n",
        "    height=300,\n",
        "    nbins=1000,\n",
        "    title=\"Skewness of the Logit Weight Distributions\",\n",
        ").show()\n",
        "\n",
        "px.histogram(\n",
        "    W_U_stats_df_dec,\n",
        "    x=np.log10(W_U_stats_df_dec[\"kurtosis\"]),\n",
        "    width=800,\n",
        "    height=300,\n",
        "    nbins=1000,\n",
        "    title=\"Kurtosis of the Logit Weight Distributions\",\n",
        ").show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4_6c12ftBSdE"
      },
      "outputs": [],
      "source": [
        "fig = px.scatter(\n",
        "    W_U_stats_df_dec,\n",
        "    x=\"skewness\",\n",
        "    y=\"kurtosis\",\n",
        "    color=\"std\",\n",
        "    color_continuous_scale=\"Portland\",\n",
        "    hover_name=\"feature\",\n",
        "    width=800,\n",
        "    height=500,\n",
        "    log_y=True,  # Kurtosis has larger outliers so logging creates a nicer scale.\n",
        "    labels={\"x\": \"Skewness\", \"y\": \"Kurtosis\", \"color\": \"Standard Deviation\"},\n",
        "    title=f\"Layer {8}: Skewness vs Kurtosis of the Logit Weight Distributions\",\n",
        ")\n",
        "\n",
        "# decrease point size\n",
        "fig.update_traces(marker=dict(size=3))\n",
        "\n",
        "\n",
        "fig.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KuWSPqqjBSdE"
      },
      "outputs": [],
      "source": [
        "# then you can query accross combinations of the statistics to find features of interest and open them in neuronpedia.\n",
        "tmp_df = W_U_stats_df_dec[[\"feature\", \"skewness\", \"kurtosis\", \"std\"]]\n",
        "# tmp_df = tmp_df[(tmp_df[\"std\"] > 0.04)]\n",
        "# tmp_df = tmp_df[(tmp_df[\"skewness\"] > 0.65)]\n",
        "tmp_df = tmp_df[(tmp_df[\"skewness\"] > 3)]\n",
        "tmp_df = tmp_df.sort_values(\"skewness\", ascending=False).head(10)\n",
        "display(tmp_df)\n",
        "\n",
        "# if desired, open the features in neuronpedia\n",
        "get_neuronpedia_quick_list(list(tmp_df.feature), layer=layer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AHGG86wkBSdE"
      },
      "source": [
        "# Token Set Enrichment Analysis\n",
        "\n",
        "We now proceed to token set enrichment analysis.  I highly recommend reading my AlignmentForum post (espeically the case studies) before reading too much into any of these results.\n",
        "Also read this [post](https://transformer-circuits.pub/2024/qualitative-essay/index.html) for good general perspectives on statistics here."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J4tuqT9SBSdE"
      },
      "source": [
        "## Defining Our Token Sets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7mUxqd6mBSdE"
      },
      "outputs": [],
      "source": [
        "import nltk\n",
        "\n",
        "nltk.download(\"averaged_perceptron_tagger\")\n",
        "# get the vocab we need to filter to formulate token sets.\n",
        "vocab = model.tokenizer.get_vocab()  # type: ignore\n",
        "\n",
        "# make a regex dictionary to specify more sets.\n",
        "regex_dict = {\n",
        "    \"starts_with_space\": r\"Ġ.*\",\n",
        "    \"starts_with_capital\": r\"^Ġ*[A-Z].*\",\n",
        "    \"starts_with_lower\": r\"^Ġ*[a-z].*\",\n",
        "    \"all_digits\": r\"^Ġ*\\d+$\",\n",
        "    \"is_punctuation\": r\"^[^\\w\\s]+$\",\n",
        "    \"contains_close_bracket\": r\".*\\).*\",\n",
        "    \"contains_open_bracket\": r\".*\\(.*\",\n",
        "    \"all_caps\": r\"Ġ*[A-Z]+$\",\n",
        "    \"1 digit\": r\"Ġ*\\d{1}$\",\n",
        "    \"2 digits\": r\"Ġ*\\d{2}$\",\n",
        "    \"3 digits\": r\"Ġ*\\d{3}$\",\n",
        "    \"4 digits\": r\"Ġ*\\d{4}$\",\n",
        "    \"length_1\": r\"^Ġ*\\w{1}$\",\n",
        "    \"length_2\": r\"^Ġ*\\w{2}$\",\n",
        "    \"length_3\": r\"^Ġ*\\w{3}$\",\n",
        "    \"length_4\": r\"^Ġ*\\w{4}$\",\n",
        "    \"length_5\": r\"^Ġ*\\w{5}$\",\n",
        "}\n",
        "\n",
        "# print size of gene sets\n",
        "all_token_sets = get_letter_gene_sets(vocab)\n",
        "for key, value in regex_dict.items():\n",
        "    gene_set = get_gene_set_from_regex(vocab, value)\n",
        "    all_token_sets[key] = gene_set\n",
        "\n",
        "# some other sets that can be interesting\n",
        "baby_name_sets = get_baby_name_sets(vocab)\n",
        "pos_sets = generate_pos_sets(vocab)\n",
        "arbitrary_sets = get_test_gene_sets(model)\n",
        "\n",
        "all_token_sets = {**all_token_sets, **pos_sets}\n",
        "all_token_sets = {**all_token_sets, **arbitrary_sets}\n",
        "all_token_sets = {**all_token_sets, **baby_name_sets}\n",
        "\n",
        "# for each gene set, convert to string and  print the first 5 tokens\n",
        "for token_set_name, gene_set in sorted(\n",
        "    all_token_sets.items(), key=lambda x: len(x[1]), reverse=True\n",
        "):\n",
        "    tokens = [model.to_string(id) for id in list(gene_set)][:10]  # type: ignore\n",
        "    print(f\"{token_set_name}, has {len(gene_set)} genes\")\n",
        "    print(tokens)\n",
        "    print(\"----\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vxctX05KBSdE"
      },
      "source": [
        "## Performing Token Set Enrichment Analysis\n",
        "\n",
        "Below we perform token set enrichment analysis on various token sets. In practice, we'd likely perform tests accross all tokens and large libraries of sets simultaneously but to make it easier to run, we look at features with higher skew and select of a few token sets at a time to consider."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eHwM7qVlBSdF"
      },
      "outputs": [],
      "source": [
        "features_ordered_by_skew = (\n",
        "    W_U_stats_df_dec[\"skewness\"].sort_values(ascending=False).head(5000).index.to_list()\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QxuiBx4NBSdF"
      },
      "outputs": [],
      "source": [
        "# filter our list.\n",
        "token_sets_index = [\n",
        "    \"starts_with_space\",\n",
        "    \"starts_with_capital\",\n",
        "    \"all_digits\",\n",
        "    \"is_punctuation\",\n",
        "    \"all_caps\",\n",
        "]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "\n",
        "# calculate the enrichment scores\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U,  # use the logit weight values as our rankings over tokens.\n",
        "    features_ordered_by_skew,  # subset by these features\n",
        "    token_set_selected,  # use token_sets\n",
        ")\n",
        "\n",
        "manhattan_plot_enrichment_scores(\n",
        "    df_enrichment_scores, label_threshold=0, top_n=3  # use our enrichment scores\n",
        ").show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yQ0n0aKTBSdF"
      },
      "outputs": [],
      "source": [
        "fig = px.scatter(\n",
        "    df_enrichment_scores.apply(lambda x: -1 * np.log(1 - x)).T,\n",
        "    x=\"starts_with_space\",\n",
        "    y=\"starts_with_capital\",\n",
        "    marginal_x=\"histogram\",\n",
        "    marginal_y=\"histogram\",\n",
        "    labels={\n",
        "        \"starts_with_space\": \"Starts with Space\",\n",
        "        \"starts_with_capital\": \"Starts with Capital\",\n",
        "    },\n",
        "    title=\"Enrichment Scores for Starts with Space vs Starts with Capital\",\n",
        "    height=800,\n",
        "    width=800,\n",
        ")\n",
        "# reduce point size on the scatter only\n",
        "fig.update_traces(marker=dict(size=2), selector=dict(mode=\"markers\"))\n",
        "fig.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RcmU_6I9BSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"1 digit\", \"2 digits\", \"3 digits\", \"4 digits\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BboYni5ZBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"nltk_pos_PRP\", \"nltk_pos_VBZ\", \"nltk_pos_NNP\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xKrcnE7GBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"nltk_pos_VBN\", \"nltk_pos_VBG\", \"nltk_pos_VB\", \"nltk_pos_VBD\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sKVdyyxoBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"nltk_pos_WP\", \"nltk_pos_RBR\", \"nltk_pos_WDT\", \"nltk_pos_RB\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YYC3GFscBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"a\", \"e\", \"i\", \"o\", \"u\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JvmgQ_YmBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"negative_words\", \"positive_words\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ltYmy5lMBSdF"
      },
      "outputs": [],
      "source": [
        "fig = px.scatter(\n",
        "    df_enrichment_scores.apply(lambda x: -1 * np.log(1 - x))\n",
        "    .T.reset_index()\n",
        "    .rename(columns={\"index\": \"feature\"}),\n",
        "    x=\"negative_words\",\n",
        "    y=\"positive_words\",\n",
        "    marginal_x=\"histogram\",\n",
        "    marginal_y=\"histogram\",\n",
        "    labels={\n",
        "        \"starts_with_space\": \"Starts with Space\",\n",
        "        \"starts_with_capital\": \"Starts with Capital\",\n",
        "    },\n",
        "    title=\"Enrichment Scores for Starts with Space vs Starts with Capital\",\n",
        "    height=800,\n",
        "    width=800,\n",
        "    hover_name=\"feature\",\n",
        ")\n",
        "# reduce point size on the scatter only\n",
        "fig.update_traces(marker=dict(size=2), selector=dict(mode=\"markers\"))\n",
        "fig.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5otoyu2SBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"contains_close_bracket\", \"contains_open_bracket\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JlN-ScWhBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\n",
        "    \"1910's\",\n",
        "    \"1920's\",\n",
        "    \"1930's\",\n",
        "    \"1940's\",\n",
        "    \"1950's\",\n",
        "    \"1960's\",\n",
        "    \"1970's\",\n",
        "    \"1980's\",\n",
        "    \"1990's\",\n",
        "    \"2000's\",\n",
        "    \"2010's\",\n",
        "]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SDmCs9OhBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"positive_words\", \"negative_words\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores, label_threshold=0.98).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ilzC33VlBSdF"
      },
      "outputs": [],
      "source": [
        "token_sets_index = [\"boys_names\", \"girls_names\"]\n",
        "token_set_selected = {\n",
        "    k: set(v) for k, v in all_token_sets.items() if k in token_sets_index\n",
        "}\n",
        "df_enrichment_scores = get_enrichment_df(\n",
        "    dec_projection_onto_W_U, features_ordered_by_skew, token_set_selected\n",
        ")\n",
        "manhattan_plot_enrichment_scores(df_enrichment_scores).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "D0mTCic7BSdG"
      },
      "outputs": [],
      "source": [
        "tmp_df = df_enrichment_scores.apply(lambda x: -1 * np.log(1 - x)).T\n",
        "color = (\n",
        "    W_U_stats_df_dec.sort_values(\"skewness\", ascending=False)\n",
        "    .head(5000)[\"skewness\"]\n",
        "    .values\n",
        ")\n",
        "fig = px.scatter(\n",
        "    tmp_df.reset_index().rename(columns={\"index\": \"feature\"}),\n",
        "    x=\"boys_names\",\n",
        "    y=\"girls_names\",\n",
        "    marginal_x=\"histogram\",\n",
        "    marginal_y=\"histogram\",\n",
        "    # color = color,\n",
        "    labels={\n",
        "        \"boys_names\": \"Enrichment Score (Boys Names)\",\n",
        "        \"girls_names\": \"Enrichment Score (Girls Names)\",\n",
        "    },\n",
        "    height=600,\n",
        "    width=800,\n",
        "    hover_name=\"feature\",\n",
        ")\n",
        "# reduce point size on the scatter only\n",
        "fig.update_traces(marker=dict(size=3), selector=dict(mode=\"markers\"))\n",
        "# annotate any features where the absolute distance between boys names and girls names > 3\n",
        "for feature in df_enrichment_scores.columns:\n",
        "    if abs(tmp_df[\"boys_names\"][feature] - tmp_df[\"girls_names\"][feature]) > 2.9:\n",
        "        fig.add_annotation(\n",
        "            x=tmp_df[\"boys_names\"][feature] - 0.4,\n",
        "            y=tmp_df[\"girls_names\"][feature] + 0.1,\n",
        "            text=f\"{feature}\",\n",
        "            showarrow=False,\n",
        "        )\n",
        "\n",
        "\n",
        "fig.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VwVcsxnkBSdG"
      },
      "source": [
        "## Digging into Particular Features\n",
        "\n",
        "When we do these enrichments, I generate the logit weight histograms by category using the following function. It's important to make sure the categories you group by are in the columns of df_enrichment_scores."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GgndTFdFBSdG"
      },
      "outputs": [],
      "source": [
        "for category in [\"boys_names\"]:\n",
        "    plot_top_k_feature_projections_by_token_and_category(\n",
        "        token_set_selected,\n",
        "        df_enrichment_scores,\n",
        "        category=category,\n",
        "        dec_projection_onto_W_U=dec_projection_onto_W_U,\n",
        "        model=model,\n",
        "        log_y=False,\n",
        "        histnorm=None,\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AKP3u0D2BSdG"
      },
      "source": [
        "# Appendix Results: Logit Weight distribution Statistics Accross All Layers"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Vhht2iCQBSdG"
      },
      "outputs": [],
      "source": [
        "W_U_stats_df_dec_all_layers = get_all_stats_dfs(\n",
        "    gpt2_small_sparse_autoencoders, gpt2_small_sae_sparsities, model, cosine_sim=True\n",
        ")\n",
        "\n",
        "display(W_U_stats_df_dec_all_layers.shape)\n",
        "display(W_U_stats_df_dec_all_layers.head())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ok3DgORLBSdG"
      },
      "outputs": [],
      "source": [
        "# Let's plot the percentiles of the skewness and kurtosis by layer\n",
        "tmp_df = W_U_stats_df_dec_all_layers.groupby(\"layer\")[\"skewness\"].describe(\n",
        "    percentiles=[0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]\n",
        ")\n",
        "tmp_df = tmp_df[[\"1%\", \"5%\", \"10%\", \"25%\", \"50%\", \"75%\", \"90%\", \"95%\", \"99%\"]]\n",
        "\n",
        "fig = px.area(\n",
        "    tmp_df,\n",
        "    title=\"Skewness by Layer\",\n",
        "    width=800,\n",
        "    height=600,\n",
        "    color_discrete_sequence=px.colors.sequential.Turbo,\n",
        ").show()\n",
        "\n",
        "\n",
        "tmp_df = W_U_stats_df_dec_all_layers.groupby(\"layer\")[\"kurtosis\"].describe(\n",
        "    percentiles=[0.01, 0.05, 0.10, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99]\n",
        ")\n",
        "tmp_df = tmp_df[[\"1%\", \"5%\", \"10%\", \"25%\", \"50%\", \"75%\", \"90%\", \"95%\", \"99%\"]]\n",
        "\n",
        "fig = px.area(\n",
        "    tmp_df,\n",
        "    title=\"Kurtosis by Layer\",\n",
        "    width=800,\n",
        "    height=600,\n",
        "    color_discrete_sequence=px.colors.sequential.Turbo,\n",
        ")\n",
        "\n",
        "fig.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UOkRZiunBSdG"
      },
      "outputs": [],
      "source": [
        "# let's make a pretty color scheme\n",
        "from plotly.colors import n_colors\n",
        "\n",
        "colors = n_colors(\"rgb(5, 200, 200)\", \"rgb(200, 10, 10)\", 13, colortype=\"rgb\")\n",
        "\n",
        "# Make a box plot of the skewness by layer\n",
        "fig = px.box(\n",
        "    W_U_stats_df_dec_all_layers,\n",
        "    x=\"layer\",\n",
        "    y=\"skewness\",\n",
        "    color=\"layer\",\n",
        "    color_discrete_sequence=colors,\n",
        "    height=600,\n",
        "    width=1200,\n",
        "    title=\"Skewness cos(W_U,W_dec) by Layer in GPT2 Small Residual Stream SAEs\",\n",
        "    labels={\"layer\": \"Layer\", \"skewnss\": \"Skewness\"},\n",
        ")\n",
        "fig.update_xaxes(showticklabels=True, dtick=1)\n",
        "\n",
        "# increase font size\n",
        "fig.update_layout(font=dict(size=16))\n",
        "fig.show()\n",
        "\n",
        "# Make a box plot of the skewness by layer\n",
        "fig = px.box(\n",
        "    W_U_stats_df_dec_all_layers,\n",
        "    x=\"layer\",\n",
        "    y=\"kurtosis\",\n",
        "    color=\"layer\",\n",
        "    color_discrete_sequence=colors,\n",
        "    height=600,\n",
        "    width=1200,\n",
        "    log_y=True,\n",
        "    title=\"log kurtosis cos(W_U,W_dec) by Layer in GPT2 Small Residual Stream SAEs\",\n",
        "    labels={\"layer\": \"Layer\", \"kurtosis\": \"Log Kurtosis\"},\n",
        ")\n",
        "fig.update_xaxes(showticklabels=True, dtick=1)\n",
        "\n",
        "# increase font size\n",
        "fig.update_layout(font=dict(size=16))\n",
        "fig.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hYNYdY3wBSdG"
      },
      "outputs": [],
      "source": [
        "# scatter\n",
        "fig = px.scatter(\n",
        "    W_U_stats_df_dec_all_layers[W_U_stats_df_dec_all_layers.log_feature_sparsity >= -9],\n",
        "    # W_U_stats_df_dec_all_layers[W_U_stats_df_dec_all_layers.layer == 8],\n",
        "    x=\"skewness\",\n",
        "    y=\"kurtosis\",\n",
        "    color=\"std\",\n",
        "    color_continuous_scale=\"Portland\",\n",
        "    hover_name=\"feature\",\n",
        "    # color_continuous_midpoint = 0,\n",
        "    # range_color = [-4,-1],\n",
        "    log_y=True,\n",
        "    height=800,\n",
        "    # width = 2000,\n",
        "    # facet_col=\"layer\",\n",
        "    # facet_col_wrap=5,\n",
        "    animation_frame=\"layer\",\n",
        ")\n",
        "fig.update_yaxes(matches=None)\n",
        "fig.for_each_yaxis(lambda yaxis: yaxis.update(showticklabels=True))\n",
        "\n",
        "# decrease point size\n",
        "fig.update_traces(marker=dict(size=5))\n",
        "fig.show()\n",
        "fig.write_html(\"skewness_kurtosis_scatter_all_layers.html\")"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "machine_shape": "hm",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.7"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
