{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Context Aware Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:35:41.180578Z",
     "start_time": "2024-09-03T02:35:41.174816Z"
    }
   },
   "outputs": [],
   "source": [
    "# For data manipulation and analysis\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# For text preprocessing\n",
    "import re\n",
    "import nltk\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "import datetime\n",
    "import string\n",
    "\n",
    "# For multilabel classification\n",
    "from sklearn.preprocessing import MultiLabelBinarizer\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.multiclass import OneVsRestClassifier\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "\n",
    "# For neural networks\n",
    "\n",
    "\n",
    "\n",
    "# For model evaluation\n",
    "from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Context Vectors (For CB Model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Explicit Context - extract context features from MovieLens dataframe or imdb link \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Movies: \n",
    "Movie Ids\n",
    "---------\n",
    "\n",
    "Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id `1` corresponds to the URL <https://movielens.org/movies/1>). Movie ids are consistent between `ratings.csv`, `tags.csv`, `movies.csv`, and `links.csv` (i.e., the same id refers to the same movie across these four data files).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Movies Data File Structure (movies.csv)\n",
    "---------------------------------------\n",
    "\n",
    "Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:\n",
    "\n",
    "    movieId,title,genres\n",
    "\n",
    "Movie titles are entered manually or imported from <https://www.themoviedb.org/>, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.\n",
    "\n",
    "Genres are a pipe-separated list, and are selected from the following:\n",
    "\n",
    "* Action\n",
    "* Adventure\n",
    "* Animation\n",
    "* Children's\n",
    "* Comedy\n",
    "* Crime\n",
    "* Documentary\n",
    "* Drama\n",
    "* Fantasy\n",
    "* Film-Noir\n",
    "* Horror\n",
    "* Musical\n",
    "* Mystery\n",
    "* Romance\n",
    "* Sci-Fi\n",
    "* Thriller\n",
    "* War\n",
    "* Western\n",
    "* (no genres listed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:35:48.852815Z",
     "start_time": "2024-09-03T02:35:48.842698Z"
    }
   },
   "outputs": [],
   "source": [
    "movies = pd.read_csv(\"../dataset/ml-20m/filtered_movies.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:35:50.951665Z",
     "start_time": "2024-09-03T02:35:50.943367Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>Action|Crime|Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6232</th>\n",
       "      <td>130856</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>Comedy|Documentary</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6233</th>\n",
       "      <td>130958</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>Horror</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6234</th>\n",
       "      <td>130984</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>Action|Fantasy|Horror</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6235</th>\n",
       "      <td>131011</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>Crime|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6236</th>\n",
       "      <td>131015</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>Horror|Thriller</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6237 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId                               title  \\\n",
       "0           1                    Toy Story (1995)   \n",
       "1           2                      Jumanji (1995)   \n",
       "2           4            Waiting to Exhale (1995)   \n",
       "3           5  Father of the Bride Part II (1995)   \n",
       "4           6                         Heat (1995)   \n",
       "...       ...                                 ...   \n",
       "6232   130856                 Severe Clear (2010)   \n",
       "6233   130958             Killer Crocodile (1989)   \n",
       "6234   130984          Santo vs. las lobas (1976)   \n",
       "6235   131011              Execution Squad (1972)   \n",
       "6236   131015                     Hellgate (2011)   \n",
       "\n",
       "                                           genres  \n",
       "0     Adventure|Animation|Children|Comedy|Fantasy  \n",
       "1                      Adventure|Children|Fantasy  \n",
       "2                            Comedy|Drama|Romance  \n",
       "3                                          Comedy  \n",
       "4                           Action|Crime|Thriller  \n",
       "...                                           ...  \n",
       "6232                           Comedy|Documentary  \n",
       "6233                                       Horror  \n",
       "6234                        Action|Fantasy|Horror  \n",
       "6235                                  Crime|Drama  \n",
       "6236                              Horror|Thriller  \n",
       "\n",
       "[6237 rows x 3 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.drop(columns=['Unnamed: 0'], inplace=True)\n",
    "\n",
    "movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:35:54.609053Z",
     "start_time": "2024-09-03T02:35:54.599438Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>Action|Crime|Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>7</td>\n",
       "      <td>Sabrina (1995)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>10</td>\n",
       "      <td>GoldenEye (1995)</td>\n",
       "      <td>Action|Adventure|Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>11</td>\n",
       "      <td>American President, The (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>12</td>\n",
       "      <td>Dracula: Dead and Loving It (1995)</td>\n",
       "      <td>Comedy|Horror</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>14</td>\n",
       "      <td>Nixon (1995)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId                               title  \\\n",
       "0        1                    Toy Story (1995)   \n",
       "1        2                      Jumanji (1995)   \n",
       "2        4            Waiting to Exhale (1995)   \n",
       "3        5  Father of the Bride Part II (1995)   \n",
       "4        6                         Heat (1995)   \n",
       "5        7                      Sabrina (1995)   \n",
       "6       10                    GoldenEye (1995)   \n",
       "7       11      American President, The (1995)   \n",
       "8       12  Dracula: Dead and Loving It (1995)   \n",
       "9       14                        Nixon (1995)   \n",
       "\n",
       "                                        genres  \n",
       "0  Adventure|Animation|Children|Comedy|Fantasy  \n",
       "1                   Adventure|Children|Fantasy  \n",
       "2                         Comedy|Drama|Romance  \n",
       "3                                       Comedy  \n",
       "4                        Action|Crime|Thriller  \n",
       "5                               Comedy|Romance  \n",
       "6                    Action|Adventure|Thriller  \n",
       "7                         Comedy|Drama|Romance  \n",
       "8                                Comedy|Horror  \n",
       "9                                        Drama  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Links Data File Structure (links.csv)\n",
    "---------------------------------------\n",
    "\n",
    "Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:\n",
    "\n",
    "    movieId,imdbId,tmdbId\n",
    "\n",
    "movieId is an identifier for movies used by <https://movielens.org>. E.g., the movie Toy Story has the link <https://movielens.org/movies/1>.\n",
    "\n",
    "imdbId is an identifier for movies used by <http://www.imdb.com>. E.g., the movie Toy Story has the link <http://www.imdb.com/title/tt0114709/>.\n",
    "\n",
    "tmdbId is an identifier for movies used by <https://www.themoviedb.org>. E.g., the movie Toy Story has the link <https://www.themoviedb.org/movie/862>.\n",
    "\n",
    "Use of the resources listed above is subject to the terms of each provider."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:36:02.214687Z",
     "start_time": "2024-09-03T02:36:02.206106Z"
    }
   },
   "outputs": [],
   "source": [
    "links = pd.read_csv(\"../dataset/ml-20m/links.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:36:02.751042Z",
     "start_time": "2024-09-03T02:36:02.744176Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>113228</td>\n",
       "      <td>15602.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7</td>\n",
       "      <td>114319</td>\n",
       "      <td>11860.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8</td>\n",
       "      <td>112302</td>\n",
       "      <td>45325.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>9</td>\n",
       "      <td>114576</td>\n",
       "      <td>9091.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>10</td>\n",
       "      <td>113189</td>\n",
       "      <td>710.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId  imdbId   tmdbId\n",
       "0        1  114709    862.0\n",
       "1        2  113497   8844.0\n",
       "2        3  113228  15602.0\n",
       "3        4  114885  31357.0\n",
       "4        5  113041  11862.0\n",
       "5        6  113277    949.0\n",
       "6        7  114319  11860.0\n",
       "7        8  112302  45325.0\n",
       "8        9  114576   9091.0\n",
       "9       10  113189    710.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "links.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:36:06.065069Z",
     "start_time": "2024-09-03T02:36:06.056384Z"
    }
   },
   "outputs": [],
   "source": [
    "df_movies = pd.merge(links, movies, on='movieId', how='inner')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:36:07.065496Z",
     "start_time": "2024-09-03T02:36:07.061449Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949.0</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>Action|Crime|Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6232</th>\n",
       "      <td>130856</td>\n",
       "      <td>494826</td>\n",
       "      <td>48376.0</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>Comedy|Documentary</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6233</th>\n",
       "      <td>130958</td>\n",
       "      <td>143338</td>\n",
       "      <td>78402.0</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>Horror</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6234</th>\n",
       "      <td>130984</td>\n",
       "      <td>208423</td>\n",
       "      <td>317168.0</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>Action|Fantasy|Horror</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6235</th>\n",
       "      <td>131011</td>\n",
       "      <td>69109</td>\n",
       "      <td>79572.0</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>Crime|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6236</th>\n",
       "      <td>131015</td>\n",
       "      <td>1430116</td>\n",
       "      <td>143928.0</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>Horror|Thriller</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6237 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId   imdbId    tmdbId                               title  \\\n",
       "0           1   114709     862.0                    Toy Story (1995)   \n",
       "1           2   113497    8844.0                      Jumanji (1995)   \n",
       "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
       "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
       "4           6   113277     949.0                         Heat (1995)   \n",
       "...       ...      ...       ...                                 ...   \n",
       "6232   130856   494826   48376.0                 Severe Clear (2010)   \n",
       "6233   130958   143338   78402.0             Killer Crocodile (1989)   \n",
       "6234   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
       "6235   131011    69109   79572.0              Execution Squad (1972)   \n",
       "6236   131015  1430116  143928.0                     Hellgate (2011)   \n",
       "\n",
       "                                           genres  \n",
       "0     Adventure|Animation|Children|Comedy|Fantasy  \n",
       "1                      Adventure|Children|Fantasy  \n",
       "2                            Comedy|Drama|Romance  \n",
       "3                                          Comedy  \n",
       "4                           Action|Crime|Thriller  \n",
       "...                                           ...  \n",
       "6232                           Comedy|Documentary  \n",
       "6233                                       Horror  \n",
       "6234                        Action|Fantasy|Horror  \n",
       "6235                                  Crime|Drama  \n",
       "6236                              Horror|Thriller  \n",
       "\n",
       "[6237 rows x 5 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Converting to correct data types\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:36:10.272269Z",
     "start_time": "2024-09-03T02:36:10.265882Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "movieId     int64\n",
      "imdbId     object\n",
      "tmdbId     object\n",
      "title      object\n",
      "genres     object\n",
      "dtype: object\n"
     ]
    }
   ],
   "source": [
    "dtype_dict = {col: 'str' for col in df_movies.columns}\n",
    "dtype_dict['movieId'] = 'int'\n",
    "df_movies = df_movies.astype(dtype_dict)\n",
    "\n",
    "# Verify the conversion\n",
    "print(df_movies.dtypes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accessing the links \n",
    "\n",
    "- Using web scraping to access the storyline/synopsis from each link\n",
    "- Stored in separate columns - \"imdb_doc\", \"tmdb_doc\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install IMDbPY"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:36:20.675117Z",
     "start_time": "2024-09-03T02:36:13.935189Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A little boy named Andy loves to be in his room, playing with his toys, especially his doll named \"Woody\". But, what do the toys do when Andy is not with them, they come to life. Woody believes that his life (as a toy) is good. However, he must worry about Andy's family moving, and what Woody does not know is about Andy's birthday party. Woody does not realize that Andy's mother gave him an action figure known as Buzz Lightyear, who does not believe that he is a toy, and quickly becomes Andy's new favorite toy. Woody, who is now consumed with jealousy, tries to get rid of Buzz. Then, both Woody and Buzz are now lost. They must find a way to get back to Andy before he moves without them, but they will have to pass through a ruthless toy killer, Sid Phillips.\n"
     ]
    }
   ],
   "source": [
    "from imdb import IMDb\n",
    "\n",
    "ia = IMDb()\n",
    "movie = ia.get_movie('0114709')  # Use IMDb ID without \"tt\"\n",
    "print(movie.get('plot outline'))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:36:23.615572Z",
     "start_time": "2024-09-03T02:36:23.610611Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>113228</td>\n",
       "      <td>15602.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27273</th>\n",
       "      <td>131254</td>\n",
       "      <td>466713</td>\n",
       "      <td>4436.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27274</th>\n",
       "      <td>131256</td>\n",
       "      <td>277703</td>\n",
       "      <td>9274.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27275</th>\n",
       "      <td>131258</td>\n",
       "      <td>3485166</td>\n",
       "      <td>285213.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27276</th>\n",
       "      <td>131260</td>\n",
       "      <td>249110</td>\n",
       "      <td>32099.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27277</th>\n",
       "      <td>131262</td>\n",
       "      <td>1724965</td>\n",
       "      <td>286971.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>27278 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       movieId   imdbId    tmdbId\n",
       "0            1   114709     862.0\n",
       "1            2   113497    8844.0\n",
       "2            3   113228   15602.0\n",
       "3            4   114885   31357.0\n",
       "4            5   113041   11862.0\n",
       "...        ...      ...       ...\n",
       "27273   131254   466713    4436.0\n",
       "27274   131256   277703    9274.0\n",
       "27275   131258  3485166  285213.0\n",
       "27276   131260   249110   32099.0\n",
       "27277   131262  1724965  286971.0\n",
       "\n",
       "[27278 rows x 3 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "links"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Due to it taking a long time - only using tmdb for now, and then we can add this if necessary "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:45:25.059803Z",
     "start_time": "2024-09-03T02:36:49.036818Z"
    }
   },
   "outputs": [],
   "source": [
    "# import pandas as pd\n",
    "# from imdb import IMDb\n",
    "# \n",
    "# # Assuming df_movies is your DataFrame and it has a column named 'imdbId'\n",
    "# # Initialize IMDb object\n",
    "# ia = IMDb()\n",
    "# \n",
    "# # Create a new column 'imdb_syn' initialized with None only if it does not exist\n",
    "# if 'imdb_syn' not in df_movies.columns:\n",
    "#     df_movies['imdb_syn'] = None\n",
    "# \n",
    "# #\n",
    "# # Loop through the first 10 IMDb IDs in the DataFrame using integer-based indexing\n",
    "# for index in range(0, 100):\n",
    "#     try:\n",
    "#         imdb_id = str(df_movies.loc[index, 'imdbId']).zfill(7)  # Ensure the IMDb ID has leading zeros up to 7 digits\n",
    "#         movie = ia.get_movie(imdb_id)  # Use IMDb ID without \"tt\"\n",
    "#         plot_outline = movie.get('plot outline')\n",
    "#         \n",
    "#         # Assign the plot outline to the corresponding entry in 'imdb_syn' column\n",
    "#         df_movies.loc[index, 'imdb_syn'] = plot_outline\n",
    "#     except Exception as e:\n",
    "#         print(f\"An error occurred for index {index}, IMDb ID {imdb_id}: {e}\")\n",
    "# \n",
    "# # Now df_movies['imdb_syn'] will contain the plot outlines for the first 10 IMDb IDs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# df_movies.loc[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:53:37.445736Z",
     "start_time": "2024-09-03T02:53:37.399813Z"
    }
   },
   "outputs": [],
   "source": [
    "# df_movies.to_csv(\"../dataset/context_imdb.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:53:38.673646Z",
     "start_time": "2024-09-03T02:53:38.655182Z"
    }
   },
   "outputs": [],
   "source": [
    "df_movies = pd.read_csv(\"../dataset/context_imdb.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T02:53:46.978095Z",
     "start_time": "2024-09-03T02:53:46.968046Z"
    }
   },
   "outputs": [],
   "source": [
    "# df_movies.drop(columns=['Unnamed: 0'], inplace=True)\n",
    "# df_movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T01:18:24.928659Z",
     "start_time": "2024-09-03T01:18:24.904629Z"
    }
   },
   "outputs": [],
   "source": [
    "# # df_movies_null = df_movies[df_movies['imdb_syn'].notna().index]\n",
    "# non_na_indices = df_movies.index[df_movies['imdb_syn'].notna()]\n",
    "# if not non_na_indices.empty:\n",
    "#     df_movies_null_num = non_na_indices.max() + 1\n",
    "# else:\n",
    "#     df_movies_null_num = 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the tmdb, we use the Python package 'tmdbv3api'\n",
    "https://github.com/AnthonyBloomer/tmdbv3api"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T01:18:48.102394Z",
     "start_time": "2024-09-03T01:18:46.966552Z"
    }
   },
   "outputs": [],
   "source": [
    "# !pip install tmdbv3api"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2cc6b369ade4867c4efa72198cd6dba9 - API KEY\n",
    "\n",
    "a95e7426cf907141b0b558fef03000ab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:11:18.431728Z",
     "start_time": "2024-09-03T02:54:31.244584Z"
    }
   },
   "outputs": [],
   "source": [
    "# import time\n",
    "# \n",
    "# import pandas as pd\n",
    "# from concurrent.futures import ThreadPoolExecutor\n",
    "# from tmdbv3api import TMDb, Movie\n",
    "# \n",
    "# # Initialize TMDb and Movie objects\n",
    "# tmdb = TMDb()\n",
    "# movie = Movie()\n",
    "# \n",
    "# # Your TMDb API key\n",
    "# tmdb.api_key = 'a95e7426cf907141b0b558fef03000ab'\n",
    "# \n",
    "# # Function to fetch movie overview\n",
    "# def fetch_overview(tmdb_id):\n",
    "#     try:\n",
    "#         if tmdb_id:\n",
    "#             details = movie.details(tmdb_id)\n",
    "#             if details:\n",
    "#                 return details.overview\n",
    "#             else:\n",
    "#                 print(f\"Resource with ID {tmdb_id} could not be found.\")\n",
    "#                 return 'N/A'\n",
    "#         else:\n",
    "#             return 'N/A'\n",
    "#     except Exception as e:\n",
    "#         print(f\"An error occurred: {e}\")\n",
    "#         return 'N/A'\n",
    "#     finally:\n",
    "#         # Sleep briefly to avoid overwhelming the API and hitting rate limits\n",
    "#         time.sleep(0.1)  \n",
    "# \n",
    "# # Function to handle each batch\n",
    "# def process_batch(batch):\n",
    "#     max_workers = 5\n",
    "#     with ThreadPoolExecutor(max_workers=max_workers) as executor:\n",
    "#         return list(executor.map(fetch_overview, batch))\n",
    "# \n",
    "# # Batch size (adjust based on your requirements)\n",
    "# batch_size = 50\n",
    "# \n",
    "# # Initialize an empty list to hold the results\n",
    "# all_results = []\n",
    "# \n",
    "# # Process each batch\n",
    "# for i in range(0, len(df_movies['tmdbId']), batch_size):\n",
    "#     batch = df_movies['tmdbId'][i:i + batch_size]\n",
    "#     batch_results = process_batch(batch)\n",
    "#     all_results.extend(batch_results)\n",
    "# \n",
    "# # Add the results back into the DataFrame\n",
    "# df_movies['tmdb_syn'] = all_results\n",
    "# \n",
    "# # Display the updated DataFrame\n",
    "# print(df_movies)\n",
    "# \n",
    "# \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:13:33.296398Z",
     "start_time": "2024-09-03T03:13:33.240455Z"
    }
   },
   "outputs": [],
   "source": [
    "\n",
    "df_movies.to_csv(\"../dataset/tmdb_syn_labelled.csv\", index=False)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T01:20:28.807326Z",
     "start_time": "2024-09-03T01:19:50.652159Z"
    }
   },
   "outputs": [],
   "source": [
    "# import pandas as pd\n",
    "# from concurrent.futures import ThreadPoolExecutor\n",
    "# from tmdbv3api import TMDb, Movie\n",
    "\n",
    "# # Initialize TMDb and Movie objects\n",
    "# tmdb = TMDb()\n",
    "# movie = Movie()\n",
    "\n",
    "# # Your TMDb API key\n",
    "# tmdb.api_key = '2cc6b369ade4867c4efa72198cd6dba9'\n",
    "\n",
    "# # Function to fetch movie overview\n",
    "# def fetch_overview(tmdb_id):\n",
    "#     try:\n",
    "#         if tmdb_id:\n",
    "#             details = movie.details(tmdb_id)\n",
    "#             if details:\n",
    "#                 return details.overview\n",
    "#             else:\n",
    "#                 print(f\"Resource with ID {tmdb_id} could not be found.\")\n",
    "#                 return 'N/A'\n",
    "#         else:\n",
    "#             return 'N/A'\n",
    "#     except Exception as e:\n",
    "#         print(f\"An error occurred: {e}\")\n",
    "#         return 'N/A'\n",
    "\n",
    "\n",
    "# # Function to handle each batch\n",
    "# def process_batch(batch):\n",
    "#     max_workers = 10\n",
    "#     with ThreadPoolExecutor(max_workers=max_workers) as executor:\n",
    "#         return list(executor.map(fetch_overview, batch))\n",
    "\n",
    "# # Batch size (adjust based on your requirements)\n",
    "# batch_size = 100\n",
    "\n",
    "# # Initialize an empty list to hold the results\n",
    "# all_results = []\n",
    "\n",
    "# # Process each batch\n",
    "# for i in range(0, len(df_movies['tmdbId']), batch_size):\n",
    "#     batch = df_movies['tmdbId'][i:i + batch_size]\n",
    "#     batch_results = process_batch(batch)\n",
    "#     all_results.extend(batch_results)\n",
    "\n",
    "# # Add the results back into the DataFrame\n",
    "# df_movies['tmdb_syn'] = all_results\n",
    "\n",
    "# # Display the updated DataFrame\n",
    "# print(df_movies)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:49:07.655953Z",
     "start_time": "2024-09-03T03:49:07.613335Z"
    }
   },
   "outputs": [],
   "source": [
    "# df_movies.to_csv(\"../dataset/tmdb_syn_labelled.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Combining the imdb and tmdb synopsis into one column - 2 Oct: Summarise once imdb is done - if used\n",
    "- Combine and summarise the synopses using the gensim package\n",
    "- ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_movies= pd.read_csv(\"../dataset/tmdb_syn_labelled.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:42:08.748282Z",
     "start_time": "2024-09-03T03:35:57.676445Z"
    }
   },
   "outputs": [],
   "source": [
    "# from transformers import pipeline\n",
    "# \n",
    "# # Initialize the summarizer pipeline\n",
    "# summarizer = pipeline(\"summarization\")\n",
    "# \n",
    "# def combine_and_summarize(imdb_syn, tmdb_syn):\n",
    "#     combined_syn = imdb_syn + \" \" + tmdb_syn\n",
    "#     # Perform summarization\n",
    "#     summary = summarizer(combined_syn, max_length=150, min_length=30, do_sample=False)\n",
    "#     # Extract the summarized text\n",
    "#     summarized_syn = summary[0]['summary_text']\n",
    "#     return summarized_syn if summarized_syn else combined_syn  # Use original if summarization fails\n",
    "# \n",
    "# # Apply the function to your DataFrame\n",
    "# df_movies['summarized_syn'] = df_movies.apply(lambda x: combine_and_summarize(x['imdb_syn'], x['tmdb_syn']), axis=1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting gensim==3.6.0\n",
      "  Downloading gensim-3.6.0.tar.gz (23.1 MB)\n",
      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m23.1/23.1 MB\u001B[0m \u001B[31m2.8 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m00:01\u001B[0m00:01\u001B[0m\n",
      "\u001B[?25h  Preparing metadata (setup.py) ... \u001B[?25ldone\n",
      "\u001B[?25hRequirement already satisfied: numpy>=1.11.3 in /Users/jiayi/anaconda3/lib/python3.11/site-packages (from gensim==3.6.0) (1.24.4)\n",
      "Requirement already satisfied: scipy>=0.18.1 in /Users/jiayi/anaconda3/lib/python3.11/site-packages (from gensim==3.6.0) (1.10.1)\n",
      "Requirement already satisfied: six>=1.5.0 in /Users/jiayi/anaconda3/lib/python3.11/site-packages (from gensim==3.6.0) (1.16.0)\n",
      "Requirement already satisfied: smart_open>=1.2.1 in /Users/jiayi/anaconda3/lib/python3.11/site-packages (from gensim==3.6.0) (5.2.1)\n",
      "Building wheels for collected packages: gensim\n",
      "  Building wheel for gensim (setup.py) ... \u001B[?25ldone\n",
      "\u001B[?25h  Created wheel for gensim: filename=gensim-3.6.0-cp311-cp311-macosx_11_0_arm64.whl size=23218283 sha256=701682c488fd8945db7da094f87e6e8bbc7327bcae9739eb0f604d29824e1250\n",
      "  Stored in directory: /Users/jiayi/Library/Caches/pip/wheels/6b/84/4d/ea7977f42f89ebaa24f5702cc2d5013300934c7641d48c1521\n",
      "Successfully built gensim\n",
      "Installing collected packages: gensim\n",
      "  Attempting uninstall: gensim\n",
      "    Found existing installation: gensim 4.3.0\n",
      "    Uninstalling gensim-4.3.0:\n",
      "      Successfully uninstalled gensim-4.3.0\n",
      "Successfully installed gensim-3.6.0\n"
     ]
    }
   ],
   "source": [
    "# !pip install gensim\n",
    "# !pip3 install gensim==3.6.0\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdb_syn</th>\n",
       "      <th>tmdb_syn</th>\n",
       "      <th>summarized_syn</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "      <td>A little boy named Andy loves to be in his roo...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "      <td>A little boy named Andy loves to be in his roo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "      <td>Jumanji, one of the most unique--and dangerous...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "      <td>Jumanji, one of the most unique--and dangerous...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "      <td>This story based on the best selling novel by ...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "      <td>This story based on the best selling novel by ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>In this sequel to \"Father of the Bride\", Georg...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "      <td>In this sequel to \"Father of the Bride\", Georg...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949.0</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>Action|Crime|Thriller</td>\n",
       "      <td>Hunters and their prey--Neil and his professio...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "      <td>Hunters and their prey--Neil and his professio...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6178</th>\n",
       "      <td>130856</td>\n",
       "      <td>494826</td>\n",
       "      <td>48376.0</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>Comedy|Documentary</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "      <td>Severe Clear is a film based on the memoirs o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6179</th>\n",
       "      <td>130958</td>\n",
       "      <td>143338</td>\n",
       "      <td>78402.0</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>Horror</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6180</th>\n",
       "      <td>130984</td>\n",
       "      <td>208423</td>\n",
       "      <td>317168.0</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>Action|Fantasy|Horror</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6181</th>\n",
       "      <td>131011</td>\n",
       "      <td>69109</td>\n",
       "      <td>79572.0</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>Crime|Drama</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "      <td>Bertone is a moderately honest homicide cop.\\n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6182</th>\n",
       "      <td>131015</td>\n",
       "      <td>1430116</td>\n",
       "      <td>143928.0</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>Horror|Thriller</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "      <td>A western businessman, his Thai wife and son ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6183 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId   imdbId    tmdbId                               title  \\\n",
       "0           1   114709     862.0                    Toy Story (1995)   \n",
       "1           2   113497    8844.0                      Jumanji (1995)   \n",
       "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
       "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
       "4           6   113277     949.0                         Heat (1995)   \n",
       "...       ...      ...       ...                                 ...   \n",
       "6178   130856   494826   48376.0                 Severe Clear (2010)   \n",
       "6179   130958   143338   78402.0             Killer Crocodile (1989)   \n",
       "6180   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
       "6181   131011    69109   79572.0              Execution Squad (1972)   \n",
       "6182   131015  1430116  143928.0                     Hellgate (2011)   \n",
       "\n",
       "                                           genres  \\\n",
       "0     Adventure|Animation|Children|Comedy|Fantasy   \n",
       "1                      Adventure|Children|Fantasy   \n",
       "2                            Comedy|Drama|Romance   \n",
       "3                                          Comedy   \n",
       "4                           Action|Crime|Thriller   \n",
       "...                                           ...   \n",
       "6178                           Comedy|Documentary   \n",
       "6179                                       Horror   \n",
       "6180                        Action|Fantasy|Horror   \n",
       "6181                                  Crime|Drama   \n",
       "6182                              Horror|Thriller   \n",
       "\n",
       "                                               imdb_syn  \\\n",
       "0     A little boy named Andy loves to be in his roo...   \n",
       "1     Jumanji, one of the most unique--and dangerous...   \n",
       "2     This story based on the best selling novel by ...   \n",
       "3     In this sequel to \"Father of the Bride\", Georg...   \n",
       "4     Hunters and their prey--Neil and his professio...   \n",
       "...                                                 ...   \n",
       "6178                                                NaN   \n",
       "6179                                                NaN   \n",
       "6180                                                NaN   \n",
       "6181                                                NaN   \n",
       "6182                                                NaN   \n",
       "\n",
       "                                               tmdb_syn  \\\n",
       "0     Led by Woody, Andy's toys live happily in his ...   \n",
       "1     When siblings Judy and Peter discover an encha...   \n",
       "2     Cheated on, mistreated and stepped on, the wom...   \n",
       "3     Just when George Banks has recovered from his ...   \n",
       "4     Obsessive master thief Neil McCauley leads a t...   \n",
       "...                                                 ...   \n",
       "6178  Severe Clear is a film based on the memoirs of...   \n",
       "6179  A group of environmentalists arrives at a fara...   \n",
       "6180             Also known as Santo vs. the She-Wolves   \n",
       "6181  Bertone is a moderately honest homicide cop. U...   \n",
       "6182  A western businessman, his Thai wife and son e...   \n",
       "\n",
       "                                         summarized_syn  \n",
       "0     A little boy named Andy loves to be in his roo...  \n",
       "1     Jumanji, one of the most unique--and dangerous...  \n",
       "2     This story based on the best selling novel by ...  \n",
       "3     In this sequel to \"Father of the Bride\", Georg...  \n",
       "4     Hunters and their prey--Neil and his professio...  \n",
       "...                                                 ...  \n",
       "6178   Severe Clear is a film based on the memoirs o...  \n",
       "6179  A group of environmentalists arrives at a fara...  \n",
       "6180             Also known as Santo vs. the She-Wolves  \n",
       "6181  Bertone is a moderately honest homicide cop.\\n...  \n",
       "6182   A western businessman, his Thai wife and son ...  \n",
       "\n",
       "[6183 rows x 8 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:32:21.228156Z",
     "start_time": "2024-09-03T03:32:20.989650Z"
    }
   },
   "outputs": [],
   "source": [
    "from gensim.summarization import summarize\n",
    "\n",
    "def count_sentences(text):\n",
    "    # Simple sentence count based on punctuation\n",
    "    return len(re.findall(r'\\w[.!?]', text))\n",
    "\n",
    "# Combine and summarize the synopses\n",
    "def combine_and_summarize(imdb_syn, tmdb_syn):\n",
    "    imdb_syn = str(imdb_syn) if not pd.isna(imdb_syn) else \"\"\n",
    "    tmdb_syn = str(tmdb_syn) if not pd.isna(tmdb_syn) else \"\"    \n",
    "    combined_syn = imdb_syn + \" \" + tmdb_syn\n",
    "    \n",
    "    if count_sentences(combined_syn) > 1:\n",
    "        try:    \n",
    "            summarized_syn = summarize(combined_syn, ratio=0.7)  # Adjust the ratio as needed\n",
    "            return summarized_syn if summarized_syn else combined_syn  # Use original if summarization fails\n",
    "        except Exception:\n",
    "            return combined_syn\n",
    "    else:\n",
    "        return combined_syn \n",
    "\n",
    "# Apply the function to your DataFrame\n",
    "df_movies['summarized_syn'] = df_movies.apply(lambda x: combine_and_summarize(x['imdb_syn'], x['tmdb_syn']), axis=1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading in the tmdb column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:18:57.935403Z",
     "start_time": "2024-09-03T03:18:57.876634Z"
    }
   },
   "outputs": [],
   "source": [
    "df_movies= pd.read_csv(\"../dataset/tmdb_syn_labelled.csv\")\n",
    "\n",
    "tmdb = pd.read_csv(\"../dataset/tmdb_syn_labelled.csv\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:19:09.863175Z",
     "start_time": "2024-09-03T03:19:09.860985Z"
    }
   },
   "outputs": [],
   "source": [
    "# tmdb['movieId'] = tmdb['movieId'].astype('int')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:19:10.464625Z",
     "start_time": "2024-09-03T03:19:10.457670Z"
    }
   },
   "outputs": [],
   "source": [
    "# tmdb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:29:36.337066Z",
     "start_time": "2024-09-03T03:29:36.283281Z"
    }
   },
   "outputs": [],
   "source": [
    "# Merge df_movies with tmdb based on 'movieId'\n",
    "# This will add the 'tmdb_syn' column from tmdb to df_movies\n",
    "df_movies = pd.merge(df_movies, tmdb[['movieId', 'tmdb_syn']], on='movieId', how='left')\n",
    "\n",
    "# Display the updated DataFrame\n",
    "df_movies[[\"movieId\", \"title\", \"summarized_syn\"]].head(10)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:29:14.472898Z",
     "start_time": "2024-09-03T03:29:14.466826Z"
    }
   },
   "outputs": [],
   "source": [
    "df_movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:17:31.829334Z",
     "start_time": "2024-09-03T03:17:31.794798Z"
    }
   },
   "outputs": [],
   "source": [
    "df_movies['tmdb_syn']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Investigating movies with NO tmdb_syn:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:19:48.540331Z",
     "start_time": "2024-09-03T03:19:48.530701Z"
    }
   },
   "outputs": [],
   "source": [
    "\n",
    "# Create a DataFrame with only the rows where 'tmdb_syn' is NaN\n",
    "df_movies_missing_tmdb_syn = df_movies[pd.isna(df_movies['tmdb_syn'])]\n",
    "\n",
    "# Display or analyze the DataFrame\n",
    "print(df_movies_missing_tmdb_syn)\n",
    "\n",
    "# If you want to know the number of such rows\n",
    "print(\"Number of rows with missing tmdb_syn:\", len(df_movies_missing_tmdb_syn))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For these 56 missing - will try to get from imdb API:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:19:56.386106Z",
     "start_time": "2024-09-03T03:19:56.378403Z"
    }
   },
   "outputs": [],
   "source": [
    "df_movies_missing_tmdb_syn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:24:33.064769Z",
     "start_time": "2024-09-03T03:20:12.813794Z"
    }
   },
   "outputs": [],
   "source": [
    "from imdb import IMDb\n",
    "\n",
    "# Initialize IMDb object\n",
    "ia = IMDb()\n",
    "\n",
    "# Function to fetch the plot outline\n",
    "def get_imdb_syn(imdbId):\n",
    "    try:\n",
    "        movie = ia.get_movie(str(imdbId))\n",
    "        return movie.get('plot outline', 'N/A')\n",
    "    except Exception as e:\n",
    "        print(f\"An error occurred: {e}\")\n",
    "        return 'N/A'\n",
    "\n",
    "# Find rows where 'tmdb_syn' is NaN\n",
    "missing_tmdb_syn_idx = df_movies[pd.isna(df_movies['tmdb_syn'])].index.tolist()\n",
    "\n",
    "# Initialize an empty list to keep track of updated rows\n",
    "updated_rows = []\n",
    "\n",
    "# Fetch 'imdb_syn' for these rows\n",
    "for idx in missing_tmdb_syn_idx:\n",
    "    imdbId = df_movies.loc[idx, 'imdbId']\n",
    "    new_syn = get_imdb_syn(imdbId)\n",
    "    \n",
    "    if new_syn != 'N/A':\n",
    "        df_movies.loc[idx, 'imdb_syn'] = new_syn\n",
    "        updated_rows.append(idx)\n",
    "\n",
    "# Show the rows that were updated\n",
    "print(\"Updated rows:\")\n",
    "print(df_movies.loc[updated_rows])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import pandas as pd\n",
    "# from imdb import IMDb\n",
    "# import time\n",
    "# \n",
    "# # Initialize IMDb object\n",
    "# ia = IMDb()\n",
    "# \n",
    "# # Function to fetch the plot outline\n",
    "# def get_imdb_syn(imdbId):\n",
    "#     try:\n",
    "#         movie = ia.get_movie(str(imdbId))\n",
    "#         return movie.get('plot outline', 'N/A')\n",
    "#     except Exception as e:\n",
    "#         print(f\"An error occurred while fetching plot outline for {imdbId}: {e}\")\n",
    "#         return 'N/A'\n",
    "# \n",
    "# # Sample DataFrame (replace with your actual DataFrame)\n",
    "# # df_movies = pd.DataFrame({\n",
    "# #     'imdbId': ['0111161', '0133093', '0133093'],  # Example IMDb IDs\n",
    "# #     'tmdb_syn': [None, 'Some Synopsis', None]\n",
    "# # })\n",
    "# \n",
    "# # Find rows where 'tmdb_syn' is NaN\n",
    "# missing_tmdb_syn_idx = df_movies[pd.isna(df_movies['tmdb_syn'])].index.tolist()\n",
    "# \n",
    "# # Initialize an empty list to keep track of updated rows\n",
    "# updated_rows = []\n",
    "# \n",
    "# # Fetch 'imdb_syn' for these rows\n",
    "# for idx in missing_tmdb_syn_idx:\n",
    "#     imdbId = df_movies.loc[idx, 'imdbId']\n",
    "#     new_syn = get_imdb_syn(imdbId)\n",
    "# \n",
    "#     if new_syn != 'N/A':\n",
    "#         df_movies.loc[idx, 'imdb_syn'] = new_syn\n",
    "#         updated_rows.append(idx)\n",
    "# \n",
    "#     # Add a delay to avoid hitting API rate limits\n",
    "#     time.sleep(1)\n",
    "# \n",
    "# # Show the rows that were updated\n",
    "# print(\"Updated rows:\")\n",
    "# print(df_movies.loc[updated_rows])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:46:07.141341Z",
     "start_time": "2024-09-03T03:46:07.128948Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdb_syn</th>\n",
       "      <th>tmdb_syn_x</th>\n",
       "      <th>tmdb_syn_y</th>\n",
       "      <th>tmdb_syn</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "      <td>A little boy named Andy loves to be in his roo...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "      <td>Jumanji, one of the most unique--and dangerous...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "      <td>This story based on the best selling novel by ...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>In this sequel to \"Father of the Bride\", Georg...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949.0</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>Action|Crime|Thriller</td>\n",
       "      <td>Hunters and their prey--Neil and his professio...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6178</th>\n",
       "      <td>130856</td>\n",
       "      <td>494826</td>\n",
       "      <td>48376.0</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>Comedy|Documentary</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6179</th>\n",
       "      <td>130958</td>\n",
       "      <td>143338</td>\n",
       "      <td>78402.0</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>Horror</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6180</th>\n",
       "      <td>130984</td>\n",
       "      <td>208423</td>\n",
       "      <td>317168.0</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>Action|Fantasy|Horror</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6181</th>\n",
       "      <td>131011</td>\n",
       "      <td>69109</td>\n",
       "      <td>79572.0</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>Crime|Drama</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6182</th>\n",
       "      <td>131015</td>\n",
       "      <td>1430116</td>\n",
       "      <td>143928.0</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>Horror|Thriller</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6183 rows × 9 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId   imdbId    tmdbId                               title  \\\n",
       "0           1   114709     862.0                    Toy Story (1995)   \n",
       "1           2   113497    8844.0                      Jumanji (1995)   \n",
       "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
       "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
       "4           6   113277     949.0                         Heat (1995)   \n",
       "...       ...      ...       ...                                 ...   \n",
       "6178   130856   494826   48376.0                 Severe Clear (2010)   \n",
       "6179   130958   143338   78402.0             Killer Crocodile (1989)   \n",
       "6180   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
       "6181   131011    69109   79572.0              Execution Squad (1972)   \n",
       "6182   131015  1430116  143928.0                     Hellgate (2011)   \n",
       "\n",
       "                                           genres  \\\n",
       "0     Adventure|Animation|Children|Comedy|Fantasy   \n",
       "1                      Adventure|Children|Fantasy   \n",
       "2                            Comedy|Drama|Romance   \n",
       "3                                          Comedy   \n",
       "4                           Action|Crime|Thriller   \n",
       "...                                           ...   \n",
       "6178                           Comedy|Documentary   \n",
       "6179                                       Horror   \n",
       "6180                        Action|Fantasy|Horror   \n",
       "6181                                  Crime|Drama   \n",
       "6182                              Horror|Thriller   \n",
       "\n",
       "                                               imdb_syn  \\\n",
       "0     A little boy named Andy loves to be in his roo...   \n",
       "1     Jumanji, one of the most unique--and dangerous...   \n",
       "2     This story based on the best selling novel by ...   \n",
       "3     In this sequel to \"Father of the Bride\", Georg...   \n",
       "4     Hunters and their prey--Neil and his professio...   \n",
       "...                                                 ...   \n",
       "6178                                                NaN   \n",
       "6179                                                NaN   \n",
       "6180                                                NaN   \n",
       "6181                                                NaN   \n",
       "6182                                                NaN   \n",
       "\n",
       "                                             tmdb_syn_x  \\\n",
       "0     Led by Woody, Andy's toys live happily in his ...   \n",
       "1     When siblings Judy and Peter discover an encha...   \n",
       "2     Cheated on, mistreated and stepped on, the wom...   \n",
       "3     Just when George Banks has recovered from his ...   \n",
       "4     Obsessive master thief Neil McCauley leads a t...   \n",
       "...                                                 ...   \n",
       "6178  Severe Clear is a film based on the memoirs of...   \n",
       "6179  A group of environmentalists arrives at a fara...   \n",
       "6180             Also known as Santo vs. the She-Wolves   \n",
       "6181  Bertone is a moderately honest homicide cop. U...   \n",
       "6182  A western businessman, his Thai wife and son e...   \n",
       "\n",
       "                                             tmdb_syn_y  \\\n",
       "0     Led by Woody, Andy's toys live happily in his ...   \n",
       "1     When siblings Judy and Peter discover an encha...   \n",
       "2     Cheated on, mistreated and stepped on, the wom...   \n",
       "3     Just when George Banks has recovered from his ...   \n",
       "4     Obsessive master thief Neil McCauley leads a t...   \n",
       "...                                                 ...   \n",
       "6178  Severe Clear is a film based on the memoirs of...   \n",
       "6179  A group of environmentalists arrives at a fara...   \n",
       "6180             Also known as Santo vs. the She-Wolves   \n",
       "6181  Bertone is a moderately honest homicide cop. U...   \n",
       "6182  A western businessman, his Thai wife and son e...   \n",
       "\n",
       "                                               tmdb_syn  \n",
       "0     Led by Woody, Andy's toys live happily in his ...  \n",
       "1     When siblings Judy and Peter discover an encha...  \n",
       "2     Cheated on, mistreated and stepped on, the wom...  \n",
       "3     Just when George Banks has recovered from his ...  \n",
       "4     Obsessive master thief Neil McCauley leads a t...  \n",
       "...                                                 ...  \n",
       "6178  Severe Clear is a film based on the memoirs of...  \n",
       "6179  A group of environmentalists arrives at a fara...  \n",
       "6180             Also known as Santo vs. the She-Wolves  \n",
       "6181  Bertone is a moderately honest homicide cop. U...  \n",
       "6182  A western businessman, his Thai wife and son e...  \n",
       "\n",
       "[6183 rows x 9 columns]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just for now, until the imdb API works -> we are removing these movieIds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:26:18.266886Z",
     "start_time": "2024-09-03T03:26:18.255430Z"
    }
   },
   "outputs": [],
   "source": [
    "missing_tmdb_syn_idx = df_movies[pd.isna(df_movies['tmdb_syn'])].index.tolist()\n",
    "# missing_tmdb_syn_idx # for \n",
    "\n",
    "# Remove rows with indices in missing_tmdb_syn_idx from df_movies\n",
    "df_movies.drop(missing_tmdb_syn_idx, inplace=True)\n",
    "\n",
    "# Reset index if needed\n",
    "df_movies.reset_index(drop=True, inplace=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_movies.to_csv(\"../dataset/imdb_tmdb_syn_labelled.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_movies= pd.read_csv(\"../dataset/imdb_tmdb_syn_labelled.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Text Pre-processing \n",
    "- Need to apply to the following colunns: summarized_syn, title, genre\n",
    "\n",
    "\n",
    "1. Removal of special characters  \n",
    "2. Uniform size of letters - lowercase\n",
    "3. Remove punctuation and quotation marks\n",
    "4. Remove possessive pronouns \n",
    "5. Lemmatisation \n",
    "6. Removal of “stop words”"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Removal of special characters  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-03T03:27:00.170506Z",
     "start_time": "2024-09-03T03:27:00.129154Z"
    }
   },
   "outputs": [],
   "source": [
    "# Replace non-alphabetical characters with empty string, leaving spaces intact\n",
    "df_movies['summarized_syn_cleaned'] = df_movies['summarized_syn'].str.replace(r'[^a-zA-Z\\s]', '', regex=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Uniform size of letters - lowercase"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_movies['summarized_syn'] = df_movies['summarized_syn'].str.lower() #lowercase\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Remove punctuation and quotation marks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_movies['summarized_syn'] = df_movies['summarized_syn'].str.replace(r'[^\\w\\s]', '', regex=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. Remove possessive pronouns "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      movieId   imdbId    tmdbId                               title  \\\n",
      "0           1   114709     862.0                    Toy Story (1995)   \n",
      "1           2   113497    8844.0                      Jumanji (1995)   \n",
      "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
      "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
      "4           6   113277     949.0                         Heat (1995)   \n",
      "...       ...      ...       ...                                 ...   \n",
      "6178   130856   494826   48376.0                 Severe Clear (2010)   \n",
      "6179   130958   143338   78402.0             Killer Crocodile (1989)   \n",
      "6180   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
      "6181   131011    69109   79572.0              Execution Squad (1972)   \n",
      "6182   131015  1430116  143928.0                     Hellgate (2011)   \n",
      "\n",
      "                                           genres  \\\n",
      "0     Adventure|Animation|Children|Comedy|Fantasy   \n",
      "1                      Adventure|Children|Fantasy   \n",
      "2                            Comedy|Drama|Romance   \n",
      "3                                          Comedy   \n",
      "4                           Action|Crime|Thriller   \n",
      "...                                           ...   \n",
      "6178                           Comedy|Documentary   \n",
      "6179                                       Horror   \n",
      "6180                        Action|Fantasy|Horror   \n",
      "6181                                  Crime|Drama   \n",
      "6182                              Horror|Thriller   \n",
      "\n",
      "                                               imdb_syn  \\\n",
      "0     A little boy named Andy loves to be in his roo...   \n",
      "1     Jumanji, one of the most unique--and dangerous...   \n",
      "2     This story based on the best selling novel by ...   \n",
      "3     In this sequel to \"Father of the Bride\", Georg...   \n",
      "4     Hunters and their prey--Neil and his professio...   \n",
      "...                                                 ...   \n",
      "6178                                                NaN   \n",
      "6179                                                NaN   \n",
      "6180                                                NaN   \n",
      "6181                                                NaN   \n",
      "6182                                                NaN   \n",
      "\n",
      "                                               tmdb_syn  \\\n",
      "0     Led by Woody, Andy's toys live happily in his ...   \n",
      "1     When siblings Judy and Peter discover an encha...   \n",
      "2     Cheated on, mistreated and stepped on, the wom...   \n",
      "3     Just when George Banks has recovered from his ...   \n",
      "4     Obsessive master thief Neil McCauley leads a t...   \n",
      "...                                                 ...   \n",
      "6178  Severe Clear is a film based on the memoirs of...   \n",
      "6179  A group of environmentalists arrives at a fara...   \n",
      "6180             Also known as Santo vs. the She-Wolves   \n",
      "6181  Bertone is a moderately honest homicide cop. U...   \n",
      "6182  A western businessman, his Thai wife and son e...   \n",
      "\n",
      "                                         summarized_syn  \\\n",
      "0     a little boy named andy loves to be in room pl...   \n",
      "1     jumanji one of the most uniqueand dangerousboa...   \n",
      "2     this story based on the best selling novel by ...   \n",
      "3     in this sequel to father of the bride george b...   \n",
      "4     hunters and preyneil and professional criminal...   \n",
      "...                                                 ...   \n",
      "6178  severe clear is a film based on the memoirs of...   \n",
      "6179  a group of environmentalists arrives at a fara...   \n",
      "6180               also known as santo vs the shewolves   \n",
      "6181  bertone is a moderately honest homicide cop be...   \n",
      "6182  a western businessman thai wife and son experi...   \n",
      "\n",
      "                                 summarized_syn_cleaned  \n",
      "0     A little boy named Andy loves to be in his roo...  \n",
      "1     Jumanji one of the most uniqueand dangerousboa...  \n",
      "2     This story based on the best selling novel by ...  \n",
      "3     In this sequel to Father of the Bride George B...  \n",
      "4     Hunters and their preyNeil and his professiona...  \n",
      "...                                                 ...  \n",
      "6178   Severe Clear is a film based on the memoirs o...  \n",
      "6179  A group of environmentalists arrives at a fara...  \n",
      "6180               Also known as Santo vs the SheWolves  \n",
      "6181  Bertone is a moderately honest homicide cop\\nB...  \n",
      "6182   A western businessman his Thai wife and son e...  \n",
      "\n",
      "[6183 rows x 9 columns]\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "\n",
    "\n",
    "# Define a regular expression to match possessive pronouns, with word boundaries\n",
    "possessive_pronouns = r'\\b(my|your|his|her|its|our|their)\\b'\n",
    "\n",
    "# Replace possessive pronouns with empty strings\n",
    "df_movies['summarized_syn'] = df_movies['summarized_syn'].str.replace(possessive_pronouns, '', regex=True)\n",
    "\n",
    "# Remove extra spaces (since the possessive pronouns might leave extra spaces when removed)\n",
    "df_movies['summarized_syn'] = df_movies['summarized_syn'].str.replace(r'\\s+', ' ', regex=True).str.strip()\n",
    "\n",
    "print(df_movies)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Lemmatisation "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     /Users/jiayi/nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n",
      "[nltk_data] Downloading package punkt to /Users/jiayi/nltk_data...\n",
      "[nltk_data]   Unzipping tokenizers/punkt.zip.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      movieId   imdbId    tmdbId                               title  \\\n",
      "0           1   114709     862.0                    Toy Story (1995)   \n",
      "1           2   113497    8844.0                      Jumanji (1995)   \n",
      "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
      "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
      "4           6   113277     949.0                         Heat (1995)   \n",
      "...       ...      ...       ...                                 ...   \n",
      "6178   130856   494826   48376.0                 Severe Clear (2010)   \n",
      "6179   130958   143338   78402.0             Killer Crocodile (1989)   \n",
      "6180   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
      "6181   131011    69109   79572.0              Execution Squad (1972)   \n",
      "6182   131015  1430116  143928.0                     Hellgate (2011)   \n",
      "\n",
      "                                           genres  \\\n",
      "0     Adventure|Animation|Children|Comedy|Fantasy   \n",
      "1                      Adventure|Children|Fantasy   \n",
      "2                            Comedy|Drama|Romance   \n",
      "3                                          Comedy   \n",
      "4                           Action|Crime|Thriller   \n",
      "...                                           ...   \n",
      "6178                           Comedy|Documentary   \n",
      "6179                                       Horror   \n",
      "6180                        Action|Fantasy|Horror   \n",
      "6181                                  Crime|Drama   \n",
      "6182                              Horror|Thriller   \n",
      "\n",
      "                                               imdb_syn  \\\n",
      "0     A little boy named Andy loves to be in his roo...   \n",
      "1     Jumanji, one of the most unique--and dangerous...   \n",
      "2     This story based on the best selling novel by ...   \n",
      "3     In this sequel to \"Father of the Bride\", Georg...   \n",
      "4     Hunters and their prey--Neil and his professio...   \n",
      "...                                                 ...   \n",
      "6178                                                NaN   \n",
      "6179                                                NaN   \n",
      "6180                                                NaN   \n",
      "6181                                                NaN   \n",
      "6182                                                NaN   \n",
      "\n",
      "                                               tmdb_syn  \\\n",
      "0     Led by Woody, Andy's toys live happily in his ...   \n",
      "1     When siblings Judy and Peter discover an encha...   \n",
      "2     Cheated on, mistreated and stepped on, the wom...   \n",
      "3     Just when George Banks has recovered from his ...   \n",
      "4     Obsessive master thief Neil McCauley leads a t...   \n",
      "...                                                 ...   \n",
      "6178  Severe Clear is a film based on the memoirs of...   \n",
      "6179  A group of environmentalists arrives at a fara...   \n",
      "6180             Also known as Santo vs. the She-Wolves   \n",
      "6181  Bertone is a moderately honest homicide cop. U...   \n",
      "6182  A western businessman, his Thai wife and son e...   \n",
      "\n",
      "                                         summarized_syn  \\\n",
      "0     a little boy name andy love to be in room play...   \n",
      "1     jumanji one of the most uniqueand dangerousboa...   \n",
      "2     this story base on the best sell novel by terr...   \n",
      "3     in this sequel to father of the bride george b...   \n",
      "4     hunter and preyneil and professional criminal ...   \n",
      "...                                                 ...   \n",
      "6178  severe clear be a film base on the memoir of f...   \n",
      "6179  a group of environmentalist arrives at a faraw...   \n",
      "6180                 also know as santo v the shewolves   \n",
      "6181  bertone be a moderately honest homicide cop be...   \n",
      "6182  a western businessman thai wife and son experi...   \n",
      "\n",
      "                                 summarized_syn_cleaned  \n",
      "0     A little boy named Andy loves to be in his roo...  \n",
      "1     Jumanji one of the most uniqueand dangerousboa...  \n",
      "2     This story based on the best selling novel by ...  \n",
      "3     In this sequel to Father of the Bride George B...  \n",
      "4     Hunters and their preyNeil and his professiona...  \n",
      "...                                                 ...  \n",
      "6178   Severe Clear is a film based on the memoirs o...  \n",
      "6179  A group of environmentalists arrives at a fara...  \n",
      "6180               Also known as Santo vs the SheWolves  \n",
      "6181  Bertone is a moderately honest homicide cop\\nB...  \n",
      "6182   A western businessman his Thai wife and son e...  \n",
      "\n",
      "[6183 rows x 9 columns]\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "from nltk.corpus import wordnet\n",
    "\n",
    "# Download necessary NLTK data\n",
    "nltk.download('averaged_perceptron_tagger')\n",
    "nltk.download('punkt')  # for word_tokenize\n",
    "\n",
    "# Initialize the WordNetLemmatizer\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "\n",
    "# Function to map NLTK's POS tags to the first character used by WordNetLemmatizer\n",
    "def pos_tagger(nltk_tag):\n",
    "    if nltk_tag.startswith('J'):\n",
    "        return wordnet.ADJ\n",
    "    elif nltk_tag.startswith('V'):\n",
    "        return wordnet.VERB\n",
    "    elif nltk_tag.startswith('N'):\n",
    "        return wordnet.NOUN\n",
    "    elif nltk_tag.startswith('R'):\n",
    "        return wordnet.ADV\n",
    "    else:         \n",
    "        return None\n",
    "\n",
    "# Function to lemmatize a single word (removed the keep check)\n",
    "def lemmatize_word(word):\n",
    "    pos = nltk.pos_tag([word])[0][1]  # POS tagging\n",
    "    wordnet_pos = pos_tagger(pos)     # Map POS tag to first character used by WordNetLemmatizer\n",
    "    if wordnet_pos is None:\n",
    "        return word\n",
    "    else:\n",
    "        return lemmatizer.lemmatize(word, wordnet_pos)\n",
    "\n",
    "# Tokenize and then lemmatize\n",
    "df_movies['summarized_syn'] = df_movies['summarized_syn'].apply(\n",
    "    lambda text: ' '.join([lemmatize_word(word) for word in word_tokenize(text)])\n",
    ")\n",
    "\n",
    "# Display the updated DataFrame\n",
    "print(df_movies)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. Removal of “stop words”"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdb_syn</th>\n",
       "      <th>tmdb_syn</th>\n",
       "      <th>summarized_syn</th>\n",
       "      <th>summarized_syn_cleaned</th>\n",
       "      <th>summarized_syn_tokens</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "      <td>A little boy named Andy loves to be in his roo...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "      <td>a little boy name andy love to be in room play...</td>\n",
       "      <td>[little, boy, andy, love, room, play, toy, esp...</td>\n",
       "      <td>[a, little, boy, name, andy, love, to, be, in,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "      <td>Jumanji, one of the most unique--and dangerous...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "      <td>jumanji one of the most uniqueand dangerousboa...</td>\n",
       "      <td>[jumanji, uniqueand, dangerousboard, game, fal...</td>\n",
       "      <td>[jumanji, one, of, the, most, uniqueand, dange...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "      <td>This story based on the best selling novel by ...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "      <td>this story base on the best sell novel by terr...</td>\n",
       "      <td>[story, base, best, sell, novel, terry, mcmill...</td>\n",
       "      <td>[this, story, base, on, the, best, sell, novel...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>In this sequel to \"Father of the Bride\", Georg...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "      <td>in this sequel to father of the bride george b...</td>\n",
       "      <td>[sequel, father, bride, george, bank, accept, ...</td>\n",
       "      <td>[in, this, sequel, to, father, of, the, bride,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949.0</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>Action|Crime|Thriller</td>\n",
       "      <td>Hunters and their prey--Neil and his professio...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "      <td>hunter and preyneil and professional criminal ...</td>\n",
       "      <td>[hunter, preyneil, professional, criminal, cre...</td>\n",
       "      <td>[hunter, and, preyneil, and, professional, cri...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6178</th>\n",
       "      <td>130856</td>\n",
       "      <td>494826</td>\n",
       "      <td>48376.0</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>Comedy|Documentary</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "      <td>severe clear be a film base on the memoir of f...</td>\n",
       "      <td>[severe, clear, film, base, memoir, lieutenant...</td>\n",
       "      <td>[severe, clear, be, a, film, base, on, the, me...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6179</th>\n",
       "      <td>130958</td>\n",
       "      <td>143338</td>\n",
       "      <td>78402.0</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>Horror</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "      <td>a group of environmentalist arrives at a faraw...</td>\n",
       "      <td>[group, environmentalist, arrives, faraway, tr...</td>\n",
       "      <td>[a, group, of, environmentalist, arrives, at, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6180</th>\n",
       "      <td>130984</td>\n",
       "      <td>208423</td>\n",
       "      <td>317168.0</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>Action|Fantasy|Horror</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "      <td>also know as santo v the shewolves</td>\n",
       "      <td>[know, santo, v, shewolves]</td>\n",
       "      <td>[also, know, as, santo, v, the, shewolves]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6181</th>\n",
       "      <td>131011</td>\n",
       "      <td>69109</td>\n",
       "      <td>79572.0</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>Crime|Drama</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "      <td>bertone be a moderately honest homicide cop be...</td>\n",
       "      <td>[bertone, moderately, honest, homicide, cop, e...</td>\n",
       "      <td>[bertone, be, a, moderately, honest, homicide,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6182</th>\n",
       "      <td>131015</td>\n",
       "      <td>1430116</td>\n",
       "      <td>143928.0</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>Horror|Thriller</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "      <td>a western businessman thai wife and son experi...</td>\n",
       "      <td>[western, businessman, thai, wife, son, experi...</td>\n",
       "      <td>[a, western, businessman, thai, wife, and, son...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6183 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId   imdbId    tmdbId                               title  \\\n",
       "0           1   114709     862.0                    Toy Story (1995)   \n",
       "1           2   113497    8844.0                      Jumanji (1995)   \n",
       "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
       "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
       "4           6   113277     949.0                         Heat (1995)   \n",
       "...       ...      ...       ...                                 ...   \n",
       "6178   130856   494826   48376.0                 Severe Clear (2010)   \n",
       "6179   130958   143338   78402.0             Killer Crocodile (1989)   \n",
       "6180   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
       "6181   131011    69109   79572.0              Execution Squad (1972)   \n",
       "6182   131015  1430116  143928.0                     Hellgate (2011)   \n",
       "\n",
       "                                           genres  \\\n",
       "0     Adventure|Animation|Children|Comedy|Fantasy   \n",
       "1                      Adventure|Children|Fantasy   \n",
       "2                            Comedy|Drama|Romance   \n",
       "3                                          Comedy   \n",
       "4                           Action|Crime|Thriller   \n",
       "...                                           ...   \n",
       "6178                           Comedy|Documentary   \n",
       "6179                                       Horror   \n",
       "6180                        Action|Fantasy|Horror   \n",
       "6181                                  Crime|Drama   \n",
       "6182                              Horror|Thriller   \n",
       "\n",
       "                                               imdb_syn  \\\n",
       "0     A little boy named Andy loves to be in his roo...   \n",
       "1     Jumanji, one of the most unique--and dangerous...   \n",
       "2     This story based on the best selling novel by ...   \n",
       "3     In this sequel to \"Father of the Bride\", Georg...   \n",
       "4     Hunters and their prey--Neil and his professio...   \n",
       "...                                                 ...   \n",
       "6178                                                NaN   \n",
       "6179                                                NaN   \n",
       "6180                                                NaN   \n",
       "6181                                                NaN   \n",
       "6182                                                NaN   \n",
       "\n",
       "                                               tmdb_syn  \\\n",
       "0     Led by Woody, Andy's toys live happily in his ...   \n",
       "1     When siblings Judy and Peter discover an encha...   \n",
       "2     Cheated on, mistreated and stepped on, the wom...   \n",
       "3     Just when George Banks has recovered from his ...   \n",
       "4     Obsessive master thief Neil McCauley leads a t...   \n",
       "...                                                 ...   \n",
       "6178  Severe Clear is a film based on the memoirs of...   \n",
       "6179  A group of environmentalists arrives at a fara...   \n",
       "6180             Also known as Santo vs. the She-Wolves   \n",
       "6181  Bertone is a moderately honest homicide cop. U...   \n",
       "6182  A western businessman, his Thai wife and son e...   \n",
       "\n",
       "                                         summarized_syn  \\\n",
       "0     a little boy name andy love to be in room play...   \n",
       "1     jumanji one of the most uniqueand dangerousboa...   \n",
       "2     this story base on the best sell novel by terr...   \n",
       "3     in this sequel to father of the bride george b...   \n",
       "4     hunter and preyneil and professional criminal ...   \n",
       "...                                                 ...   \n",
       "6178  severe clear be a film base on the memoir of f...   \n",
       "6179  a group of environmentalist arrives at a faraw...   \n",
       "6180                 also know as santo v the shewolves   \n",
       "6181  bertone be a moderately honest homicide cop be...   \n",
       "6182  a western businessman thai wife and son experi...   \n",
       "\n",
       "                                 summarized_syn_cleaned  \\\n",
       "0     [little, boy, andy, love, room, play, toy, esp...   \n",
       "1     [jumanji, uniqueand, dangerousboard, game, fal...   \n",
       "2     [story, base, best, sell, novel, terry, mcmill...   \n",
       "3     [sequel, father, bride, george, bank, accept, ...   \n",
       "4     [hunter, preyneil, professional, criminal, cre...   \n",
       "...                                                 ...   \n",
       "6178  [severe, clear, film, base, memoir, lieutenant...   \n",
       "6179  [group, environmentalist, arrives, faraway, tr...   \n",
       "6180                        [know, santo, v, shewolves]   \n",
       "6181  [bertone, moderately, honest, homicide, cop, e...   \n",
       "6182  [western, businessman, thai, wife, son, experi...   \n",
       "\n",
       "                                  summarized_syn_tokens  \n",
       "0     [a, little, boy, name, andy, love, to, be, in,...  \n",
       "1     [jumanji, one, of, the, most, uniqueand, dange...  \n",
       "2     [this, story, base, on, the, best, sell, novel...  \n",
       "3     [in, this, sequel, to, father, of, the, bride,...  \n",
       "4     [hunter, and, preyneil, and, professional, cri...  \n",
       "...                                                 ...  \n",
       "6178  [severe, clear, be, a, film, base, on, the, me...  \n",
       "6179  [a, group, of, environmentalist, arrives, at, ...  \n",
       "6180         [also, know, as, santo, v, the, shewolves]  \n",
       "6181  [bertone, be, a, moderately, honest, homicide,...  \n",
       "6182  [a, western, businessman, thai, wife, and, son...  \n",
       "\n",
       "[6183 rows x 10 columns]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.tokenize import word_tokenize\n",
    "from spacy.lang.en import STOP_WORDS\n",
    "\n",
    "# Function to remove stop words from a list of words\n",
    "def remove_stop_words(words_list):\n",
    "    return [word for word in words_list if word.lower() not in STOP_WORDS]\n",
    "\n",
    "# First, tokenize the sentences into words\n",
    "df_movies['summarized_syn_tokens'] = df_movies['summarized_syn'].apply(\n",
    "    lambda element: word_tokenize(element) if isinstance(element, str) else element\n",
    ")\n",
    "\n",
    "# Now remove stop words from the 'summarized_syn_tokens' column\n",
    "df_movies['summarized_syn_cleaned'] = df_movies['summarized_syn_tokens'].apply(\n",
    "    lambda element: remove_stop_words(element) if isinstance(element, list) else element\n",
    ")\n",
    "\n",
    "# Display the updated DataFrame\n",
    "df_movies\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Encoding genre column using GloVe vectors\n",
    "\n",
    "- Using pre-trained word vectors (wikipedia) - 200d\n",
    "ref: https://github.com/stanfordnlp/GloVe \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_movies['genres'] = df_movies['genres'].apply(lambda x: x.split('|'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get a distinct list of genres\n",
    "\n",
    "\n",
    "# Assuming df_movies['genres'] has been split into lists of strings\n",
    "unique_genres = set()\n",
    "\n",
    "# Iterate over the 'genres' column to populate the unique_genres set\n",
    "for genre_list in df_movies['genres']:\n",
    "    unique_genres.update(genre_list)\n",
    "\n",
    "# Convert the set to a list, if needed\n",
    "unique_genres = list(unique_genres)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "unique_genres = [genre.lower() for genre in unique_genres]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['imax',\n",
       " 'comedy',\n",
       " '(no genres listed)',\n",
       " 'crime',\n",
       " 'animation',\n",
       " 'drama',\n",
       " 'horror',\n",
       " 'mystery',\n",
       " 'adventure',\n",
       " 'action',\n",
       " 'fantasy',\n",
       " 'film-noir',\n",
       " 'documentary',\n",
       " 'war',\n",
       " 'sci-fi',\n",
       " 'children',\n",
       " 'western',\n",
       " 'romance',\n",
       " 'thriller',\n",
       " 'musical']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_genres"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"film-noir\" has been replaced to \"noir\"\n",
    "- As film-noir is not detected in the glove vectors \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['imax',\n",
       " 'comedy',\n",
       " '(no genres listed)',\n",
       " 'crime',\n",
       " 'animation',\n",
       " 'drama',\n",
       " 'horror',\n",
       " 'mystery',\n",
       " 'adventure',\n",
       " 'action',\n",
       " 'fantasy',\n",
       " 'noir',\n",
       " 'documentary',\n",
       " 'war',\n",
       " 'sci-fi',\n",
       " 'children',\n",
       " 'western',\n",
       " 'romance',\n",
       " 'thriller',\n",
       " 'musical']"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_genres = [genre.replace(\"film-noir\", \"noir\") for genre in unique_genres]\n",
    "unique_genres"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting the glove vectors for each unique genre "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found 400001 word vectors.\n"
     ]
    }
   ],
   "source": [
    "# ref: https://keras.io/examples/nlp/pretrained_word_embeddings/\n",
    "\n",
    "def get_glove(file_path):\n",
    "\n",
    "    embeddings_index = {}\n",
    "    with open(path_to_glove_file, 'r', encoding='utf-8') as f:\n",
    "        for line in f:\n",
    "            word, coefs = line.split(maxsplit=1)\n",
    "            coefs = np.fromstring(coefs, \"f\", sep=\" \")\n",
    "            embeddings_index[word] = coefs\n",
    "    return embeddings_index\n",
    "\n",
    "path_to_glove_file = \"../pretrain_model/glove.6B/glove.6B.200d.txt\"\n",
    "glove_vec = get_glove(path_to_glove_file)\n",
    "\n",
    "print(\"Found %s word vectors.\" % len(glove_vec))\n",
    "\n",
    " # 200 zero vec is assigned when the word is not found in the GloVe index "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No GloVe representation found for genre: (no genres listed)\n"
     ]
    }
   ],
   "source": [
    "# Initialize an empty dictionary to hold the GloVe vectors for each unique genre\n",
    "genre_glove_vec = {}\n",
    "\n",
    "# Populate the genre_glove_vec dictionary\n",
    "for genre in unique_genres:\n",
    "    if genre in glove_vec:  # Check if the genre name is available in the GloVe vocab\n",
    "        genre_glove_vec[genre] = glove_vec[genre]\n",
    "    else:\n",
    "        print(f\"No GloVe representation found for genre: {genre}\")\n",
    "        genre_glove_vec[genre] = None  # You can also populate with a default vector if needed\n",
    "\n",
    "# Now, genre_glove_vec contains the GloVe representation for each unique genre\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Number of movies with  no genres: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of movies with no genres: 0\n"
     ]
    }
   ],
   "source": [
    "# Count the number of movies with no genres\n",
    "count_no_genres = df_movies['genres'].apply(lambda x: len(x) == 0).sum()\n",
    "\n",
    "print(f'Number of movies with no genres: {count_no_genres}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All movies have an assigned genre!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# passing the genre_glove_vec to file - which will be read in the CB model \n",
    "import pickle\n",
    "\n",
    "# Writing to file\n",
    "with open('../pretrain_model/genre_glove_vec.pkl', 'wb') as f:\n",
    "    pickle.dump(genre_glove_vec, f)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# put movieId and genres into one dataframe too"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_output= df_movies[[\"movieId\", \"genres\"]]\n",
    "df_output.to_csv(\"../dataset/movie_genre.csv\",index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdb_syn</th>\n",
       "      <th>tmdb_syn</th>\n",
       "      <th>summarized_syn</th>\n",
       "      <th>summarized_syn_cleaned</th>\n",
       "      <th>summarized_syn_tokens</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>[Adventure, Animation, Children, Comedy, Fantasy]</td>\n",
       "      <td>A little boy named Andy loves to be in his roo...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "      <td>a little boy name andy love to be in room play...</td>\n",
       "      <td>[little, boy, andy, love, room, play, toy, esp...</td>\n",
       "      <td>[a, little, boy, name, andy, love, to, be, in,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>[Adventure, Children, Fantasy]</td>\n",
       "      <td>Jumanji, one of the most unique--and dangerous...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "      <td>jumanji one of the most uniqueand dangerousboa...</td>\n",
       "      <td>[jumanji, uniqueand, dangerousboard, game, fal...</td>\n",
       "      <td>[jumanji, one, of, the, most, uniqueand, dange...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>[Comedy, Drama, Romance]</td>\n",
       "      <td>This story based on the best selling novel by ...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "      <td>this story base on the best sell novel by terr...</td>\n",
       "      <td>[story, base, best, sell, novel, terry, mcmill...</td>\n",
       "      <td>[this, story, base, on, the, best, sell, novel...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>[Comedy]</td>\n",
       "      <td>In this sequel to \"Father of the Bride\", Georg...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "      <td>in this sequel to father of the bride george b...</td>\n",
       "      <td>[sequel, father, bride, george, bank, accept, ...</td>\n",
       "      <td>[in, this, sequel, to, father, of, the, bride,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949.0</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>[Action, Crime, Thriller]</td>\n",
       "      <td>Hunters and their prey--Neil and his professio...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "      <td>hunter and preyneil and professional criminal ...</td>\n",
       "      <td>[hunter, preyneil, professional, criminal, cre...</td>\n",
       "      <td>[hunter, and, preyneil, and, professional, cri...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6178</th>\n",
       "      <td>130856</td>\n",
       "      <td>494826</td>\n",
       "      <td>48376.0</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>[Comedy, Documentary]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "      <td>severe clear be a film base on the memoir of f...</td>\n",
       "      <td>[severe, clear, film, base, memoir, lieutenant...</td>\n",
       "      <td>[severe, clear, be, a, film, base, on, the, me...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6179</th>\n",
       "      <td>130958</td>\n",
       "      <td>143338</td>\n",
       "      <td>78402.0</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>[Horror]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "      <td>a group of environmentalist arrives at a faraw...</td>\n",
       "      <td>[group, environmentalist, arrives, faraway, tr...</td>\n",
       "      <td>[a, group, of, environmentalist, arrives, at, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6180</th>\n",
       "      <td>130984</td>\n",
       "      <td>208423</td>\n",
       "      <td>317168.0</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>[Action, Fantasy, Horror]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "      <td>also know as santo v the shewolves</td>\n",
       "      <td>[know, santo, v, shewolves]</td>\n",
       "      <td>[also, know, as, santo, v, the, shewolves]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6181</th>\n",
       "      <td>131011</td>\n",
       "      <td>69109</td>\n",
       "      <td>79572.0</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>[Crime, Drama]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "      <td>bertone be a moderately honest homicide cop be...</td>\n",
       "      <td>[bertone, moderately, honest, homicide, cop, e...</td>\n",
       "      <td>[bertone, be, a, moderately, honest, homicide,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6182</th>\n",
       "      <td>131015</td>\n",
       "      <td>1430116</td>\n",
       "      <td>143928.0</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>[Horror, Thriller]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "      <td>a western businessman thai wife and son experi...</td>\n",
       "      <td>[western, businessman, thai, wife, son, experi...</td>\n",
       "      <td>[a, western, businessman, thai, wife, and, son...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6183 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId   imdbId    tmdbId                               title  \\\n",
       "0           1   114709     862.0                    Toy Story (1995)   \n",
       "1           2   113497    8844.0                      Jumanji (1995)   \n",
       "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
       "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
       "4           6   113277     949.0                         Heat (1995)   \n",
       "...       ...      ...       ...                                 ...   \n",
       "6178   130856   494826   48376.0                 Severe Clear (2010)   \n",
       "6179   130958   143338   78402.0             Killer Crocodile (1989)   \n",
       "6180   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
       "6181   131011    69109   79572.0              Execution Squad (1972)   \n",
       "6182   131015  1430116  143928.0                     Hellgate (2011)   \n",
       "\n",
       "                                                 genres  \\\n",
       "0     [Adventure, Animation, Children, Comedy, Fantasy]   \n",
       "1                        [Adventure, Children, Fantasy]   \n",
       "2                              [Comedy, Drama, Romance]   \n",
       "3                                              [Comedy]   \n",
       "4                             [Action, Crime, Thriller]   \n",
       "...                                                 ...   \n",
       "6178                              [Comedy, Documentary]   \n",
       "6179                                           [Horror]   \n",
       "6180                          [Action, Fantasy, Horror]   \n",
       "6181                                     [Crime, Drama]   \n",
       "6182                                 [Horror, Thriller]   \n",
       "\n",
       "                                               imdb_syn  \\\n",
       "0     A little boy named Andy loves to be in his roo...   \n",
       "1     Jumanji, one of the most unique--and dangerous...   \n",
       "2     This story based on the best selling novel by ...   \n",
       "3     In this sequel to \"Father of the Bride\", Georg...   \n",
       "4     Hunters and their prey--Neil and his professio...   \n",
       "...                                                 ...   \n",
       "6178                                                NaN   \n",
       "6179                                                NaN   \n",
       "6180                                                NaN   \n",
       "6181                                                NaN   \n",
       "6182                                                NaN   \n",
       "\n",
       "                                               tmdb_syn  \\\n",
       "0     Led by Woody, Andy's toys live happily in his ...   \n",
       "1     When siblings Judy and Peter discover an encha...   \n",
       "2     Cheated on, mistreated and stepped on, the wom...   \n",
       "3     Just when George Banks has recovered from his ...   \n",
       "4     Obsessive master thief Neil McCauley leads a t...   \n",
       "...                                                 ...   \n",
       "6178  Severe Clear is a film based on the memoirs of...   \n",
       "6179  A group of environmentalists arrives at a fara...   \n",
       "6180             Also known as Santo vs. the She-Wolves   \n",
       "6181  Bertone is a moderately honest homicide cop. U...   \n",
       "6182  A western businessman, his Thai wife and son e...   \n",
       "\n",
       "                                         summarized_syn  \\\n",
       "0     a little boy name andy love to be in room play...   \n",
       "1     jumanji one of the most uniqueand dangerousboa...   \n",
       "2     this story base on the best sell novel by terr...   \n",
       "3     in this sequel to father of the bride george b...   \n",
       "4     hunter and preyneil and professional criminal ...   \n",
       "...                                                 ...   \n",
       "6178  severe clear be a film base on the memoir of f...   \n",
       "6179  a group of environmentalist arrives at a faraw...   \n",
       "6180                 also know as santo v the shewolves   \n",
       "6181  bertone be a moderately honest homicide cop be...   \n",
       "6182  a western businessman thai wife and son experi...   \n",
       "\n",
       "                                 summarized_syn_cleaned  \\\n",
       "0     [little, boy, andy, love, room, play, toy, esp...   \n",
       "1     [jumanji, uniqueand, dangerousboard, game, fal...   \n",
       "2     [story, base, best, sell, novel, terry, mcmill...   \n",
       "3     [sequel, father, bride, george, bank, accept, ...   \n",
       "4     [hunter, preyneil, professional, criminal, cre...   \n",
       "...                                                 ...   \n",
       "6178  [severe, clear, film, base, memoir, lieutenant...   \n",
       "6179  [group, environmentalist, arrives, faraway, tr...   \n",
       "6180                        [know, santo, v, shewolves]   \n",
       "6181  [bertone, moderately, honest, homicide, cop, e...   \n",
       "6182  [western, businessman, thai, wife, son, experi...   \n",
       "\n",
       "                                  summarized_syn_tokens  \n",
       "0     [a, little, boy, name, andy, love, to, be, in,...  \n",
       "1     [jumanji, one, of, the, most, uniqueand, dange...  \n",
       "2     [this, story, base, on, the, best, sell, novel...  \n",
       "3     [in, this, sequel, to, father, of, the, bride,...  \n",
       "4     [hunter, and, preyneil, and, professional, cri...  \n",
       "...                                                 ...  \n",
       "6178  [severe, clear, be, a, film, base, on, the, me...  \n",
       "6179  [a, group, of, environmentalist, arrives, at, ...  \n",
       "6180         [also, know, as, santo, v, the, shewolves]  \n",
       "6181  [bertone, be, a, moderately, honest, homicide,...  \n",
       "6182  [a, western, businessman, thai, wife, and, son...  \n",
       "\n",
       "[6183 rows x 10 columns]"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdb_syn</th>\n",
       "      <th>tmdb_syn</th>\n",
       "      <th>summarized_syn</th>\n",
       "      <th>summarized_syn_cleaned</th>\n",
       "      <th>summarized_syn_tokens</th>\n",
       "      <th>...</th>\n",
       "      <th>Film-Noir</th>\n",
       "      <th>Horror</th>\n",
       "      <th>IMAX</th>\n",
       "      <th>Musical</th>\n",
       "      <th>Mystery</th>\n",
       "      <th>Romance</th>\n",
       "      <th>Sci-Fi</th>\n",
       "      <th>Thriller</th>\n",
       "      <th>War</th>\n",
       "      <th>Western</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>[Adventure, Animation, Children, Comedy, Fantasy]</td>\n",
       "      <td>A little boy named Andy loves to be in his roo...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "      <td>a little boy name andy love to be in room play...</td>\n",
       "      <td>[little, boy, andy, love, room, play, toy, esp...</td>\n",
       "      <td>[a, little, boy, name, andy, love, to, be, in,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>[Adventure, Children, Fantasy]</td>\n",
       "      <td>Jumanji, one of the most unique--and dangerous...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "      <td>jumanji one of the most uniqueand dangerousboa...</td>\n",
       "      <td>[jumanji, uniqueand, dangerousboard, game, fal...</td>\n",
       "      <td>[jumanji, one, of, the, most, uniqueand, dange...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>[Comedy, Drama, Romance]</td>\n",
       "      <td>This story based on the best selling novel by ...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "      <td>this story base on the best sell novel by terr...</td>\n",
       "      <td>[story, base, best, sell, novel, terry, mcmill...</td>\n",
       "      <td>[this, story, base, on, the, best, sell, novel...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>[Comedy]</td>\n",
       "      <td>In this sequel to \"Father of the Bride\", Georg...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "      <td>in this sequel to father of the bride george b...</td>\n",
       "      <td>[sequel, father, bride, george, bank, accept, ...</td>\n",
       "      <td>[in, this, sequel, to, father, of, the, bride,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949.0</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>[Action, Crime, Thriller]</td>\n",
       "      <td>Hunters and their prey--Neil and his professio...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "      <td>hunter and preyneil and professional criminal ...</td>\n",
       "      <td>[hunter, preyneil, professional, criminal, cre...</td>\n",
       "      <td>[hunter, and, preyneil, and, professional, cri...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6178</th>\n",
       "      <td>130856</td>\n",
       "      <td>494826</td>\n",
       "      <td>48376.0</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>[Comedy, Documentary]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "      <td>severe clear be a film base on the memoir of f...</td>\n",
       "      <td>[severe, clear, film, base, memoir, lieutenant...</td>\n",
       "      <td>[severe, clear, be, a, film, base, on, the, me...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6179</th>\n",
       "      <td>130958</td>\n",
       "      <td>143338</td>\n",
       "      <td>78402.0</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>[Horror]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "      <td>a group of environmentalist arrives at a faraw...</td>\n",
       "      <td>[group, environmentalist, arrives, faraway, tr...</td>\n",
       "      <td>[a, group, of, environmentalist, arrives, at, ...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6180</th>\n",
       "      <td>130984</td>\n",
       "      <td>208423</td>\n",
       "      <td>317168.0</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>[Action, Fantasy, Horror]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "      <td>also know as santo v the shewolves</td>\n",
       "      <td>[know, santo, v, shewolves]</td>\n",
       "      <td>[also, know, as, santo, v, the, shewolves]</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6181</th>\n",
       "      <td>131011</td>\n",
       "      <td>69109</td>\n",
       "      <td>79572.0</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>[Crime, Drama]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "      <td>bertone be a moderately honest homicide cop be...</td>\n",
       "      <td>[bertone, moderately, honest, homicide, cop, e...</td>\n",
       "      <td>[bertone, be, a, moderately, honest, homicide,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6182</th>\n",
       "      <td>131015</td>\n",
       "      <td>1430116</td>\n",
       "      <td>143928.0</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>[Horror, Thriller]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "      <td>a western businessman thai wife and son experi...</td>\n",
       "      <td>[western, businessman, thai, wife, son, experi...</td>\n",
       "      <td>[a, western, businessman, thai, wife, and, son...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6183 rows × 30 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId   imdbId    tmdbId                               title  \\\n",
       "0           1   114709     862.0                    Toy Story (1995)   \n",
       "1           2   113497    8844.0                      Jumanji (1995)   \n",
       "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
       "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
       "4           6   113277     949.0                         Heat (1995)   \n",
       "...       ...      ...       ...                                 ...   \n",
       "6178   130856   494826   48376.0                 Severe Clear (2010)   \n",
       "6179   130958   143338   78402.0             Killer Crocodile (1989)   \n",
       "6180   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
       "6181   131011    69109   79572.0              Execution Squad (1972)   \n",
       "6182   131015  1430116  143928.0                     Hellgate (2011)   \n",
       "\n",
       "                                                 genres  \\\n",
       "0     [Adventure, Animation, Children, Comedy, Fantasy]   \n",
       "1                        [Adventure, Children, Fantasy]   \n",
       "2                              [Comedy, Drama, Romance]   \n",
       "3                                              [Comedy]   \n",
       "4                             [Action, Crime, Thriller]   \n",
       "...                                                 ...   \n",
       "6178                              [Comedy, Documentary]   \n",
       "6179                                           [Horror]   \n",
       "6180                          [Action, Fantasy, Horror]   \n",
       "6181                                     [Crime, Drama]   \n",
       "6182                                 [Horror, Thriller]   \n",
       "\n",
       "                                               imdb_syn  \\\n",
       "0     A little boy named Andy loves to be in his roo...   \n",
       "1     Jumanji, one of the most unique--and dangerous...   \n",
       "2     This story based on the best selling novel by ...   \n",
       "3     In this sequel to \"Father of the Bride\", Georg...   \n",
       "4     Hunters and their prey--Neil and his professio...   \n",
       "...                                                 ...   \n",
       "6178                                                NaN   \n",
       "6179                                                NaN   \n",
       "6180                                                NaN   \n",
       "6181                                                NaN   \n",
       "6182                                                NaN   \n",
       "\n",
       "                                               tmdb_syn  \\\n",
       "0     Led by Woody, Andy's toys live happily in his ...   \n",
       "1     When siblings Judy and Peter discover an encha...   \n",
       "2     Cheated on, mistreated and stepped on, the wom...   \n",
       "3     Just when George Banks has recovered from his ...   \n",
       "4     Obsessive master thief Neil McCauley leads a t...   \n",
       "...                                                 ...   \n",
       "6178  Severe Clear is a film based on the memoirs of...   \n",
       "6179  A group of environmentalists arrives at a fara...   \n",
       "6180             Also known as Santo vs. the She-Wolves   \n",
       "6181  Bertone is a moderately honest homicide cop. U...   \n",
       "6182  A western businessman, his Thai wife and son e...   \n",
       "\n",
       "                                         summarized_syn  \\\n",
       "0     a little boy name andy love to be in room play...   \n",
       "1     jumanji one of the most uniqueand dangerousboa...   \n",
       "2     this story base on the best sell novel by terr...   \n",
       "3     in this sequel to father of the bride george b...   \n",
       "4     hunter and preyneil and professional criminal ...   \n",
       "...                                                 ...   \n",
       "6178  severe clear be a film base on the memoir of f...   \n",
       "6179  a group of environmentalist arrives at a faraw...   \n",
       "6180                 also know as santo v the shewolves   \n",
       "6181  bertone be a moderately honest homicide cop be...   \n",
       "6182  a western businessman thai wife and son experi...   \n",
       "\n",
       "                                 summarized_syn_cleaned  \\\n",
       "0     [little, boy, andy, love, room, play, toy, esp...   \n",
       "1     [jumanji, uniqueand, dangerousboard, game, fal...   \n",
       "2     [story, base, best, sell, novel, terry, mcmill...   \n",
       "3     [sequel, father, bride, george, bank, accept, ...   \n",
       "4     [hunter, preyneil, professional, criminal, cre...   \n",
       "...                                                 ...   \n",
       "6178  [severe, clear, film, base, memoir, lieutenant...   \n",
       "6179  [group, environmentalist, arrives, faraway, tr...   \n",
       "6180                        [know, santo, v, shewolves]   \n",
       "6181  [bertone, moderately, honest, homicide, cop, e...   \n",
       "6182  [western, businessman, thai, wife, son, experi...   \n",
       "\n",
       "                                  summarized_syn_tokens  ...  Film-Noir  \\\n",
       "0     [a, little, boy, name, andy, love, to, be, in,...  ...          0   \n",
       "1     [jumanji, one, of, the, most, uniqueand, dange...  ...          0   \n",
       "2     [this, story, base, on, the, best, sell, novel...  ...          0   \n",
       "3     [in, this, sequel, to, father, of, the, bride,...  ...          0   \n",
       "4     [hunter, and, preyneil, and, professional, cri...  ...          0   \n",
       "...                                                 ...  ...        ...   \n",
       "6178  [severe, clear, be, a, film, base, on, the, me...  ...          0   \n",
       "6179  [a, group, of, environmentalist, arrives, at, ...  ...          0   \n",
       "6180         [also, know, as, santo, v, the, shewolves]  ...          0   \n",
       "6181  [bertone, be, a, moderately, honest, homicide,...  ...          0   \n",
       "6182  [a, western, businessman, thai, wife, and, son...  ...          0   \n",
       "\n",
       "      Horror  IMAX  Musical  Mystery  Romance  Sci-Fi  Thriller  War  Western  \n",
       "0          0     0        0        0        0       0         0    0        0  \n",
       "1          0     0        0        0        0       0         0    0        0  \n",
       "2          0     0        0        0        1       0         0    0        0  \n",
       "3          0     0        0        0        0       0         0    0        0  \n",
       "4          0     0        0        0        0       0         1    0        0  \n",
       "...      ...   ...      ...      ...      ...     ...       ...  ...      ...  \n",
       "6178       0     0        0        0        0       0         0    0        0  \n",
       "6179       1     0        0        0        0       0         0    0        0  \n",
       "6180       1     0        0        0        0       0         0    0        0  \n",
       "6181       0     0        0        0        0       0         0    0        0  \n",
       "6182       1     0        0        0        0       0         1    0        0  \n",
       "\n",
       "[6183 rows x 30 columns]"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from sklearn.preprocessing import MultiLabelBinarizer\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "# Initialize MultiLabelBinarizer\n",
    "mlb = MultiLabelBinarizer()\n",
    "\n",
    "# Transform the 'genres' column to a binary matrix\n",
    "binary_genres = mlb.fit_transform(df_movies['genres'])\n",
    "\n",
    "# Create a new DataFrame from the binary matrix\n",
    "df_genres = pd.DataFrame(binary_genres, columns=mlb.classes_)\n",
    "\n",
    "# Concatenate the original DataFrame and the new DataFrame\n",
    "df_movies = pd.concat([df_movies, df_genres], axis=1)\n",
    "\n",
    "df_movies\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>[Adventure, Animation, Children, Comedy, Fantasy]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>[Adventure, Children, Fantasy]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>[Comedy, Drama, Romance]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>[Comedy]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>[Action, Crime, Thriller]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>7</td>\n",
       "      <td>[Comedy, Romance]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>10</td>\n",
       "      <td>[Action, Adventure, Thriller]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>11</td>\n",
       "      <td>[Comedy, Drama, Romance]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>12</td>\n",
       "      <td>[Comedy, Horror]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>14</td>\n",
       "      <td>[Drama]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId                                             genres\n",
       "0        1  [Adventure, Animation, Children, Comedy, Fantasy]\n",
       "1        2                     [Adventure, Children, Fantasy]\n",
       "2        4                           [Comedy, Drama, Romance]\n",
       "3        5                                           [Comedy]\n",
       "4        6                          [Action, Crime, Thriller]\n",
       "5        7                                  [Comedy, Romance]\n",
       "6       10                      [Action, Adventure, Thriller]\n",
       "7       11                           [Comedy, Drama, Romance]\n",
       "8       12                                   [Comedy, Horror]\n",
       "9       14                                            [Drama]"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies[[\"movieId\", \"genres\"]].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>(no genres listed)</th>\n",
       "      <th>Action</th>\n",
       "      <th>Adventure</th>\n",
       "      <th>Animation</th>\n",
       "      <th>Children</th>\n",
       "      <th>Comedy</th>\n",
       "      <th>Crime</th>\n",
       "      <th>Documentary</th>\n",
       "      <th>Drama</th>\n",
       "      <th>Fantasy</th>\n",
       "      <th>Film-Noir</th>\n",
       "      <th>Horror</th>\n",
       "      <th>IMAX</th>\n",
       "      <th>Musical</th>\n",
       "      <th>Mystery</th>\n",
       "      <th>Romance</th>\n",
       "      <th>Sci-Fi</th>\n",
       "      <th>Thriller</th>\n",
       "      <th>War</th>\n",
       "      <th>Western</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6178</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6179</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6180</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6181</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6182</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6183 rows × 20 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      (no genres listed)  Action  Adventure  Animation  Children  Comedy  \\\n",
       "0                      0       0          1          1         1       1   \n",
       "1                      0       0          1          0         1       0   \n",
       "2                      0       0          0          0         0       1   \n",
       "3                      0       0          0          0         0       1   \n",
       "4                      0       1          0          0         0       0   \n",
       "...                  ...     ...        ...        ...       ...     ...   \n",
       "6178                   0       0          0          0         0       1   \n",
       "6179                   0       0          0          0         0       0   \n",
       "6180                   0       1          0          0         0       0   \n",
       "6181                   0       0          0          0         0       0   \n",
       "6182                   0       0          0          0         0       0   \n",
       "\n",
       "      Crime  Documentary  Drama  Fantasy  Film-Noir  Horror  IMAX  Musical  \\\n",
       "0         0            0      0        1          0       0     0        0   \n",
       "1         0            0      0        1          0       0     0        0   \n",
       "2         0            0      1        0          0       0     0        0   \n",
       "3         0            0      0        0          0       0     0        0   \n",
       "4         1            0      0        0          0       0     0        0   \n",
       "...     ...          ...    ...      ...        ...     ...   ...      ...   \n",
       "6178      0            1      0        0          0       0     0        0   \n",
       "6179      0            0      0        0          0       1     0        0   \n",
       "6180      0            0      0        1          0       1     0        0   \n",
       "6181      1            0      1        0          0       0     0        0   \n",
       "6182      0            0      0        0          0       1     0        0   \n",
       "\n",
       "      Mystery  Romance  Sci-Fi  Thriller  War  Western  \n",
       "0           0        0       0         0    0        0  \n",
       "1           0        0       0         0    0        0  \n",
       "2           0        1       0         0    0        0  \n",
       "3           0        0       0         0    0        0  \n",
       "4           0        0       0         1    0        0  \n",
       "...       ...      ...     ...       ...  ...      ...  \n",
       "6178        0        0       0         0    0        0  \n",
       "6179        0        0       0         0    0        0  \n",
       "6180        0        0       0         0    0        0  \n",
       "6181        0        0       0         0    0        0  \n",
       "6182        0        0       0         1    0        0  \n",
       "\n",
       "[6183 rows x 20 columns]"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_genres"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CBOW TF-IDF Method to generate Context Vector "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. tf-IDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer()\n",
    "tfidf_matrix = vectorizer.fit_transform(df_movies['summarized_syn_cleaned'].apply(' '.join))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<6183x20223 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 128331 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "# Assuming df_movies['summarized_syn_cleaned'] is a Series of lists of words\n",
    "# We'll join the lists into single strings for each document\n",
    "tfidf_matrix = vectorizer.fit_transform(df_movies['summarized_syn_cleaned'].apply(' '.join))\n",
    "\n",
    "# Get the words corresponding to the features of the TF-IDF matrix\n",
    "feature_names = vectorizer.get_feature_names_out()\n",
    "\n",
    "# Convert the tfidf_matrix to a dense matrix, then to a DataFrame\n",
    "tfidf_dataframe = pd.DataFrame(tfidf_matrix.todense(), columns=feature_names)\n",
    "\n",
    "# Show the head of the DataFrame for a preview\n",
    "tfidf_matrix\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<6183x20223 sparse matrix of type '<class 'numpy.float64'>'\n",
       "\twith 128331 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. CBOW Representation with weight with TF-IDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Need to use 200-dimensional pre-trained word vectors\n",
    "Why? Because this matches the dimensionality of the vectors that we use for the GloVe vectors\n",
    "- This is what we need to compare in the WSD operation. They need to have the same dimensionality.\n",
    "- **Put this in thesis.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using GloVe vectors in your Continuous Bag-of-Words (CBOW) model is a common practice. The idea is to replace the word vectors you generate in the CBOW model with pre-trained GloVe vectors. This way, you can leverage the semantic information captured during the GloVe pre-training process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "def get_glove(file_path):\n",
    "\n",
    "    \n",
    "    embeddings_index = {}\n",
    "    with open(path_to_glove_file, 'r', encoding='utf-8') as f:\n",
    "        for line in f:\n",
    "            word, coefs = line.split(maxsplit=1)\n",
    "            coefs = np.fromstring(coefs, \"f\", sep=\" \")\n",
    "            embeddings_index[word] = coefs\n",
    "    return embeddings_index\n",
    "\n",
    "path_to_glove_file = \"../pretrain_model/glove.6B/glove.6B.200d.txt\"\n",
    "glove_vec = get_glove(path_to_glove_file)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_weighted_cbow_vector(tokens, tfidf_scores, vocab, model):\n",
    "    weighted_vector = np.zeros((200,))  # Updated dimensionality to match the GloVe vectors\n",
    "    for token, score in zip(tokens, tfidf_scores):\n",
    "        if token in vocab:\n",
    "            try:\n",
    "                weighted_vector += model[token] * score  # Used 'model' parameter\n",
    "            except KeyError:  # Token might not be in the model vocabulary\n",
    "                continue\n",
    "    if len(tokens) > 0:\n",
    "        return weighted_vector / len(tokens)\n",
    "    else:\n",
    "        return np.zeros((200,))\n",
    "\n",
    "\n",
    "# Convert sparse tfidf_matrix to dense form and iterate to compute weighted CBOW vectors, WHY: because not all algorithms work well with sparse form. Helps with element-wise operations. \n",
    "dense_tfidf = tfidf_matrix.todense()\n",
    "vocab = set(vectorizer.get_feature_names_out())\n",
    "df_movies['weighted_cbow_synopsis'] = [get_weighted_cbow_vector(tokens, dense_tfidf[i].tolist()[0], vocab, glove_vec) for i, tokens in enumerate(df_movies['summarized_syn_cleaned'])]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How is tf-IDF used here? With Filtering (vocab): You ensure that the token is not only in the GloVe vocabulary but also relevant in your specific corpus according to TF-IDF. This could be more precise for your use-case but may miss out on some broader semantic relationships captured by GloVe.\n",
    "- Gives more importance to context words that are relevant within your specific corpus. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transforming genre matrix and weighted cbow for each movie in df_movies\n",
    "\n",
    "These will be separate columns\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Convert the 'context_vector_synopsis' and 'binary_genres' to NumPy arrays if they are not already\n",
    "df_movies['final_context_vector'] = df_movies['weighted_cbow_synopsis'].apply(np.array)\n",
    "df_genres_np = df_genres.to_numpy()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 1, ..., 0, 0, 0],\n",
       "       [0, 0, 1, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       ...,\n",
       "       [0, 1, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 1, 0, 0]])"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_genres_np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdb_syn</th>\n",
       "      <th>tmdb_syn</th>\n",
       "      <th>summarized_syn</th>\n",
       "      <th>summarized_syn_cleaned</th>\n",
       "      <th>summarized_syn_tokens</th>\n",
       "      <th>...</th>\n",
       "      <th>IMAX</th>\n",
       "      <th>Musical</th>\n",
       "      <th>Mystery</th>\n",
       "      <th>Romance</th>\n",
       "      <th>Sci-Fi</th>\n",
       "      <th>Thriller</th>\n",
       "      <th>War</th>\n",
       "      <th>Western</th>\n",
       "      <th>weighted_cbow_synopsis</th>\n",
       "      <th>final_context_vector</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>[Adventure, Animation, Children, Comedy, Fantasy]</td>\n",
       "      <td>A little boy named Andy loves to be in his roo...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "      <td>a little boy name andy love to be in room play...</td>\n",
       "      <td>[little, boy, andy, love, room, play, toy, esp...</td>\n",
       "      <td>[a, little, boy, name, andy, love, to, be, in,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>[Adventure, Children, Fantasy]</td>\n",
       "      <td>Jumanji, one of the most unique--and dangerous...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "      <td>jumanji one of the most uniqueand dangerousboa...</td>\n",
       "      <td>[jumanji, uniqueand, dangerousboard, game, fal...</td>\n",
       "      <td>[jumanji, one, of, the, most, uniqueand, dange...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>[Comedy, Drama, Romance]</td>\n",
       "      <td>This story based on the best selling novel by ...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "      <td>this story base on the best sell novel by terr...</td>\n",
       "      <td>[story, base, best, sell, novel, terry, mcmill...</td>\n",
       "      <td>[this, story, base, on, the, best, sell, novel...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>[Comedy]</td>\n",
       "      <td>In this sequel to \"Father of the Bride\", Georg...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "      <td>in this sequel to father of the bride george b...</td>\n",
       "      <td>[sequel, father, bride, george, bank, accept, ...</td>\n",
       "      <td>[in, this, sequel, to, father, of, the, bride,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949.0</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>[Action, Crime, Thriller]</td>\n",
       "      <td>Hunters and their prey--Neil and his professio...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "      <td>hunter and preyneil and professional criminal ...</td>\n",
       "      <td>[hunter, preyneil, professional, criminal, cre...</td>\n",
       "      <td>[hunter, and, preyneil, and, professional, cri...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6178</th>\n",
       "      <td>130856</td>\n",
       "      <td>494826</td>\n",
       "      <td>48376.0</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>[Comedy, Documentary]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "      <td>severe clear be a film base on the memoir of f...</td>\n",
       "      <td>[severe, clear, film, base, memoir, lieutenant...</td>\n",
       "      <td>[severe, clear, be, a, film, base, on, the, me...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6179</th>\n",
       "      <td>130958</td>\n",
       "      <td>143338</td>\n",
       "      <td>78402.0</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>[Horror]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "      <td>a group of environmentalist arrives at a faraw...</td>\n",
       "      <td>[group, environmentalist, arrives, faraway, tr...</td>\n",
       "      <td>[a, group, of, environmentalist, arrives, at, ...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6180</th>\n",
       "      <td>130984</td>\n",
       "      <td>208423</td>\n",
       "      <td>317168.0</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>[Action, Fantasy, Horror]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "      <td>also know as santo v the shewolves</td>\n",
       "      <td>[know, santo, v, shewolves]</td>\n",
       "      <td>[also, know, as, santo, v, the, shewolves]</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6181</th>\n",
       "      <td>131011</td>\n",
       "      <td>69109</td>\n",
       "      <td>79572.0</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>[Crime, Drama]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "      <td>bertone be a moderately honest homicide cop be...</td>\n",
       "      <td>[bertone, moderately, honest, homicide, cop, e...</td>\n",
       "      <td>[bertone, be, a, moderately, honest, homicide,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6182</th>\n",
       "      <td>131015</td>\n",
       "      <td>1430116</td>\n",
       "      <td>143928.0</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>[Horror, Thriller]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "      <td>a western businessman thai wife and son experi...</td>\n",
       "      <td>[western, businessman, thai, wife, son, experi...</td>\n",
       "      <td>[a, western, businessman, thai, wife, and, son...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6183 rows × 32 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId   imdbId    tmdbId                               title  \\\n",
       "0           1   114709     862.0                    Toy Story (1995)   \n",
       "1           2   113497    8844.0                      Jumanji (1995)   \n",
       "2           4   114885   31357.0            Waiting to Exhale (1995)   \n",
       "3           5   113041   11862.0  Father of the Bride Part II (1995)   \n",
       "4           6   113277     949.0                         Heat (1995)   \n",
       "...       ...      ...       ...                                 ...   \n",
       "6178   130856   494826   48376.0                 Severe Clear (2010)   \n",
       "6179   130958   143338   78402.0             Killer Crocodile (1989)   \n",
       "6180   130984   208423  317168.0          Santo vs. las lobas (1976)   \n",
       "6181   131011    69109   79572.0              Execution Squad (1972)   \n",
       "6182   131015  1430116  143928.0                     Hellgate (2011)   \n",
       "\n",
       "                                                 genres  \\\n",
       "0     [Adventure, Animation, Children, Comedy, Fantasy]   \n",
       "1                        [Adventure, Children, Fantasy]   \n",
       "2                              [Comedy, Drama, Romance]   \n",
       "3                                              [Comedy]   \n",
       "4                             [Action, Crime, Thriller]   \n",
       "...                                                 ...   \n",
       "6178                              [Comedy, Documentary]   \n",
       "6179                                           [Horror]   \n",
       "6180                          [Action, Fantasy, Horror]   \n",
       "6181                                     [Crime, Drama]   \n",
       "6182                                 [Horror, Thriller]   \n",
       "\n",
       "                                               imdb_syn  \\\n",
       "0     A little boy named Andy loves to be in his roo...   \n",
       "1     Jumanji, one of the most unique--and dangerous...   \n",
       "2     This story based on the best selling novel by ...   \n",
       "3     In this sequel to \"Father of the Bride\", Georg...   \n",
       "4     Hunters and their prey--Neil and his professio...   \n",
       "...                                                 ...   \n",
       "6178                                                NaN   \n",
       "6179                                                NaN   \n",
       "6180                                                NaN   \n",
       "6181                                                NaN   \n",
       "6182                                                NaN   \n",
       "\n",
       "                                               tmdb_syn  \\\n",
       "0     Led by Woody, Andy's toys live happily in his ...   \n",
       "1     When siblings Judy and Peter discover an encha...   \n",
       "2     Cheated on, mistreated and stepped on, the wom...   \n",
       "3     Just when George Banks has recovered from his ...   \n",
       "4     Obsessive master thief Neil McCauley leads a t...   \n",
       "...                                                 ...   \n",
       "6178  Severe Clear is a film based on the memoirs of...   \n",
       "6179  A group of environmentalists arrives at a fara...   \n",
       "6180             Also known as Santo vs. the She-Wolves   \n",
       "6181  Bertone is a moderately honest homicide cop. U...   \n",
       "6182  A western businessman, his Thai wife and son e...   \n",
       "\n",
       "                                         summarized_syn  \\\n",
       "0     a little boy name andy love to be in room play...   \n",
       "1     jumanji one of the most uniqueand dangerousboa...   \n",
       "2     this story base on the best sell novel by terr...   \n",
       "3     in this sequel to father of the bride george b...   \n",
       "4     hunter and preyneil and professional criminal ...   \n",
       "...                                                 ...   \n",
       "6178  severe clear be a film base on the memoir of f...   \n",
       "6179  a group of environmentalist arrives at a faraw...   \n",
       "6180                 also know as santo v the shewolves   \n",
       "6181  bertone be a moderately honest homicide cop be...   \n",
       "6182  a western businessman thai wife and son experi...   \n",
       "\n",
       "                                 summarized_syn_cleaned  \\\n",
       "0     [little, boy, andy, love, room, play, toy, esp...   \n",
       "1     [jumanji, uniqueand, dangerousboard, game, fal...   \n",
       "2     [story, base, best, sell, novel, terry, mcmill...   \n",
       "3     [sequel, father, bride, george, bank, accept, ...   \n",
       "4     [hunter, preyneil, professional, criminal, cre...   \n",
       "...                                                 ...   \n",
       "6178  [severe, clear, film, base, memoir, lieutenant...   \n",
       "6179  [group, environmentalist, arrives, faraway, tr...   \n",
       "6180                        [know, santo, v, shewolves]   \n",
       "6181  [bertone, moderately, honest, homicide, cop, e...   \n",
       "6182  [western, businessman, thai, wife, son, experi...   \n",
       "\n",
       "                                  summarized_syn_tokens  ...  IMAX  Musical  \\\n",
       "0     [a, little, boy, name, andy, love, to, be, in,...  ...     0        0   \n",
       "1     [jumanji, one, of, the, most, uniqueand, dange...  ...     0        0   \n",
       "2     [this, story, base, on, the, best, sell, novel...  ...     0        0   \n",
       "3     [in, this, sequel, to, father, of, the, bride,...  ...     0        0   \n",
       "4     [hunter, and, preyneil, and, professional, cri...  ...     0        0   \n",
       "...                                                 ...  ...   ...      ...   \n",
       "6178  [severe, clear, be, a, film, base, on, the, me...  ...     0        0   \n",
       "6179  [a, group, of, environmentalist, arrives, at, ...  ...     0        0   \n",
       "6180         [also, know, as, santo, v, the, shewolves]  ...     0        0   \n",
       "6181  [bertone, be, a, moderately, honest, homicide,...  ...     0        0   \n",
       "6182  [a, western, businessman, thai, wife, and, son...  ...     0        0   \n",
       "\n",
       "      Mystery  Romance  Sci-Fi  Thriller  War  Western  \\\n",
       "0           0        0       0         0    0        0   \n",
       "1           0        0       0         0    0        0   \n",
       "2           0        1       0         0    0        0   \n",
       "3           0        0       0         0    0        0   \n",
       "4           0        0       0         1    0        0   \n",
       "...       ...      ...     ...       ...  ...      ...   \n",
       "6178        0        0       0         0    0        0   \n",
       "6179        0        0       0         0    0        0   \n",
       "6180        0        0       0         0    0        0   \n",
       "6181        0        0       0         0    0        0   \n",
       "6182        0        0       0         1    0        0   \n",
       "\n",
       "                                 weighted_cbow_synopsis  \\\n",
       "0     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "1     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "2     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "3     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "4     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "...                                                 ...   \n",
       "6178  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "6179  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "6180  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "6181  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "6182  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "\n",
       "                                   final_context_vector  \n",
       "0     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "1     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "2     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "3     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "4     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "...                                                 ...  \n",
       "6178  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "6179  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "6180  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "6181  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "6182  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "\n",
       "[6183 rows x 32 columns]"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6183"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(binary_genres)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Passing to a file ->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json \n",
    "\n",
    "df_movies.to_json(\"../dataset/df_context_vec.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sparse Users Method \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reading in the recommendations from CF model - empty ones\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>tag</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>un-lemmatised</th>\n",
       "      <th>glove_vec</th>\n",
       "      <th>has_glove_vec</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>318</td>\n",
       "      <td>260</td>\n",
       "      <td>s</td>\n",
       "      <td>2015-02-20 22:42:49</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[ 0.18209    0.88297   -0.49805    0.53137   -...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>318</td>\n",
       "      <td>115149</td>\n",
       "      <td>action</td>\n",
       "      <td>2015-02-21 15:58:30</td>\n",
       "      <td>action</td>\n",
       "      <td>[ 2.0240e-02  8.4992e-01 -7.8150e-01 -8.2769e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>320</td>\n",
       "      <td>2762</td>\n",
       "      <td>twist</td>\n",
       "      <td>2006-04-25 11:33:52</td>\n",
       "      <td>twist</td>\n",
       "      <td>[-9.5859e-02 -1.7472e-01 -3.4692e-02 -3.7307e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>320</td>\n",
       "      <td>2959</td>\n",
       "      <td>twist</td>\n",
       "      <td>2006-04-25 11:30:58</td>\n",
       "      <td>twist</td>\n",
       "      <td>[-9.5859e-02 -1.7472e-01 -3.4692e-02 -3.7307e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>320</td>\n",
       "      <td>3996</td>\n",
       "      <td>overrate</td>\n",
       "      <td>2006-04-25 11:32:28</td>\n",
       "      <td>overrated</td>\n",
       "      <td>[ 2.8151e-01 -4.2171e-01 -3.8275e-01  1.5364e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50186</th>\n",
       "      <td>138280</td>\n",
       "      <td>116797</td>\n",
       "      <td>history</td>\n",
       "      <td>2015-01-30 23:07:25</td>\n",
       "      <td>history</td>\n",
       "      <td>[ 4.5847e-02  7.4334e-02  1.5092e-02 -2.6392e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50187</th>\n",
       "      <td>138280</td>\n",
       "      <td>116797</td>\n",
       "      <td>informatics</td>\n",
       "      <td>2015-01-30 23:07:35</td>\n",
       "      <td>informatics</td>\n",
       "      <td>[ 1.7728e-01  1.5395e-01  7.7811e-01  1.6527e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50188</th>\n",
       "      <td>138280</td>\n",
       "      <td>116797</td>\n",
       "      <td>mathematics</td>\n",
       "      <td>2015-01-30 23:07:17</td>\n",
       "      <td>mathematics</td>\n",
       "      <td>[ 1.0033e+00  3.8874e-01  6.4312e-01 -6.8630e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50189</th>\n",
       "      <td>138280</td>\n",
       "      <td>117871</td>\n",
       "      <td>image</td>\n",
       "      <td>2015-01-30 23:09:16</td>\n",
       "      <td>image</td>\n",
       "      <td>[ 1.1091e-02  4.8461e-01  1.9142e-02  8.3725e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50190</th>\n",
       "      <td>138280</td>\n",
       "      <td>117871</td>\n",
       "      <td>story</td>\n",
       "      <td>2015-01-30 23:09:25</td>\n",
       "      <td>story</td>\n",
       "      <td>[-0.35058    0.58245   -0.065584  -0.41768    ...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>50191 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       userId  movieId          tag            timestamp un-lemmatised  \\\n",
       "0         318      260            s  2015-02-20 22:42:49           NaN   \n",
       "1         318   115149       action  2015-02-21 15:58:30        action   \n",
       "2         320     2762        twist  2006-04-25 11:33:52         twist   \n",
       "3         320     2959        twist  2006-04-25 11:30:58         twist   \n",
       "4         320     3996     overrate  2006-04-25 11:32:28     overrated   \n",
       "...       ...      ...          ...                  ...           ...   \n",
       "50186  138280   116797      history  2015-01-30 23:07:25       history   \n",
       "50187  138280   116797  informatics  2015-01-30 23:07:35   informatics   \n",
       "50188  138280   116797  mathematics  2015-01-30 23:07:17   mathematics   \n",
       "50189  138280   117871        image  2015-01-30 23:09:16         image   \n",
       "50190  138280   117871        story  2015-01-30 23:09:25         story   \n",
       "\n",
       "                                               glove_vec  has_glove_vec  \n",
       "0      [ 0.18209    0.88297   -0.49805    0.53137   -...           True  \n",
       "1      [ 2.0240e-02  8.4992e-01 -7.8150e-01 -8.2769e-...           True  \n",
       "2      [-9.5859e-02 -1.7472e-01 -3.4692e-02 -3.7307e-...           True  \n",
       "3      [-9.5859e-02 -1.7472e-01 -3.4692e-02 -3.7307e-...           True  \n",
       "4      [ 2.8151e-01 -4.2171e-01 -3.8275e-01  1.5364e-...           True  \n",
       "...                                                  ...            ...  \n",
       "50186  [ 4.5847e-02  7.4334e-02  1.5092e-02 -2.6392e-...           True  \n",
       "50187  [ 1.7728e-01  1.5395e-01  7.7811e-01  1.6527e-...           True  \n",
       "50188  [ 1.0033e+00  3.8874e-01  6.4312e-01 -6.8630e-...           True  \n",
       "50189  [ 1.1091e-02  4.8461e-01  1.9142e-02  8.3725e-...           True  \n",
       "50190  [-0.35058    0.58245   -0.065584  -0.41768    ...           True  \n",
       "\n",
       "[50191 rows x 7 columns]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# reading in the tags (userId, movieId, tag)\n",
    "tags = pd.read_csv(\"../dataset/tags_contentbased.csv\")\n",
    "# tags.drop(columns=['Unnamed: 0','Unnamed: 0.1','Unnamed: 0.2'], inplace=True)\n",
    "# tags.to_csv(\"../dataset/tags_contentbased.csv\",index=False)\n",
    "tags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdb_syn</th>\n",
       "      <th>tmdb_syn</th>\n",
       "      <th>summarized_syn</th>\n",
       "      <th>summarized_syn_cleaned</th>\n",
       "      <th>summarized_syn_tokens</th>\n",
       "      <th>...</th>\n",
       "      <th>IMAX</th>\n",
       "      <th>Musical</th>\n",
       "      <th>Mystery</th>\n",
       "      <th>Romance</th>\n",
       "      <th>Sci-Fi</th>\n",
       "      <th>Thriller</th>\n",
       "      <th>War</th>\n",
       "      <th>Western</th>\n",
       "      <th>weighted_cbow_synopsis</th>\n",
       "      <th>final_context_vector</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>[Adventure, Animation, Children, Comedy, Fantasy]</td>\n",
       "      <td>A little boy named Andy loves to be in his roo...</td>\n",
       "      <td>Led by Woody, Andy's toys live happily in his ...</td>\n",
       "      <td>a little boy name andy love to be in room play...</td>\n",
       "      <td>[little, boy, andy, love, room, play, toy, esp...</td>\n",
       "      <td>[a, little, boy, name, andy, love, to, be, in,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>[Adventure, Children, Fantasy]</td>\n",
       "      <td>Jumanji, one of the most unique--and dangerous...</td>\n",
       "      <td>When siblings Judy and Peter discover an encha...</td>\n",
       "      <td>jumanji one of the most uniqueand dangerousboa...</td>\n",
       "      <td>[jumanji, uniqueand, dangerousboard, game, fal...</td>\n",
       "      <td>[jumanji, one, of, the, most, uniqueand, dange...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>[Comedy, Drama, Romance]</td>\n",
       "      <td>This story based on the best selling novel by ...</td>\n",
       "      <td>Cheated on, mistreated and stepped on, the wom...</td>\n",
       "      <td>this story base on the best sell novel by terr...</td>\n",
       "      <td>[story, base, best, sell, novel, terry, mcmill...</td>\n",
       "      <td>[this, story, base, on, the, best, sell, novel...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>[Comedy]</td>\n",
       "      <td>In this sequel to \"Father of the Bride\", Georg...</td>\n",
       "      <td>Just when George Banks has recovered from his ...</td>\n",
       "      <td>in this sequel to father of the bride george b...</td>\n",
       "      <td>[sequel, father, bride, george, bank, accept, ...</td>\n",
       "      <td>[in, this, sequel, to, father, of, the, bride,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>113277</td>\n",
       "      <td>949</td>\n",
       "      <td>Heat (1995)</td>\n",
       "      <td>[Action, Crime, Thriller]</td>\n",
       "      <td>Hunters and their prey--Neil and his professio...</td>\n",
       "      <td>Obsessive master thief Neil McCauley leads a t...</td>\n",
       "      <td>hunter and preyneil and professional criminal ...</td>\n",
       "      <td>[hunter, preyneil, professional, criminal, cre...</td>\n",
       "      <td>[hunter, and, preyneil, and, professional, cri...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6178</th>\n",
       "      <td>130856</td>\n",
       "      <td>494826</td>\n",
       "      <td>48376</td>\n",
       "      <td>Severe Clear (2010)</td>\n",
       "      <td>[Comedy, Documentary]</td>\n",
       "      <td>None</td>\n",
       "      <td>Severe Clear is a film based on the memoirs of...</td>\n",
       "      <td>severe clear be a film base on the memoir of f...</td>\n",
       "      <td>[severe, clear, film, base, memoir, lieutenant...</td>\n",
       "      <td>[severe, clear, be, a, film, base, on, the, me...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6179</th>\n",
       "      <td>130958</td>\n",
       "      <td>143338</td>\n",
       "      <td>78402</td>\n",
       "      <td>Killer Crocodile (1989)</td>\n",
       "      <td>[Horror]</td>\n",
       "      <td>None</td>\n",
       "      <td>A group of environmentalists arrives at a fara...</td>\n",
       "      <td>a group of environmentalist arrives at a faraw...</td>\n",
       "      <td>[group, environmentalist, arrives, faraway, tr...</td>\n",
       "      <td>[a, group, of, environmentalist, arrives, at, ...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6180</th>\n",
       "      <td>130984</td>\n",
       "      <td>208423</td>\n",
       "      <td>317168</td>\n",
       "      <td>Santo vs. las lobas (1976)</td>\n",
       "      <td>[Action, Fantasy, Horror]</td>\n",
       "      <td>None</td>\n",
       "      <td>Also known as Santo vs. the She-Wolves</td>\n",
       "      <td>also know as santo v the shewolves</td>\n",
       "      <td>[know, santo, v, shewolves]</td>\n",
       "      <td>[also, know, as, santo, v, the, shewolves]</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6181</th>\n",
       "      <td>131011</td>\n",
       "      <td>69109</td>\n",
       "      <td>79572</td>\n",
       "      <td>Execution Squad (1972)</td>\n",
       "      <td>[Crime, Drama]</td>\n",
       "      <td>None</td>\n",
       "      <td>Bertone is a moderately honest homicide cop. U...</td>\n",
       "      <td>bertone be a moderately honest homicide cop be...</td>\n",
       "      <td>[bertone, moderately, honest, homicide, cop, e...</td>\n",
       "      <td>[bertone, be, a, moderately, honest, homicide,...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6182</th>\n",
       "      <td>131015</td>\n",
       "      <td>1430116</td>\n",
       "      <td>143928</td>\n",
       "      <td>Hellgate (2011)</td>\n",
       "      <td>[Horror, Thriller]</td>\n",
       "      <td>None</td>\n",
       "      <td>A western businessman, his Thai wife and son e...</td>\n",
       "      <td>a western businessman thai wife and son experi...</td>\n",
       "      <td>[western, businessman, thai, wife, son, experi...</td>\n",
       "      <td>[a, western, businessman, thai, wife, and, son...</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "      <td>[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6183 rows × 32 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId   imdbId  tmdbId                               title  \\\n",
       "0           1   114709     862                    Toy Story (1995)   \n",
       "1           2   113497    8844                      Jumanji (1995)   \n",
       "2           4   114885   31357            Waiting to Exhale (1995)   \n",
       "3           5   113041   11862  Father of the Bride Part II (1995)   \n",
       "4           6   113277     949                         Heat (1995)   \n",
       "...       ...      ...     ...                                 ...   \n",
       "6178   130856   494826   48376                 Severe Clear (2010)   \n",
       "6179   130958   143338   78402             Killer Crocodile (1989)   \n",
       "6180   130984   208423  317168          Santo vs. las lobas (1976)   \n",
       "6181   131011    69109   79572              Execution Squad (1972)   \n",
       "6182   131015  1430116  143928                     Hellgate (2011)   \n",
       "\n",
       "                                                 genres  \\\n",
       "0     [Adventure, Animation, Children, Comedy, Fantasy]   \n",
       "1                        [Adventure, Children, Fantasy]   \n",
       "2                              [Comedy, Drama, Romance]   \n",
       "3                                              [Comedy]   \n",
       "4                             [Action, Crime, Thriller]   \n",
       "...                                                 ...   \n",
       "6178                              [Comedy, Documentary]   \n",
       "6179                                           [Horror]   \n",
       "6180                          [Action, Fantasy, Horror]   \n",
       "6181                                     [Crime, Drama]   \n",
       "6182                                 [Horror, Thriller]   \n",
       "\n",
       "                                               imdb_syn  \\\n",
       "0     A little boy named Andy loves to be in his roo...   \n",
       "1     Jumanji, one of the most unique--and dangerous...   \n",
       "2     This story based on the best selling novel by ...   \n",
       "3     In this sequel to \"Father of the Bride\", Georg...   \n",
       "4     Hunters and their prey--Neil and his professio...   \n",
       "...                                                 ...   \n",
       "6178                                               None   \n",
       "6179                                               None   \n",
       "6180                                               None   \n",
       "6181                                               None   \n",
       "6182                                               None   \n",
       "\n",
       "                                               tmdb_syn  \\\n",
       "0     Led by Woody, Andy's toys live happily in his ...   \n",
       "1     When siblings Judy and Peter discover an encha...   \n",
       "2     Cheated on, mistreated and stepped on, the wom...   \n",
       "3     Just when George Banks has recovered from his ...   \n",
       "4     Obsessive master thief Neil McCauley leads a t...   \n",
       "...                                                 ...   \n",
       "6178  Severe Clear is a film based on the memoirs of...   \n",
       "6179  A group of environmentalists arrives at a fara...   \n",
       "6180             Also known as Santo vs. the She-Wolves   \n",
       "6181  Bertone is a moderately honest homicide cop. U...   \n",
       "6182  A western businessman, his Thai wife and son e...   \n",
       "\n",
       "                                         summarized_syn  \\\n",
       "0     a little boy name andy love to be in room play...   \n",
       "1     jumanji one of the most uniqueand dangerousboa...   \n",
       "2     this story base on the best sell novel by terr...   \n",
       "3     in this sequel to father of the bride george b...   \n",
       "4     hunter and preyneil and professional criminal ...   \n",
       "...                                                 ...   \n",
       "6178  severe clear be a film base on the memoir of f...   \n",
       "6179  a group of environmentalist arrives at a faraw...   \n",
       "6180                 also know as santo v the shewolves   \n",
       "6181  bertone be a moderately honest homicide cop be...   \n",
       "6182  a western businessman thai wife and son experi...   \n",
       "\n",
       "                                 summarized_syn_cleaned  \\\n",
       "0     [little, boy, andy, love, room, play, toy, esp...   \n",
       "1     [jumanji, uniqueand, dangerousboard, game, fal...   \n",
       "2     [story, base, best, sell, novel, terry, mcmill...   \n",
       "3     [sequel, father, bride, george, bank, accept, ...   \n",
       "4     [hunter, preyneil, professional, criminal, cre...   \n",
       "...                                                 ...   \n",
       "6178  [severe, clear, film, base, memoir, lieutenant...   \n",
       "6179  [group, environmentalist, arrives, faraway, tr...   \n",
       "6180                        [know, santo, v, shewolves]   \n",
       "6181  [bertone, moderately, honest, homicide, cop, e...   \n",
       "6182  [western, businessman, thai, wife, son, experi...   \n",
       "\n",
       "                                  summarized_syn_tokens  ...  IMAX  Musical  \\\n",
       "0     [a, little, boy, name, andy, love, to, be, in,...  ...     0        0   \n",
       "1     [jumanji, one, of, the, most, uniqueand, dange...  ...     0        0   \n",
       "2     [this, story, base, on, the, best, sell, novel...  ...     0        0   \n",
       "3     [in, this, sequel, to, father, of, the, bride,...  ...     0        0   \n",
       "4     [hunter, and, preyneil, and, professional, cri...  ...     0        0   \n",
       "...                                                 ...  ...   ...      ...   \n",
       "6178  [severe, clear, be, a, film, base, on, the, me...  ...     0        0   \n",
       "6179  [a, group, of, environmentalist, arrives, at, ...  ...     0        0   \n",
       "6180         [also, know, as, santo, v, the, shewolves]  ...     0        0   \n",
       "6181  [bertone, be, a, moderately, honest, homicide,...  ...     0        0   \n",
       "6182  [a, western, businessman, thai, wife, and, son...  ...     0        0   \n",
       "\n",
       "      Mystery  Romance  Sci-Fi  Thriller  War  Western  \\\n",
       "0           0        0       0         0    0        0   \n",
       "1           0        0       0         0    0        0   \n",
       "2           0        1       0         0    0        0   \n",
       "3           0        0       0         0    0        0   \n",
       "4           0        0       0         1    0        0   \n",
       "...       ...      ...     ...       ...  ...      ...   \n",
       "6178        0        0       0         0    0        0   \n",
       "6179        0        0       0         0    0        0   \n",
       "6180        0        0       0         0    0        0   \n",
       "6181        0        0       0         0    0        0   \n",
       "6182        0        0       0         1    0        0   \n",
       "\n",
       "                                 weighted_cbow_synopsis  \\\n",
       "0     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "1     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "2     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "3     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "4     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "...                                                 ...   \n",
       "6178  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "6179  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "6180  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "6181  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "6182  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   \n",
       "\n",
       "                                   final_context_vector  \n",
       "0     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "1     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "2     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "3     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "4     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "...                                                 ...  \n",
       "6178  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "6179  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "6180  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "6181  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "6182  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  \n",
       "\n",
       "[6183 rows x 32 columns]"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# reading in the movie context vectors\n",
    "\n",
    "df_movies = pd.read_json(\"../dataset/df_context_vec.json\")\n",
    "df_movies\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>recommendations</th>\n",
       "      <th>tags_movies</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>160</th>\n",
       "      <td>31935</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416</th>\n",
       "      <td>77374</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>555</th>\n",
       "      <td>120748</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>586</th>\n",
       "      <td>29096</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>943</th>\n",
       "      <td>117259</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1014</th>\n",
       "      <td>98756</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1073</th>\n",
       "      <td>115495</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1095</th>\n",
       "      <td>11585</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>50687</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1404</th>\n",
       "      <td>54248</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      userId recommendations tags_movies\n",
       "160    31935              []          []\n",
       "416    77374              []          []\n",
       "555   120748              []          []\n",
       "586    29096              []          []\n",
       "943   117259              []          []\n",
       "1014   98756              []          []\n",
       "1073  115495              []          []\n",
       "1095   11585              []          []\n",
       "1105   50687              []          []\n",
       "1404   54248              []          []"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_rec = pd.read_json('../dataset/df_rec.json', orient='split')\n",
    "\n",
    "df_rec_sparse = df_rec[df_rec['recommendations'].str.len() == 0]\n",
    "\n",
    "df_rec_sparse\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['IMAX',\n",
       " 'Comedy',\n",
       " '(no genres listed)',\n",
       " 'Crime',\n",
       " 'Animation',\n",
       " 'Drama',\n",
       " 'Horror',\n",
       " 'Mystery',\n",
       " 'Adventure',\n",
       " 'Action',\n",
       " 'Fantasy',\n",
       " 'Film-Noir',\n",
       " 'Documentary',\n",
       " 'War',\n",
       " 'Sci-Fi',\n",
       " 'Children',\n",
       " 'Western',\n",
       " 'Romance',\n",
       " 'Thriller',\n",
       " 'Musical']"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "# Extracting genre vectors\n",
    "# df_movies['genre_vector'] = df_movies[all_genres].values.tolist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     movieId                                                tag  userId\n",
      "0         41                         [shakespeare, shakespeare]   31935\n",
      "1         73                                   [jewish, lawyer]   31935\n",
      "2        527  [atmospheric, biography, classic, disturb, his...   31935\n",
      "3        760                                          [germany]   31935\n",
      "4       1090  [action, action, s, vietnam, action, drama, na...   31935\n",
      "..       ...                                                ...     ...\n",
      "795   126387                                          [latvian]   54248\n",
      "796   127453  [crime, history, investigation, moscow, poland...   54248\n",
      "797   128604                            [horrible, pretentious]   54248\n",
      "798   128948                                            [greek]   54248\n",
      "799   129303                                             [camp]   54248\n",
      "\n",
      "[5416 rows x 3 columns]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "# One-hot encode the genres for each movie\n",
    "df_movies['genres'] = df_movies['genres'].apply(lambda x: x if isinstance(x, list) else [])\n",
    "all_genres = set([genre for sublist in df_movies['genres'].tolist() for genre in sublist])\n",
    "for genre in all_genres:\n",
    "    df_movies[genre] = df_movies['genres'].apply(lambda x: 1 if genre in x else 0)\n",
    "\n",
    "# Extracting genre vectors\n",
    "df_movies['genre_vector'] = df_movies[list(all_genres)].values.tolist()\n",
    "\n",
    "def get_similar_movies_genre_vector(user_movie_vector, threshold=0.98):\n",
    "    similarities = cosine_similarity([user_movie_vector], df_movies['genre_vector'].tolist())[0]\n",
    "    df_movies['cosine_similarity'] = similarities\n",
    "    similar_movies = df_movies[df_movies['cosine_similarity'] >= threshold]\n",
    "    return similar_movies['movieId'].tolist()\n",
    "\n",
    "def get_movie_tags(similar_movie_ids, user):\n",
    "    movie_tags_df = tags[tags['movieId'].isin(similar_movie_ids)][['movieId', 'tag']].groupby('movieId')['tag'].apply(list).reset_index()\n",
    "    movie_tags_df['userId'] = user\n",
    "    return movie_tags_df\n",
    "\n",
    "# Main function using genre vectors\n",
    "def context_aware_model_for_sparse_users_genre_vector():\n",
    "    sparse_users = df_rec_sparse['userId'].unique()\n",
    "    all_tags_for_similar_movies = []\n",
    "    \n",
    "    for user in sparse_users:\n",
    "        # Getting movies tagged by the user\n",
    "        user_movies = tags[tags['userId'] == user]['movieId'].unique()\n",
    "        \n",
    "        for movie in user_movies:\n",
    "            if df_movies[df_movies['movieId'] == movie]['genre_vector'].shape[0] > 0:\n",
    "                user_movie_vector = np.array(df_movies[df_movies['movieId'] == movie]['genre_vector'].iloc[0])\n",
    "            else:\n",
    "                continue\n",
    "\n",
    "            similar_movie_ids = get_similar_movies_genre_vector(user_movie_vector)\n",
    "            movie_tags_df = get_movie_tags(similar_movie_ids, user)\n",
    "            \n",
    "            all_tags_for_similar_movies.append(movie_tags_df)\n",
    "    \n",
    "    return pd.concat(all_tags_for_similar_movies)\n",
    "\n",
    "result_df = context_aware_model_for_sparse_users_genre_vector()\n",
    "print(result_df)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>tag</th>\n",
       "      <th>userId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>41</td>\n",
       "      <td>[shakespeare, shakespeare]</td>\n",
       "      <td>31935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>73</td>\n",
       "      <td>[jewish, lawyer]</td>\n",
       "      <td>31935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>527</td>\n",
       "      <td>[atmospheric, biography, classic, disturb, his...</td>\n",
       "      <td>31935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>760</td>\n",
       "      <td>[germany]</td>\n",
       "      <td>31935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1090</td>\n",
       "      <td>[action, action, s, vietnam, action, drama, na...</td>\n",
       "      <td>31935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>795</th>\n",
       "      <td>126387</td>\n",
       "      <td>[latvian]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>796</th>\n",
       "      <td>127453</td>\n",
       "      <td>[crime, history, investigation, moscow, poland...</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>797</th>\n",
       "      <td>128604</td>\n",
       "      <td>[horrible, pretentious]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>798</th>\n",
       "      <td>128948</td>\n",
       "      <td>[greek]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>799</th>\n",
       "      <td>129303</td>\n",
       "      <td>[camp]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5416 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     movieId                                                tag  userId\n",
       "0         41                         [shakespeare, shakespeare]   31935\n",
       "1         73                                   [jewish, lawyer]   31935\n",
       "2        527  [atmospheric, biography, classic, disturb, his...   31935\n",
       "3        760                                          [germany]   31935\n",
       "4       1090  [action, action, s, vietnam, action, drama, na...   31935\n",
       "..       ...                                                ...     ...\n",
       "795   126387                                          [latvian]   54248\n",
       "796   127453  [crime, history, investigation, moscow, poland...   54248\n",
       "797   128604                            [horrible, pretentious]   54248\n",
       "798   128948                                            [greek]   54248\n",
       "799   129303                                             [camp]   54248\n",
       "\n",
       "[5416 rows x 3 columns]"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_df.to_csv(\"../dataset/df_sparse_similar_movies.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>tag</th>\n",
       "      <th>userId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>14</td>\n",
       "      <td>[president, drama, nan, politics, president]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>26</td>\n",
       "      <td>[shakespeare, shakespeare, shakespeare, shakes...</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>31</td>\n",
       "      <td>[inspirational, inspirational, nan, education,...</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>43</td>\n",
       "      <td>[historical, england]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>55</td>\n",
       "      <td>[musician]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>795</th>\n",
       "      <td>126387</td>\n",
       "      <td>[latvian]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>796</th>\n",
       "      <td>127453</td>\n",
       "      <td>[crime, history, investigation, moscow, poland...</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>797</th>\n",
       "      <td>128604</td>\n",
       "      <td>[horrible, pretentious]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>798</th>\n",
       "      <td>128948</td>\n",
       "      <td>[greek]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>799</th>\n",
       "      <td>129303</td>\n",
       "      <td>[camp]</td>\n",
       "      <td>54248</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>800 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     movieId                                                tag  userId\n",
       "0         14       [president, drama, nan, politics, president]   54248\n",
       "1         26  [shakespeare, shakespeare, shakespeare, shakes...   54248\n",
       "2         31  [inspirational, inspirational, nan, education,...   54248\n",
       "3         43                              [historical, england]   54248\n",
       "4         55                                         [musician]   54248\n",
       "..       ...                                                ...     ...\n",
       "795   126387                                          [latvian]   54248\n",
       "796   127453  [crime, history, investigation, moscow, poland...   54248\n",
       "797   128604                            [horrible, pretentious]   54248\n",
       "798   128948                                            [greek]   54248\n",
       "799   129303                                             [camp]   54248\n",
       "\n",
       "[800 rows x 3 columns]"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_df[result_df['userId'] == 54248]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>tag</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>un-lemmatised</th>\n",
       "      <th>glove_vec</th>\n",
       "      <th>has_glove_vec</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>22702</th>\n",
       "      <td>54248</td>\n",
       "      <td>119430</td>\n",
       "      <td>casino</td>\n",
       "      <td>2014-12-24 23:16:11</td>\n",
       "      <td>casino</td>\n",
       "      <td>[ 1.5207e-02  1.9634e-01 -6.1053e-01  2.6841e-...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22703</th>\n",
       "      <td>54248</td>\n",
       "      <td>119430</td>\n",
       "      <td>crap</td>\n",
       "      <td>2014-12-24 23:16:18</td>\n",
       "      <td>craps</td>\n",
       "      <td>[ 0.51509    0.036047   0.14851   -0.081221  -...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22704</th>\n",
       "      <td>54248</td>\n",
       "      <td>119430</td>\n",
       "      <td>gamble</td>\n",
       "      <td>2014-12-24 23:16:39</td>\n",
       "      <td>gamble</td>\n",
       "      <td>[-0.43183    0.22825   -0.46183    0.30202    ...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22705</th>\n",
       "      <td>54248</td>\n",
       "      <td>119430</td>\n",
       "      <td>gamble</td>\n",
       "      <td>2014-12-24 23:16:07</td>\n",
       "      <td>gambling</td>\n",
       "      <td>[-0.43183    0.22825   -0.46183    0.30202    ...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22706</th>\n",
       "      <td>54248</td>\n",
       "      <td>119430</td>\n",
       "      <td>hustle</td>\n",
       "      <td>2014-12-24 23:16:33</td>\n",
       "      <td>hustle</td>\n",
       "      <td>[ 0.45497   -0.11613   -0.62842    0.25815    ...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22707</th>\n",
       "      <td>54248</td>\n",
       "      <td>119430</td>\n",
       "      <td>poker</td>\n",
       "      <td>2014-12-24 23:16:21</td>\n",
       "      <td>poker</td>\n",
       "      <td>[ 0.42254    0.24923   -0.15362   -0.73099    ...</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       userId  movieId     tag            timestamp un-lemmatised  \\\n",
       "22702   54248   119430  casino  2014-12-24 23:16:11        casino   \n",
       "22703   54248   119430    crap  2014-12-24 23:16:18         craps   \n",
       "22704   54248   119430  gamble  2014-12-24 23:16:39        gamble   \n",
       "22705   54248   119430  gamble  2014-12-24 23:16:07      gambling   \n",
       "22706   54248   119430  hustle  2014-12-24 23:16:33        hustle   \n",
       "22707   54248   119430   poker  2014-12-24 23:16:21         poker   \n",
       "\n",
       "                                               glove_vec  has_glove_vec  \n",
       "22702  [ 1.5207e-02  1.9634e-01 -6.1053e-01  2.6841e-...           True  \n",
       "22703  [ 0.51509    0.036047   0.14851   -0.081221  -...           True  \n",
       "22704  [-0.43183    0.22825   -0.46183    0.30202    ...           True  \n",
       "22705  [-0.43183    0.22825   -0.46183    0.30202    ...           True  \n",
       "22706  [ 0.45497   -0.11613   -0.62842    0.25815    ...           True  \n",
       "22707  [ 0.42254    0.24923   -0.15362   -0.73099    ...           True  "
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tags[tags['userId'] == 54248]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
