{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------- \n",
    "## 42578 - Advanced Business Analytics - Quiz 1\n",
    "### March 2021\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You just got a nicely cleand dataset of hotel reviews from you colleagues, and now need to develop a service, where people can upload their latest hotel review, and your algorithm will return the most similar review from the entire dataset\n",
    "\n",
    "The idea is that you will apply topic modeling to the entire dataset, so that each review becomes represented as a set of 5 topics coefficients (i.e. the proportions of each topic in each message). \n",
    "\n",
    "Then, when you get a query, you also obtain its topics proportions and apply cosine similary to the entire dataset. You then retrieve the review with highest similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We make the essential imports for you (you are free to add whatever else you need!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "import numpy as np\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "from sklearn.decomposition import LatentDirichletAllocation\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from stopwords import get_stopwords\n",
    "import string\n",
    "from nltk.stem import WordNetLemmatizer\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You need to load the reviews file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reviews_df=pd.read_csv(\"reviews_clean.csv\", encoding='latin1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "\n",
    "stop_words=get_stopwords('en')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 1. We also created a stopwords list (they are in the file \"my_stopwords.txt\". You should add these words to the default ones you just loaded with \"get_stopwords\" in the line above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "f_stop=open(\"my_stopwords.txt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words+=[stopword.strip() for stopword in f_stop.readlines()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 2.  Before you apply LDA, you need to clean up the individual messages. You should apply the following sequence: \n",
    "- remove punctuation\n",
    "- apply lowercase\n",
    "- remove stopwords\n",
    "- apply stemming/lemmatization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def text_processing(text):    \n",
    "    # remove punctuation \n",
    "    text = \"\".join([c for c in text \n",
    "                    if c not in string.punctuation])\n",
    "    # lowercase\n",
    "    text = \"\".join([c.lower() for c in text])\n",
    "    # remove stopwords\n",
    "    text = \" \".join([w for w in text.split() \n",
    "                     if w not in stop_words])\n",
    "    # stemming / lematizing (optional)\n",
    "    text = \" \".join([lemmatizer.lemmatize(w) for w in text.split()])\n",
    "    return text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_msgs=[text_processing(text) for text in reviews_df['Review']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 3 - Create your Latent Dirichlet Allocation (LDA) model, with 5 topics. Please print the 10 top words for those 5 topics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_topics = 5\n",
    "lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5,\n",
    "                                learning_method='online',\n",
    "                                learning_offset=50.,\n",
    "                                random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "count_vect = CountVectorizer()\n",
    "bow_counts = count_vect.fit_transform(clean_msgs)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_lda = lda.fit_transform(bow_counts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_top_words(model, feature_names, n_top_words):\n",
    "    norm = model.components_.sum(axis=1)[:, np.newaxis]\n",
    "    for topic_idx, topic in enumerate(model.components_):\n",
    "        print(80 * \"-\")\n",
    "        print(\"Topic {}\".format(topic_idx))\n",
    "        for i in topic.argsort()[:-n_top_words - 1:-1]:\n",
    "            print(\"{:.3f}\".format(topic[i] / norm[topic_idx][0]) \n",
    "                  + '\\t' + feature_names[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Topics in LDA model:\n",
      "--------------------------------------------------------------------------------\n",
      "Topic 0\n",
      "0.002\tbuilding\n",
      "0.002\tfriday\n",
      "0.002\tregulatory\n",
      "0.001\ttgi\n",
      "0.001\tshangrila\n",
      "0.001\ttranquil\n",
      "0.001\tred\n",
      "0.001\tfriendship\n",
      "0.001\tbeautiful\n",
      "0.001\tguo\n",
      "--------------------------------------------------------------------------------\n",
      "Topic 1\n",
      "0.021\troom\n",
      "0.009\tgood\n",
      "0.008\tstaff\n",
      "0.008\tgreat\n",
      "0.007\tbreakfast\n",
      "0.006\tstay\n",
      "0.006\tservice\n",
      "0.006\trestaurant\n",
      "0.006\tnight\n",
      "0.006\tlocation\n",
      "--------------------------------------------------------------------------------\n",
      "Topic 2\n",
      "0.001\tloong\n",
      "0.001\tsloppy\n",
      "0.001\tforgetful\n",
      "0.001\tuncompleted\n",
      "0.001\tapplicant\n",
      "0.001\tsorbet\n",
      "0.001\tassignment\n",
      "0.001\taimlessly\n",
      "0.001\tasyet\n",
      "0.001\treceptionpros\n",
      "--------------------------------------------------------------------------------\n",
      "Topic 3\n",
      "0.001\ttheft\n",
      "0.001\tdeg\n",
      "0.001\tquotfree\n",
      "0.001\tcherry\n",
      "0.001\tankle\n",
      "0.001\tappartment\n",
      "0.001\tswiming\n",
      "0.001\tchop\n",
      "0.001\tcdn\n",
      "0.001\tgoldfish\n",
      "--------------------------------------------------------------------------------\n",
      "Topic 4\n",
      "0.001\thair\n",
      "0.001\tstylusthingie\n",
      "0.001\tgray\n",
      "0.001\tstarwoods\n",
      "0.001\tvodka\n",
      "0.001\twoken\n",
      "0.001\tshoulder\n",
      "0.001\tcompete\n",
      "0.001\tskin\n",
      "0.001\tjane\n"
     ]
    }
   ],
   "source": [
    "print(\"\\nTopics in LDA model:\")\n",
    "counts_feature_names = count_vect.get_feature_names()\n",
    "n_top_words = 10\n",
    "print_top_words(lda, counts_feature_names, n_top_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 4 - Extend your reviews_df dataframe such that, for each review, you will have 5 columns extra with the topic proportions coming from your model (the column names should be LDA_topic1, LDA_topic2, LDA_topic3, LDA_topic4, LDA_topic5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "topics=lda.transform(bow_counts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "topic_features=[]\n",
    "for i in range(n_topics):\n",
    "    topic_feature='LDA_topic'+str(i+1)\n",
    "    topic_features.append(topic_feature)\n",
    "    reviews_df[topic_feature]=topics[:,i]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>Header</th>\n",
       "      <th>Review</th>\n",
       "      <th>LDA_topic1</th>\n",
       "      <th>LDA_topic2</th>\n",
       "      <th>LDA_topic3</th>\n",
       "      <th>LDA_topic4</th>\n",
       "      <th>LDA_topic5</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Oct 12 2009</td>\n",
       "      <td>Nice trendy hotel location not too bad.</td>\n",
       "      <td>I stayed in this hotel for one night. As this ...</td>\n",
       "      <td>0.106707</td>\n",
       "      <td>0.888603</td>\n",
       "      <td>0.001563</td>\n",
       "      <td>0.001563</td>\n",
       "      <td>0.001563</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Sep 25 2009</td>\n",
       "      <td>Great Budget Hotel!</td>\n",
       "      <td>Stayed two nights at Aloft on the most recent ...</td>\n",
       "      <td>0.001845</td>\n",
       "      <td>0.992647</td>\n",
       "      <td>0.001836</td>\n",
       "      <td>0.001836</td>\n",
       "      <td>0.001836</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Aug 4 2009</td>\n",
       "      <td>Excellent value - location not a big problem.</td>\n",
       "      <td>We stayed at the Aloft Beijing Haidian for 5 n...</td>\n",
       "      <td>0.001451</td>\n",
       "      <td>0.968000</td>\n",
       "      <td>0.001450</td>\n",
       "      <td>0.027649</td>\n",
       "      <td>0.001450</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Jul 17 2009</td>\n",
       "      <td>Stylish clean reasonable value poor location</td>\n",
       "      <td>I am glad to be the first person to post photo...</td>\n",
       "      <td>0.052681</td>\n",
       "      <td>0.943824</td>\n",
       "      <td>0.001167</td>\n",
       "      <td>0.001164</td>\n",
       "      <td>0.001164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>May 30 2009</td>\n",
       "      <td>Remote but excellent value for money</td>\n",
       "      <td>Stayed there for one night. The hotel is locat...</td>\n",
       "      <td>0.004352</td>\n",
       "      <td>0.917213</td>\n",
       "      <td>0.069738</td>\n",
       "      <td>0.004349</td>\n",
       "      <td>0.004349</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4250</th>\n",
       "      <td>Jun 9 2007</td>\n",
       "      <td>Comfortable hotel but couldn't care less staff...</td>\n",
       "      <td>I stayed 3 days in this hotel before heading h...</td>\n",
       "      <td>0.052811</td>\n",
       "      <td>0.942228</td>\n",
       "      <td>0.001653</td>\n",
       "      <td>0.001655</td>\n",
       "      <td>0.001653</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4251</th>\n",
       "      <td>May 26 2007</td>\n",
       "      <td>Wonderful place to stay</td>\n",
       "      <td>I am currently here at the Westin, Beijing. Th...</td>\n",
       "      <td>0.001757</td>\n",
       "      <td>0.992979</td>\n",
       "      <td>0.001755</td>\n",
       "      <td>0.001755</td>\n",
       "      <td>0.001755</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4252</th>\n",
       "      <td>May 17 2007</td>\n",
       "      <td>Good Hardware; Poor \"Software\"</td>\n",
       "      <td>If you're here with a tour group, as we were, ...</td>\n",
       "      <td>0.001223</td>\n",
       "      <td>0.995116</td>\n",
       "      <td>0.001222</td>\n",
       "      <td>0.001220</td>\n",
       "      <td>0.001220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4253</th>\n",
       "      <td>Apr 29 2007</td>\n",
       "      <td>Shiny and new but with problems</td>\n",
       "      <td>I wrapped up a recent trip to China with a thr...</td>\n",
       "      <td>0.000850</td>\n",
       "      <td>0.859235</td>\n",
       "      <td>0.138220</td>\n",
       "      <td>0.000848</td>\n",
       "      <td>0.000848</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4254</th>\n",
       "      <td>Apr 28 2007</td>\n",
       "      <td>Wow...Very impressed!</td>\n",
       "      <td>This was my first stay in Beijing &amp;amp; I was ...</td>\n",
       "      <td>0.099659</td>\n",
       "      <td>0.893674</td>\n",
       "      <td>0.002223</td>\n",
       "      <td>0.002223</td>\n",
       "      <td>0.002223</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4255 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              Date                                             Header  \\\n",
       "0     Oct 12 2009             Nice trendy hotel location not too bad.   \n",
       "1     Sep 25 2009                                 Great Budget Hotel!   \n",
       "2      Aug 4 2009       Excellent value - location not a big problem.   \n",
       "3     Jul 17 2009        Stylish clean reasonable value poor location   \n",
       "4     May 30 2009                Remote but excellent value for money   \n",
       "...            ...                                                ...   \n",
       "4250    Jun 9 2007  Comfortable hotel but couldn't care less staff...   \n",
       "4251  May 26 2007                             Wonderful place to stay   \n",
       "4252   May 17 2007                     Good Hardware; Poor \"Software\"   \n",
       "4253  Apr 29 2007                     Shiny and new but with problems   \n",
       "4254  Apr 28 2007                               Wow...Very impressed!   \n",
       "\n",
       "                                                 Review  LDA_topic1  \\\n",
       "0     I stayed in this hotel for one night. As this ...    0.106707   \n",
       "1     Stayed two nights at Aloft on the most recent ...    0.001845   \n",
       "2     We stayed at the Aloft Beijing Haidian for 5 n...    0.001451   \n",
       "3     I am glad to be the first person to post photo...    0.052681   \n",
       "4     Stayed there for one night. The hotel is locat...    0.004352   \n",
       "...                                                 ...         ...   \n",
       "4250  I stayed 3 days in this hotel before heading h...    0.052811   \n",
       "4251  I am currently here at the Westin, Beijing. Th...    0.001757   \n",
       "4252  If you're here with a tour group, as we were, ...    0.001223   \n",
       "4253  I wrapped up a recent trip to China with a thr...    0.000850   \n",
       "4254  This was my first stay in Beijing &amp; I was ...    0.099659   \n",
       "\n",
       "      LDA_topic2  LDA_topic3  LDA_topic4  LDA_topic5  \n",
       "0       0.888603    0.001563    0.001563    0.001563  \n",
       "1       0.992647    0.001836    0.001836    0.001836  \n",
       "2       0.968000    0.001450    0.027649    0.001450  \n",
       "3       0.943824    0.001167    0.001164    0.001164  \n",
       "4       0.917213    0.069738    0.004349    0.004349  \n",
       "...          ...         ...         ...         ...  \n",
       "4250    0.942228    0.001653    0.001655    0.001653  \n",
       "4251    0.992979    0.001755    0.001755    0.001755  \n",
       "4252    0.995116    0.001222    0.001220    0.001220  \n",
       "4253    0.859235    0.138220    0.000848    0.000848  \n",
       "4254    0.893674    0.002223    0.002223    0.002223  \n",
       "\n",
       "[4255 rows x 8 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reviews_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Question 5 - It is time to make your query system. Make a python function that receives a query, and returns the review which has the highest cosine similarity with respect to the topic proportions of the query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "query=\"Good location and staff members were nice. However, the room was very dirty! Cleaning people very sloppy!\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_review_by_query(query, lda_model, reviews_df):\n",
    "    clean_target=text_processing(query)\n",
    "    bow_count_target = count_vect.transform([clean_target])\n",
    "    query_lda=lda.transform(bow_count_target)\n",
    "    query_result=reviews_df.iloc[np.argmax(cosine_similarity(query_lda, reviews_df[topic_features]))]\n",
    "    return query_result\n",
    "    \n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_result=get_review_by_query(query,lda, reviews_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Header: Not so clean\n",
      "Review: Stayed in the hotel for 2 nights. Didn't enjoy my stay here as the room reeked of cigarettes even though we had asked for a non-smoking room. The smell is extremely strong after the bathroom ventilator is turned on. No offence to smokers pls!We are not early wakers, as such, we couldn't make it for the breakfast as it ended at 9 in the morning. The corridor is noisy, especially in the morning.The carpet is very stained and the bathroom floor tiles were coming off. The towel looks unclean too. There were fruits in the room when we checked in, but a staff had come to collect it back and we had no idea why!Overall, it had been an unpleasant stay for us.\n"
     ]
    }
   ],
   "source": [
    "print(\"Header:\", query_result['Header'])\n",
    "print(\"Review:\", query_result['Review'])\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
