{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import necessary packages\n",
    "import pandas as pd\n",
    "import re\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem.porter import PorterStemmer\n",
    "from gensim import corpora\n",
    "from gensim import models\n",
    "import nltk\n",
    "from sklearn.externals import joblib\n",
    "from nltk.tag import pos_tag\n",
    "import numpy as np\n",
    "from sklearn.linear_model import LinearRegression\n",
    "import shelve\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Topics Analysis with Latent Dirichelet Allocation (LDA) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read in the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# read the data \n",
    "review = pd.read_csv(\"data/reviews.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text Manipulation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wished it was better..\n",
      "After watching man vs. food I decided to stop by, décor was not that homey and welcoming, and the neighborhood was bad, but nothing I haven't been around before.  The ribs were very fatty and grisly, it was disappointing and I didn't get enough sauce and when I asked for a little more they wanted to charge me, the coleslaw was awesome!  I noticed a hair in my food and it turned me off to the rest of it, so i threw it away , I wont be returning...\n",
      "sorry guys\n"
     ]
    }
   ],
   "source": [
    "# we focus on the text column (the reviews provided by customers)\n",
    "raw_all = review.text.tolist()\n",
    "first_review_train = raw_all[0]\n",
    "print(first_review_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with shelve.open('result/first_review_train') as result:\n",
    "    result['first_review_train'] = first_review_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the review example shown above, we extract the text column, which contains the raw reviews from customers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to tokenize the reviews, we first change all the words in each review as the lower case for convenience in the topic analysis. Then, we apply some string manipulation in order to save the meaningful words and numbers. Next, we delete all the stop words from each review, which are certain parts of English speech, like (for, or) or the words that are meaningless to the topic model. Finally, we decide to only keep the words that are the noun for further analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def tokenize_text(raw):\n",
    "    \"\"\"\n",
    "    function to get the tokenized list of reviews from the raw reviews\n",
    "\n",
    "    Input\n",
    "    ----------\n",
    "    raw: a list of review text\n",
    "\n",
    "    Output\n",
    "    ------\n",
    "    text_array: a list of tokenized review word sets\n",
    "\n",
    "    \"\"\"\n",
    "    # define the stopwords\n",
    "    stop = set(stopwords.words('english'))\n",
    "\n",
    "    text_array = []\n",
    "    for i in range(len(raw)):\n",
    "        # for each review, change the words to lowercase\n",
    "        text = raw[i].lower()\n",
    "\n",
    "        # get rid of \\r and \\n for each string\n",
    "        text = text.replace('\\r\\n', ' ')\n",
    "\n",
    "        # get rid of all the elements that are not characters or numbers\n",
    "        text = re.sub(\"[^a-z0-9]\", \" \", text)\n",
    "\n",
    "        # Tokenization segments a document into its atomic elements.\n",
    "        words = text.split()\n",
    "\n",
    "        # delete stop words\n",
    "        words = [j for j in words if j not in stop]\n",
    "\n",
    "        # only keep the words that are noun, since we only need to find the subtopics\n",
    "        tagged_sent = pos_tag(words)\n",
    "        words = [word for word, pos in tagged_sent if pos == 'NN']\n",
    "\n",
    "        text_array.append(words)\n",
    "    \n",
    "    return text_array\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "text_array = tokenize_text(raw_all)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **text_array** contains all the tokenized reviews in our data set. Then, we use the Dictionary() function to traverse texts, assigning a unique integer id to each unique token while also collecting word counts and relevant statistics. To see each token’s unique integer id, try **print(dictionary.token2id)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# save the dictionary results\n",
    "dictionary = corpora.Dictionary(text_array)\n",
    "dictionary.save('result/dictionary.dic')\n",
    "\n",
    "\n",
    "# save the text_array\n",
    "with shelve.open('result/text_array') as result:\n",
    "    result['text_array'] = text_array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use doc2bow() function converts dictionary into a bag-of-words. The result, corpus, is a list of vectors equal to the number of documents. Each document list is a series of tuples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corpus = [dictionary.doc2bow(text) for text in text_array]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    }
   ],
   "source": [
    "print(dictionary.token2id['man'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]\n"
     ]
    }
   ],
   "source": [
    "print(corpus[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above list of tuples represents our first document (i.e. the first review after tokenizing). The tuples are (term ID, term frequency) pairs, so we can see that **dictionary.token2id['man']** says \"man\"’s integer id is 0, then the first tuple indicates that \"man\" appears once in the first tokenized review document. **doc2bow()** only includes terms that actually occur: terms that do not occur in a document will not appear in that document’s vector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training Set and Validation Set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, in order to split data set into a training set and a validation set, let's find the restaurant that has the largest number of reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "business = pd.read_csv(\"data/restaurant.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "896\n"
     ]
    }
   ],
   "source": [
    "# choose the business with the max review_count\n",
    "example_business = business.business_id[business.review_count == max(business.review_count)]\n",
    "print(max(business.review_count))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We decide to set the validation set as the 896 reviews from the restaurant that has the largest number of reviews. Then, the rest of the reviews are considered as training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# validation set \n",
    "bool_list_vali = review['business_id'].isin(example_business)\n",
    "review_example = review[bool_list_vali]\n",
    "vali_index = review_example.index.tolist()\n",
    "vali_corpus = [corpus[i] for i in vali_index]\n",
    "\n",
    "# training set\n",
    "bool_list_train = [not i for i in bool_list_vali]\n",
    "review_train = review[bool_list_train]\n",
    "train_index = review_train.index.tolist()\n",
    "train_corpus = [corpus[i] for i in train_index]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Latent Dirichelet Allocation Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To discover latent topics in each review, we use Latent Dirichlet Allocation (LDA), a topic model that generates topics based on word frequency from a set of documents. LDA assumes that (1) documents contain multiple latent topics; (2) each document is assumed to be generated by a generative process defined by probabilistic model; and (3) each topic is characterized by a distribution over a fixed vocabulary. More specically, the joint distribution of the hidden topics and observed variables (words) is:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$p(\\phi_{1:K}, \\theta_{1:D}, Z_{1:D}, W_{1:D} ) = \\prod_{i=1}^K p(\\phi_{i}) \\prod_{d=1}^D p(\\theta_{d}) \\prod_{n=1}^N p(Z_{d,n}|\\theta_{d}) p(W_{d,n}|\\phi_{1:K}, Z_{d,n})$$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where\n",
    "$$\\phi_{1:K}: the\\ topics, each\\ \\phi_k\\ is\\ a\\ distribution\\ over\\ the\\ vocabulary\\ ; \\phi_{k}\\sim Dirichlet_V(\\beta)$$   \n",
    "\n",
    "$$\\theta_{1:D}: the\\ topic\\ proportion\\ for\\ document\\ 1:D;\\ \\theta_d \\sim Dirichlet_K(\\alpha)$$\n",
    "\n",
    "$$Z_{1:D}: the\\ topic\\ assignments\\ for\\ document\\ 1:D;\\ Z_d \\sim Multinomial_K(\\theta_d)$$\n",
    "\n",
    "$$W_{1:D}: the\\ observed\\ words\\ for\\ document\\ 1:D;\\ W_d \\sim Multinomial_V(\\phi_z)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LDA learns the distributions (e.g. the distribution of a set of topics, their associated word probabilities, the topic of each word, and the particular topic mixture of each document) by using Bayesian inference. After repeating the updating process for a large number of times, the model will reach a steady state and can be used to estimate the hidden topics, topic mixtures of each document and the words associated with each topic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the **LdaModel** in gensim package to apply the LDA model to our training set. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **LdaModel** class is described in detail in the gensim documentation. Parameters used in our example are:\n",
    "\n",
    "**num_topics**: required. An LDA model requires the user to determine how many topics should be generated. We tried 6, 8, 10, 12, 15 as num_topics, and it looks like that num_topics=10 works the best. Thus, we only fit the model with num_topic=10, and save the model for further analysis. \n",
    "\n",
    "**id2word**: required. The LdaModel class requires our previous dictionary to map ids to strings.\n",
    "\n",
    "**passes**: optional. The number of laps that the model will take through corpus. The greater the number of passes, the more accurate the model will be. A lot of passes can be slow on a very large corpus.\n",
    "\n",
    "**random_state**: optional. It is similar to random seed. Controlling random_state can make sure the result is the same every time we run the model "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# fit lda model\n",
    "ldamodel = models.ldamodel.LdaModel(train_corpus, num_topics=10, id2word=dictionary, passes=20, random_state=259)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['result/finalized_model_10.sav']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# save the lda model\n",
    "filepath = 'result/finalized_model_10.sav'\n",
    "joblib.dump(ldamodel, filepath)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.109*\"food\" + 0.075*\"place\" + 0.040*\"service\" + 0.024*\"time\" + 0.020*\"restaurant\" + 0.016*\"menu\" + 0.013*\"staff\" + 0.012*\"everything\"'),\n",
       " (1,\n",
       "  '0.047*\"pizza\" + 0.028*\"place\" + 0.017*\"cleveland\" + 0.011*\"time\" + 0.010*\"way\" + 0.009*\"line\" + 0.009*\"root\" + 0.009*\"home\"'),\n",
       " (2,\n",
       "  '0.052*\"food\" + 0.035*\"time\" + 0.034*\"service\" + 0.019*\"experience\" + 0.018*\"order\" + 0.018*\"night\" + 0.015*\"place\" + 0.015*\"restaurant\"'),\n",
       " (3,\n",
       "  '0.018*\"sauce\" + 0.015*\"salad\" + 0.014*\"flavor\" + 0.014*\"dinner\" + 0.013*\"pork\" + 0.013*\"chicken\" + 0.013*\"meal\" + 0.011*\"cream\"'),\n",
       " (4,\n",
       "  '0.055*\"thai\" + 0.037*\"sushi\" + 0.035*\"spicy\" + 0.032*\"rice\" + 0.030*\"roll\" + 0.025*\"tea\" + 0.024*\"curry\" + 0.020*\"shrimp\"'),\n",
       " (5,\n",
       "  '0.086*\"coffee\" + 0.069*\"breakfast\" + 0.054*\"brunch\" + 0.027*\"bacon\" + 0.020*\"egg\" + 0.019*\"toast\" + 0.018*\"morning\" + 0.018*\"hash\"'),\n",
       " (6,\n",
       "  '0.066*\"beef\" + 0.030*\"pho\" + 0.026*\"pork\" + 0.025*\"soup\" + 0.020*\"cleveland\" + 0.016*\"bowl\" + 0.014*\"pot\" + 0.014*\"meat\"'),\n",
       " (7,\n",
       "  '0.067*\"beer\" + 0.053*\"place\" + 0.047*\"bar\" + 0.047*\"food\" + 0.034*\"selection\" + 0.020*\"service\" + 0.018*\"night\" + 0.017*\"cleveland\"'),\n",
       " (8,\n",
       "  '0.043*\"sandwich\" + 0.034*\"wife\" + 0.028*\"bbq\" + 0.024*\"dog\" + 0.020*\"meat\" + 0.019*\"melt\" + 0.017*\"beef\" + 0.017*\"chicken\"'),\n",
       " (9,\n",
       "  '0.109*\"burger\" + 0.051*\"hour\" + 0.030*\"bar\" + 0.023*\"tacos\" + 0.023*\"bartender\" + 0.022*\"spot\" + 0.019*\"b\" + 0.015*\"taco\"')]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ldamodel.print_topics(num_topics=10, num_words=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The LDA model finds the 10 topics with 8 highest frequent words in each topic. The 10 topics are relatively interpretable. By associating and categorizing the high-frequency words of each topic, we name the topics as the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0, 'Service1')\n",
      "(1, 'Location')\n",
      "(2, 'Service2')\n",
      "(3, 'American1')\n",
      "(4, 'Asian1')\n",
      "(5, 'Breakfast')\n",
      "(6, 'Asian2')\n",
      "(7, 'Bar')\n",
      "(8, 'American2')\n",
      "(9, 'Mexican')\n"
     ]
    }
   ],
   "source": [
    "topic_dict = {0:\"Service1\",1:\"Location\",2:\"Service2\", 3:\"American1\",\n",
    "                 4:\"Asian1\",5:\"Breakfast\",6:\"Asian2\",7:\"Bar\",8:\"American2\",9:\"Mexican\"}\n",
    "for keys,values in topic_dict.items():\n",
    "    print((keys, values))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# save the names of the topics\n",
    "with shelve.open('result/topic_name') as result:\n",
    "    result['topic_name'] = topic_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further Interpretation of Topics with A Review"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at an original review from the validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "If I didn't have to pay the bill, I'd enjoy this restaurant a lot more.\n",
      "\n",
      "Yea, yea, I know -- I could say that about any place. But this one seems to fit that statement more than almost any other in Cleveland. \n",
      "\n",
      "Two burgers -- perfectly cooked and well seasoned with just the right amount of salt and other mouth-watering dashes of spice (plus the bun was nicely seasoned, something that many burger joints neglect). A decent side of fries. About a dozen chicken wings -- Falling off the bone and \"Chef-ed up\" with lemon juice, scallions, jalapeno and garlic, not simply smothered in a thick reddish-orange sauce.\n",
      "\n",
      "But why does all of that still have to cost $50? (And those were some of the least expensive items on the menu). \n",
      "\n",
      "Nevertheless, the drinks were great -- I had two different unique takes on the Old-Fashioned (who would have thought that Curacao works with Bourbon) and Jeannene had a new spin on the French 75 before trying one of the Old-Fashioneds -- and they were well worth another $50.\n",
      "\n",
      "Coupled with attentive service and a great patio seating on E. 4th Street, I am happy to give it four stars. But the prices seem to promise a 5-star experience.\n"
     ]
    }
   ],
   "source": [
    "print(review_example['text'][35868])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# save the first raw review of validation set\n",
    "with shelve.open('result/first_review_vali') as result:\n",
    "    result['first_review_vali'] = review_example['text'][35868]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, let's see the topics probability associated with this review."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(2, 0.29715302229392604),\n",
       " (0, 0.24879956111281701),\n",
       " (9, 0.21169559380609276),\n",
       " (3, 0.16659952745901788),\n",
       " (4, 0.057228951701560421)]"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_topics=[]\n",
    "for doc in vali_corpus:\n",
    "    topics = [sorted(ldamodel[doc], key=lambda x: x[1], reverse=True)]\n",
    "    all_topics.extend(topics)\n",
    "\n",
    "all_topics[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# save the topics probability of the first review in validation set\n",
    "with shelve.open('result/first_review_topics_vali') as result:\n",
    "    result['first_review_topics_vali'] = all_topics[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Topics 0 and 2 are the service topics. Topic 9, 3, and 4 represent Mexican food, American food, and Asian food respectively. We can see the topics are relatively reasonable, since the review talks a lot about the service and the service topics have the highest probability in this review. However, according to the review, the restaurant seems to be an American restaurant, but Mexican food topic has the highest probability among the 3 topics talking about food. It shows that the topics modeling may not be perfect to describe every part of the restaurant."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, we do not have a high expectation about the prediction accuracy for customer ratings according to the topics, and the next section confirms our concern."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Customer Rating Prediction with Topics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we are trying to use the topics probability found by LDA model to predict the rating given by the customers. Traditional linear regression is applied here, in order to see if the topics from a review are highly associated with the customer rating."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we use the probability of each topics as the elements of the design matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def design_matrix_creation(n_topics, topics_prob_list):\n",
    "    \"\"\"\n",
    "    function to get the design matrix for linear regression from the LDA topics \n",
    "\n",
    "    Input\n",
    "    ----------\n",
    "    n_topics: number of LDA topics\n",
    "    topics_prob_list: a list of tuples with the LDA topics and its corresponding probability in each review\n",
    "\n",
    "    Output\n",
    "    ------\n",
    "    design_matrix: a matrix containing the LDA topics probability in each review as an observation\n",
    "\n",
    "    \"\"\"\n",
    "    nrows = len(topics_prob_list)\n",
    "    design_matrix = np.zeros((nrows, n_topics))\n",
    "    for i in range(nrows):\n",
    "        items =  topics_prob_list[i]\n",
    "        for s in items:\n",
    "            topic_prob = s[1]\n",
    "            design_matrix[i, s[0]] = topic_prob\n",
    "    return design_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# get the design matrix for the validation set\n",
    "design_matrix = design_matrix_creation(10, all_topics)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After creating the design matrix, we decide to merge few topics together, since they reflect the similar contents. We merge topics 0 and 2 together, since they both reflect the service of the restaurants. Topic 3 and 8 are merged, because they both represent American food. Finally, topic 4 and 6 are merged, since they are both Asian food topics. Thus, the design matrix eventually has 7 different features. Our response variable is the customer ratings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# merge the similar topics\n",
    "design_matrix[:, 0] += design_matrix[:, 2]\n",
    "design_matrix[:, 3] += design_matrix[:, 8]\n",
    "design_matrix[:, 4] += design_matrix[:, 6]\n",
    "\n",
    "design_matrix = design_matrix[:, [0, 1, 3, 4, 5, 7, 9]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create linear regression object\n",
    "regr = LinearRegression()\n",
    "\n",
    "# Train the model using the training sets\n",
    "regr.fit(design_matrix, review_example.stars)\n",
    "\n",
    "\n",
    "# The mean squared error\n",
    "mse = np.mean((regr.predict(design_matrix) - review_example.stars) ** 2)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lm_result = {'Coefficients': regr.coef_, 'Intercept': regr.intercept_, \"Mean squared error\": mse}\n",
    "\n",
    "# save the linear regression result\n",
    "with shelve.open('result/lm_result') as result:\n",
    "    result['lm_result'] = lm_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Coefficients\n",
      "[-3.37275057 -2.04183867 -2.21252578 -1.61272055 -1.7564765  -1.5980306\n",
      " -3.56868465]\n",
      "Intercept\n",
      "6.27726471738\n",
      "Mean squared error\n",
      "1.6797105653141962\n"
     ]
    }
   ],
   "source": [
    "for keys,values in lm_result.items():\n",
    "    print(keys)\n",
    "    print(values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We apply MSE (Mean Squared Error) as a metric of accuracy, as the original ratings are integers and predicted ratings are float numbers. From the above result, we can see that the MSE is pretty high, which means using the topics probability to predict customer ratings is not very accurate for our data. Therefore, another supervised learning model---Multinomial Logistic Regression is applied to the data as well. The analysis and results can be found in another notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def test_tokenize_text_input():\n",
    "    \"\"\"Test that if the input is not a list of string, raise attribute error\"\"\"\n",
    "    try:\n",
    "        file = [1, 2, 3, 4]\n",
    "        tokenize_text(file)\n",
    "    except AttributeError:\n",
    "        assert True\n",
    "    else:\n",
    "        assert False\n",
    "        \n",
    "\n",
    "def test_tokenize_text_output():\n",
    "    \"\"\"Test that the output of tokenize_text is a list containing lists of string. \"\"\"\n",
    "    \n",
    "    file = [\"Today is a sunny day\", \"I meet Eli at Evans\"]\n",
    "    obs = tokenize_text(file)\n",
    "    for text_list in obs:\n",
    "        obs1 = isinstance(text_list, list)\n",
    "        exp1 = True\n",
    "        assert obs1 == exp1\n",
    "        \n",
    "        for text in text_list:\n",
    "            obs2 = isinstance(text, str)\n",
    "            exp2 = True\n",
    "            assert obs2 == exp2\n",
    "        \n",
    "\n",
    "def test_design_matrix_creation_input():\n",
    "    \"\"\"Test that when input n_topics is not an integer, raise type error\"\"\"\n",
    "    try:\n",
    "        topic_prob = [[(0, 0.5),(1, 0.5)], [(0, 0.25), (1, 0.25), (2, 0.25), (3, 0.35)]]\n",
    "        m = design_matrix_creation(4.5, topic_prob)\n",
    "    except TypeError:\n",
    "        assert True\n",
    "    else:\n",
    "        assert False\n",
    "        \n",
    "\n",
    "def test_design_matrix_creation_output():\n",
    "    \"\"\"Test that the design matrix has the right dimension. \"\"\"\n",
    "    topic_prob = [[(0, 0.5),(1, 0.5)], [(0, 0.25), (1, 0.25), (2, 0.25), (3, 0.35)]]\n",
    "    m = design_matrix_creation(4, topic_prob) \n",
    "    obs1 = m.shape[0]\n",
    "    obs2 = m.shape[1]\n",
    "    exp1 = 2\n",
    "    exp2 = 4\n",
    "    assert obs1 == exp1\n",
    "    assert obs2 == exp2\n",
    "    \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "test_tokenize_text_input()\n",
    "test_tokenize_text_output()\n",
    "test_design_matrix_creation_input()\n",
    "test_design_matrix_creation_output()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}