{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic Regression Modeling with Cross Validation on Word Frequecy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we have seen from the previous notebook, topic probabilities of reviews do not have good predictive powers of customers’ ratings of restaurants. Thus, in this notebook, we investigate how multinomial logistic regression performs. We will calculate the TF-IDF statistics from the reviews and use them to predict customers’ ratings. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: load dependencies and read data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.feature_selection import SelectKBest,chi2\n", "import numpy as np\n", "from sklearn.model_selection import KFold\n", "from sklearn.linear_model import LogisticRegression\n", "import unittest\n", "import matplotlib.pyplot as plt\n", "import re\n", "import scipy.sparse\n", "import random" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# set the seed\n", "random.seed(259)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# load the data\n", "review = pd.read_csv(\"data/reviews.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: _Term Frequency - Inverse Document Frequenc (TF-IDF)_ Transformation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TF-IDF means \"Term Frequency - Inverse Document Frequency\". It is a powerful technique to detect important words in a collection of documents. \"Term Frequency\" (TF) meansures the frequency of word $w_i$ in document $d_j$, and the \"Inverse Document Frequency\" (IDF) measures how much information the word provides, i.e., the frequency of word $w_i$ in the collection of documents. The TF-IDF value for a word $w_i$ in document $d_j$ is positively associated with word frequencies and negatively associated with document frequencies. The math formula for TF-IDF is:\n", "\n", "$$TF-IDF(w_i, d_j) = TF(w_i, d_j) \\times IDF(w_i)$$\n", "\n", "And IDF can be smoothed using the formula:\n", "\n", "$$IDF_{smooth}(w_i) = log(\\frac{N}{1 + n_i})$$\n", "\n", "where $N$ is the number of documents considered and $n_i$ is the frequency of $w_i$ in the all documents considered.\n", "\n", "In this project, TF-IDF is used in logistic regression classification. In the following analysis, we did several steps to fit the best logistic regression model:\n", "\n", "1. constructed the TF-IDF matrix, \n", "2. used $\\chi^2$ independent test to select top $1,000$ keywords from training set,\n", "3. computed the TF-IDF values of the $1,000$ keywords,\n", "4. splited the whole dataset into training set and validation set using 10-fold cross-valudation,\n", "5. used the _TF-IDF values_ as covariates, the _star values_ of review (ratings) as responses, to build a logistic regression model in the training set,\n", "6. tried 3 different tuning parameters respectively,\n", "7. applied the models built in training set to validation set and obtained the predicted _star values_ for each tuning parameter,\n", "8. computed the Mean Squared Error (MSE) between true _star value_ and predicted _star value_ in validation set, \n", "9. and chose the optimal tuning parameters which produces lowest MSE." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### construct the TF-IDF matrix" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# make raw dataset into the format for TF-idf transformation\n", "star = np.array(review.stars)\n", "text = list(map(lambda x: x[2:-1].replace(\"\\\\n\",\"\\n\"), review.text))\n", "pat = re.compile(r\"[^\\w\\s]\")\n", "text_clean = np.array(list(map(lambda x: pat.sub(\" \",x).lower(), text)))\n", "\n", "# create TF-IDF\n", "vectorizer = TfidfVectorizer(stop_words = \"english\")\n", "text_features = vectorizer.fit_transform(text_clean)\n", "vocab = vectorizer.get_feature_names()\n", "# save\n", "scipy.sparse.save_npz('result/text_features.npz', text_features)\n", "np.save('result/star', star)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of observations in the text_features dataset is 60222 \n", "Number of covariates in the text_features dataset is 50137\n" ] } ], "source": [ "print(\"Number of observations in the text_features dataset is\", text_features.shape[0],\n", " \"\\nNumber of covariates in the text_features dataset is\", text_features.shape[1])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The format of text_feature is\n", " (0, 17785)\t0.060146376659\n", " (0, 19534)\t0.0664307345244\n", " (0, 37080)\t0.0971322128585\n", " (0, 6868)\t0.215221170104\n", " (0, 41758)\t0.14424586198\n", " (0, 16242)\t0.115560138128\n", " (0, 3668)\t0.169463104229\n", " (0, 33687)\t0.18134186701\n", " (0, 42347)\t0.244903534338\n", " (0, 30214)\t0.173935675158\n", " (0, 19663)\t0.219337002949\n", " (0, 10430)\t0.210710863093\n", " (0, 38753)\t0.253060975895\n", " (0, 11918)\t0.548103837493\n", " (0, 1483)\t0.283007805783\n", " (0, 2376)\t0.371248887227\n", " (0, 43674)\t0.274472118608\n", "The format of star is\n", " [2 4 5 ..., 4 5 5]\n" ] } ], "source": [ "print(\"The format of text_feature is\\n\", text_features[-1:])\n", "print(\"The format of star is\\n\", star)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Compute MSE\n", "### Write a function to compute step 2 to 8" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def compute_CV_mse(df_text, df_star, n_fold, n_words, seed, parameters):\n", " \"\"\"Return the Mean Squared Error (MSE) between predictied reponses and true responses in validation set.\n", " \n", " Parameters\n", " ----------\n", " \n", " df_text: TF-IDF format sparse matrix\n", " df_star: array of responses in logistic model\n", " n_fold: number of folds in cross-validation, positive integer\n", " n_words: number of keywords selected, positive integer\n", " seed: random seed for splitting training and validation set\n", " parameters: tuning parameters of logistic regression, positive float vector\n", " \n", " Return\n", " ------\n", " Array\n", " A numeric Array where each value in dimension 0 is the tuning parameter, \n", " and each value in dimension 1 is MSE computed using the corresponding tuning parameter\n", " \n", " Example\n", " -------\n", " \n", " >>> text_features = vectorizer.fit_transform(text_clean)\n", " ... star = np.array(review.stars)\n", " ... compute_CV_mse(text_features, star, 2, 10, 1, (100.0, 1000.0))\n", " \"\"\"\n", " # parameters must be positive\n", " test = False\n", " if isinstance(n_fold, int):\n", " test = True\n", " else:\n", " raise TypeError(\"n_fold is not an integer\")\n", " if n_fold > 0:\n", " test = True\n", " else:\n", " raise ValueError(\"n_fold should be positive\")\n", " if isinstance(n_words, int):\n", " test = True\n", " else:\n", " raise TypeError(\"n_words is not an integer\")\n", " if n_words > 0:\n", " test = True\n", " else:\n", " raise ValueError(\"n_words should be positive\") \n", " for i in parameters:\n", " if i > 0:\n", " test = True\n", " else:\n", " raise ValueError(\"parameters should be positive\")\n", " # create K-folds\n", " kf = KFold(n_fold, shuffle = True, random_state = seed)\n", " # create empty dataframe\n", " mse = np.zeros([n_fold + 1, len(parameters)])\n", " k = 0\n", " for train_idx,val_idx in kf.split(df_text):\n", " # create training and validation sets\n", " text_features_train = df_text[train_idx]\n", " text_features_val = df_text[val_idx]\n", " star_train = df_star[train_idx]\n", " star_val = df_star[val_idx]\n", " # using $chi^2$ independent test to select top 1,000 keywords from training set\n", " fselect = SelectKBest(chi2, k = n_words)\n", " # transform training set to format that fits select functuon\n", " text_features_train = fselect.fit_transform(text_features_train, star_train)\n", " text_features_val = text_features_val[:, fselect.get_support()]\n", " # compute MSE for each parameter\n", " t = 0\n", " for para in parameters:\n", " # logistic regression with C = parameter,\n", " # where C is positive float, indicates \"Inverse of regularization strength\", \n", " # and smaller values specify stronger regularization.\n", " mod_temp = LogisticRegression(C = para)\n", " # fit regression on training set\n", " mod_temp.fit(X = text_features_train, y = star_train)\n", " # predict star values on validation set\n", " pred = mod_temp.predict(X = text_features_val)\n", " # compute MSE as a dataframe, each value is one mse in one validation set\n", " mse[k,t] = sum((pred - star_val)**2)/len(pred)\n", " t+= 1\n", " k+= 1\n", " # compute overall MSE\n", " mse_out = np.mean(mse[1:n_fold,], axis = 0) \n", " return(np.vstack((parameters, mse_out)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Execute the function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "compute MSE with 10 fold cross-validation, of first 1,000 keywords, with random splitting seed for training and validation sets = 1, and original tuning parameters = (1, 100, 1000, 10000, 100000).\n", "\n", "**NOTE:** Original tuning parameter range is [1, 100,000], the current range [10, 100] is selected after many trails as the optimal range of tuning parameters**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# output\n", "para, mse = compute_CV_mse(df_text = text_features, df_star = star, n_fold = 10, \n", " n_words = 1000, seed = 1, parameters = list(range(10, 110, 10)))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sorted MSE and corresponding parameters: small to big\n" ] }, { "data": { "text/html": [ "
\n", " | mse | \n", "parameters | \n", "
---|---|---|
4 | \n", "0.748335 | \n", "50.0 | \n", "
3 | \n", "0.749202 | \n", "40.0 | \n", "
1 | \n", "0.749553 | \n", "20.0 | \n", "
2 | \n", "0.749811 | \n", "30.0 | \n", "
5 | \n", "0.749885 | \n", "60.0 | \n", "
6 | \n", "0.750014 | \n", "70.0 | \n", "
7 | \n", "0.751435 | \n", "80.0 | \n", "
8 | \n", "0.753058 | \n", "90.0 | \n", "
0 | \n", "0.753077 | \n", "10.0 | \n", "
9 | \n", "0.753630 | \n", "100.0 | \n", "