In [14]:

# import necessary packages
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import shelve
import scipy
from sklearn.externals import joblib

Summary: Modeling on Ohio’s Restaurant Yelp Review Data¶

Author: Ningning Long, Yue You, Tian Xia

Abstract¶

When running the business like restaurants or cafes, owners of business care a lot of how they can make uses of customers’ reviews after customers visit their businesses. Also, sometimes, it is too time-consuming to go through each review about a business, so people only pay attention to the ratings. However, an overall rating cannot convey the information that led a reviewer to that experience, and people have different standards of rating, so it may be misleading. For example, if a person cares a lot about a restaurant’s ambiance and goes to a restaurant with below-average ambiance but high rating because of its the taste of its food, this person may feel disappointed. Therefore, we hope to extract the topics involved in the customers’ reviews, in order to see the influential factors of the customer ratings. Also, the insights mined from customers’ reviews will help discover the weaknesses of the business and contribute to the improvements of service, products or ambiance.

This study is aimed at utilizing machine learning and natural language processing (NLP) techniques to analyze customers’ reviews from the Yelp Dataset Challenge. To be more specific, in this project, we aim at identifying topics involved in every review and the association between the reviews and the customer ratings. We focus on a subset data from the whole dataset challenge (i.e. the restaurants’ reviews in the state of Ohio). We are interested in using the Latent Dirichlet Allocation (LDA) method to find out the topics underlying the customer’s reviews. The latent Dirichlet Allocation (LDA) is an unsupervised machine learning algorithm that can generate topics based on word frequency from a set of documents. We will use the topics to investigate how they match the reviews and predict the customers’ ratings. Meanwhile, in order to make a comparison, we will also compute the TF-IDF of reviews and use them to run the multinomial logistic regression to predict customers’ ratings.

Data Cleaning and Exploration¶

The raw data are JSON files from the Round 10 Yelp Dataset Challenge. We subset and extract part of the data and save it in the ‘data’ folder. The ‘restaurant.csv’ file is the meta-data for the restaurants in the state of Ohio. The ‘reviews.csv’ file is the file that we mainly worked with, which contains the customers’ reviews for restaurants in the state of Ohio. Our latent dirichelet allocation (LDA) model and multinomial logistic regression model mainly use the reviews in that file.

Here is a quick look of the ‘restaurant.csv’ file in the ‘data’ folder:

In [2]:

pd.read_csv('./data/restaurant.csv',index_col=0).head()

Out[2]:

	state	city	address	name	business_id	stars	review_count	categories
0	OH	Painesville	1 S State St	Sidewalk Cafe Painesville	Bl7Y-ATTzXytQnCceg5k6w	3.0	26	['American (Traditional)', 'Breakfast & Brunch...
1	OH	Northfield	10430 Northfield Rd	Zeppe's Pizzeria	7HFRdxVttyY9GiMpywhhYw	3.0	7	['Pizza', 'Caterers', 'Italian', 'Wraps', 'Eve...
2	OH	Mentor	9209 Mentor Ave	Firehouse Subs	lXcxSdPa2m__LqhsaL9t9A	3.5	9	['Restaurants', 'Sandwiches', 'Delis', 'Fast F...
3	OH	Cleveland	13181 Cedar Rd	Richie Chan's Chinese Restaurant	Pawavw9U8rjxWVPU-RB7LA	3.5	22	['Chinese', 'Restaurants']
4	OH	Northfield	134 E Aurora Rd	Romeo's Pizza	RzVHK8Jfcy8RvXjn_z3OBw	4.0	4	['Restaurants', 'Pizza']

Here is a quick look of the ‘reviews.csv’ file in the ‘data’ folder. We will primarily work on the ‘text’ column, which stores the customers’ reviews, and the ‘stars’ column, which stores the actual customers’ ratings to restaurants, in our analysis.

In [3]:

pd.read_csv('./data/reviews.csv',index_col=0).head()

Out[3]:

	business_id	cool	date	funny	review_id	stars	text	useful	user_id
52	tulUhFYMvBkYHsjmn30A9w	1	2013-11-19	0	FsS5TUFPI8QJEE60-HR3dw	2	Wished it was better..\nAfter watching man vs....	1	bWh4k_cCuVt5GLVd33xIxg
53	tulUhFYMvBkYHsjmn30A9w	1	2014-12-18	0	7xGHiLP1vAaGmX6srC_XXw	4	Decor and service leave much to be desired, bu...	0	nQ4e81UdfczimYcIUtO3HA
54	tulUhFYMvBkYHsjmn30A9w	1	2014-09-12	0	ZWlXWc9LHPLiOksrp-enyw	5	My husband and I ate here tonight for the firs...	0	gJPa95ZRozMhiOqvENpspA
55	tulUhFYMvBkYHsjmn30A9w	1	2012-02-28	1	KpRwKYyQ93ypyDSdA7IXfw	2	Don't believe the hype. Nooooo! \n\nIn the Cle...	5	bAwfPH4lXNzgcYp9JFy6ow
56	tulUhFYMvBkYHsjmn30A9w	3	2014-10-06	6	OZvrgp4vWBsYqIt3-YMSEw	3	Don't believe the hype!\n\nAfter seeing this l...	10	BjtJ3VkMOxV2Lan037AFuw

In short summary of the data, we focus on the 316 restaurants which have 100 customers’ reviews at least, in the state of Ohio. The maximum number of reviews for a single restaurant in our sample is around 900. The distribution of mean star ratings received for restaurants is skewed with the peak around 4 stars.

Topics Analysis with Latent Dirichelet Allocation (LDA)¶

Text Manipulation¶

Before applying the LDA model, we manipulate the customer reviews in order to get the tokenized review data. For instance, an original review from the Yelp dataset is shown as the following.

In [4]:

with shelve.open('result/first_review_train') as result:
    first_review_train = result['first_review_train']

print(first_review_train)

Wished it was better..
After watching man vs. food I decided to stop by, décor was not that homey and welcoming, and the neighborhood was bad, but nothing I haven't been around before.  The ribs were very fatty and grisly, it was disappointing and I didn't get enough sauce and when I asked for a little more they wanted to charge me, the coleslaw was awesome!  I noticed a hair in my food and it turned me off to the rest of it, so i threw it away , I wont be returning...
sorry guys

In order to find the topics, we need to tokenize the reviews. First, we change all the words in each review as the lower case for convenience in the topic analysis. Then, we apply some string manipulation in order to save the meaningful words and numbers. Next, we delete all the stop words from each review, which are certain parts of English speech, like (for, or) or the words that are meaningless to the topic model. Finally, we decide to only keep the words that are the noun for further analysis.

After the text manipulation, the above raw review becomes the following word list.

In [5]:

with shelve.open('result/text_array') as result:
    text_array = result['text_array']

print(text_array[0])

['man', 'vs', 'food', 'cor', 'homey', 'neighborhood', 'nothing', 'ribs', 'sauce', 'charge', 'coleslaw', 'awesome', 'hair', 'food', 'rest', 'threw']

The word list contains the words talking about the food, service, and location of the restaurants, which are helpful for our topic modeling.

Then, we use the Dictionary() function to traverse the text_array, assigning a unique integer id to each unique token while also collecting word counts and relevant statistics. Next, we use doc2bow() function converts dictionary into a bag-of-words. The details can be found in LDA.ipynb. The corpus after these processes is used as input in LDA model.

Training Set and Validation Set¶

In order to fit and apply LDA model, we split data set into a training set and a validation set. We decide to set the validation set as the 896 reviews from the restaurant that has the largest number of reviews. Then, the rest of the reviews are considered as training set. The training set is used to train the LDA model and find the topics, and the trained LDA model is applied on the validation set for further analysis and linear regression.

Latent Dirichelet Allocation Model¶

To discover latent topics in each review, we use Latent Dirichlet Allocation (LDA), a topic model that generates topics based on word frequency from a set of documents. LDA assumes that (1) documents contain multiple latent topics; (2) each document is assumed to be generated by a generative process defined by probabilistic model; and (3) each topic is characterized by a distribution over a fixed vocabulary. More specically, the joint distribution of the hidden topics and observed variables (words) is:

\[p(\phi_{1:K}, \theta_{1:D}, Z_{1:D}, W_{1:D} ) = \prod_{i=1}^K p(\phi_{i}) \prod_{d=1}^D p(\theta_{d}) \prod_{n=1}^N p(Z_{d,n}|\theta_{d}) p(W_{d,n}|\phi_{1:K}, Z_{d,n})\]

where

\[\phi_{1:K}: the\ topics, each\ \phi_k\ is\ a\ distribution\ over\ the\ vocabulary\ ; \phi_{k}\sim Dirichlet_V(\beta)\]

\[\theta_{1:D}: the\ topic\ proportion\ for\ document\ 1:D;\ \theta_d \sim Dirichlet_K(\alpha)\]

\[Z_{1:D}: the\ topic\ assignments\ for\ document\ 1:D;\ Z_d \sim Multinomial_K(\theta_d)\]

\[W_{1:D}: the\ observed\ words\ for\ document\ 1:D;\ W_d \sim Multinomial_V(\phi_z)\]

LDA learns the distributions (e.g. the distribution of a set of topics, their associated word probabilities, the topic of each word, and the particular topic mixture of each document) by using Bayesian inference. After repeating the updating process for a large number of times, the model will reach a steady state and can be used to estimate the hidden topics, topic mixtures of each document and the words associated with each topic. We use the LdaModel in gensim package to apply the LDA model to our training set. When fitting the model, We have tried 6, 8, 10, 12, 15 as the number of topics, and it looks like that the number of topics = 10 works the best. Thus, we save the model with 10 topics for further analysis.

In [7]:

# read in the lda model
ldamodel = joblib.load("result/finalized_model_10.sav")

# the topics found by lda model
ldamodel.print_topics(num_topics=10, num_words=8)

Out[7]:

[(0,
  '0.109*"food" + 0.075*"place" + 0.040*"service" + 0.024*"time" + 0.020*"restaurant" + 0.016*"menu" + 0.013*"staff" + 0.012*"everything"'),
 (1,
  '0.047*"pizza" + 0.028*"place" + 0.017*"cleveland" + 0.011*"time" + 0.010*"way" + 0.009*"line" + 0.009*"root" + 0.009*"home"'),
 (2,
  '0.052*"food" + 0.035*"time" + 0.034*"service" + 0.019*"experience" + 0.018*"order" + 0.018*"night" + 0.015*"place" + 0.015*"restaurant"'),
 (3,
  '0.018*"sauce" + 0.015*"salad" + 0.014*"flavor" + 0.014*"dinner" + 0.013*"pork" + 0.013*"chicken" + 0.013*"meal" + 0.011*"cream"'),
 (4,
  '0.055*"thai" + 0.037*"sushi" + 0.035*"spicy" + 0.032*"rice" + 0.030*"roll" + 0.025*"tea" + 0.024*"curry" + 0.020*"shrimp"'),
 (5,
  '0.086*"coffee" + 0.069*"breakfast" + 0.054*"brunch" + 0.027*"bacon" + 0.020*"egg" + 0.019*"toast" + 0.018*"morning" + 0.018*"hash"'),
 (6,
  '0.066*"beef" + 0.030*"pho" + 0.026*"pork" + 0.025*"soup" + 0.020*"cleveland" + 0.016*"bowl" + 0.014*"pot" + 0.014*"meat"'),
 (7,
  '0.067*"beer" + 0.053*"place" + 0.047*"bar" + 0.047*"food" + 0.034*"selection" + 0.020*"service" + 0.018*"night" + 0.017*"cleveland"'),
 (8,
  '0.043*"sandwich" + 0.034*"wife" + 0.028*"bbq" + 0.024*"dog" + 0.020*"meat" + 0.019*"melt" + 0.017*"beef" + 0.017*"chicken"'),
 (9,
  '0.109*"burger" + 0.051*"hour" + 0.030*"bar" + 0.023*"tacos" + 0.023*"bartender" + 0.022*"spot" + 0.019*"b" + 0.015*"taco"')]

The LDA model finds the 10 topics, and we have shown the 10 topics with 8 highest frequent words in each topic as above. The 10 topics are relatively interpretable. By associating and categorizing the high-frequency words of each topic, we name the topics as the following:

In [8]:

with shelve.open('result/topic_name') as result:
    topic_dict = result['topic_name']

for keys,values in topic_dict.items():
    print((keys, values))

(0, 'Service1')
(1, 'Location')
(2, 'Service2')
(3, 'American1')
(4, 'Asian1')
(5, 'Breakfast')
(6, 'Asian2')
(7, 'Bar')
(8, 'American2')
(9, 'Mexican')

Further Interpretation of Topics with A Review¶

In order to see if the topic modeling makes sense, we have extracted a review from the validation set and applied LDA to find the topics probability of this review.

Let’s take a look at an original review from the validation set.

In [10]:

with shelve.open('result/first_review_vali') as result:
    example_review = result['first_review_vali']

print(example_review)

If I didn't have to pay the bill, I'd enjoy this restaurant a lot more.

Yea, yea, I know -- I could say that about any place. But this one seems to fit that statement more than almost any other in Cleveland.

Two burgers -- perfectly cooked and well seasoned with just the right amount of salt and other mouth-watering dashes of spice (plus the bun was nicely seasoned, something that many burger joints neglect). A decent side of fries. About a dozen chicken wings -- Falling off the bone and "Chef-ed up" with lemon juice, scallions, jalapeno and garlic, not simply smothered in a thick reddish-orange sauce.

But why does all of that still have to cost $50? (And those were some of the least expensive items on the menu).

Nevertheless, the drinks were great -- I had two different unique takes on the Old-Fashioned (who would have thought that Curacao works with Bourbon) and Jeannene had a new spin on the French 75 before trying one of the Old-Fashioneds -- and they were well worth another $50.

Coupled with attentive service and a great patio seating on E. 4th Street, I am happy to give it four stars. But the prices seem to promise a 5-star experience.

Then, let’s see the topics probability associated with this review.

In [12]:

with shelve.open('result/first_review_topics_vali') as result:
    topic_prob = result['first_review_topics_vali']

topic_prob

Out[12]:

[(2, 0.29715302229392604),
 (0, 0.24879956111281701),
 (9, 0.21169559380609276),
 (3, 0.16659952745901788),
 (4, 0.057228951701560421)]

Topics 0 and 2 are the service topics. Topic 9, 3, and 4 represent Mexican food, American food, and Asian food respectively. We can see the topics are relatively reasonable, since the review talks a lot about the service and the service topics have the highest probability in this review. However, according to the review, the restaurant seems to be an American restaurant, but Mexican food topic has the highest probability among the 3 topics talking about food. It shows that the topics modeling may not be perfect to describe every part of the restaurant.

Thus, we do not have a high expectation about the prediction accuracy for customer ratings according to the topics, and the next section confirms our concern.

Customer Rating Prediction with Topics¶

In this section, we are trying to use the topics probability found by LDA model to predict the rating given by the customers. Traditional linear regression is applied here, in order to see if the topics from a review are highly associated with the customer rating.

First, we use the probability of each topics as the elements of the design matrix. After creating the design matrix, we decide to merge few topics together, since they reflect the similar contents. We merge topics 0 and 2 together, since they both reflect the service of the restaurants. Topic 3 and 8 are merged, because they both represent American food. Finally, topic 4 and 6 are merged, since they are both Asian food topics. Thus, the design matrix eventually has 7 different features. Our response variable is the customer ratings. Also, we apply MSE (Mean Squared Error) as a metric of accuracy, as the original ratings are integers and predicted ratings are float numbers. The regression results are the following:

In [13]:

# read in the linear regression result
with shelve.open('result/lm_result') as result:
    lm_result =  result['lm_result']

for keys,values in lm_result.items():
    print(keys)
    print(values)

Coefficients
[-3.37275057 -2.04183867 -2.21252578 -1.61272055 -1.7564765  -1.5980306
 -3.56868465]
Intercept
6.27726471738
Mean squared error
1.6797105653141962

From the above result, we can see that the MSE is pretty high, which means using the topics probability to predict customer ratings is not very accurate for our data. Therefore, another supervised learning model—Multinomial Logistic Regression is applied to the data as well. The analysis and results can be found in another notebook.

However, why using the topics probability to predict customer ratings is not working here? There are several possible explainations. For instance, when we are training the LDA model, We have only tried 6, 8, 10, 12, 15 as the number of topics and chosen 10 as the best one because of time limitaion. There may be a better choice of number of topics that fits the data. Therefore, we believe that if we keep tunning the parameters in the LDA model, the prediction results can be better. Also, fitting the LDA model with larger dataset can be helpful as well.

Multinomial Logestic Regression with TF-IDF¶

TF-IDF transformation¶

As we have seen from the previous notebook, topic probabilities of reviews do not have good predictive powers of customers’ ratings of restaurants. Thus, in this notebook, we investigate how multinomial logistic regression performs. We will calculate the TF-IDF statistics from the reviews and use them to predict customers’ ratings.

Definition of TF-IDF Transformation¶

TF-IDF means “Term Frequency - Inverse Document Frequency”. It is a powerful technique to detect important words in a collection of documents. “Term Frequency” (TF) meansures the frequency of word \(w_i\) in document \(d_j\), and the “Inverse Document Frequency” (IDF) measures how much information the word provides, i.e., the frequency of word \(w_i\) in the collection of documents. The TF-IDF value for a word \(w_i\) in document \(d_j\) is positively associated with word frequencies and negatively associated with document frequencies. The math formula for TF-IDF is:

\[TF-IDF(w_i, d_j) = TF(w_i, d_j) \times IDF(w_i)\]

And IDF can be smoothed using the formula:

\[IDF_{smooth}(w_i) = log(\frac{N}{1 + n_i})\]

where \(N\) is the number of documents considered and \(n_i\) is the frequency of \(w_i\) in the all documents considered.

In this project, TF-IDF is used in logistic regression classification. In the following analysis, we did several steps to fit the best logistic regression model:

Steps in TF-IDF Transformation¶

constructed the TF-IDF matrix, the matrix is ‘text_features’, a sparse matrix. Also, the reponses of all observations are in ‘star’, an array.

In [16]:

# load transformed data
text_features = scipy.sparse.load_npz('result/text_features.npz')
star = np.load('result/star.npy')

In [18]:

print("Number of observations in the text_features dataset is", text_features.shape[0],
      "\nNumber of covariates in the text_features dataset is", text_features.shape[1])
print("The format of text_feature is\n", text_features[-1:])
print("The format of star is\n", star)

Number of observations in the text_features dataset is 60222
Number of covariates in the text_features dataset is 50137
The format of text_feature is
   (0, 17785)   0.060146376659
  (0, 19534)    0.0664307345244
  (0, 37080)    0.0971322128585
  (0, 6868)     0.215221170104
  (0, 41758)    0.14424586198
  (0, 16242)    0.115560138128
  (0, 3668)     0.169463104229
  (0, 33687)    0.18134186701
  (0, 42347)    0.244903534338
  (0, 30214)    0.173935675158
  (0, 19663)    0.219337002949
  (0, 10430)    0.210710863093
  (0, 38753)    0.253060975895
  (0, 11918)    0.548103837493
  (0, 1483)     0.283007805783
  (0, 2376)     0.371248887227
  (0, 43674)    0.274472118608
The format of star is
 [2 4 5 ..., 4 5 5]

Next Step¶

splited the whole dataset into training set and validation set using 10-fold cross-valudation,
used the TF-IDF values as covariates, the star values of review as responses, to build a logistic regression model in the training set,
tried 3 different tuning parameters respectively,
applied the models built in training set to validation set and obtained the predicted star values for each tuning parameter,
computed the Mean Squared Error (MSE) between true star value and predicted star value in validation set,
and chose the optimal tuning parameters which produces lowest MSE.

Here, we run a function called ‘compute_CV_mse’ to do the rest steps above.

We computed MSE with 10 fold cross-validation, of first 1,000 keywords, with random splitting seed for training and validation sets = 1, and original tuning parameters = (1, 100, 1000, 10000, 100000). After many trails, we selected the current range [10, 100] as the optimal range of tuning parameters.

Here, we showed the output MSE with corresponding tuning parameters, sorted smallest to largest:

In [23]:

df_sorted = pd.read_hdf('result/df_sorted.h5', 'df_sorted')
print("Sorted MSE and corresponding parameters: small to big")
df_sorted

Sorted MSE and corresponding parameters: small to big

Out[23]:

	mse	parameters
4	0.748335	50.0
3	0.749202	40.0
1	0.749553	20.0
2	0.749811	30.0
5	0.749885	60.0
6	0.750014	70.0
7	0.751435	80.0
8	0.753058	90.0
0	0.753077	10.0
9	0.753630	100.0

And the trend of MSE by tuning parameter is plotted:

In [24]:

print("The minimum MSE is", np.round(df_sorted.iloc[0][0], 4), "with tuning parameter =", df_sorted.iloc[0][1])

The minimum MSE is 0.7483 with tuning parameter = 50.0

We observe that the multinomial logistic regression performs reasonably well. The cross-validated mean square error (MSE) is 0.748. It means that TF-IDF statistics from the reviews have explanatory and predictive powers of customers’ ratings of restaurants.

Conclusion¶

In this project, we have found the topics of Yelp customers reviews by LDA and then used the topics probabilities to predict the customer’s ratings. LDA has selected 10 different topics from our training set, and these topics reflect different parts of the restaurants, such as location, service, drinks, and types of food. However, topic probabilities of reviews do not have good predictive powers of customers’ ratings of restaurants. Thus, in order to predict customers’ ratings, we also investigate how multinomial logistic regression performs with TF-IDF statistics from the reviews. The result shows that TF-IDF statistics from the reviews have better explanatory and predictive powers than LDA topics.

Since the time is limited, there are a lot of improvements we can make in our analysis. For instance, we can choose more parameters to tune while we apply LDA, in order to find more interpretable topics. Also, we can take a look at the data from another state to see the topics difference among different states. Furthermore, trying more machine learning models in predicting customer’s ratings from the reviews can be interesting too.

Author Contributions¶

This repository and project is the collaboration from Ningning Long, Yue You and Tian Xia. Their contributions to the project are summarized as:

Ningning Long: - ‘LDA.ipynb’, latent dirichelet allocation modeling and analysis - ‘environment.yml’ - Some write-up of ‘main.ipynb’ - Project website with Sphinx

Yue You: - ‘Logistic.ipynb’, statistical modeling of multinomial logistic regression - ‘.gitignore’ - Some write-up of ‘main.ipynb’

Tian Xia: - ‘Data_Cleaning.ipynb’, data extraction and cleaning - ‘README.md’ - Some write-up of ‘main.ipynb’ - ‘LICENSE.md’ - ‘Makefile’