In [2]:

# import necessary packages
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from gensim import corpora
from gensim import models
import nltk
from sklearn.externals import joblib
from nltk.tag import pos_tag
import numpy as np
from sklearn.linear_model import LinearRegression
import shelve

Topics Analysis with Latent Dirichelet Allocation (LDA)¶

Read in the data¶

In [5]:

# read the data
review = pd.read_csv("data/reviews.csv")

Text Manipulation¶

In [119]:

# we focus on the text column (the reviews provided by customers)
raw_all = review.text.tolist()
first_review_train = raw_all[0]
print(first_review_train)

Wished it was better..
After watching man vs. food I decided to stop by, décor was not that homey and welcoming, and the neighborhood was bad, but nothing I haven't been around before.  The ribs were very fatty and grisly, it was disappointing and I didn't get enough sauce and when I asked for a little more they wanted to charge me, the coleslaw was awesome!  I noticed a hair in my food and it turned me off to the rest of it, so i threw it away , I wont be returning...
sorry guys

In [120]:

with shelve.open('result/first_review_train') as result:
    result['first_review_train'] = first_review_train

As the review example shown above, we extract the text column, which contains the raw reviews from customers.

In order to tokenize the reviews, we first change all the words in each review as the lower case for convenience in the topic analysis. Then, we apply some string manipulation in order to save the meaningful words and numbers. Next, we delete all the stop words from each review, which are certain parts of English speech, like (for, or) or the words that are meaningless to the topic model. Finally, we decide to only keep the words that are the noun for further analysis.

In [8]:

def tokenize_text(raw):
    """
    function to get the tokenized list of reviews from the raw reviews

    Input
    ----------
    raw: a list of review text

    Output
    ------
    text_array: a list of tokenized review word sets

    """
    # define the stopwords
    stop = set(stopwords.words('english'))

    text_array = []
    for i in range(len(raw)):
        # for each review, change the words to lowercase
        text = raw[i].lower()

        # get rid of \r and \n for each string
        text = text.replace('\r\n', ' ')

        # get rid of all the elements that are not characters or numbers
        text = re.sub("[^a-z0-9]", " ", text)

        # Tokenization segments a document into its atomic elements.
        words = text.split()

        # delete stop words
        words = [j for j in words if j not in stop]

        # only keep the words that are noun, since we only need to find the subtopics
        tagged_sent = pos_tag(words)
        words = [word for word, pos in tagged_sent if pos == 'NN']

        text_array.append(words)

    return text_array

In [10]:

text_array = tokenize_text(raw_all)

The text_array contains all the tokenized reviews in our data set. Then, we use the Dictionary() function to traverse texts, assigning a unique integer id to each unique token while also collecting word counts and relevant statistics. To see each token’s unique integer id, try print(dictionary.token2id)

In [11]:

# save the dictionary results
dictionary = corpora.Dictionary(text_array)
dictionary.save('result/dictionary.dic')


# save the text_array
with shelve.open('result/text_array') as result:
    result['text_array'] = text_array

Next, we use doc2bow() function converts dictionary into a bag-of-words. The result, corpus, is a list of vectors equal to the number of documents. Each document list is a series of tuples.

In [13]:

corpus = [dictionary.doc2bow(text) for text in text_array]

In [14]:

print(dictionary.token2id['man'])

In [15]:

print(corpus[0])

[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]

The above list of tuples represents our first document (i.e. the first review after tokenizing). The tuples are (term ID, term frequency) pairs, so we can see that dictionary.token2id[‘man’] says “man”’s integer id is 0, then the first tuple indicates that “man” appears once in the first tokenized review document. doc2bow() only includes terms that actually occur: terms that do not occur in a document will not appear in that document’s vector.

Training Set and Validation Set¶

Then, in order to split data set into a training set and a validation set, let’s find the restaurant that has the largest number of reviews.

In [17]:

business = pd.read_csv("data/restaurant.csv")

In [18]:

# choose the business with the max review_count
example_business = business.business_id[business.review_count == max(business.review_count)]
print(max(business.review_count))

We decide to set the validation set as the 896 reviews from the restaurant that has the largest number of reviews. Then, the rest of the reviews are considered as training set.

In [26]:

# validation set
bool_list_vali = review['business_id'].isin(example_business)
review_example = review[bool_list_vali]
vali_index = review_example.index.tolist()
vali_corpus = [corpus[i] for i in vali_index]

# training set
bool_list_train = [not i for i in bool_list_vali]
review_train = review[bool_list_train]
train_index = review_train.index.tolist()
train_corpus = [corpus[i] for i in train_index]

Latent Dirichelet Allocation Model¶

To discover latent topics in each review, we use Latent Dirichlet Allocation (LDA), a topic model that generates topics based on word frequency from a set of documents. LDA assumes that (1) documents contain multiple latent topics; (2) each document is assumed to be generated by a generative process defined by probabilistic model; and (3) each topic is characterized by a distribution over a fixed vocabulary. More specically, the joint distribution of the hidden topics and observed variables (words) is:

\[p(\phi_{1:K}, \theta_{1:D}, Z_{1:D}, W_{1:D} ) = \prod_{i=1}^K p(\phi_{i}) \prod_{d=1}^D p(\theta_{d}) \prod_{n=1}^N p(Z_{d,n}|\theta_{d}) p(W_{d,n}|\phi_{1:K}, Z_{d,n})\]

where

\[\phi_{1:K}: the\ topics, each\ \phi_k\ is\ a\ distribution\ over\ the\ vocabulary\ ; \phi_{k}\sim Dirichlet_V(\beta)\]

\[\theta_{1:D}: the\ topic\ proportion\ for\ document\ 1:D;\ \theta_d \sim Dirichlet_K(\alpha)\]

\[Z_{1:D}: the\ topic\ assignments\ for\ document\ 1:D;\ Z_d \sim Multinomial_K(\theta_d)\]

\[W_{1:D}: the\ observed\ words\ for\ document\ 1:D;\ W_d \sim Multinomial_V(\phi_z)\]

LDA learns the distributions (e.g. the distribution of a set of topics, their associated word probabilities, the topic of each word, and the particular topic mixture of each document) by using Bayesian inference. After repeating the updating process for a large number of times, the model will reach a steady state and can be used to estimate the hidden topics, topic mixtures of each document and the words associated with each topic.

We use the LdaModel in gensim package to apply the LDA model to our training set.

The LdaModel class is described in detail in the gensim documentation. Parameters used in our example are:

num_topics: required. An LDA model requires the user to determine how many topics should be generated. We tried 6, 8, 10, 12, 15 as num_topics, and it looks like that num_topics=10 works the best. Thus, we only fit the model with num_topic=10, and save the model for further analysis.

id2word: required. The LdaModel class requires our previous dictionary to map ids to strings.

passes: optional. The number of laps that the model will take through corpus. The greater the number of passes, the more accurate the model will be. A lot of passes can be slow on a very large corpus.

random_state: optional. It is similar to random seed. Controlling random_state can make sure the result is the same every time we run the model

In [16]:

# fit lda model
ldamodel = models.ldamodel.LdaModel(train_corpus, num_topics=10, id2word=dictionary, passes=20, random_state=259)

In [17]:

# save the lda model
filepath = 'result/finalized_model_10.sav'
joblib.dump(ldamodel, filepath)

Out[17]:

['result/finalized_model_10.sav']

In [21]:

ldamodel.print_topics(num_topics=10, num_words=8)

Out[21]:

[(0,
  '0.109*"food" + 0.075*"place" + 0.040*"service" + 0.024*"time" + 0.020*"restaurant" + 0.016*"menu" + 0.013*"staff" + 0.012*"everything"'),
 (1,
  '0.047*"pizza" + 0.028*"place" + 0.017*"cleveland" + 0.011*"time" + 0.010*"way" + 0.009*"line" + 0.009*"root" + 0.009*"home"'),
 (2,
  '0.052*"food" + 0.035*"time" + 0.034*"service" + 0.019*"experience" + 0.018*"order" + 0.018*"night" + 0.015*"place" + 0.015*"restaurant"'),
 (3,
  '0.018*"sauce" + 0.015*"salad" + 0.014*"flavor" + 0.014*"dinner" + 0.013*"pork" + 0.013*"chicken" + 0.013*"meal" + 0.011*"cream"'),
 (4,
  '0.055*"thai" + 0.037*"sushi" + 0.035*"spicy" + 0.032*"rice" + 0.030*"roll" + 0.025*"tea" + 0.024*"curry" + 0.020*"shrimp"'),
 (5,
  '0.086*"coffee" + 0.069*"breakfast" + 0.054*"brunch" + 0.027*"bacon" + 0.020*"egg" + 0.019*"toast" + 0.018*"morning" + 0.018*"hash"'),
 (6,
  '0.066*"beef" + 0.030*"pho" + 0.026*"pork" + 0.025*"soup" + 0.020*"cleveland" + 0.016*"bowl" + 0.014*"pot" + 0.014*"meat"'),
 (7,
  '0.067*"beer" + 0.053*"place" + 0.047*"bar" + 0.047*"food" + 0.034*"selection" + 0.020*"service" + 0.018*"night" + 0.017*"cleveland"'),
 (8,
  '0.043*"sandwich" + 0.034*"wife" + 0.028*"bbq" + 0.024*"dog" + 0.020*"meat" + 0.019*"melt" + 0.017*"beef" + 0.017*"chicken"'),
 (9,
  '0.109*"burger" + 0.051*"hour" + 0.030*"bar" + 0.023*"tacos" + 0.023*"bartender" + 0.022*"spot" + 0.019*"b" + 0.015*"taco"')]

The LDA model finds the 10 topics with 8 highest frequent words in each topic. The 10 topics are relatively interpretable. By associating and categorizing the high-frequency words of each topic, we name the topics as the following:

In [24]:

topic_dict = {0:"Service1",1:"Location",2:"Service2", 3:"American1",
                 4:"Asian1",5:"Breakfast",6:"Asian2",7:"Bar",8:"American2",9:"Mexican"}
for keys,values in topic_dict.items():
    print((keys, values))

(0, 'Service1')
(1, 'Location')
(2, 'Service2')
(3, 'American1')
(4, 'Asian1')
(5, 'Breakfast')
(6, 'Asian2')
(7, 'Bar')
(8, 'American2')
(9, 'Mexican')

In [122]:

# save the names of the topics
with shelve.open('result/topic_name') as result:
    result['topic_name'] = topic_dict

Further Interpretation of Topics with A Review¶

Let’s take a look at an original review from the validation set.

In [36]:

print(review_example['text'][35868])

If I didn't have to pay the bill, I'd enjoy this restaurant a lot more.

Yea, yea, I know -- I could say that about any place. But this one seems to fit that statement more than almost any other in Cleveland.

Two burgers -- perfectly cooked and well seasoned with just the right amount of salt and other mouth-watering dashes of spice (plus the bun was nicely seasoned, something that many burger joints neglect). A decent side of fries. About a dozen chicken wings -- Falling off the bone and "Chef-ed up" with lemon juice, scallions, jalapeno and garlic, not simply smothered in a thick reddish-orange sauce.

But why does all of that still have to cost $50? (And those were some of the least expensive items on the menu).

Nevertheless, the drinks were great -- I had two different unique takes on the Old-Fashioned (who would have thought that Curacao works with Bourbon) and Jeannene had a new spin on the French 75 before trying one of the Old-Fashioneds -- and they were well worth another $50.

Coupled with attentive service and a great patio seating on E. 4th Street, I am happy to give it four stars. But the prices seem to promise a 5-star experience.

In [123]:

# save the first raw review of validation set
with shelve.open('result/first_review_vali') as result:
    result['first_review_vali'] = review_example['text'][35868]

Then, let’s see the topics probability associated with this review.

In [89]:

all_topics=[]
for doc in vali_corpus:
    topics = [sorted(ldamodel[doc], key=lambda x: x[1], reverse=True)]
    all_topics.extend(topics)

all_topics[0]

Out[89]:

[(2, 0.29715302229392604),
 (0, 0.24879956111281701),
 (9, 0.21169559380609276),
 (3, 0.16659952745901788),
 (4, 0.057228951701560421)]

In [124]:

# save the topics probability of the first review in validation set
with shelve.open('result/first_review_topics_vali') as result:
    result['first_review_topics_vali'] = all_topics[0]

Topics 0 and 2 are the service topics. Topic 9, 3, and 4 represent Mexican food, American food, and Asian food respectively. We can see the topics are relatively reasonable, since the review talks a lot about the service and the service topics have the highest probability in this review. However, according to the review, the restaurant seems to be an American restaurant, but Mexican food topic has the highest probability among the 3 topics talking about food. It shows that the topics modeling may not be perfect to describe every part of the restaurant.

Thus, we do not have a high expectation about the prediction accuracy for customer ratings according to the topics, and the next section confirms our concern.

Customer Rating Prediction with Topics¶

In this section, we are trying to use the topics probability found by LDA model to predict the rating given by the customers. Traditional linear regression is applied here, in order to see if the topics from a review are highly associated with the customer rating.

First, we use the probability of each topics as the elements of the design matrix.

In [90]:

def design_matrix_creation(n_topics, topics_prob_list):
    """
    function to get the design matrix for linear regression from the LDA topics

    Input
    ----------
    n_topics: number of LDA topics
    topics_prob_list: a list of tuples with the LDA topics and its corresponding probability in each review

    Output
    ------
    design_matrix: a matrix containing the LDA topics probability in each review as an observation

    """
    nrows = len(topics_prob_list)
    design_matrix = np.zeros((nrows, n_topics))
    for i in range(nrows):
        items =  topics_prob_list[i]
        for s in items:
            topic_prob = s[1]
            design_matrix[i, s[0]] = topic_prob
    return design_matrix

In [91]:

# get the design matrix for the validation set
design_matrix = design_matrix_creation(10, all_topics)

After creating the design matrix, we decide to merge few topics together, since they reflect the similar contents. We merge topics 0 and 2 together, since they both reflect the service of the restaurants. Topic 3 and 8 are merged, because they both represent American food. Finally, topic 4 and 6 are merged, since they are both Asian food topics. Thus, the design matrix eventually has 7 different features. Our response variable is the customer ratings.

In [94]:

# merge the similar topics
design_matrix[:, 0] += design_matrix[:, 2]
design_matrix[:, 3] += design_matrix[:, 8]
design_matrix[:, 4] += design_matrix[:, 6]

design_matrix = design_matrix[:, [0, 1, 3, 4, 5, 7, 9]]

In [95]:

# Create linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(design_matrix, review_example.stars)


# The mean squared error
mse = np.mean((regr.predict(design_matrix) - review_example.stars) ** 2)

In [125]:

lm_result = {'Coefficients': regr.coef_, 'Intercept': regr.intercept_, "Mean squared error": mse}

# save the linear regression result
with shelve.open('result/lm_result') as result:
    result['lm_result'] = lm_result

In [128]:

for keys,values in lm_result.items():
    print(keys)
    print(values)

Coefficients
[-3.37275057 -2.04183867 -2.21252578 -1.61272055 -1.7564765  -1.5980306
 -3.56868465]
Intercept
6.27726471738
Mean squared error
1.6797105653141962

We apply MSE (Mean Squared Error) as a metric of accuracy, as the original ratings are integers and predicted ratings are float numbers. From the above result, we can see that the MSE is pretty high, which means using the topics probability to predict customer ratings is not very accurate for our data. Therefore, another supervised learning model—Multinomial Logistic Regression is applied to the data as well. The analysis and results can be found in another notebook.

Testing¶

In [115]:

def test_tokenize_text_input():
    """Test that if the input is not a list of string, raise attribute error"""
    try:
        file = [1, 2, 3, 4]
        tokenize_text(file)
    except AttributeError:
        assert True
    else:
        assert False


def test_tokenize_text_output():
    """Test that the output of tokenize_text is a list containing lists of string. """

    file = ["Today is a sunny day", "I meet Eli at Evans"]
    obs = tokenize_text(file)
    for text_list in obs:
        obs1 = isinstance(text_list, list)
        exp1 = True
        assert obs1 == exp1

        for text in text_list:
            obs2 = isinstance(text, str)
            exp2 = True
            assert obs2 == exp2


def test_design_matrix_creation_input():
    """Test that when input n_topics is not an integer, raise type error"""
    try:
        topic_prob = [[(0, 0.5),(1, 0.5)], [(0, 0.25), (1, 0.25), (2, 0.25), (3, 0.35)]]
        m = design_matrix_creation(4.5, topic_prob)
    except TypeError:
        assert True
    else:
        assert False


def test_design_matrix_creation_output():
    """Test that the design matrix has the right dimension. """
    topic_prob = [[(0, 0.5),(1, 0.5)], [(0, 0.25), (1, 0.25), (2, 0.25), (3, 0.35)]]
    m = design_matrix_creation(4, topic_prob)
    obs1 = m.shape[0]
    obs2 = m.shape[1]
    exp1 = 2
    exp2 = 4
    assert obs1 == exp1
    assert obs2 == exp2

In [117]:

test_tokenize_text_input()
test_tokenize_text_output()
test_design_matrix_creation_input()
test_design_matrix_creation_output()