In [1]:

import json
from collections import Counter
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sns
import shelve

Data Cleaning and Exploratory Data Analysis¶

The original data is too large to be uploaded onto GitHub. Thus, the pre-processing and extracting of data were conducted locally with the following code. The original data can be retrived from Yelp Dataset. We extract a subset from it and save the data we worked with in the “data” folder in this repository. If you’d like to replicate the data cleaning and extraction process for this project, please download the raw JSON data, this notebook and run the process locally from your computer.

Extract Restaurants Data¶

In [2]:

# the local path of JSON files that needed
json_business_path='./data/business.json'
json_review_path='./data/review.json'

In [3]:

# extract business information
business_contents=[]
with open(json_business_path) as fin:
    for line in fin:
        line=json.loads(line)
        business_contents.append(line)

As we can see below, the Yelp Dataset Challenge contains business information from different states and different business types. Among the most popular locations and business types, we focus on restaurants at Ohio.

In [4]:

# In order to control the scope the project, we take OH state, 10930 businesses
state=[]
for line in business_contents:
    state.append(line['state'])
# the 10 most popular locations
Counter(state).most_common(10)

Out[4]:

[('AZ', 47376),
 ('NV', 30571),
 ('ON', 26520),
 ('NC', 11299),
 ('OH', 10930),
 ('PA', 8916),
 ('QC', 7273),
 ('WI', 4190),
 ('EDH', 3561),
 ('BW', 3071)]

In [5]:

# we focus on Restaurants in OH, around 4513 businesses
cat_of_business=[]
for line in business_contents:
    if line['state']=='OH':
        try:
            cat_of_business.extend(line['categories'])
        except:
            next

# The 10 most popular business types,focus on 'Restaurants'
Counter(cat_of_business).most_common(10)

Out[5]:

[('Restaurants', 4513),
 ('Food', 1916),
 ('Shopping', 1628),
 ('Nightlife', 1180),
 ('Bars', 1083),
 ('Beauty & Spas', 930),
 ('American (Traditional)', 891),
 ('Automotive', 738),
 ('Pizza', 706),
 ('Sandwiches', 677)]

In [4]:

# we have 4513 restaurants in OH to be reseaerched
OH_Restaurants=[]
for line in business_contents:
    if (line['state']=='OH') and ('Restaurants' in line['categories']):
        OH_Restaurants.append(line)
print('we will research %d restaurants in OH'%len(OH_Restaurants))

we will research 4513 restaurants in OH

For the convenience of research and reproducibility, we will save the data that extracted from the original JSON data into data frames, put them in ‘data’ folder and use them for future analysis and statistical modeling. Note that we convert the restaurant data into csv file called ‘restaurant.csv’. We will subset it and use part of all restaurant information in the following steps.

In [5]:

# we need to keep those columns and convert to csv
res_col=['state','city','address','name','business_id','stars','review_count','categories']
restaurants=json_normalize(OH_Restaurants)[res_col]

In [6]:

# a glimpse of the restaurant
restaurants.head()

Out[6]:

	state	city	address	name	business_id	stars	review_count	categories
0	OH	Painesville	1 S State St	Sidewalk Cafe Painesville	Bl7Y-ATTzXytQnCceg5k6w	3.0	26	[American (Traditional), Breakfast & Brunch, R...
1	OH	Northfield	10430 Northfield Rd	Zeppe's Pizzeria	7HFRdxVttyY9GiMpywhhYw	3.0	7	[Pizza, Caterers, Italian, Wraps, Event Planni...
2	OH	Mentor	9209 Mentor Ave	Firehouse Subs	lXcxSdPa2m__LqhsaL9t9A	3.5	9	[Restaurants, Sandwiches, Delis, Fast Food]
3	OH	Cleveland	13181 Cedar Rd	Richie Chan's Chinese Restaurant	Pawavw9U8rjxWVPU-RB7LA	3.5	22	[Chinese, Restaurants]
4	OH	Northfield	134 E Aurora Rd	Romeo's Pizza	RzVHK8Jfcy8RvXjn_z3OBw	4.0	4	[Restaurants, Pizza]

In [9]:

restaurants.shape

Out[9]:

(4513, 8)

Extract Reviews for Restaurants¶

After extracting the restaurants information, we move to extract the reviews for those restaurants. There are a lot of reviews, when we extract them from JSON file, we only extract those reviews which are for the restaurants we keep in our sample.

In [35]:

# extract review information
review_contents=[]
with open(json_review_path) as fin:
    for line in fin:
        line=json.loads(line)
        if line['business_id'] in list(restaurants.business_id):
            review_contents.append(line)

In [66]:

# save all the reviews for restaurants in Ohio. It's around 600Mb, so
# just store them locally
with shelve.open('./result/ohio_all_reviews') as db:
    db['reviews'] = review_contents

In [178]:

# total 154764 reviews on 4513 restaurants
len(review_contents)

Out[178]:

In [11]:

# convert into data frame
reviews=json_normalize(review_contents)

We’d like to remove restaurant with too few reviews. We see that there are some restaurants which have few reviews, so we only keep restaurants which have more than 100 reviews. Below, we can find the summary statistics for number of reviews for restaurants.

In [12]:

reviews.business_id.value_counts().describe()

Out[12]:

count    4513.000000
mean       34.292932
std        57.777458
min         3.000000
25%         6.000000
50%        15.000000
75%        39.000000
max       896.000000
Name: business_id, dtype: float64

In [13]:

# only keep the restaurants which have more than 100 reviews
keep_busID=reviews.business_id.value_counts().index[reviews.business_id.value_counts()>100]
reviews_keep=reviews.loc[reviews.business_id.isin(keep_busID),:]

In [50]:

# we can find the summary statistics for number of reviews for restaurants
# after we exclude some restaurants with few reviews.
reviews_keep.business_id.value_counts().describe()

Out[50]:

count    316.000000
mean     190.575949
std      122.290224
min      101.000000
25%      117.750000
50%      149.000000
75%      206.250000
max      896.000000
Name: business_id, dtype: float64

We only keep 316 restaurants after we exclude some restaurants with few reviews. As we can see in the follow plot, the range of reviews per restaurant is [100,900] after we filter the data set.

In [176]:

# plot the restaurant on x axis and number of reviews on y axis
reviews_keep.business_id.value_counts().plot()
plt.xlabel('Restaurants')
plt.ylabel('Number of Reviews')

n_obs=len(reviews_keep.business_id.unique())
sns.rugplot(np.arange(0,n_obs,1))

plt.tick_params(axis='x',labelbottom='off')
plt.title('Number of Reviews Per Restaurant')

plt.savefig('./fig/NumberOfReviewsPerRestaurant.png')
plt.show()

As we can see in th following plot, the distribution of mean star rating for restaurants is skewed to the left with most restaurants having the mean star ratings in the range between 3 and 4. There are few restaurants which have earned mean ratings less than 2.

In [177]:

# the mean star rating for restaurants
sns.distplot(reviews_keep.groupby('business_id').mean()['stars'],rug=True)
plt.xlabel('Mean Star Ratings')
plt.title('Mean Star Ratings for Restaurants')

plt.savefig('./fig/MeanRatings.png')
plt.show()

In [57]:

# we save the restaurants data and reviews data into csv for future analysis
reviews_keep.to_csv('./data/reviews.csv')
restaurants.to_csv('./data/restaurant.csv')

Data Cleaning and Exploratory Data Analysis¶

Extract Restaurants Data¶

Extract Reviews for Restaurants¶

Table Of Contents

Related Topics

This Page