{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "from collections import Counter\n", "import pandas as pd\n", "import numpy as np\n", "from pandas.io.json import json_normalize\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import shelve" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Cleaning and Exploratory Data Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The original data is too large to be uploaded onto GitHub. Thus, the pre-processing and extracting of data were conducted locally with the following code. The original data can be retrived from [Yelp Dataset](https://www.yelp.com/dataset/challenge). We extract a subset from it and save the data we worked with in the \"data\" folder in this repository. If you’d like to replicate the data cleaning and extraction process for this project, please download the raw JSON data, this notebook and run the process locally from your computer. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract Restaurants Data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# the local path of JSON files that needed\n", "json_business_path='./data/business.json'\n", "json_review_path='./data/review.json'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# extract business information\n", "business_contents=[]\n", "with open(json_business_path) as fin:\n", " for line in fin:\n", " line=json.loads(line)\n", " business_contents.append(line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see below, the Yelp Dataset Challenge contains business information from different states and different business types. Among the most popular locations and business types, we focus on restaurants at Ohio.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('AZ', 47376),\n", " ('NV', 30571),\n", " ('ON', 26520),\n", " ('NC', 11299),\n", " ('OH', 10930),\n", " ('PA', 8916),\n", " ('QC', 7273),\n", " ('WI', 4190),\n", " ('EDH', 3561),\n", " ('BW', 3071)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# In order to control the scope the project, we take OH state, 10930 businesses\n", "state=[]\n", "for line in business_contents:\n", " state.append(line['state']) \n", "# the 10 most popular locations\n", "Counter(state).most_common(10)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Restaurants', 4513),\n", " ('Food', 1916),\n", " ('Shopping', 1628),\n", " ('Nightlife', 1180),\n", " ('Bars', 1083),\n", " ('Beauty & Spas', 930),\n", " ('American (Traditional)', 891),\n", " ('Automotive', 738),\n", " ('Pizza', 706),\n", " ('Sandwiches', 677)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we focus on Restaurants in OH, around 4513 businesses\n", "cat_of_business=[]\n", "for line in business_contents:\n", " if line['state']=='OH':\n", " try:\n", " cat_of_business.extend(line['categories'])\n", " except:\n", " next \n", " \n", "# The 10 most popular business types,focus on 'Restaurants'\n", "Counter(cat_of_business).most_common(10)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "we will research 4513 restaurants in OH\n" ] } ], "source": [ "# we have 4513 restaurants in OH to be reseaerched\n", "OH_Restaurants=[]\n", "for line in business_contents:\n", " if (line['state']=='OH') and ('Restaurants' in line['categories']):\n", " OH_Restaurants.append(line)\n", "print('we will research %d restaurants in OH'%len(OH_Restaurants))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the convenience of research and reproducibility, we will save the data that extracted from the original JSON data into data frames, put them in ‘data’ folder and use them for future analysis and statistical modeling. Note that we convert the restaurant data into csv file called ‘restaurant.csv’. We will subset it and use part of all restaurant information in the following steps." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# we need to keep those columns and convert to csv\n", "res_col=['state','city','address','name','business_id','stars','review_count','categories']\n", "restaurants=json_normalize(OH_Restaurants)[res_col]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | state | \n", "city | \n", "address | \n", "name | \n", "business_id | \n", "stars | \n", "review_count | \n", "categories | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "OH | \n", "Painesville | \n", "1 S State St | \n", "Sidewalk Cafe Painesville | \n", "Bl7Y-ATTzXytQnCceg5k6w | \n", "3.0 | \n", "26 | \n", "[American (Traditional), Breakfast & Brunch, R... | \n", "
1 | \n", "OH | \n", "Northfield | \n", "10430 Northfield Rd | \n", "Zeppe's Pizzeria | \n", "7HFRdxVttyY9GiMpywhhYw | \n", "3.0 | \n", "7 | \n", "[Pizza, Caterers, Italian, Wraps, Event Planni... | \n", "
2 | \n", "OH | \n", "Mentor | \n", "9209 Mentor Ave | \n", "Firehouse Subs | \n", "lXcxSdPa2m__LqhsaL9t9A | \n", "3.5 | \n", "9 | \n", "[Restaurants, Sandwiches, Delis, Fast Food] | \n", "
3 | \n", "OH | \n", "Cleveland | \n", "13181 Cedar Rd | \n", "Richie Chan's Chinese Restaurant | \n", "Pawavw9U8rjxWVPU-RB7LA | \n", "3.5 | \n", "22 | \n", "[Chinese, Restaurants] | \n", "
4 | \n", "OH | \n", "Northfield | \n", "134 E Aurora Rd | \n", "Romeo's Pizza | \n", "RzVHK8Jfcy8RvXjn_z3OBw | \n", "4.0 | \n", "4 | \n", "[Restaurants, Pizza] | \n", "