Restaurant Ratings Group Project

1. Overview of Dataset

We tested a number of different hypotheses with respect to restaurants in major US cities. We used Yelp’s Challenge Data - an open dataset with the following characteristics:

4.1M reviews and 947K tips by 1M users for 144K businesses
1.1M business attributes, e.g., hours, parking availability, ambience.
Aggregated check-ins over time for each of the 125K businesses
200,000 pictures from the included businesses

2. Key Business Questions and Data Analysis Process

We sought primarily to answer the following questions:

What does the distribution of ratings and checkins look like for Yelp-listed restaurants in major US metropolitan areas?
To what extent can restaurant location (using population and median income of the zip code in which it is located as proxies) explain the variance in restaurant ratings and/or checkins?
Are there any other meaninful relationships or inferences we can draw between restaurant ratings and restaurant location, customer propensity to review the restaurant, or customer propensity to check in at the restaurant?

We divided the data into training and test sets, in order to be able to ideally build a predictive model for either restaurant ratings, checkins, or both, and then test the model on the test set.

We ran ordered logistic regression and linear regresssions on this data in order to determine the predictive power of various restaurant features on the outcomes of interest, namely stars and checkins.

3. Data Import, Cleaning, and Analysis

#uncompress data
library(foreign)

#import data and create dataframes 
library(jsonlite)
YelpCheckins <- stream_in(file("C:/Users/Matthew/Documents/R/GroupProjectFinal/yelp_academic_dataset_checkin.json"),pagesize = 500)

## opening file input connection.

## 
 Found 500 records...
 Found 1000 records...
 Found 1500 records...
 Found 2000 records...
 Found 2500 records...
 Found 3000 records...
 Found 3500 records...
 Found 4000 records...
 Found 4500 records...
 Found 5000 records...
 Found 5500 records...
 Found 6000 records...
 Found 6500 records...
 Found 7000 records...
 Found 7500 records...
 Found 8000 records...
 Found 8500 records...
 Found 9000 records...
 Found 9500 records...
 Found 10000 records...
 Found 10500 records...
 Found 11000 records...
 Found 11500 records...
 Found 12000 records...
 Found 12500 records...
 Found 13000 records...
 Found 13500 records...
 Found 14000 records...
 Found 14500 records...
 Found 15000 records...
 Found 15500 records...
 Found 16000 records...
 Found 16500 records...
 Found 17000 records...
 Found 17500 records...
 Found 18000 records...
 Found 18500 records...
 Found 19000 records...
 Found 19500 records...
 Found 20000 records...
 Found 20500 records...
 Found 21000 records...
 Found 21500 records...
 Found 22000 records...
 Found 22500 records...
 Found 23000 records...
 Found 23500 records...
 Found 24000 records...
 Found 24500 records...
 Found 25000 records...
 Found 25500 records...
 Found 26000 records...
 Found 26500 records...
 Found 27000 records...
 Found 27500 records...
 Found 28000 records...
 Found 28500 records...
 Found 29000 records...
 Found 29500 records...
 Found 30000 records...
 Found 30500 records...
 Found 31000 records...
 Found 31500 records...
 Found 32000 records...
 Found 32500 records...
 Found 33000 records...
 Found 33500 records...
 Found 34000 records...
 Found 34500 records...
 Found 35000 records...
 Found 35500 records...
 Found 36000 records...
 Found 36500 records...
 Found 37000 records...
 Found 37500 records...
 Found 38000 records...
 Found 38500 records...
 Found 39000 records...
 Found 39500 records...
 Found 40000 records...
 Found 40500 records...
 Found 41000 records...
 Found 41500 records...
 Found 42000 records...
 Found 42500 records...
 Found 43000 records...
 Found 43500 records...
 Found 44000 records...
 Found 44500 records...
 Found 45000 records...
 Found 45500 records...
 Found 46000 records...
 Found 46500 records...
 Found 47000 records...
 Found 47500 records...
 Found 48000 records...
 Found 48500 records...
 Found 49000 records...
 Found 49500 records...
 Found 50000 records...
 Found 50500 records...
 Found 51000 records...
 Found 51500 records...
 Found 52000 records...
 Found 52500 records...
 Found 53000 records...
 Found 53500 records...
 Found 54000 records...
 Found 54500 records...
 Found 55000 records...
 Found 55500 records...
 Found 56000 records...
 Found 56500 records...
 Found 57000 records...
 Found 57500 records...
 Found 58000 records...
 Found 58500 records...
 Found 59000 records...
 Found 59500 records...
 Found 60000 records...
 Found 60500 records...
 Found 61000 records...
 Found 61500 records...
 Found 62000 records...
 Found 62500 records...
 Found 63000 records...
 Found 63500 records...
 Found 64000 records...
 Found 64500 records...
 Found 65000 records...
 Found 65500 records...
 Found 66000 records...
 Found 66500 records...
 Found 67000 records...
 Found 67500 records...
 Found 68000 records...
 Found 68500 records...
 Found 69000 records...
 Found 69500 records...
 Found 70000 records...
 Found 70500 records...
 Found 71000 records...
 Found 71500 records...
 Found 72000 records...
 Found 72500 records...
 Found 73000 records...
 Found 73500 records...
 Found 74000 records...
 Found 74500 records...
 Found 75000 records...
 Found 75500 records...
 Found 76000 records...
 Found 76500 records...
 Found 77000 records...
 Found 77500 records...
 Found 78000 records...
 Found 78500 records...
 Found 79000 records...
 Found 79500 records...
 Found 80000 records...
 Found 80500 records...
 Found 81000 records...
 Found 81500 records...
 Found 82000 records...
 Found 82500 records...
 Found 83000 records...
 Found 83500 records...
 Found 84000 records...
 Found 84500 records...
 Found 85000 records...
 Found 85500 records...
 Found 86000 records...
 Found 86500 records...
 Found 87000 records...
 Found 87500 records...
 Found 88000 records...
 Found 88500 records...
 Found 89000 records...
 Found 89500 records...
 Found 90000 records...
 Found 90500 records...
 Found 91000 records...
 Found 91500 records...
 Found 92000 records...
 Found 92500 records...
 Found 93000 records...
 Found 93500 records...
 Found 94000 records...
 Found 94500 records...
 Found 95000 records...
 Found 95500 records...
 Found 96000 records...
 Found 96500 records...
 Found 97000 records...
 Found 97500 records...
 Found 98000 records...
 Found 98500 records...
 Found 99000 records...
 Found 99500 records...
 Found 1e+05 records...
 Found 100500 records...
 Found 101000 records...
 Found 101500 records...
 Found 102000 records...
 Found 102500 records...
 Found 103000 records...
 Found 103500 records...
 Found 104000 records...
 Found 104500 records...
 Found 105000 records...
 Found 105500 records...
 Found 106000 records...
 Found 106500 records...
 Found 107000 records...
 Found 107500 records...
 Found 108000 records...
 Found 108500 records...
 Found 109000 records...
 Found 109500 records...
 Found 110000 records...
 Found 110500 records...
 Found 111000 records...
 Found 111500 records...
 Found 112000 records...
 Found 112500 records...
 Found 113000 records...
 Found 113500 records...
 Found 114000 records...
 Found 114500 records...
 Found 115000 records...
 Found 115500 records...
 Found 116000 records...
 Found 116500 records...
 Found 117000 records...
 Found 117500 records...
 Found 118000 records...
 Found 118500 records...
 Found 119000 records...
 Found 119500 records...
 Found 120000 records...
 Found 120500 records...
 Found 121000 records...
 Found 121500 records...
 Found 122000 records...
 Found 122500 records...
 Found 123000 records...
 Found 123500 records...
 Found 124000 records...
 Found 124500 records...
 Found 125000 records...
 Found 125500 records...
 Found 125532 records...
 Imported 125532 records. Simplifying...

## closing file input connection.

YelpBusinesses <- stream_in(file("C:/Users/Matthew/Documents/R/GroupProjectFinal/yelp_academic_dataset_business.json"),pagesize = 500)

## opening file input connection.

## 
 Found 500 records...
 Found 1000 records...
 Found 1500 records...
 Found 2000 records...
 Found 2500 records...
 Found 3000 records...
 Found 3500 records...
 Found 4000 records...
 Found 4500 records...
 Found 5000 records...
 Found 5500 records...
 Found 6000 records...
 Found 6500 records...
 Found 7000 records...
 Found 7500 records...
 Found 8000 records...
 Found 8500 records...
 Found 9000 records...
 Found 9500 records...
 Found 10000 records...
 Found 10500 records...
 Found 11000 records...
 Found 11500 records...
 Found 12000 records...
 Found 12500 records...
 Found 13000 records...
 Found 13500 records...
 Found 14000 records...
 Found 14500 records...
 Found 15000 records...
 Found 15500 records...
 Found 16000 records...
 Found 16500 records...
 Found 17000 records...
 Found 17500 records...
 Found 18000 records...
 Found 18500 records...
 Found 19000 records...
 Found 19500 records...
 Found 20000 records...
 Found 20500 records...
 Found 21000 records...
 Found 21500 records...
 Found 22000 records...
 Found 22500 records...
 Found 23000 records...
 Found 23500 records...
 Found 24000 records...
 Found 24500 records...
 Found 25000 records...
 Found 25500 records...
 Found 26000 records...
 Found 26500 records...
 Found 27000 records...
 Found 27500 records...
 Found 28000 records...
 Found 28500 records...
 Found 29000 records...
 Found 29500 records...
 Found 30000 records...
 Found 30500 records...
 Found 31000 records...
 Found 31500 records...
 Found 32000 records...
 Found 32500 records...
 Found 33000 records...
 Found 33500 records...
 Found 34000 records...
 Found 34500 records...
 Found 35000 records...
 Found 35500 records...
 Found 36000 records...
 Found 36500 records...
 Found 37000 records...
 Found 37500 records...
 Found 38000 records...
 Found 38500 records...
 Found 39000 records...
 Found 39500 records...
 Found 40000 records...
 Found 40500 records...
 Found 41000 records...
 Found 41500 records...
 Found 42000 records...
 Found 42500 records...
 Found 43000 records...
 Found 43500 records...
 Found 44000 records...
 Found 44500 records...
 Found 45000 records...
 Found 45500 records...
 Found 46000 records...
 Found 46500 records...
 Found 47000 records...
 Found 47500 records...
 Found 48000 records...
 Found 48500 records...
 Found 49000 records...
 Found 49500 records...
 Found 50000 records...
 Found 50500 records...
 Found 51000 records...
 Found 51500 records...
 Found 52000 records...
 Found 52500 records...
 Found 53000 records...
 Found 53500 records...
 Found 54000 records...
 Found 54500 records...
 Found 55000 records...
 Found 55500 records...
 Found 56000 records...
 Found 56500 records...
 Found 57000 records...
 Found 57500 records...
 Found 58000 records...
 Found 58500 records...
 Found 59000 records...
 Found 59500 records...
 Found 60000 records...
 Found 60500 records...
 Found 61000 records...
 Found 61500 records...
 Found 62000 records...
 Found 62500 records...
 Found 63000 records...
 Found 63500 records...
 Found 64000 records...
 Found 64500 records...
 Found 65000 records...
 Found 65500 records...
 Found 66000 records...
 Found 66500 records...
 Found 67000 records...
 Found 67500 records...
 Found 68000 records...
 Found 68500 records...
 Found 69000 records...
 Found 69500 records...
 Found 70000 records...
 Found 70500 records...
 Found 71000 records...
 Found 71500 records...
 Found 72000 records...
 Found 72500 records...
 Found 73000 records...
 Found 73500 records...
 Found 74000 records...
 Found 74500 records...
 Found 75000 records...
 Found 75500 records...
 Found 76000 records...
 Found 76500 records...
 Found 77000 records...
 Found 77500 records...
 Found 78000 records...
 Found 78500 records...
 Found 79000 records...
 Found 79500 records...
 Found 80000 records...
 Found 80500 records...
 Found 81000 records...
 Found 81500 records...
 Found 82000 records...
 Found 82500 records...
 Found 83000 records...
 Found 83500 records...
 Found 84000 records...
 Found 84500 records...
 Found 85000 records...
 Found 85500 records...
 Found 86000 records...
 Found 86500 records...
 Found 87000 records...
 Found 87500 records...
 Found 88000 records...
 Found 88500 records...
 Found 89000 records...
 Found 89500 records...
 Found 90000 records...
 Found 90500 records...
 Found 91000 records...
 Found 91500 records...
 Found 92000 records...
 Found 92500 records...
 Found 93000 records...
 Found 93500 records...
 Found 94000 records...
 Found 94500 records...
 Found 95000 records...
 Found 95500 records...
 Found 96000 records...
 Found 96500 records...
 Found 97000 records...
 Found 97500 records...
 Found 98000 records...
 Found 98500 records...
 Found 99000 records...
 Found 99500 records...
 Found 1e+05 records...
 Found 100500 records...
 Found 101000 records...
 Found 101500 records...
 Found 102000 records...
 Found 102500 records...
 Found 103000 records...
 Found 103500 records...
 Found 104000 records...
 Found 104500 records...
 Found 105000 records...
 Found 105500 records...
 Found 106000 records...
 Found 106500 records...
 Found 107000 records...
 Found 107500 records...
 Found 108000 records...
 Found 108500 records...
 Found 109000 records...
 Found 109500 records...
 Found 110000 records...
 Found 110500 records...
 Found 111000 records...
 Found 111500 records...
 Found 112000 records...
 Found 112500 records...
 Found 113000 records...
 Found 113500 records...
 Found 114000 records...
 Found 114500 records...
 Found 115000 records...
 Found 115500 records...
 Found 116000 records...
 Found 116500 records...
 Found 117000 records...
 Found 117500 records...
 Found 118000 records...
 Found 118500 records...
 Found 119000 records...
 Found 119500 records...
 Found 120000 records...
 Found 120500 records...
 Found 121000 records...
 Found 121500 records...
 Found 122000 records...
 Found 122500 records...
 Found 123000 records...
 Found 123500 records...
 Found 124000 records...
 Found 124500 records...
 Found 125000 records...
 Found 125500 records...
 Found 126000 records...
 Found 126500 records...
 Found 127000 records...
 Found 127500 records...
 Found 128000 records...
 Found 128500 records...
 Found 129000 records...
 Found 129500 records...
 Found 130000 records...
 Found 130500 records...
 Found 131000 records...
 Found 131500 records...
 Found 132000 records...
 Found 132500 records...
 Found 133000 records...
 Found 133500 records...
 Found 134000 records...
 Found 134500 records...
 Found 135000 records...
 Found 135500 records...
 Found 136000 records...
 Found 136500 records...
 Found 137000 records...
 Found 137500 records...
 Found 138000 records...
 Found 138500 records...
 Found 139000 records...
 Found 139500 records...
 Found 140000 records...
 Found 140500 records...
 Found 141000 records...
 Found 141500 records...
 Found 142000 records...
 Found 142500 records...
 Found 143000 records...
 Found 143500 records...
 Found 144000 records...
 Found 144072 records...
 Imported 144072 records. Simplifying...

## closing file input connection.

#make each individual checkin its own row
require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require(tidyr)

## Loading required package: tidyr

library(splitstackshape)

## Loading required package: data.table

## data.table 1.10.0

##   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way

##   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")

##   Release notes, videos and slides: http://r-datatable.com

## -------------------------------------------------------------------------

## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!

## -------------------------------------------------------------------------

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

YelpCheckins_v2 <- YelpCheckins
YelpCheckins_v2 <- cSplit(YelpCheckins_v2, "time", sep = ",", direction = "long")

require(RCurl)

## Loading required package: RCurl

## Loading required package: bitops

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

require(xlsx)

## Loading required package: xlsx

## Loading required package: rJava

require(readxl)

## Loading required package: readxl

urlfile <-'http://www.psc.isr.umich.edu/dis/census/Features/tract2zip/MedianZIP-3.xlsx'
destfile <- "census20062010.xlsx"
download.file(urlfile, destfile, mode="wb")
census <- read_excel(destfile, sheet = "Median")

# clean up data
names(census) <- c('postal_code','median_income','population')
census$median_income <- as.character(census$median_income)
census$median_income <- as.numeric(gsub(',','',census$median_income))
print(head(census,5))

## # A tibble: 5 × 3
##   postal_code median_income population
##         <dbl>         <dbl>      <dbl>
## 1        1001      56662.57      16445
## 2        1002      49853.42      28069
## 3        1003      28462.00       8491
## 4        1005      75423.00       4798
## 5        1007      79076.35      12962

#strip out businesses that aren't restaurants, have fewer than 25 reviews, or are no longer active
require(plyr)

## Loading required package: plyr

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: readr
## Loading tidyverse: purrr

## Conflicts with tidy packages ----------------------------------------------

## arrange():   dplyr, plyr
## between():   dplyr, data.table
## compact():   purrr, plyr
## complete():  tidyr, RCurl
## count():     dplyr, plyr
## failwith():  dplyr, plyr
## filter():    dplyr, stats
## first():     dplyr, data.table
## id():        dplyr, plyr
## lag():       dplyr, stats
## last():      dplyr, data.table
## mutate():    dplyr, plyr
## rename():    dplyr, plyr
## summarise(): dplyr, plyr
## summarize(): dplyr, plyr
## transpose(): purrr, data.table

library(stringr)

Restaurants_v1 <- subset(YelpBusinesses,is_open == 1)
Restaurants_v2 <- filter(Restaurants_v1 , grepl('Restaurants',categories))
Restaurants_v3 <- subset(Restaurants_v2,review_count > 25)

#merge checkins data with restaurants data, matching on business_id; remove restaurants located outside the US
Restaurants_v4 <- merge(x=Restaurants_v3, y=YelpCheckins_v2, by.x="business_id", by.y="business_id", all=TRUE)
Restaurants_v5 <- subset(Restaurants_v4,is_open == 1)

Restaurants_v5$business_id <- as.character(Restaurants_v5$business_id)

Restaurants_v5$checkin_count <- as.numeric(ave(Restaurants_v5$business_id, Restaurants_v5$business_id, FUN = length))

Restaurants_v6 <- subset(Restaurants_v5,!duplicated(business_id),-c(17))
Restaurants_v7 <- subset(Restaurants_v6,nchar(postal_code)==5)

Restaurants_v8 <- merge(x=Restaurants_v7,y=census,by.x="postal_code",by.y = "postal_code",scale(population))

ScaledMedianIncome <- scale(Restaurants_v8$median_income)
ScaledPopulation <- scale(Restaurants_v8$population)
Restaurants_v8$median_income <- ScaledMedianIncome
Restaurants_v8$population <- ScaledPopulation
Restaurants_final <- Restaurants_v8

#Divides restaurant dataset into training set and test set 
RestaurantsTestSet <- Restaurants_final[sample(nrow(Restaurants_final), 11431/2), ]
Restaurants_New <- merge(x=Restaurants_final,y=RestaurantsTestSet, by.x = "business_id", by.y = "business_id",all=TRUE)
RestaurantsTrainingSet <- subset(Restaurants_New,attributes.y == "NA")

#Checks for equivalence of training set and test set 
mean(RestaurantsTestSet$stars,na.rm=TRUE)

## [1] 3.621522

mean(RestaurantsTrainingSet$stars.x,na.rm=TRUE)

## [1] 3.6236

sd(RestaurantsTestSet$stars,na.rm=TRUE)

## [1] 0.6209718

sd(RestaurantsTrainingSet$stars.x,na.rm=TRUE)

## [1] 0.6163316

mean(RestaurantsTestSet$stars,na.rm=TRUE) - mean(RestaurantsTrainingSet$stars.x,na.rm=TRUE)

## [1] -0.00207811

sd(RestaurantsTestSet$stars,na.rm=TRUE) - sd(RestaurantsTrainingSet$stars.x,na.rm=TRUE)

## [1] 0.004640246

mean(RestaurantsTestSet$checkin_count,na.rm=TRUE)

## [1] 71.68224

mean(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)

## [1] 72.0007

sd(RestaurantsTestSet$checkin_count,na.rm=TRUE)

## [1] 30.93848

sd(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)

## [1] 30.84472

mean(RestaurantsTestSet$checkin_count,na.rm=TRUE) - mean(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)

## [1] -0.3184601

sd(RestaurantsTestSet$checkin_count,na.rm=TRUE) - sd(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)

## [1] 0.09376093

mean(RestaurantsTestSet$review_count,na.rm=TRUE)

## [1] 149.1207

mean(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)

## [1] 146.3754

sd(RestaurantsTestSet$review_count,na.rm=TRUE)

## [1] 257.6816

sd(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)

## [1] 227.1995

mean(RestaurantsTestSet$review_count,na.rm=TRUE) - mean(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)

## [1] 2.745298

sd(RestaurantsTestSet$review_count,na.rm=TRUE) - sd(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)

## [1] 30.48209

sapply(RestaurantsTestSet,mean, na.rm=TRUE)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

##   postal_code   business_id          name  neighborhood       address 
##            NA            NA            NA            NA            NA 
##          city         state      latitude     longitude         stars 
##            NA            NA  3.647610e+01 -1.011885e+02  3.621522e+00 
##  review_count       is_open    attributes    categories         hours 
##  1.491207e+02  1.000000e+00            NA            NA            NA 
##        type.x        type.y checkin_count median_income    population 
##            NA            NA  7.168224e+01 -1.965074e-03 -2.510127e-03

sapply(RestaurantsTrainingSet,mean,na.rm=TRUE)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

##     business_id   postal_code.x          name.x  neighborhood.x 
##              NA              NA              NA              NA 
##       address.x          city.x         state.x      latitude.x 
##              NA              NA              NA    3.643328e+01 
##     longitude.x         stars.x  review_count.x       is_open.x 
##   -1.015357e+02    3.623600e+00    1.463754e+02    1.000000e+00 
##    attributes.x    categories.x         hours.x        type.x.x 
##              NA              NA              NA              NA 
##        type.y.x checkin_count.x median_income.x    population.x 
##              NA    7.200070e+01    1.964731e-03    2.509688e-03 
##   postal_code.y          name.y  neighborhood.y       address.y 
##              NA              NA              NA              NA 
##          city.y         state.y      latitude.y     longitude.y 
##              NA              NA             NaN             NaN 
##         stars.y  review_count.y       is_open.y    attributes.y 
##             NaN             NaN             NaN              NA 
##    categories.y         hours.y        type.x.y        type.y.y 
##              NA              NA              NA              NA 
## checkin_count.y median_income.y    population.y 
##             NaN             NaN             NaN

Restaurants_v8 <- merge(x=Restaurants_v7,y=census,by.x="postal_code",by.y = "postal_code",scale(population))

ScaledMedianIncome <- scale(Restaurants_v8$median_income)
ScaledPopulation <- scale(Restaurants_v8$population)
Restaurants_v8$median_income <- ScaledMedianIncome
Restaurants_v8$population <- ScaledPopulation
Restaurants_final <- Restaurants_v8


#Divides restaurant dataset into training set and test set 
RestaurantsTestSet <- Restaurants_final[sample(nrow(Restaurants_final), 11431/2), ]
Restaurants_New <- merge(x=Restaurants_final,y=RestaurantsTestSet, by.x = "business_id", by.y = "business_id",all=TRUE)
RestaurantsTrainingSet <- subset(Restaurants_New,attributes.y == "NA")

#Checks for equivalence of training set and test set 
mean(RestaurantsTestSet$stars,na.rm=TRUE)

## [1] 3.629396

mean(RestaurantsTrainingSet$stars.x,na.rm=TRUE)

## [1] 3.615728

sd(RestaurantsTestSet$stars,na.rm=TRUE)

## [1] 0.6088344

sd(RestaurantsTrainingSet$stars.x,na.rm=TRUE)

## [1] 0.6282495

mean(RestaurantsTestSet$stars,na.rm=TRUE) - mean(RestaurantsTrainingSet$stars.x,na.rm=TRUE)

## [1] 0.01366854

sd(RestaurantsTestSet$stars,na.rm=TRUE) - sd(RestaurantsTrainingSet$stars.x,na.rm=TRUE)

## [1] -0.01941517

mean(RestaurantsTestSet$checkin_count,na.rm=TRUE)

## [1] 71.58128

mean(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)

## [1] 72.10164

sd(RestaurantsTestSet$checkin_count,na.rm=TRUE)

## [1] 31.05963

sd(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)

## [1] 30.72136

mean(RestaurantsTestSet$checkin_count,na.rm=TRUE) - mean(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)

## [1] -0.5203672

sd(RestaurantsTestSet$checkin_count,na.rm=TRUE) - sd(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)

## [1] 0.3382641

mean(RestaurantsTestSet$review_count,na.rm=TRUE)

## [1] 146.2483

mean(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)

## [1] 149.2474

sd(RestaurantsTestSet$review_count,na.rm=TRUE)

## [1] 249.672

sd(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)

## [1] 235.9701

mean(RestaurantsTestSet$review_count,na.rm=TRUE) - mean(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)

## [1] -2.999082

sd(RestaurantsTestSet$review_count,na.rm=TRUE) - sd(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)

## [1] 13.70188

sapply(RestaurantsTestSet,mean, na.rm=TRUE)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

##   postal_code   business_id          name  neighborhood       address 
##            NA            NA            NA            NA            NA 
##          city         state      latitude     longitude         stars 
##            NA            NA  3.650779e+01 -1.011359e+02  3.629396e+00 
##  review_count       is_open    attributes    categories         hours 
##  1.462483e+02  1.000000e+00            NA            NA            NA 
##        type.x        type.y checkin_count median_income    population 
##            NA            NA  7.158128e+01  1.401877e-02 -3.197292e-03

sapply(RestaurantsTrainingSet,mean,na.rm=TRUE)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

##     business_id   postal_code.x          name.x  neighborhood.x 
##              NA              NA              NA              NA 
##       address.x          city.x         state.x      latitude.x 
##              NA              NA              NA    3.640160e+01 
##     longitude.x         stars.x  review_count.x       is_open.x 
##   -1.015882e+02    3.615728e+00    1.492474e+02    1.000000e+00 
##    attributes.x    categories.x         hours.x        type.x.x 
##              NA              NA              NA              NA 
##        type.y.x checkin_count.x median_income.x    population.x 
##              NA    7.210164e+01   -1.401632e-02    3.196732e-03 
##   postal_code.y          name.y  neighborhood.y       address.y 
##              NA              NA              NA              NA 
##          city.y         state.y      latitude.y     longitude.y 
##              NA              NA             NaN             NaN 
##         stars.y  review_count.y       is_open.y    attributes.y 
##             NaN             NaN             NaN              NA 
##    categories.y         hours.y        type.x.y        type.y.y 
##              NA              NA              NA              NA 
## checkin_count.y median_income.y    population.y 
##             NaN             NaN             NaN

#require packages for logistic regression
require(foreign)
require(nnet)

## Loading required package: nnet

require(ggplot2)
require(reshape2)

## Loading required package: reshape2

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

## The following object is masked from 'package:tidyr':
## 
##     smiths

require(MASS)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

require(ResourceSelection)

## Loading required package: ResourceSelection

## ResourceSelection 0.3-0   2016-11-04

#run ordered logistic regressions to assess the predictive power of checkins, number of ratings, median income, and population on restaurant ratings
Stars_ReviewCount_LogitModel <- polr(as.factor(stars.x) ~ review_count.x, data=RestaurantsTrainingSet, Hess = TRUE)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Stars_Checkin_LogitModel <- polr(as.factor(stars.x) ~ checkin_count.x, data=RestaurantsTrainingSet, Hess = TRUE)
Stars_MedianIncome_LogitModel <- polr(as.factor(stars.x) ~ median_income.x, data=RestaurantsTrainingSet, Hess = TRUE)
Stars_Population_LogitModel <- polr(as.factor(stars.x) ~ population.x, data=RestaurantsTrainingSet, Hess = TRUE)

#run linear regressions to assess the predictive power of checkins, number of ratings, population, and median income on restaurant ratings 
Stars_ReviewCount_LinearModel <- glm((stars.x) ~ review_count.x, data=RestaurantsTrainingSet)
Stars_Checkin_LinearModel <- glm((stars.x) ~ checkin_count.x, data=RestaurantsTrainingSet)
Stars_Income_LinearModel <- glm((stars.x) ~ median_income.x, data=RestaurantsTrainingSet)
Stars_Population_LinearModel <- glm((stars.x) ~ population.x, data=RestaurantsTrainingSet)

#run linear regressions to assess the predictive power of stars, number of ratings, population, and income on restaurant checkins 
Checkins_ReviewCount_LinearModel <- glm((checkin_count.x) ~ review_count.x, data=RestaurantsTrainingSet)
Checkins_Stars_LinearModel <- glm((checkin_count.x) ~ stars.x, data=RestaurantsTrainingSet)
Checkins_Income_LinearModel <- glm((checkin_count.x) ~ median_income.x, data=RestaurantsTrainingSet)
Checkins_Population_LinearModel <- glm((checkin_count.x) ~ population.x, data=RestaurantsTrainingSet)

#Produce summaries
summary(Stars_ReviewCount_LogitModel)

## Call:
## polr(formula = as.factor(stars.x) ~ review_count.x, data = RestaurantsTrainingSet, 
##     Hess = TRUE)
## 
## Coefficients:
##                   Value Std. Error t value
## review_count.x 0.001661  0.0001215   13.67
## 
## Intercepts:
##       Value    Std. Error t value 
## 1|1.5  -7.7582   0.7018   -11.0547
## 1.5|2  -5.0465   0.1835   -27.5051
## 2|2.5  -3.4201   0.0842   -40.6202
## 2.5|3  -2.0526   0.0475   -43.1728
## 3|3.5  -0.8572   0.0343   -24.9695
## 3.5|4   0.4056   0.0324    12.5234
## 4|4.5   2.1285   0.0450    47.2916
## 4.5|5   5.0761   0.1463    34.7068
## 
## Residual Deviance: 18245.69 
## AIC: 18263.69

summary(Stars_Checkin_LogitModel)

## Call:
## polr(formula = as.factor(stars.x) ~ checkin_count.x, data = RestaurantsTrainingSet, 
##     Hess = TRUE)
## 
## Coefficients:
##                     Value Std. Error t value
## checkin_count.x -0.001057  0.0007821  -1.351
## 
## Intercepts:
##       Value    Std. Error t value 
## 1|1.5  -8.0332   0.7081   -11.3442
## 1.5|2  -5.3214   0.1917   -27.7573
## 2|2.5  -3.6978   0.1008   -36.6748
## 2.5|3  -2.3377   0.0731   -31.9689
## 3|3.5  -1.1587   0.0651   -17.7934
## 3.5|4   0.0715   0.0634     1.1275
## 4|4.5   1.7443   0.0686    25.4115
## 4.5|5   4.6342   0.1514    30.6023
## 
## Residual Deviance: 18465.77 
## AIC: 18483.77

summary(Stars_MedianIncome_LogitModel)

## Call:
## polr(formula = as.factor(stars.x) ~ median_income.x, data = RestaurantsTrainingSet, 
##     Hess = TRUE)
## 
## Coefficients:
##                    Value Std. Error t value
## median_income.x -0.04757     0.0237  -2.007
## 
## Intercepts:
##       Value    Std. Error t value 
## 1|1.5  -7.9578   0.7067   -11.2610
## 1.5|2  -5.2449   0.1830   -28.6537
## 2|2.5  -3.6211   0.0830   -43.6200
## 2.5|3  -2.2608   0.0452   -49.9877
## 3|3.5  -1.0817   0.0304   -35.5579
## 3.5|4   0.1491   0.0265     5.6185
## 4|4.5   1.8224   0.0382    47.6911
## 4.5|5   4.7119   0.1407    33.4984
## 
## Residual Deviance: 18463.56 
## AIC: 18481.56

summary(Stars_Population_LogitModel)

## Call:
## polr(formula = as.factor(stars.x) ~ population.x, data = RestaurantsTrainingSet, 
##     Hess = TRUE)
## 
## Coefficients:
##                Value Std. Error t value
## population.x 0.01526    0.02361  0.6462
## 
## Intercepts:
##       Value    Std. Error t value 
## 1|1.5  -7.9573   0.7068   -11.2587
## 1.5|2  -5.2447   0.1831   -28.6513
## 2|2.5  -3.6207   0.0830   -43.6170
## 2.5|3  -2.2603   0.0452   -49.9817
## 3|3.5  -1.0811   0.0304   -35.5457
## 3.5|4   0.1493   0.0265     5.6278
## 4|4.5   1.8215   0.0382    47.6787
## 4.5|5   4.7104   0.1407    33.4882
## 
## Residual Deviance: 18467.18 
## AIC: 18485.18

summary(Stars_ReviewCount_LinearModel)

## 
## Call:
## glm(formula = (stars.x) ~ review_count.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.55812  -0.55566  -0.06008   0.42174   1.44483  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.542e+00  9.665e-03  366.54   <2e-16 ***
## review_count.x 4.913e-04  3.462e-05   14.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.3813214)
## 
##     Null deviance: 2255.7  on 5715  degrees of freedom
## Residual deviance: 2178.9  on 5714  degrees of freedom
## AIC: 10714
## 
## Number of Fisher Scoring iterations: 2

summary(Stars_Checkin_LinearModel)

## 
## Call:
## glm(formula = (stars.x) ~ checkin_count.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6204  -0.6061  -0.1109   0.3862   1.3940  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.6255159  0.0212020 170.999   <2e-16 ***
## checkin_count.x -0.0001358  0.0002705  -0.502    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.3947492)
## 
##     Null deviance: 2255.7  on 5715  degrees of freedom
## Residual deviance: 2255.6  on 5714  degrees of freedom
## AIC: 10912
## 
## Number of Fisher Scoring iterations: 2

summary(Stars_Income_LinearModel)

## 
## Call:
## glm(formula = (stars.x) ~ median_income.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.63177  -0.58771  -0.09972   0.38805   1.41374  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.615530   0.008309 435.122   <2e-16 ***
## median_income.x -0.014077   0.008404  -1.675    0.094 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.3945728)
## 
##     Null deviance: 2255.7  on 5715  degrees of freedom
## Residual deviance: 2254.6  on 5714  degrees of freedom
## AIC: 10910
## 
## Number of Fisher Scoring iterations: 2

summary(Stars_Population_LinearModel)

## 
## Call:
## glm(formula = (stars.x) ~ population.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6155  -0.6087  -0.1106   0.3861   1.3913  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.615715   0.008310 435.088   <2e-16 ***
## population.x 0.004012   0.008269   0.485    0.628    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.3947503)
## 
##     Null deviance: 2255.7  on 5715  degrees of freedom
## Residual deviance: 2255.6  on 5714  degrees of freedom
## AIC: 10912
## 
## Number of Fisher Scoring iterations: 2

summary(Checkins_ReviewCount_LinearModel)

## 
## Call:
## glm(formula = (checkin_count.x) ~ review_count.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -264.736   -18.133     0.405    16.264    95.324  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    62.538390   0.418583  149.41   <2e-16 ***
## review_count.x  0.064077   0.001499   42.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 715.3087)
## 
##     Null deviance: 5393830  on 5715  degrees of freedom
## Residual deviance: 4087274  on 5714  degrees of freedom
## AIC: 53795
## 
## Number of Fisher Scoring iterations: 2

summary(Checkins_Stars_LinearModel )

## 
## Call:
## glm(formula = (checkin_count.x) ~ stars.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -71.464  -22.855   -0.464   19.861   96.185  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  73.2754     2.3740  30.866   <2e-16 ***
## stars.x      -0.3246     0.6469  -0.502    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 943.9258)
## 
##     Null deviance: 5393830  on 5715  degrees of freedom
## Residual deviance: 5393592  on 5714  degrees of freedom
## AIC: 55380
## 
## Number of Fisher Scoring iterations: 2

summary(Checkins_Income_LinearModel)

## 
## Call:
## glm(formula = (checkin_count.x) ~ median_income.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -75.399  -22.685   -0.382   19.769   96.600  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      72.0650     0.4050 177.948  < 2e-16 ***
## median_income.x  -2.6158     0.4096  -6.387 1.83e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 937.2769)
## 
##     Null deviance: 5393830  on 5715  degrees of freedom
## Residual deviance: 5355600  on 5714  degrees of freedom
## AIC: 55340
## 
## Number of Fisher Scoring iterations: 2

summary(Checkins_Population_LinearModel)

## 
## Call:
## glm(formula = (checkin_count.x) ~ population.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -72.661  -22.730   -0.532   19.917   97.111  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   72.1046     0.4062 177.509   <2e-16 ***
## population.x  -0.9099     0.4042  -2.251   0.0244 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 943.1308)
## 
##     Null deviance: 5393830  on 5715  degrees of freedom
## Residual deviance: 5389049  on 5714  degrees of freedom
## AIC: 55375
## 
## Number of Fisher Scoring iterations: 2

#require packages for logistic regression
require(foreign)
require(nnet)
require(ggplot2)
require(reshape2)
require(MASS)
require(ResourceSelection)

#run ordered logistic regressions to assess the predictive power of checkins, number of ratings, median income, and population on restaurant ratings
Stars_ReviewCount_LogitModel <- polr(as.factor(stars.x) ~ review_count.x, data=RestaurantsTrainingSet, Hess = TRUE)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Stars_Checkin_LogitModel <- polr(as.factor(stars.x) ~ checkin_count.x, data=RestaurantsTrainingSet, Hess = TRUE)
Stars_MedianIncome_LogitModel <- polr(as.factor(stars.x) ~ median_income.x, data=RestaurantsTrainingSet, Hess = TRUE)
Stars_Population_LogitModel <- polr(as.factor(stars.x) ~ population.x, data=RestaurantsTrainingSet, Hess = TRUE)

#run linear regressions to assess the predictive power of checkins, number of ratings, population, and median income on restaurant ratings 
Stars_ReviewCount_LinearModel <- glm((stars.x) ~ review_count.x, data=RestaurantsTrainingSet)
Stars_Checkin_LinearModel <- glm((stars.x) ~ checkin_count.x, data=RestaurantsTrainingSet)
Stars_Income_LinearModel <- glm((stars.x) ~ median_income.x, data=RestaurantsTrainingSet)
Stars_Population_LinearModel <- glm((stars.x) ~ population.x, data=RestaurantsTrainingSet)

#run linear regressions to assess the predictive power of stars, number of ratings, population, and income on restaurant checkins 
Checkins_ReviewCount_LinearModel <- glm((checkin_count.x) ~ review_count.x, data=RestaurantsTrainingSet)
Checkins_Stars_LinearModel <- glm((checkin_count.x) ~ stars.x, data=RestaurantsTrainingSet)
Checkins_Income_LinearModel <- glm((checkin_count.x) ~ median_income.x, data=RestaurantsTrainingSet)
Checkins_Population_LinearModel <- glm((checkin_count.x) ~ population.x, data=RestaurantsTrainingSet)

#Produce summaries
summary(Stars_ReviewCount_LogitModel)

## Call:
## polr(formula = as.factor(stars.x) ~ review_count.x, data = RestaurantsTrainingSet, 
##     Hess = TRUE)
## 
## Coefficients:
##                   Value Std. Error t value
## review_count.x 0.001661  0.0001215   13.67
## 
## Intercepts:
##       Value    Std. Error t value 
## 1|1.5  -7.7582   0.7018   -11.0547
## 1.5|2  -5.0465   0.1835   -27.5051
## 2|2.5  -3.4201   0.0842   -40.6202
## 2.5|3  -2.0526   0.0475   -43.1728
## 3|3.5  -0.8572   0.0343   -24.9695
## 3.5|4   0.4056   0.0324    12.5234
## 4|4.5   2.1285   0.0450    47.2916
## 4.5|5   5.0761   0.1463    34.7068
## 
## Residual Deviance: 18245.69 
## AIC: 18263.69

summary(Stars_Checkin_LogitModel)

## Call:
## polr(formula = as.factor(stars.x) ~ checkin_count.x, data = RestaurantsTrainingSet, 
##     Hess = TRUE)
## 
## Coefficients:
##                     Value Std. Error t value
## checkin_count.x -0.001057  0.0007821  -1.351
## 
## Intercepts:
##       Value    Std. Error t value 
## 1|1.5  -8.0332   0.7081   -11.3442
## 1.5|2  -5.3214   0.1917   -27.7573
## 2|2.5  -3.6978   0.1008   -36.6748
## 2.5|3  -2.3377   0.0731   -31.9689
## 3|3.5  -1.1587   0.0651   -17.7934
## 3.5|4   0.0715   0.0634     1.1275
## 4|4.5   1.7443   0.0686    25.4115
## 4.5|5   4.6342   0.1514    30.6023
## 
## Residual Deviance: 18465.77 
## AIC: 18483.77

summary(Stars_MedianIncome_LogitModel)

## Call:
## polr(formula = as.factor(stars.x) ~ median_income.x, data = RestaurantsTrainingSet, 
##     Hess = TRUE)
## 
## Coefficients:
##                    Value Std. Error t value
## median_income.x -0.04757     0.0237  -2.007
## 
## Intercepts:
##       Value    Std. Error t value 
## 1|1.5  -7.9578   0.7067   -11.2610
## 1.5|2  -5.2449   0.1830   -28.6537
## 2|2.5  -3.6211   0.0830   -43.6200
## 2.5|3  -2.2608   0.0452   -49.9877
## 3|3.5  -1.0817   0.0304   -35.5579
## 3.5|4   0.1491   0.0265     5.6185
## 4|4.5   1.8224   0.0382    47.6911
## 4.5|5   4.7119   0.1407    33.4984
## 
## Residual Deviance: 18463.56 
## AIC: 18481.56

summary(Stars_Population_LogitModel)

## Call:
## polr(formula = as.factor(stars.x) ~ population.x, data = RestaurantsTrainingSet, 
##     Hess = TRUE)
## 
## Coefficients:
##                Value Std. Error t value
## population.x 0.01526    0.02361  0.6462
## 
## Intercepts:
##       Value    Std. Error t value 
## 1|1.5  -7.9573   0.7068   -11.2587
## 1.5|2  -5.2447   0.1831   -28.6513
## 2|2.5  -3.6207   0.0830   -43.6170
## 2.5|3  -2.2603   0.0452   -49.9817
## 3|3.5  -1.0811   0.0304   -35.5457
## 3.5|4   0.1493   0.0265     5.6278
## 4|4.5   1.8215   0.0382    47.6787
## 4.5|5   4.7104   0.1407    33.4882
## 
## Residual Deviance: 18467.18 
## AIC: 18485.18

summary(Stars_ReviewCount_LinearModel)

## 
## Call:
## glm(formula = (stars.x) ~ review_count.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.55812  -0.55566  -0.06008   0.42174   1.44483  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.542e+00  9.665e-03  366.54   <2e-16 ***
## review_count.x 4.913e-04  3.462e-05   14.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.3813214)
## 
##     Null deviance: 2255.7  on 5715  degrees of freedom
## Residual deviance: 2178.9  on 5714  degrees of freedom
## AIC: 10714
## 
## Number of Fisher Scoring iterations: 2

summary(Stars_Checkin_LinearModel)

## 
## Call:
## glm(formula = (stars.x) ~ checkin_count.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6204  -0.6061  -0.1109   0.3862   1.3940  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.6255159  0.0212020 170.999   <2e-16 ***
## checkin_count.x -0.0001358  0.0002705  -0.502    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.3947492)
## 
##     Null deviance: 2255.7  on 5715  degrees of freedom
## Residual deviance: 2255.6  on 5714  degrees of freedom
## AIC: 10912
## 
## Number of Fisher Scoring iterations: 2

summary(Stars_Income_LinearModel)

## 
## Call:
## glm(formula = (stars.x) ~ median_income.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.63177  -0.58771  -0.09972   0.38805   1.41374  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.615530   0.008309 435.122   <2e-16 ***
## median_income.x -0.014077   0.008404  -1.675    0.094 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.3945728)
## 
##     Null deviance: 2255.7  on 5715  degrees of freedom
## Residual deviance: 2254.6  on 5714  degrees of freedom
## AIC: 10910
## 
## Number of Fisher Scoring iterations: 2

summary(Stars_Population_LinearModel)

## 
## Call:
## glm(formula = (stars.x) ~ population.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6155  -0.6087  -0.1106   0.3861   1.3913  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.615715   0.008310 435.088   <2e-16 ***
## population.x 0.004012   0.008269   0.485    0.628    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.3947503)
## 
##     Null deviance: 2255.7  on 5715  degrees of freedom
## Residual deviance: 2255.6  on 5714  degrees of freedom
## AIC: 10912
## 
## Number of Fisher Scoring iterations: 2

summary(Checkins_ReviewCount_LinearModel)

## 
## Call:
## glm(formula = (checkin_count.x) ~ review_count.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -264.736   -18.133     0.405    16.264    95.324  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    62.538390   0.418583  149.41   <2e-16 ***
## review_count.x  0.064077   0.001499   42.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 715.3087)
## 
##     Null deviance: 5393830  on 5715  degrees of freedom
## Residual deviance: 4087274  on 5714  degrees of freedom
## AIC: 53795
## 
## Number of Fisher Scoring iterations: 2

summary(Checkins_Stars_LinearModel )

## 
## Call:
## glm(formula = (checkin_count.x) ~ stars.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -71.464  -22.855   -0.464   19.861   96.185  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  73.2754     2.3740  30.866   <2e-16 ***
## stars.x      -0.3246     0.6469  -0.502    0.616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 943.9258)
## 
##     Null deviance: 5393830  on 5715  degrees of freedom
## Residual deviance: 5393592  on 5714  degrees of freedom
## AIC: 55380
## 
## Number of Fisher Scoring iterations: 2

summary(Checkins_Income_LinearModel)

## 
## Call:
## glm(formula = (checkin_count.x) ~ median_income.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -75.399  -22.685   -0.382   19.769   96.600  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      72.0650     0.4050 177.948  < 2e-16 ***
## median_income.x  -2.6158     0.4096  -6.387 1.83e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 937.2769)
## 
##     Null deviance: 5393830  on 5715  degrees of freedom
## Residual deviance: 5355600  on 5714  degrees of freedom
## AIC: 55340
## 
## Number of Fisher Scoring iterations: 2

summary(Checkins_Population_LinearModel)

## 
## Call:
## glm(formula = (checkin_count.x) ~ population.x, data = RestaurantsTrainingSet)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -72.661  -22.730   -0.532   19.917   97.111  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   72.1046     0.4062 177.509   <2e-16 ***
## population.x  -0.9099     0.4042  -2.251   0.0244 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 943.1308)
## 
##     Null deviance: 5393830  on 5715  degrees of freedom
## Residual deviance: 5389049  on 5714  degrees of freedom
## AIC: 55375
## 
## Number of Fisher Scoring iterations: 2

4. Data Visualization

The following charts are intended to allow us to visualize:

the distribution of checkins, stars, and number of reviews for all the restaurants in our dataset
the equivalence between the training and test sets with respect to the key features we focused on
the geographic distribution of the restaurants in our sample

#visualize data
require(ggplot2)
require(ggmap)

## Loading required package: ggmap

## Google Maps API Terms of Service: http://developers.google.com/maps/terms.

## Please cite ggmap if you use it: see citation('ggmap') for details.

qplot(data=Restaurants_final,checkin_count)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(data=Restaurants_final,stars)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(data=Restaurants_final,review_count ,stars,geom="smooth")

## `geom_smooth()` using method = 'gam'

qplot(data=RestaurantsTrainingSet,checkin_count.x)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(data=RestaurantsTrainingSet,stars.x)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(data=RestaurantsTrainingSet,review_count.x ,stars.x,geom="smooth")

## `geom_smooth()` using method = 'gam'

qplot(data=RestaurantsTestSet,checkin_count)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(data=RestaurantsTestSet,stars)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(data=RestaurantsTestSet,review_count ,stars,geom="smooth")

## `geom_smooth()` using method = 'gam'

map<-get_map(location='united states', zoom=4, maptype = "terrain",
             source='google',color='color')

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=united+states&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=united%20states&sensor=false

ggmap(map) + geom_point(
        aes(x=longitude, y=latitude, show_guide = TRUE, colour=checkin_count), 
        data=Restaurants_final, alpha=.5, na.rm = T)  + 
        scale_color_gradient(low="red", high="blue")

## Warning: Ignoring unknown aesthetics: show_guide

ggmap(map) + geom_point(
        aes(x=longitude, y=latitude, show_guide = TRUE, colour=stars), 
        data=Restaurants_final, alpha=.5, na.rm = T)  + 
        scale_color_gradient(low="red", high="blue")

## Warning: Ignoring unknown aesthetics: show_guide

5. Conclusions and Further Research

Unfortunately we found no meaningful, statistically significant relationships between any of the features we tested and restaurant checkins or ratings. This may be due to the fact that we ultimately didn’t have a big enough sample (only 5000 restaurants in our training set) to be able to differentiate between signal and noise. Possible further steps we could take to continue to explore whether there are good predictors of restaurant ratings and checkins include:

Analyze the review text data using a Natural Language Processing package, to assess whether there are any interesting relationships between written reviews and number of stars. For example, are reviewers who give high ratings more likely to write long reviews, or use certain key words or phrases?
Analyze user data to understand the relationship between the frequency with which users write reviews on Yelp, and their propensity to give very high or lower ratings.
Analyze restaurant categories, tags, and hours, to assess whether they have any relationship with restaurant ratings and/or checkins.

Restaurant Ratings Group Project - BDA January 2017

Matt,Rachana,Sharjeel,Ahmad

January 29, 2017

1. Overview of Dataset

2. Key Business Questions and Data Analysis Process

3. Data Import, Cleaning, and Analysis

4. Data Visualization

5. Conclusions and Further Research