We tested a number of different hypotheses with respect to restaurants in major US cities. We used Yelp’s Challenge Data - an open dataset with the following characteristics:
We sought primarily to answer the following questions:
We divided the data into training and test sets, in order to be able to ideally build a predictive model for either restaurant ratings, checkins, or both, and then test the model on the test set.
We ran ordered logistic regression and linear regresssions on this data in order to determine the predictive power of various restaurant features on the outcomes of interest, namely stars and checkins.
#uncompress data
library(foreign)
#import data and create dataframes
library(jsonlite)
YelpCheckins <- stream_in(file("C:/Users/Matthew/Documents/R/GroupProjectFinal/yelp_academic_dataset_checkin.json"),pagesize = 500)
## opening file input connection.
##
Found 500 records...
Found 1000 records...
Found 1500 records...
Found 2000 records...
Found 2500 records...
Found 3000 records...
Found 3500 records...
Found 4000 records...
Found 4500 records...
Found 5000 records...
Found 5500 records...
Found 6000 records...
Found 6500 records...
Found 7000 records...
Found 7500 records...
Found 8000 records...
Found 8500 records...
Found 9000 records...
Found 9500 records...
Found 10000 records...
Found 10500 records...
Found 11000 records...
Found 11500 records...
Found 12000 records...
Found 12500 records...
Found 13000 records...
Found 13500 records...
Found 14000 records...
Found 14500 records...
Found 15000 records...
Found 15500 records...
Found 16000 records...
Found 16500 records...
Found 17000 records...
Found 17500 records...
Found 18000 records...
Found 18500 records...
Found 19000 records...
Found 19500 records...
Found 20000 records...
Found 20500 records...
Found 21000 records...
Found 21500 records...
Found 22000 records...
Found 22500 records...
Found 23000 records...
Found 23500 records...
Found 24000 records...
Found 24500 records...
Found 25000 records...
Found 25500 records...
Found 26000 records...
Found 26500 records...
Found 27000 records...
Found 27500 records...
Found 28000 records...
Found 28500 records...
Found 29000 records...
Found 29500 records...
Found 30000 records...
Found 30500 records...
Found 31000 records...
Found 31500 records...
Found 32000 records...
Found 32500 records...
Found 33000 records...
Found 33500 records...
Found 34000 records...
Found 34500 records...
Found 35000 records...
Found 35500 records...
Found 36000 records...
Found 36500 records...
Found 37000 records...
Found 37500 records...
Found 38000 records...
Found 38500 records...
Found 39000 records...
Found 39500 records...
Found 40000 records...
Found 40500 records...
Found 41000 records...
Found 41500 records...
Found 42000 records...
Found 42500 records...
Found 43000 records...
Found 43500 records...
Found 44000 records...
Found 44500 records...
Found 45000 records...
Found 45500 records...
Found 46000 records...
Found 46500 records...
Found 47000 records...
Found 47500 records...
Found 48000 records...
Found 48500 records...
Found 49000 records...
Found 49500 records...
Found 50000 records...
Found 50500 records...
Found 51000 records...
Found 51500 records...
Found 52000 records...
Found 52500 records...
Found 53000 records...
Found 53500 records...
Found 54000 records...
Found 54500 records...
Found 55000 records...
Found 55500 records...
Found 56000 records...
Found 56500 records...
Found 57000 records...
Found 57500 records...
Found 58000 records...
Found 58500 records...
Found 59000 records...
Found 59500 records...
Found 60000 records...
Found 60500 records...
Found 61000 records...
Found 61500 records...
Found 62000 records...
Found 62500 records...
Found 63000 records...
Found 63500 records...
Found 64000 records...
Found 64500 records...
Found 65000 records...
Found 65500 records...
Found 66000 records...
Found 66500 records...
Found 67000 records...
Found 67500 records...
Found 68000 records...
Found 68500 records...
Found 69000 records...
Found 69500 records...
Found 70000 records...
Found 70500 records...
Found 71000 records...
Found 71500 records...
Found 72000 records...
Found 72500 records...
Found 73000 records...
Found 73500 records...
Found 74000 records...
Found 74500 records...
Found 75000 records...
Found 75500 records...
Found 76000 records...
Found 76500 records...
Found 77000 records...
Found 77500 records...
Found 78000 records...
Found 78500 records...
Found 79000 records...
Found 79500 records...
Found 80000 records...
Found 80500 records...
Found 81000 records...
Found 81500 records...
Found 82000 records...
Found 82500 records...
Found 83000 records...
Found 83500 records...
Found 84000 records...
Found 84500 records...
Found 85000 records...
Found 85500 records...
Found 86000 records...
Found 86500 records...
Found 87000 records...
Found 87500 records...
Found 88000 records...
Found 88500 records...
Found 89000 records...
Found 89500 records...
Found 90000 records...
Found 90500 records...
Found 91000 records...
Found 91500 records...
Found 92000 records...
Found 92500 records...
Found 93000 records...
Found 93500 records...
Found 94000 records...
Found 94500 records...
Found 95000 records...
Found 95500 records...
Found 96000 records...
Found 96500 records...
Found 97000 records...
Found 97500 records...
Found 98000 records...
Found 98500 records...
Found 99000 records...
Found 99500 records...
Found 1e+05 records...
Found 100500 records...
Found 101000 records...
Found 101500 records...
Found 102000 records...
Found 102500 records...
Found 103000 records...
Found 103500 records...
Found 104000 records...
Found 104500 records...
Found 105000 records...
Found 105500 records...
Found 106000 records...
Found 106500 records...
Found 107000 records...
Found 107500 records...
Found 108000 records...
Found 108500 records...
Found 109000 records...
Found 109500 records...
Found 110000 records...
Found 110500 records...
Found 111000 records...
Found 111500 records...
Found 112000 records...
Found 112500 records...
Found 113000 records...
Found 113500 records...
Found 114000 records...
Found 114500 records...
Found 115000 records...
Found 115500 records...
Found 116000 records...
Found 116500 records...
Found 117000 records...
Found 117500 records...
Found 118000 records...
Found 118500 records...
Found 119000 records...
Found 119500 records...
Found 120000 records...
Found 120500 records...
Found 121000 records...
Found 121500 records...
Found 122000 records...
Found 122500 records...
Found 123000 records...
Found 123500 records...
Found 124000 records...
Found 124500 records...
Found 125000 records...
Found 125500 records...
Found 125532 records...
Imported 125532 records. Simplifying...
## closing file input connection.
YelpBusinesses <- stream_in(file("C:/Users/Matthew/Documents/R/GroupProjectFinal/yelp_academic_dataset_business.json"),pagesize = 500)
## opening file input connection.
##
Found 500 records...
Found 1000 records...
Found 1500 records...
Found 2000 records...
Found 2500 records...
Found 3000 records...
Found 3500 records...
Found 4000 records...
Found 4500 records...
Found 5000 records...
Found 5500 records...
Found 6000 records...
Found 6500 records...
Found 7000 records...
Found 7500 records...
Found 8000 records...
Found 8500 records...
Found 9000 records...
Found 9500 records...
Found 10000 records...
Found 10500 records...
Found 11000 records...
Found 11500 records...
Found 12000 records...
Found 12500 records...
Found 13000 records...
Found 13500 records...
Found 14000 records...
Found 14500 records...
Found 15000 records...
Found 15500 records...
Found 16000 records...
Found 16500 records...
Found 17000 records...
Found 17500 records...
Found 18000 records...
Found 18500 records...
Found 19000 records...
Found 19500 records...
Found 20000 records...
Found 20500 records...
Found 21000 records...
Found 21500 records...
Found 22000 records...
Found 22500 records...
Found 23000 records...
Found 23500 records...
Found 24000 records...
Found 24500 records...
Found 25000 records...
Found 25500 records...
Found 26000 records...
Found 26500 records...
Found 27000 records...
Found 27500 records...
Found 28000 records...
Found 28500 records...
Found 29000 records...
Found 29500 records...
Found 30000 records...
Found 30500 records...
Found 31000 records...
Found 31500 records...
Found 32000 records...
Found 32500 records...
Found 33000 records...
Found 33500 records...
Found 34000 records...
Found 34500 records...
Found 35000 records...
Found 35500 records...
Found 36000 records...
Found 36500 records...
Found 37000 records...
Found 37500 records...
Found 38000 records...
Found 38500 records...
Found 39000 records...
Found 39500 records...
Found 40000 records...
Found 40500 records...
Found 41000 records...
Found 41500 records...
Found 42000 records...
Found 42500 records...
Found 43000 records...
Found 43500 records...
Found 44000 records...
Found 44500 records...
Found 45000 records...
Found 45500 records...
Found 46000 records...
Found 46500 records...
Found 47000 records...
Found 47500 records...
Found 48000 records...
Found 48500 records...
Found 49000 records...
Found 49500 records...
Found 50000 records...
Found 50500 records...
Found 51000 records...
Found 51500 records...
Found 52000 records...
Found 52500 records...
Found 53000 records...
Found 53500 records...
Found 54000 records...
Found 54500 records...
Found 55000 records...
Found 55500 records...
Found 56000 records...
Found 56500 records...
Found 57000 records...
Found 57500 records...
Found 58000 records...
Found 58500 records...
Found 59000 records...
Found 59500 records...
Found 60000 records...
Found 60500 records...
Found 61000 records...
Found 61500 records...
Found 62000 records...
Found 62500 records...
Found 63000 records...
Found 63500 records...
Found 64000 records...
Found 64500 records...
Found 65000 records...
Found 65500 records...
Found 66000 records...
Found 66500 records...
Found 67000 records...
Found 67500 records...
Found 68000 records...
Found 68500 records...
Found 69000 records...
Found 69500 records...
Found 70000 records...
Found 70500 records...
Found 71000 records...
Found 71500 records...
Found 72000 records...
Found 72500 records...
Found 73000 records...
Found 73500 records...
Found 74000 records...
Found 74500 records...
Found 75000 records...
Found 75500 records...
Found 76000 records...
Found 76500 records...
Found 77000 records...
Found 77500 records...
Found 78000 records...
Found 78500 records...
Found 79000 records...
Found 79500 records...
Found 80000 records...
Found 80500 records...
Found 81000 records...
Found 81500 records...
Found 82000 records...
Found 82500 records...
Found 83000 records...
Found 83500 records...
Found 84000 records...
Found 84500 records...
Found 85000 records...
Found 85500 records...
Found 86000 records...
Found 86500 records...
Found 87000 records...
Found 87500 records...
Found 88000 records...
Found 88500 records...
Found 89000 records...
Found 89500 records...
Found 90000 records...
Found 90500 records...
Found 91000 records...
Found 91500 records...
Found 92000 records...
Found 92500 records...
Found 93000 records...
Found 93500 records...
Found 94000 records...
Found 94500 records...
Found 95000 records...
Found 95500 records...
Found 96000 records...
Found 96500 records...
Found 97000 records...
Found 97500 records...
Found 98000 records...
Found 98500 records...
Found 99000 records...
Found 99500 records...
Found 1e+05 records...
Found 100500 records...
Found 101000 records...
Found 101500 records...
Found 102000 records...
Found 102500 records...
Found 103000 records...
Found 103500 records...
Found 104000 records...
Found 104500 records...
Found 105000 records...
Found 105500 records...
Found 106000 records...
Found 106500 records...
Found 107000 records...
Found 107500 records...
Found 108000 records...
Found 108500 records...
Found 109000 records...
Found 109500 records...
Found 110000 records...
Found 110500 records...
Found 111000 records...
Found 111500 records...
Found 112000 records...
Found 112500 records...
Found 113000 records...
Found 113500 records...
Found 114000 records...
Found 114500 records...
Found 115000 records...
Found 115500 records...
Found 116000 records...
Found 116500 records...
Found 117000 records...
Found 117500 records...
Found 118000 records...
Found 118500 records...
Found 119000 records...
Found 119500 records...
Found 120000 records...
Found 120500 records...
Found 121000 records...
Found 121500 records...
Found 122000 records...
Found 122500 records...
Found 123000 records...
Found 123500 records...
Found 124000 records...
Found 124500 records...
Found 125000 records...
Found 125500 records...
Found 126000 records...
Found 126500 records...
Found 127000 records...
Found 127500 records...
Found 128000 records...
Found 128500 records...
Found 129000 records...
Found 129500 records...
Found 130000 records...
Found 130500 records...
Found 131000 records...
Found 131500 records...
Found 132000 records...
Found 132500 records...
Found 133000 records...
Found 133500 records...
Found 134000 records...
Found 134500 records...
Found 135000 records...
Found 135500 records...
Found 136000 records...
Found 136500 records...
Found 137000 records...
Found 137500 records...
Found 138000 records...
Found 138500 records...
Found 139000 records...
Found 139500 records...
Found 140000 records...
Found 140500 records...
Found 141000 records...
Found 141500 records...
Found 142000 records...
Found 142500 records...
Found 143000 records...
Found 143500 records...
Found 144000 records...
Found 144072 records...
Imported 144072 records. Simplifying...
## closing file input connection.
#make each individual checkin its own row
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(tidyr)
## Loading required package: tidyr
library(splitstackshape)
## Loading required package: data.table
## data.table 1.10.0
## The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
## Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
## Release notes, videos and slides: http://r-datatable.com
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
YelpCheckins_v2 <- YelpCheckins
YelpCheckins_v2 <- cSplit(YelpCheckins_v2, "time", sep = ",", direction = "long")
require(RCurl)
## Loading required package: RCurl
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
require(xlsx)
## Loading required package: xlsx
## Loading required package: rJava
require(readxl)
## Loading required package: readxl
urlfile <-'http://www.psc.isr.umich.edu/dis/census/Features/tract2zip/MedianZIP-3.xlsx'
destfile <- "census20062010.xlsx"
download.file(urlfile, destfile, mode="wb")
census <- read_excel(destfile, sheet = "Median")
# clean up data
names(census) <- c('postal_code','median_income','population')
census$median_income <- as.character(census$median_income)
census$median_income <- as.numeric(gsub(',','',census$median_income))
print(head(census,5))
## # A tibble: 5 Ă— 3
## postal_code median_income population
## <dbl> <dbl> <dbl>
## 1 1001 56662.57 16445
## 2 1002 49853.42 28069
## 3 1003 28462.00 8491
## 4 1005 75423.00 4798
## 5 1007 79076.35 12962
#strip out businesses that aren't restaurants, have fewer than 25 reviews, or are no longer active
require(plyr)
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## arrange(): dplyr, plyr
## between(): dplyr, data.table
## compact(): purrr, plyr
## complete(): tidyr, RCurl
## count(): dplyr, plyr
## failwith(): dplyr, plyr
## filter(): dplyr, stats
## first(): dplyr, data.table
## id(): dplyr, plyr
## lag(): dplyr, stats
## last(): dplyr, data.table
## mutate(): dplyr, plyr
## rename(): dplyr, plyr
## summarise(): dplyr, plyr
## summarize(): dplyr, plyr
## transpose(): purrr, data.table
library(stringr)
Restaurants_v1 <- subset(YelpBusinesses,is_open == 1)
Restaurants_v2 <- filter(Restaurants_v1 , grepl('Restaurants',categories))
Restaurants_v3 <- subset(Restaurants_v2,review_count > 25)
#merge checkins data with restaurants data, matching on business_id; remove restaurants located outside the US
Restaurants_v4 <- merge(x=Restaurants_v3, y=YelpCheckins_v2, by.x="business_id", by.y="business_id", all=TRUE)
Restaurants_v5 <- subset(Restaurants_v4,is_open == 1)
Restaurants_v5$business_id <- as.character(Restaurants_v5$business_id)
Restaurants_v5$checkin_count <- as.numeric(ave(Restaurants_v5$business_id, Restaurants_v5$business_id, FUN = length))
Restaurants_v6 <- subset(Restaurants_v5,!duplicated(business_id),-c(17))
Restaurants_v7 <- subset(Restaurants_v6,nchar(postal_code)==5)
Restaurants_v8 <- merge(x=Restaurants_v7,y=census,by.x="postal_code",by.y = "postal_code",scale(population))
ScaledMedianIncome <- scale(Restaurants_v8$median_income)
ScaledPopulation <- scale(Restaurants_v8$population)
Restaurants_v8$median_income <- ScaledMedianIncome
Restaurants_v8$population <- ScaledPopulation
Restaurants_final <- Restaurants_v8
#Divides restaurant dataset into training set and test set
RestaurantsTestSet <- Restaurants_final[sample(nrow(Restaurants_final), 11431/2), ]
Restaurants_New <- merge(x=Restaurants_final,y=RestaurantsTestSet, by.x = "business_id", by.y = "business_id",all=TRUE)
RestaurantsTrainingSet <- subset(Restaurants_New,attributes.y == "NA")
#Checks for equivalence of training set and test set
mean(RestaurantsTestSet$stars,na.rm=TRUE)
## [1] 3.621522
mean(RestaurantsTrainingSet$stars.x,na.rm=TRUE)
## [1] 3.6236
sd(RestaurantsTestSet$stars,na.rm=TRUE)
## [1] 0.6209718
sd(RestaurantsTrainingSet$stars.x,na.rm=TRUE)
## [1] 0.6163316
mean(RestaurantsTestSet$stars,na.rm=TRUE) - mean(RestaurantsTrainingSet$stars.x,na.rm=TRUE)
## [1] -0.00207811
sd(RestaurantsTestSet$stars,na.rm=TRUE) - sd(RestaurantsTrainingSet$stars.x,na.rm=TRUE)
## [1] 0.004640246
mean(RestaurantsTestSet$checkin_count,na.rm=TRUE)
## [1] 71.68224
mean(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)
## [1] 72.0007
sd(RestaurantsTestSet$checkin_count,na.rm=TRUE)
## [1] 30.93848
sd(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)
## [1] 30.84472
mean(RestaurantsTestSet$checkin_count,na.rm=TRUE) - mean(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)
## [1] -0.3184601
sd(RestaurantsTestSet$checkin_count,na.rm=TRUE) - sd(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)
## [1] 0.09376093
mean(RestaurantsTestSet$review_count,na.rm=TRUE)
## [1] 149.1207
mean(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)
## [1] 146.3754
sd(RestaurantsTestSet$review_count,na.rm=TRUE)
## [1] 257.6816
sd(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)
## [1] 227.1995
mean(RestaurantsTestSet$review_count,na.rm=TRUE) - mean(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)
## [1] 2.745298
sd(RestaurantsTestSet$review_count,na.rm=TRUE) - sd(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)
## [1] 30.48209
sapply(RestaurantsTestSet,mean, na.rm=TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## postal_code business_id name neighborhood address
## NA NA NA NA NA
## city state latitude longitude stars
## NA NA 3.647610e+01 -1.011885e+02 3.621522e+00
## review_count is_open attributes categories hours
## 1.491207e+02 1.000000e+00 NA NA NA
## type.x type.y checkin_count median_income population
## NA NA 7.168224e+01 -1.965074e-03 -2.510127e-03
sapply(RestaurantsTrainingSet,mean,na.rm=TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## business_id postal_code.x name.x neighborhood.x
## NA NA NA NA
## address.x city.x state.x latitude.x
## NA NA NA 3.643328e+01
## longitude.x stars.x review_count.x is_open.x
## -1.015357e+02 3.623600e+00 1.463754e+02 1.000000e+00
## attributes.x categories.x hours.x type.x.x
## NA NA NA NA
## type.y.x checkin_count.x median_income.x population.x
## NA 7.200070e+01 1.964731e-03 2.509688e-03
## postal_code.y name.y neighborhood.y address.y
## NA NA NA NA
## city.y state.y latitude.y longitude.y
## NA NA NaN NaN
## stars.y review_count.y is_open.y attributes.y
## NaN NaN NaN NA
## categories.y hours.y type.x.y type.y.y
## NA NA NA NA
## checkin_count.y median_income.y population.y
## NaN NaN NaN
Restaurants_v8 <- merge(x=Restaurants_v7,y=census,by.x="postal_code",by.y = "postal_code",scale(population))
ScaledMedianIncome <- scale(Restaurants_v8$median_income)
ScaledPopulation <- scale(Restaurants_v8$population)
Restaurants_v8$median_income <- ScaledMedianIncome
Restaurants_v8$population <- ScaledPopulation
Restaurants_final <- Restaurants_v8
#Divides restaurant dataset into training set and test set
RestaurantsTestSet <- Restaurants_final[sample(nrow(Restaurants_final), 11431/2), ]
Restaurants_New <- merge(x=Restaurants_final,y=RestaurantsTestSet, by.x = "business_id", by.y = "business_id",all=TRUE)
RestaurantsTrainingSet <- subset(Restaurants_New,attributes.y == "NA")
#Checks for equivalence of training set and test set
mean(RestaurantsTestSet$stars,na.rm=TRUE)
## [1] 3.629396
mean(RestaurantsTrainingSet$stars.x,na.rm=TRUE)
## [1] 3.615728
sd(RestaurantsTestSet$stars,na.rm=TRUE)
## [1] 0.6088344
sd(RestaurantsTrainingSet$stars.x,na.rm=TRUE)
## [1] 0.6282495
mean(RestaurantsTestSet$stars,na.rm=TRUE) - mean(RestaurantsTrainingSet$stars.x,na.rm=TRUE)
## [1] 0.01366854
sd(RestaurantsTestSet$stars,na.rm=TRUE) - sd(RestaurantsTrainingSet$stars.x,na.rm=TRUE)
## [1] -0.01941517
mean(RestaurantsTestSet$checkin_count,na.rm=TRUE)
## [1] 71.58128
mean(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)
## [1] 72.10164
sd(RestaurantsTestSet$checkin_count,na.rm=TRUE)
## [1] 31.05963
sd(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)
## [1] 30.72136
mean(RestaurantsTestSet$checkin_count,na.rm=TRUE) - mean(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)
## [1] -0.5203672
sd(RestaurantsTestSet$checkin_count,na.rm=TRUE) - sd(RestaurantsTrainingSet$checkin_count.x,na.rm=TRUE)
## [1] 0.3382641
mean(RestaurantsTestSet$review_count,na.rm=TRUE)
## [1] 146.2483
mean(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)
## [1] 149.2474
sd(RestaurantsTestSet$review_count,na.rm=TRUE)
## [1] 249.672
sd(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)
## [1] 235.9701
mean(RestaurantsTestSet$review_count,na.rm=TRUE) - mean(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)
## [1] -2.999082
sd(RestaurantsTestSet$review_count,na.rm=TRUE) - sd(RestaurantsTrainingSet$review_count.x,na.rm=TRUE)
## [1] 13.70188
sapply(RestaurantsTestSet,mean, na.rm=TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## postal_code business_id name neighborhood address
## NA NA NA NA NA
## city state latitude longitude stars
## NA NA 3.650779e+01 -1.011359e+02 3.629396e+00
## review_count is_open attributes categories hours
## 1.462483e+02 1.000000e+00 NA NA NA
## type.x type.y checkin_count median_income population
## NA NA 7.158128e+01 1.401877e-02 -3.197292e-03
sapply(RestaurantsTrainingSet,mean,na.rm=TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## business_id postal_code.x name.x neighborhood.x
## NA NA NA NA
## address.x city.x state.x latitude.x
## NA NA NA 3.640160e+01
## longitude.x stars.x review_count.x is_open.x
## -1.015882e+02 3.615728e+00 1.492474e+02 1.000000e+00
## attributes.x categories.x hours.x type.x.x
## NA NA NA NA
## type.y.x checkin_count.x median_income.x population.x
## NA 7.210164e+01 -1.401632e-02 3.196732e-03
## postal_code.y name.y neighborhood.y address.y
## NA NA NA NA
## city.y state.y latitude.y longitude.y
## NA NA NaN NaN
## stars.y review_count.y is_open.y attributes.y
## NaN NaN NaN NA
## categories.y hours.y type.x.y type.y.y
## NA NA NA NA
## checkin_count.y median_income.y population.y
## NaN NaN NaN
#require packages for logistic regression
require(foreign)
require(nnet)
## Loading required package: nnet
require(ggplot2)
require(reshape2)
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
## The following object is masked from 'package:tidyr':
##
## smiths
require(MASS)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
require(ResourceSelection)
## Loading required package: ResourceSelection
## ResourceSelection 0.3-0 2016-11-04
#run ordered logistic regressions to assess the predictive power of checkins, number of ratings, median income, and population on restaurant ratings
Stars_ReviewCount_LogitModel <- polr(as.factor(stars.x) ~ review_count.x, data=RestaurantsTrainingSet, Hess = TRUE)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Stars_Checkin_LogitModel <- polr(as.factor(stars.x) ~ checkin_count.x, data=RestaurantsTrainingSet, Hess = TRUE)
Stars_MedianIncome_LogitModel <- polr(as.factor(stars.x) ~ median_income.x, data=RestaurantsTrainingSet, Hess = TRUE)
Stars_Population_LogitModel <- polr(as.factor(stars.x) ~ population.x, data=RestaurantsTrainingSet, Hess = TRUE)
#run linear regressions to assess the predictive power of checkins, number of ratings, population, and median income on restaurant ratings
Stars_ReviewCount_LinearModel <- glm((stars.x) ~ review_count.x, data=RestaurantsTrainingSet)
Stars_Checkin_LinearModel <- glm((stars.x) ~ checkin_count.x, data=RestaurantsTrainingSet)
Stars_Income_LinearModel <- glm((stars.x) ~ median_income.x, data=RestaurantsTrainingSet)
Stars_Population_LinearModel <- glm((stars.x) ~ population.x, data=RestaurantsTrainingSet)
#run linear regressions to assess the predictive power of stars, number of ratings, population, and income on restaurant checkins
Checkins_ReviewCount_LinearModel <- glm((checkin_count.x) ~ review_count.x, data=RestaurantsTrainingSet)
Checkins_Stars_LinearModel <- glm((checkin_count.x) ~ stars.x, data=RestaurantsTrainingSet)
Checkins_Income_LinearModel <- glm((checkin_count.x) ~ median_income.x, data=RestaurantsTrainingSet)
Checkins_Population_LinearModel <- glm((checkin_count.x) ~ population.x, data=RestaurantsTrainingSet)
#Produce summaries
summary(Stars_ReviewCount_LogitModel)
## Call:
## polr(formula = as.factor(stars.x) ~ review_count.x, data = RestaurantsTrainingSet,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## review_count.x 0.001661 0.0001215 13.67
##
## Intercepts:
## Value Std. Error t value
## 1|1.5 -7.7582 0.7018 -11.0547
## 1.5|2 -5.0465 0.1835 -27.5051
## 2|2.5 -3.4201 0.0842 -40.6202
## 2.5|3 -2.0526 0.0475 -43.1728
## 3|3.5 -0.8572 0.0343 -24.9695
## 3.5|4 0.4056 0.0324 12.5234
## 4|4.5 2.1285 0.0450 47.2916
## 4.5|5 5.0761 0.1463 34.7068
##
## Residual Deviance: 18245.69
## AIC: 18263.69
summary(Stars_Checkin_LogitModel)
## Call:
## polr(formula = as.factor(stars.x) ~ checkin_count.x, data = RestaurantsTrainingSet,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## checkin_count.x -0.001057 0.0007821 -1.351
##
## Intercepts:
## Value Std. Error t value
## 1|1.5 -8.0332 0.7081 -11.3442
## 1.5|2 -5.3214 0.1917 -27.7573
## 2|2.5 -3.6978 0.1008 -36.6748
## 2.5|3 -2.3377 0.0731 -31.9689
## 3|3.5 -1.1587 0.0651 -17.7934
## 3.5|4 0.0715 0.0634 1.1275
## 4|4.5 1.7443 0.0686 25.4115
## 4.5|5 4.6342 0.1514 30.6023
##
## Residual Deviance: 18465.77
## AIC: 18483.77
summary(Stars_MedianIncome_LogitModel)
## Call:
## polr(formula = as.factor(stars.x) ~ median_income.x, data = RestaurantsTrainingSet,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## median_income.x -0.04757 0.0237 -2.007
##
## Intercepts:
## Value Std. Error t value
## 1|1.5 -7.9578 0.7067 -11.2610
## 1.5|2 -5.2449 0.1830 -28.6537
## 2|2.5 -3.6211 0.0830 -43.6200
## 2.5|3 -2.2608 0.0452 -49.9877
## 3|3.5 -1.0817 0.0304 -35.5579
## 3.5|4 0.1491 0.0265 5.6185
## 4|4.5 1.8224 0.0382 47.6911
## 4.5|5 4.7119 0.1407 33.4984
##
## Residual Deviance: 18463.56
## AIC: 18481.56
summary(Stars_Population_LogitModel)
## Call:
## polr(formula = as.factor(stars.x) ~ population.x, data = RestaurantsTrainingSet,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## population.x 0.01526 0.02361 0.6462
##
## Intercepts:
## Value Std. Error t value
## 1|1.5 -7.9573 0.7068 -11.2587
## 1.5|2 -5.2447 0.1831 -28.6513
## 2|2.5 -3.6207 0.0830 -43.6170
## 2.5|3 -2.2603 0.0452 -49.9817
## 3|3.5 -1.0811 0.0304 -35.5457
## 3.5|4 0.1493 0.0265 5.6278
## 4|4.5 1.8215 0.0382 47.6787
## 4.5|5 4.7104 0.1407 33.4882
##
## Residual Deviance: 18467.18
## AIC: 18485.18
summary(Stars_ReviewCount_LinearModel)
##
## Call:
## glm(formula = (stars.x) ~ review_count.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.55812 -0.55566 -0.06008 0.42174 1.44483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.542e+00 9.665e-03 366.54 <2e-16 ***
## review_count.x 4.913e-04 3.462e-05 14.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3813214)
##
## Null deviance: 2255.7 on 5715 degrees of freedom
## Residual deviance: 2178.9 on 5714 degrees of freedom
## AIC: 10714
##
## Number of Fisher Scoring iterations: 2
summary(Stars_Checkin_LinearModel)
##
## Call:
## glm(formula = (stars.x) ~ checkin_count.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6204 -0.6061 -0.1109 0.3862 1.3940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6255159 0.0212020 170.999 <2e-16 ***
## checkin_count.x -0.0001358 0.0002705 -0.502 0.616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3947492)
##
## Null deviance: 2255.7 on 5715 degrees of freedom
## Residual deviance: 2255.6 on 5714 degrees of freedom
## AIC: 10912
##
## Number of Fisher Scoring iterations: 2
summary(Stars_Income_LinearModel)
##
## Call:
## glm(formula = (stars.x) ~ median_income.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.63177 -0.58771 -0.09972 0.38805 1.41374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.615530 0.008309 435.122 <2e-16 ***
## median_income.x -0.014077 0.008404 -1.675 0.094 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3945728)
##
## Null deviance: 2255.7 on 5715 degrees of freedom
## Residual deviance: 2254.6 on 5714 degrees of freedom
## AIC: 10910
##
## Number of Fisher Scoring iterations: 2
summary(Stars_Population_LinearModel)
##
## Call:
## glm(formula = (stars.x) ~ population.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6155 -0.6087 -0.1106 0.3861 1.3913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.615715 0.008310 435.088 <2e-16 ***
## population.x 0.004012 0.008269 0.485 0.628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3947503)
##
## Null deviance: 2255.7 on 5715 degrees of freedom
## Residual deviance: 2255.6 on 5714 degrees of freedom
## AIC: 10912
##
## Number of Fisher Scoring iterations: 2
summary(Checkins_ReviewCount_LinearModel)
##
## Call:
## glm(formula = (checkin_count.x) ~ review_count.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -264.736 -18.133 0.405 16.264 95.324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.538390 0.418583 149.41 <2e-16 ***
## review_count.x 0.064077 0.001499 42.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 715.3087)
##
## Null deviance: 5393830 on 5715 degrees of freedom
## Residual deviance: 4087274 on 5714 degrees of freedom
## AIC: 53795
##
## Number of Fisher Scoring iterations: 2
summary(Checkins_Stars_LinearModel )
##
## Call:
## glm(formula = (checkin_count.x) ~ stars.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -71.464 -22.855 -0.464 19.861 96.185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.2754 2.3740 30.866 <2e-16 ***
## stars.x -0.3246 0.6469 -0.502 0.616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 943.9258)
##
## Null deviance: 5393830 on 5715 degrees of freedom
## Residual deviance: 5393592 on 5714 degrees of freedom
## AIC: 55380
##
## Number of Fisher Scoring iterations: 2
summary(Checkins_Income_LinearModel)
##
## Call:
## glm(formula = (checkin_count.x) ~ median_income.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -75.399 -22.685 -0.382 19.769 96.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.0650 0.4050 177.948 < 2e-16 ***
## median_income.x -2.6158 0.4096 -6.387 1.83e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 937.2769)
##
## Null deviance: 5393830 on 5715 degrees of freedom
## Residual deviance: 5355600 on 5714 degrees of freedom
## AIC: 55340
##
## Number of Fisher Scoring iterations: 2
summary(Checkins_Population_LinearModel)
##
## Call:
## glm(formula = (checkin_count.x) ~ population.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -72.661 -22.730 -0.532 19.917 97.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.1046 0.4062 177.509 <2e-16 ***
## population.x -0.9099 0.4042 -2.251 0.0244 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 943.1308)
##
## Null deviance: 5393830 on 5715 degrees of freedom
## Residual deviance: 5389049 on 5714 degrees of freedom
## AIC: 55375
##
## Number of Fisher Scoring iterations: 2
#require packages for logistic regression
require(foreign)
require(nnet)
require(ggplot2)
require(reshape2)
require(MASS)
require(ResourceSelection)
#run ordered logistic regressions to assess the predictive power of checkins, number of ratings, median income, and population on restaurant ratings
Stars_ReviewCount_LogitModel <- polr(as.factor(stars.x) ~ review_count.x, data=RestaurantsTrainingSet, Hess = TRUE)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Stars_Checkin_LogitModel <- polr(as.factor(stars.x) ~ checkin_count.x, data=RestaurantsTrainingSet, Hess = TRUE)
Stars_MedianIncome_LogitModel <- polr(as.factor(stars.x) ~ median_income.x, data=RestaurantsTrainingSet, Hess = TRUE)
Stars_Population_LogitModel <- polr(as.factor(stars.x) ~ population.x, data=RestaurantsTrainingSet, Hess = TRUE)
#run linear regressions to assess the predictive power of checkins, number of ratings, population, and median income on restaurant ratings
Stars_ReviewCount_LinearModel <- glm((stars.x) ~ review_count.x, data=RestaurantsTrainingSet)
Stars_Checkin_LinearModel <- glm((stars.x) ~ checkin_count.x, data=RestaurantsTrainingSet)
Stars_Income_LinearModel <- glm((stars.x) ~ median_income.x, data=RestaurantsTrainingSet)
Stars_Population_LinearModel <- glm((stars.x) ~ population.x, data=RestaurantsTrainingSet)
#run linear regressions to assess the predictive power of stars, number of ratings, population, and income on restaurant checkins
Checkins_ReviewCount_LinearModel <- glm((checkin_count.x) ~ review_count.x, data=RestaurantsTrainingSet)
Checkins_Stars_LinearModel <- glm((checkin_count.x) ~ stars.x, data=RestaurantsTrainingSet)
Checkins_Income_LinearModel <- glm((checkin_count.x) ~ median_income.x, data=RestaurantsTrainingSet)
Checkins_Population_LinearModel <- glm((checkin_count.x) ~ population.x, data=RestaurantsTrainingSet)
#Produce summaries
summary(Stars_ReviewCount_LogitModel)
## Call:
## polr(formula = as.factor(stars.x) ~ review_count.x, data = RestaurantsTrainingSet,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## review_count.x 0.001661 0.0001215 13.67
##
## Intercepts:
## Value Std. Error t value
## 1|1.5 -7.7582 0.7018 -11.0547
## 1.5|2 -5.0465 0.1835 -27.5051
## 2|2.5 -3.4201 0.0842 -40.6202
## 2.5|3 -2.0526 0.0475 -43.1728
## 3|3.5 -0.8572 0.0343 -24.9695
## 3.5|4 0.4056 0.0324 12.5234
## 4|4.5 2.1285 0.0450 47.2916
## 4.5|5 5.0761 0.1463 34.7068
##
## Residual Deviance: 18245.69
## AIC: 18263.69
summary(Stars_Checkin_LogitModel)
## Call:
## polr(formula = as.factor(stars.x) ~ checkin_count.x, data = RestaurantsTrainingSet,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## checkin_count.x -0.001057 0.0007821 -1.351
##
## Intercepts:
## Value Std. Error t value
## 1|1.5 -8.0332 0.7081 -11.3442
## 1.5|2 -5.3214 0.1917 -27.7573
## 2|2.5 -3.6978 0.1008 -36.6748
## 2.5|3 -2.3377 0.0731 -31.9689
## 3|3.5 -1.1587 0.0651 -17.7934
## 3.5|4 0.0715 0.0634 1.1275
## 4|4.5 1.7443 0.0686 25.4115
## 4.5|5 4.6342 0.1514 30.6023
##
## Residual Deviance: 18465.77
## AIC: 18483.77
summary(Stars_MedianIncome_LogitModel)
## Call:
## polr(formula = as.factor(stars.x) ~ median_income.x, data = RestaurantsTrainingSet,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## median_income.x -0.04757 0.0237 -2.007
##
## Intercepts:
## Value Std. Error t value
## 1|1.5 -7.9578 0.7067 -11.2610
## 1.5|2 -5.2449 0.1830 -28.6537
## 2|2.5 -3.6211 0.0830 -43.6200
## 2.5|3 -2.2608 0.0452 -49.9877
## 3|3.5 -1.0817 0.0304 -35.5579
## 3.5|4 0.1491 0.0265 5.6185
## 4|4.5 1.8224 0.0382 47.6911
## 4.5|5 4.7119 0.1407 33.4984
##
## Residual Deviance: 18463.56
## AIC: 18481.56
summary(Stars_Population_LogitModel)
## Call:
## polr(formula = as.factor(stars.x) ~ population.x, data = RestaurantsTrainingSet,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## population.x 0.01526 0.02361 0.6462
##
## Intercepts:
## Value Std. Error t value
## 1|1.5 -7.9573 0.7068 -11.2587
## 1.5|2 -5.2447 0.1831 -28.6513
## 2|2.5 -3.6207 0.0830 -43.6170
## 2.5|3 -2.2603 0.0452 -49.9817
## 3|3.5 -1.0811 0.0304 -35.5457
## 3.5|4 0.1493 0.0265 5.6278
## 4|4.5 1.8215 0.0382 47.6787
## 4.5|5 4.7104 0.1407 33.4882
##
## Residual Deviance: 18467.18
## AIC: 18485.18
summary(Stars_ReviewCount_LinearModel)
##
## Call:
## glm(formula = (stars.x) ~ review_count.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.55812 -0.55566 -0.06008 0.42174 1.44483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.542e+00 9.665e-03 366.54 <2e-16 ***
## review_count.x 4.913e-04 3.462e-05 14.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3813214)
##
## Null deviance: 2255.7 on 5715 degrees of freedom
## Residual deviance: 2178.9 on 5714 degrees of freedom
## AIC: 10714
##
## Number of Fisher Scoring iterations: 2
summary(Stars_Checkin_LinearModel)
##
## Call:
## glm(formula = (stars.x) ~ checkin_count.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6204 -0.6061 -0.1109 0.3862 1.3940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6255159 0.0212020 170.999 <2e-16 ***
## checkin_count.x -0.0001358 0.0002705 -0.502 0.616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3947492)
##
## Null deviance: 2255.7 on 5715 degrees of freedom
## Residual deviance: 2255.6 on 5714 degrees of freedom
## AIC: 10912
##
## Number of Fisher Scoring iterations: 2
summary(Stars_Income_LinearModel)
##
## Call:
## glm(formula = (stars.x) ~ median_income.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.63177 -0.58771 -0.09972 0.38805 1.41374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.615530 0.008309 435.122 <2e-16 ***
## median_income.x -0.014077 0.008404 -1.675 0.094 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3945728)
##
## Null deviance: 2255.7 on 5715 degrees of freedom
## Residual deviance: 2254.6 on 5714 degrees of freedom
## AIC: 10910
##
## Number of Fisher Scoring iterations: 2
summary(Stars_Population_LinearModel)
##
## Call:
## glm(formula = (stars.x) ~ population.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6155 -0.6087 -0.1106 0.3861 1.3913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.615715 0.008310 435.088 <2e-16 ***
## population.x 0.004012 0.008269 0.485 0.628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3947503)
##
## Null deviance: 2255.7 on 5715 degrees of freedom
## Residual deviance: 2255.6 on 5714 degrees of freedom
## AIC: 10912
##
## Number of Fisher Scoring iterations: 2
summary(Checkins_ReviewCount_LinearModel)
##
## Call:
## glm(formula = (checkin_count.x) ~ review_count.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -264.736 -18.133 0.405 16.264 95.324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.538390 0.418583 149.41 <2e-16 ***
## review_count.x 0.064077 0.001499 42.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 715.3087)
##
## Null deviance: 5393830 on 5715 degrees of freedom
## Residual deviance: 4087274 on 5714 degrees of freedom
## AIC: 53795
##
## Number of Fisher Scoring iterations: 2
summary(Checkins_Stars_LinearModel )
##
## Call:
## glm(formula = (checkin_count.x) ~ stars.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -71.464 -22.855 -0.464 19.861 96.185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.2754 2.3740 30.866 <2e-16 ***
## stars.x -0.3246 0.6469 -0.502 0.616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 943.9258)
##
## Null deviance: 5393830 on 5715 degrees of freedom
## Residual deviance: 5393592 on 5714 degrees of freedom
## AIC: 55380
##
## Number of Fisher Scoring iterations: 2
summary(Checkins_Income_LinearModel)
##
## Call:
## glm(formula = (checkin_count.x) ~ median_income.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -75.399 -22.685 -0.382 19.769 96.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.0650 0.4050 177.948 < 2e-16 ***
## median_income.x -2.6158 0.4096 -6.387 1.83e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 937.2769)
##
## Null deviance: 5393830 on 5715 degrees of freedom
## Residual deviance: 5355600 on 5714 degrees of freedom
## AIC: 55340
##
## Number of Fisher Scoring iterations: 2
summary(Checkins_Population_LinearModel)
##
## Call:
## glm(formula = (checkin_count.x) ~ population.x, data = RestaurantsTrainingSet)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -72.661 -22.730 -0.532 19.917 97.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.1046 0.4062 177.509 <2e-16 ***
## population.x -0.9099 0.4042 -2.251 0.0244 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 943.1308)
##
## Null deviance: 5393830 on 5715 degrees of freedom
## Residual deviance: 5389049 on 5714 degrees of freedom
## AIC: 55375
##
## Number of Fisher Scoring iterations: 2
The following charts are intended to allow us to visualize:
#visualize data
require(ggplot2)
require(ggmap)
## Loading required package: ggmap
## Google Maps API Terms of Service: http://developers.google.com/maps/terms.
## Please cite ggmap if you use it: see citation('ggmap') for details.
qplot(data=Restaurants_final,checkin_count)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(data=Restaurants_final,stars)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(data=Restaurants_final,review_count ,stars,geom="smooth")
## `geom_smooth()` using method = 'gam'
qplot(data=RestaurantsTrainingSet,checkin_count.x)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(data=RestaurantsTrainingSet,stars.x)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(data=RestaurantsTrainingSet,review_count.x ,stars.x,geom="smooth")
## `geom_smooth()` using method = 'gam'
qplot(data=RestaurantsTestSet,checkin_count)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(data=RestaurantsTestSet,stars)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(data=RestaurantsTestSet,review_count ,stars,geom="smooth")
## `geom_smooth()` using method = 'gam'
map<-get_map(location='united states', zoom=4, maptype = "terrain",
source='google',color='color')
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=united+states&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=united%20states&sensor=false
ggmap(map) + geom_point(
aes(x=longitude, y=latitude, show_guide = TRUE, colour=checkin_count),
data=Restaurants_final, alpha=.5, na.rm = T) +
scale_color_gradient(low="red", high="blue")
## Warning: Ignoring unknown aesthetics: show_guide
ggmap(map) + geom_point(
aes(x=longitude, y=latitude, show_guide = TRUE, colour=stars),
data=Restaurants_final, alpha=.5, na.rm = T) +
scale_color_gradient(low="red", high="blue")
## Warning: Ignoring unknown aesthetics: show_guide
Unfortunately we found no meaningful, statistically significant relationships between any of the features we tested and restaurant checkins or ratings. This may be due to the fact that we ultimately didn’t have a big enough sample (only 5000 restaurants in our training set) to be able to differentiate between signal and noise. Possible further steps we could take to continue to explore whether there are good predictors of restaurant ratings and checkins include:
Analyze the review text data using a Natural Language Processing package, to assess whether there are any interesting relationships between written reviews and number of stars. For example, are reviewers who give high ratings more likely to write long reviews, or use certain key words or phrases?
Analyze user data to understand the relationship between the frequency with which users write reviews on Yelp, and their propensity to give very high or lower ratings.
Analyze restaurant categories, tags, and hours, to assess whether they have any relationship with restaurant ratings and/or checkins.