AirBnB was founded in 2008 by Brian Chesky, Joe Gebbia, and Nathan Blecharczyk as AirBed & Breakfast, an online marketplace and hospitality service for short-term lodging. Over the past years, the share of professional hospitality providers has significantly increased and is now crowding out the private providers, threatening AirBnB’s value proposition of offering unique design and personal touch. Within this context, the marketing department wants to run a campaign to attract more private providers. To do this, they requested the analytics department to create a tool that helps attract potential landlords by helping them understand how much money they could earn with their respective apartments through AirBnb.
As a pilot, Amsterdam was chosen because of the “AirBnB friendly” policy of the local regulators and the high number of short-term visitors. The proposed solution, however, is designed to be city-independent and therefore, an easily replicable process was designed, using .rmd-Files and Github.
(Data source: http://tomslee.net/airbnb-data-collection-get-the-data. We acknowledge the following: All material is copyright Tom Slee, licensed under a Creative Commons Attribution-NonCommercial 2.5 Canada License.)
The data is collected from the official AirBnB website by Tom Slee and provided as datasets for a large number of cities at different times. The considered dataset contains 18723 entries with 15 independent variables.
Name | Description |
---|---|
room_id | A unique number identifying an AirBnB listing. The listing has a URL on the AirBnB web site of http://airbnb.com/rooms/room_id |
host_id | A unique number identifying an AirBnB host. The host`s page has a URL on the AirBnB web site of http://airbnb.com/users/show/host_id |
room_type | One of Entire home/apt, Private room, Shared room |
neighborhood | A subregion of the city or search area for which the survey is carried out (within this dataset: Bijlmer Centrum, Bijlmer Oost, Bos en Lommer, Buitenveldert / Zuidas, Centrum Oost, Centrum West, De Aker / Nieuw Sloten, De Baarsjes / Oud West, De Pijp / Rivierenbuurt, Gaasperdam / Driemond, Geuzenveld / Slotermeer, Ijburg / Eiland Zeeburg, Noord-West / Noord-Midden, Noord Oost, Noord West, Oostelijk Havengebied / Indische Buurt, Osdorp, Oud Noord, Oud Oost, Slotervaart, Watergraafsmeer, Westerpark, Westpoort) |
reviews | The number of reviews that a listing has received. As 70% of visits end up with a review, the number of reviews can be used to estimate the number of visits. Note that such an estimate will not be reliable for an individual listing, but over a city as a whole it should be a useful metric of traffic |
overall_satisfaction | The average rating (out of five) that the listing has received from those visitors who left a review |
accommodates | The number of guests a listing can accommodate |
bedrooms | The number of bedrooms a listing offers |
minstay | The minimum stay for a visit, as posted by the host |
latitude and longitude | The latitude and longitude of the listing as posted on the AirBnB site |
last_modified | The date and time that the values were read from the AirBnB web site |
price | The price (in USD) for a night stay |
Let’s look into the data for a few AirBnB listings. This is how the first 8 out of the total of 18723 rows look like (transposed, for convenience):
01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | |
---|---|---|---|---|---|---|---|---|
room_id | 10176931 | 8935871 | 14011697 | 6137978 | 18630616 | 5790170 | 934060 | 19590049 |
survey_id | 1476 | 1476 | 1476 | 1476 | 1476 | 1476 | 1476 | 1476 |
host_id | 49180562 | 46718394 | 10346595 | 8685430 | 70191803 | 29968916 | 5037506 | 132687356 |
room_type | Shared room | Shared room | Shared room | Shared room | Shared room | Shared room | Shared room | Shared room |
country | ||||||||
city | Amsterdam | Amsterdam | Amsterdam | Amsterdam | Amsterdam | Amsterdam | Amsterdam | Amsterdam |
borough | ||||||||
neighborhood | De Pijp / Rivierenbuurt | Centrum West | Watergraafsmeer | Centrum West | De Baarsjes / Oud West | De Pijp / Rivierenbuurt | Oostelijk Havengebied / Indische Buurt | Westerpark |
reviews | 7 | 45 | 1 | 7 | 1 | 184 | 67 | 2 |
overall_satisfaction | 4.5 | 4.5 | 0.0 | 5.0 | 0.0 | 4.5 | 5.0 | 0.0 |
accommodates | 2 | 4 | 3 | 4 | 2 | 2 | 16 | 2 |
bedrooms | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
bathrooms | ||||||||
latitude | 52.35621 | 52.37852 | 52.33881 | 52.37632 | 52.37038 | 52.34226 | 52.37755 | 52.37521 |
longitude | 4.887491 | 4.896120 | 4.943592 | 4.890028 | 4.852873 | 4.897126 | 4.930418 | 4.866117 |
Step 1 - Prepare and split the data: At the end of this step, three cleaned up datasets should be ready before going to the next step: one set for the estimation, a second step for validation and a last set for testing.
Step 2 - Exploratory Data Analysis: In this step, a feeling can be established for the available data. Scatterplots, boxplots and correlation matrices can provide useful insights into the data that will help to build a better regression model.
Step 3 - Building a Regression Model: To create the actual model, a suitable algorithm and respective parameters need to be chosen. Steps 3 and 4 are part of an iterative approach that will improve the outcome over time, as the parameters get tweaked.
Step 4 - Validate Prediction Quality: Different methods can be used to determine how good the model predicts a different set of listings (i.e. validation dataset).
Only after following these steps, the resulting model can be used to predict the prices for new listings or the testing dataset. Let’s follow these steps.
In addition to that, the dataset contains listings for which the number of bedrooms was set to 0 (1154 or 6.18%). These datapoints are removed from the dataset.
To validate and test the result of our regression, we split the available data into 3 subsets. We refer to the three data samples as estimation data (90% of the data in our case), validation data (5% of the data) and test data (the remaining 5% of the data). In a more thorough analysis, different or multiple models could be prepared based on the estimation set and an average of these could be used for prediction.
In our case we use 15755 observations in the estimation data, 875 in the validation data, and 876 in the test data.
We will now examine the data a bit more thoroughly.
The first high-level analysis of the datasets is looking into the distribution of the dependent variable, price, among the listings. The following chart shows a histogram of all prices in the estimation data:
In case the data contains outliers (defined as listings with prices >= 1000 and visualized in red above, all listings to the right of the vertical line), we want to exclude these extraordinarily expensive listings. This will change the histogram to the following:
We can notice a high concentration of accomodation prices between 100 and 200 USD.
A correlation matrix is a table showing correlation coefficients between sets of variables. This allows us to identify pairs with higher correlations:
room_type | neighborhood | reviews | overall_satisfaction | accommodates | bedrooms | price | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|
room_type | 1.00 | -0.03 | 0.29 | 0.05 | -0.27 | -0.27 | -0.33 | -0.01 | 0.00 |
neighborhood | -0.03 | 1.00 | -0.05 | -0.02 | 0.03 | 0.05 | -0.09 | 0.12 | 0.18 |
reviews | 0.29 | -0.05 | 1.00 | 0.31 | -0.06 | -0.11 | -0.09 | 0.05 | -0.01 |
overall_satisfaction | 0.05 | -0.02 | 0.31 | 1.00 | -0.05 | -0.10 | -0.04 | 0.04 | -0.02 |
accommodates | -0.27 | 0.03 | -0.06 | -0.05 | 1.00 | 0.75 | 0.56 | 0.00 | 0.10 |
bedrooms | -0.27 | 0.05 | -0.11 | -0.10 | 0.75 | 1.00 | 0.53 | -0.01 | 0.11 |
price | -0.33 | -0.09 | -0.09 | -0.04 | 0.56 | 0.53 | 1.00 | 0.01 | 0.04 |
latitude | -0.01 | 0.12 | 0.05 | 0.04 | 0.00 | -0.01 | 0.01 | 1.00 | -0.10 |
longitude | 0.00 | 0.18 | -0.01 | -0.02 | 0.10 | 0.11 | 0.04 | -0.10 | 1.00 |
We can identify a couple of interesting observations: Overall satisfaction correlates strongly with Reviews and Bedrooms correlates with Accomodates. Neither of them is a surprise and confirm intuition. Furthermore, we see a higher correlation between Accomodates and the dependent variable Price as well as between Bedrooms and Price. These indicate the - nonsurprising - relationship between the price of an accomodation and the number of people or bedrooms that can be hosted in the lodging.
A simple visualization tool to assess the discriminatory power of the independent variables are the box plots. A box plot visually indicates simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc.). For example consider the box plots for our estimation data for the numerical values.
The chart helps us to get a first understanding of the numerical variables.
Scatter plots are used to plot data points on a horizontal and a vertical axis in the attempt to show how much one variable is affected by another. In this case, we plotted Price vs. Reviews (which is the main focus of our research).
The next scatterplot shows the price versus the neighborhoods. We can see that some neighborhoods have more properties available than others, and we can observe the disparity in pricing.
Next, we show the price versus the satisfaction. Again, we can see that some satisfaction scores occur more often than other and also a certain relationship between higher prices with higher satisfaction levels.
Lastly, we look at the price versus the number of bedrooms. This - isolated - does not provide much information.
We will exclude data from neighborhoods that are not at least represented with 100 listings. Below is the lists of neighborhoods and their incidence in our data pool.
This cutoff affects a total of 290 listings in our estimation dataset (1.84%).
We built three regression models: first a linear, then a log-linear and finally a log-linear model with interactions. We iterated the variables in the models to increase R-squared, and reduce MAPE when running the models on the test data. We aimed to optimize the models with the AIC method, however this ended after the first step in most cases, not improving the models much.
Before running the data, we excluded the listings with 0 reviews as the price of these might not be tested by the market, i.e. irrelevant for the price estimation.
The linear regression model consistently gave an R-squared less than 0.5.
Most variables seemed significant: room_type, neighborhood, reviews, overall_satisfaction, accommodates and bedrooms.
Call:
lm(formula = scale(price) ~ room_type + neighborhood + reviews +
overall_satisfaction + accommodates + bedrooms, data = PricingData.estimation.non0)
Residuals:
Min 1Q Median 3Q Max
-4.4940 -0.3912 -0.0820 0.2650 8.7611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.5691983 0.0339631 -46.203 < 2e-16 ***
room_typePrivate room -0.4072260 0.0182220 -22.348 < 2e-16 ***
neighborhoodBuitenveldert / Zuidas 0.0274332 0.0619424 0.443 0.65786
neighborhoodCentrum Oost 0.7348485 0.0341592 21.512 < 2e-16 ***
neighborhoodCentrum West 0.8958976 0.0329601 27.181 < 2e-16 ***
neighborhoodDe Baarsjes / Oud West 0.3228804 0.0307401 10.504 < 2e-16 ***
neighborhoodDe Pijp / Rivierenbuurt 0.4163369 0.0323015 12.889 < 2e-16 ***
neighborhoodGeuzenveld / Slotermeer -0.0891600 0.0710054 -1.256 0.20925
neighborhoodIjburg / Eiland Zeeburg -0.0202940 0.0556883 -0.364 0.71555
neighborhoodNoord-West / Noord-Midden 0.4117225 0.0356388 11.553 < 2e-16 ***
neighborhoodNoord Oost -0.2160796 0.0662973 -3.259 0.00112 **
neighborhoodNoord West -0.2861028 0.0604739 -4.731 2.26e-06 ***
neighborhoodOostelijk Havengebied / Indische Buurt 0.0917014 0.0390737 2.347 0.01895 *
neighborhoodOsdorp -0.0347492 0.0730297 -0.476 0.63421
neighborhoodOud Noord -0.0010251 0.0488631 -0.021 0.98326
neighborhoodOud Oost 0.2295885 0.0366897 6.258 4.03e-10 ***
neighborhoodSlotervaart -0.0041252 0.0537983 -0.077 0.93888
neighborhoodWatergraafsmeer 0.0676739 0.0506158 1.337 0.18124
neighborhoodWesterpark 0.3238754 0.0351195 9.222 < 2e-16 ***
reviews -0.0006414 0.0001978 -3.242 0.00119 **
overall_satisfaction 0.0115545 0.0035883 3.220 0.00128 **
accommodates 0.2551153 0.0075291 33.884 < 2e-16 ***
bedrooms 0.3417779 0.0124274 27.502 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7273 on 12955 degrees of freedom
Multiple R-squared: 0.4719, Adjusted R-squared: 0.4711
F-statistic: 526.3 on 22 and 12955 DF, p-value: < 2.2e-16
The log-linear regression model consistently gave an R-squared between 0.5 and 0.55.
Percentage changes in price were driven by the following variables: room_type, neighborhood, reviews, overall_satisfaction, log(accommodates) and log(bedrooms). I.e., we assumed that there is a marginally decreasing impact of additional guests or bedrooms on the price.
Call:
lm(formula = log(price) ~ room_type + neighborhood + overall_satisfaction +
reviews + log(accommodates) + log(bedrooms), data = PricingData.estimation.non0)
Residuals:
Min 1Q Median 3Q Max
-2.4107 -0.1902 -0.0066 0.1848 2.0021
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.401e+00 1.568e-02 280.733 < 2e-16 ***
room_typePrivate room -3.075e-01 7.939e-03 -38.733 < 2e-16 ***
neighborhoodBuitenveldert / Zuidas 3.002e-02 2.658e-02 1.130 0.258671
neighborhoodCentrum Oost 3.784e-01 1.466e-02 25.810 < 2e-16 ***
neighborhoodCentrum West 4.451e-01 1.415e-02 31.466 < 2e-16 ***
neighborhoodDe Baarsjes / Oud West 1.801e-01 1.319e-02 13.657 < 2e-16 ***
neighborhoodDe Pijp / Rivierenbuurt 2.279e-01 1.386e-02 16.440 < 2e-16 ***
neighborhoodGeuzenveld / Slotermeer -8.705e-02 3.047e-02 -2.857 0.004279 **
neighborhoodIjburg / Eiland Zeeburg 4.605e-02 2.386e-02 1.930 0.053615 .
neighborhoodNoord-West / Noord-Midden 2.261e-01 1.529e-02 14.792 < 2e-16 ***
neighborhoodNoord Oost -9.166e-02 2.844e-02 -3.223 0.001272 **
neighborhoodNoord West -1.388e-01 2.593e-02 -5.352 8.83e-08 ***
neighborhoodOostelijk Havengebied / Indische Buurt 5.328e-02 1.677e-02 3.178 0.001488 **
neighborhoodOsdorp -8.129e-02 3.134e-02 -2.594 0.009493 **
neighborhoodOud Noord 3.268e-02 2.093e-02 1.561 0.118472
neighborhoodOud Oost 1.406e-01 1.574e-02 8.927 < 2e-16 ***
neighborhoodSlotervaart -1.645e-02 2.309e-02 -0.712 0.476332
neighborhoodWatergraafsmeer 5.926e-02 2.172e-02 2.729 0.006366 **
neighborhoodWesterpark 1.847e-01 1.507e-02 12.256 < 2e-16 ***
overall_satisfaction 7.050e-03 1.540e-03 4.578 4.74e-06 ***
reviews -2.838e-04 8.492e-05 -3.343 0.000832 ***
log(accommodates) 3.709e-01 1.087e-02 34.118 < 2e-16 ***
log(bedrooms) 2.569e-01 9.965e-03 25.776 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3121 on 12955 degrees of freedom
Multiple R-squared: 0.5322, Adjusted R-squared: 0.5314
F-statistic: 670 on 22 and 12955 DF, p-value: < 2.2e-16
The log-linear regression model with interactions also consistently gave an R-squared between 0.5 and 0.55, a slightly better model than the log-linear in most cases.
Percentage changes in price were driven by the following variables: room_type, neighborhood, reviews, overall_satisfaction, log(accommodates), log(bedrooms) and log(accommodates):bedrooms. I.e., we assumed that the price-effect of additional guests depends on the number of bedrooms: staying with 3 other people in the same bedroom is a different experience than 4 guests having a bedroom each.
Call:
lm(formula = log(price) ~ room_type + neighborhood + overall_satisfaction +
reviews + log(accommodates) * bedrooms + log(bedrooms), data = PricingData.estimation.non0)
Residuals:
Min 1Q Median 3Q Max
-2.41025 -0.19177 -0.00554 0.18485 2.00717
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.459e+00 4.297e-02 103.771 < 2e-16 ***
room_typePrivate room -3.139e-01 7.940e-03 -39.530 < 2e-16 ***
neighborhoodBuitenveldert / Zuidas 2.877e-02 2.649e-02 1.086 0.277394
neighborhoodCentrum Oost 3.769e-01 1.461e-02 25.802 < 2e-16 ***
neighborhoodCentrum West 4.429e-01 1.410e-02 31.417 < 2e-16 ***
neighborhoodDe Baarsjes / Oud West 1.785e-01 1.314e-02 13.581 < 2e-16 ***
neighborhoodDe Pijp / Rivierenbuurt 2.268e-01 1.381e-02 16.424 < 2e-16 ***
neighborhoodGeuzenveld / Slotermeer -9.001e-02 3.036e-02 -2.965 0.003033 **
neighborhoodIjburg / Eiland Zeeburg 3.572e-02 2.382e-02 1.499 0.133774
neighborhoodNoord-West / Noord-Midden 2.220e-01 1.524e-02 14.566 < 2e-16 ***
neighborhoodNoord Oost -9.436e-02 2.837e-02 -3.327 0.000881 ***
neighborhoodNoord West -1.441e-01 2.587e-02 -5.571 2.59e-08 ***
neighborhoodOostelijk Havengebied / Indische Buurt 5.158e-02 1.671e-02 3.087 0.002025 **
neighborhoodOsdorp -8.039e-02 3.123e-02 -2.574 0.010056 *
neighborhoodOud Noord 1.753e-02 2.091e-02 0.838 0.401999
neighborhoodOud Oost 1.394e-01 1.569e-02 8.888 < 2e-16 ***
neighborhoodSlotervaart -1.485e-02 2.301e-02 -0.645 0.518793
neighborhoodWatergraafsmeer 5.806e-02 2.166e-02 2.681 0.007343 **
neighborhoodWesterpark 1.816e-01 1.502e-02 12.093 < 2e-16 ***
overall_satisfaction 6.792e-03 1.535e-03 4.425 9.73e-06 ***
reviews -2.870e-04 8.462e-05 -3.392 0.000695 ***
log(accommodates) 2.977e-01 2.005e-02 14.849 < 2e-16 ***
bedrooms -3.120e-02 3.596e-02 -0.868 0.385663
log(bedrooms) 1.841e-01 4.262e-02 4.319 1.58e-05 ***
log(accommodates):bedrooms 5.117e-02 1.118e-02 4.576 4.78e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.311 on 12953 degrees of freedom
Multiple R-squared: 0.5357, Adjusted R-squared: 0.5348
F-statistic: 622.6 on 24 and 12953 DF, p-value: < 2.2e-16
Start: AIC=-30293.53
log(price) ~ room_type + neighborhood + overall_satisfaction +
reviews + log(accommodates) * bedrooms + log(bedrooms)
Df Sum of Sq RSS AIC
<none> 1252.5 -30294
- reviews 1 1.113 1253.7 -30284
- log(bedrooms) 1 1.804 1254.3 -30277
- overall_satisfaction 1 1.893 1254.4 -30276
- log(accommodates):bedrooms 1 2.025 1254.6 -30275
- room_type 1 151.108 1403.7 -28817
- neighborhood 17 264.132 1516.7 -27844
Call:
lm(formula = log(price) ~ room_type + neighborhood + overall_satisfaction +
reviews + log(accommodates) * bedrooms + log(bedrooms), data = PricingData.estimation.non0)
Residuals:
Min 1Q Median 3Q Max
-2.41025 -0.19177 -0.00554 0.18485 2.00717
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.459e+00 4.297e-02 103.771 < 2e-16 ***
room_typePrivate room -3.139e-01 7.940e-03 -39.530 < 2e-16 ***
neighborhoodBuitenveldert / Zuidas 2.877e-02 2.649e-02 1.086 0.277394
neighborhoodCentrum Oost 3.769e-01 1.461e-02 25.802 < 2e-16 ***
neighborhoodCentrum West 4.429e-01 1.410e-02 31.417 < 2e-16 ***
neighborhoodDe Baarsjes / Oud West 1.785e-01 1.314e-02 13.581 < 2e-16 ***
neighborhoodDe Pijp / Rivierenbuurt 2.268e-01 1.381e-02 16.424 < 2e-16 ***
neighborhoodGeuzenveld / Slotermeer -9.001e-02 3.036e-02 -2.965 0.003033 **
neighborhoodIjburg / Eiland Zeeburg 3.572e-02 2.382e-02 1.499 0.133774
neighborhoodNoord-West / Noord-Midden 2.220e-01 1.524e-02 14.566 < 2e-16 ***
neighborhoodNoord Oost -9.436e-02 2.837e-02 -3.327 0.000881 ***
neighborhoodNoord West -1.441e-01 2.587e-02 -5.571 2.59e-08 ***
neighborhoodOostelijk Havengebied / Indische Buurt 5.158e-02 1.671e-02 3.087 0.002025 **
neighborhoodOsdorp -8.039e-02 3.123e-02 -2.574 0.010056 *
neighborhoodOud Noord 1.753e-02 2.091e-02 0.838 0.401999
neighborhoodOud Oost 1.394e-01 1.569e-02 8.888 < 2e-16 ***
neighborhoodSlotervaart -1.485e-02 2.301e-02 -0.645 0.518793
neighborhoodWatergraafsmeer 5.806e-02 2.166e-02 2.681 0.007343 **
neighborhoodWesterpark 1.816e-01 1.502e-02 12.093 < 2e-16 ***
overall_satisfaction 6.792e-03 1.535e-03 4.425 9.73e-06 ***
reviews -2.870e-04 8.462e-05 -3.392 0.000695 ***
log(accommodates) 2.977e-01 2.005e-02 14.849 < 2e-16 ***
bedrooms -3.120e-02 3.596e-02 -0.868 0.385663
log(bedrooms) 1.841e-01 4.262e-02 4.319 1.58e-05 ***
log(accommodates):bedrooms 5.117e-02 1.118e-02 4.576 4.78e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.311 on 12953 degrees of freedom
Multiple R-squared: 0.5357, Adjusted R-squared: 0.5348
F-statistic: 622.6 on 24 and 12953 DF, p-value: < 2.2e-16
In this step, we ran the models on the testing data, and finally the log-linear-interaction model on the validation data.
The Mean Absolute Percentage Error of a model gave us an idea about how well it can forecast, the lower being better. A 20% or lower figure indicates a good model, 10% or below an excellent model.
According to their MAPE, our log-linear model (also with interactions) is close to good with a MAPE between 20-25%.
The Mean Absolute Percentage Error for this model is 100.19%.
The Mean Absolute Percentage Error for this model is 23.53%.
The Mean Absolute Percentage Error for this model on the test data is 23.35%.
While MAPE on the validation data is 24.25%.
Looking at the results, it seems as though our model is able to predict the value of an appartment through short-term AirBnB rentals with an adjusted R score of 0.535
Iterating our model and tweaking the analysis process, we chose to segment and remove part of the data. For instance, we thought about the incidence of “reviews” on price: in fact, places which have never been booked will be priced less accurately than “mature” properties on the market. Looking at this, we tried to minimize the impact of the data exclusion on our total number of data points.
Our purpose in this exercise was to be able to predict the price of an accomodation for a night in Amsterdam through AirBnB. This exercise was meant to entice private owners to list their property on the platform, given our routine’s ability to predict the price the listing would sell for.
Going through the process, we ended up seeing shortcomings in the data - we had to filter out the data and noted some inconsistencies. For example, data indicating a ration of “0” could mean either a review score of 0, or an absence of a review (which is usually the case for new properties). As such, we chose to consider only the review ratings above a certain cutoff, in order to minimize the impact on the model accuracy.
As a group, we were able to identify some variables not included in the data which could have been significant: for example, the “premium-ness” of the lodging (equivalent to a hotel’s number of stars) would have a strong impact on price. Other elements were also overlooked in our data, such as the size (Sq-Ft) of the lodging.
To this point, we had difficulty in obtaining a high level of accuracy in our predictions - our finale Adjusted R score converged towards the 60% level.
Our Mean Absolute Percentage Error (23.35%) is converging towards the vaunted level of 20% which is very encouraging in terms of the quality of our output.
Future models would require a better dataset to provide useful predictions. As a preliminary result, however, we can propose the above shown models.