The Business Context

AirBnB was founded in 2008 by Brian Chesky, Joe Gebbia, and Nathan Blecharczyk as AirBed & Breakfast, an online marketplace and hospitality service for short-term lodging. Over the past years, the share of professional hospitality providers has significantly increased and is now crowding out the private providers, threatening AirBnB’s value proposition of offering unique design and personal touch. Within this context, the marketing department wants to run a campaign to attract more private providers. To do this, they requested the analytics department to create a tool that helps attract potential landlords by helping them understand how much money they could earn with their respective apartments through AirBnb.

As a pilot, Amsterdam was chosen because of the “AirBnB friendly” policy of the local regulators and the high number of short-term visitors. The proposed solution, however, is designed to be city-independent and therefore, an easily replicable process was designed, using .rmd-Files and Github.

The Data

(Data source: http://tomslee.net/airbnb-data-collection-get-the-data. We acknowledge the following: All material is copyright Tom Slee, licensed under a Creative Commons Attribution-NonCommercial 2.5 Canada License.)

The data is collected from the official AirBnB website by Tom Slee and provided as datasets for a large number of cities at different times. The considered dataset contains 18723 entries with 15 independent variables.

Name	Description
room_id	A unique number identifying an AirBnB listing. The listing has a URL on the AirBnB web site of http://airbnb.com/rooms/room_id
host_id	A unique number identifying an AirBnB host. The host`s page has a URL on the AirBnB web site of http://airbnb.com/users/show/host_id
room_type	One of Entire home/apt, Private room, Shared room
neighborhood	A subregion of the city or search area for which the survey is carried out (within this dataset: Bijlmer Centrum, Bijlmer Oost, Bos en Lommer, Buitenveldert / Zuidas, Centrum Oost, Centrum West, De Aker / Nieuw Sloten, De Baarsjes / Oud West, De Pijp / Rivierenbuurt, Gaasperdam / Driemond, Geuzenveld / Slotermeer, Ijburg / Eiland Zeeburg, Noord-West / Noord-Midden, Noord Oost, Noord West, Oostelijk Havengebied / Indische Buurt, Osdorp, Oud Noord, Oud Oost, Slotervaart, Watergraafsmeer, Westerpark, Westpoort)
reviews	The number of reviews that a listing has received. As 70% of visits end up with a review, the number of reviews can be used to estimate the number of visits. Note that such an estimate will not be reliable for an individual listing, but over a city as a whole it should be a useful metric of traffic
overall_satisfaction	The average rating (out of five) that the listing has received from those visitors who left a review
accommodates	The number of guests a listing can accommodate
bedrooms	The number of bedrooms a listing offers
minstay	The minimum stay for a visit, as posted by the host
latitude and longitude	The latitude and longitude of the listing as posted on the AirBnB site
last_modified	The date and time that the values were read from the AirBnB web site
price	The price (in USD) for a night stay

Let’s look into the data for a few AirBnB listings. This is how the first 8 out of the total of 18723 rows look like (transposed, for convenience):

	01	02	03	04	05	06	07	08
room_id	10176931	8935871	14011697	6137978	18630616	5790170	934060	19590049
survey_id	1476	1476	1476	1476	1476	1476	1476	1476
host_id	49180562	46718394	10346595	8685430	70191803	29968916	5037506	132687356
room_type	Shared room	Shared room	Shared room	Shared room	Shared room	Shared room	Shared room	Shared room
country
city	Amsterdam	Amsterdam	Amsterdam	Amsterdam	Amsterdam	Amsterdam	Amsterdam	Amsterdam
borough
neighborhood	De Pijp / Rivierenbuurt	Centrum West	Watergraafsmeer	Centrum West	De Baarsjes / Oud West	De Pijp / Rivierenbuurt	Oostelijk Havengebied / Indische Buurt	Westerpark
reviews	7	45	1	7	1	184	67	2
overall_satisfaction	4.5	4.5	0.0	5.0	0.0	4.5	5.0	0.0
accommodates	2	4	3	4	2	2	16	2
bedrooms	1	1	1	1	1	1	1	1
bathrooms
latitude	52.35621	52.37852	52.33881	52.37632	52.37038	52.34226	52.37755	52.37521
longitude	4.887491	4.896120	4.943592	4.890028	4.852873	4.897126	4.930418	4.866117

Overview Process Steps

Step 1 - Prepare and split the data: At the end of this step, three cleaned up datasets should be ready before going to the next step: one set for the estimation, a second step for validation and a last set for testing.

Step 2 - Exploratory Data Analysis: In this step, a feeling can be established for the available data. Scatterplots, boxplots and correlation matrices can provide useful insights into the data that will help to build a better regression model.

Step 3 - Building a Regression Model: To create the actual model, a suitable algorithm and respective parameters need to be chosen. Steps 3 and 4 are part of an iterative approach that will improve the outcome over time, as the parameters get tweaked.

Step 4 - Validate Prediction Quality: Different methods can be used to determine how good the model predicts a different set of listings (i.e. validation dataset).

Only after following these steps, the resulting model can be used to predict the prices for new listings or the testing dataset. Let’s follow these steps.

Step 1: Prepare and split the data

Shared Rooms

Knowing that only a minor share of all listings are shared rooms (0.34% of all listings in this dataset), the team decided to remove these listings from the data. This will reduce the current dataset to 18660 listings.

Invalid Number of Bedrooms

In addition to that, the dataset contains listings for which the number of bedrooms was set to 0 (1154 or 6.18%). These datapoints are removed from the dataset.

K-Fold

To validate and test the result of our regression, we split the available data into 3 subsets. We refer to the three data samples as estimation data (90% of the data in our case), validation data (5% of the data) and test data (the remaining 5% of the data). In a more thorough analysis, different or multiple models could be prepared based on the estimation set and an average of these could be used for prediction.

In our case we use 15755 observations in the estimation data, 875 in the validation data, and 876 in the test data.

Step 2: Exploratory Data Analysis

We will now examine the data a bit more thoroughly.

Dependent variable distribution

The first high-level analysis of the datasets is looking into the distribution of the dependent variable, price, among the listings. The following chart shows a histogram of all prices in the estimation data:

In case the data contains outliers (defined as listings with prices >= 1000 and visualized in red above, all listings to the right of the vertical line), we want to exclude these extraordinarily expensive listings. This will change the histogram to the following:

We can notice a high concentration of accomodation prices between 100 and 200 USD.

Correlation matrix

A correlation matrix is a table showing correlation coefficients between sets of variables. This allows us to identify pairs with higher correlations:

	room_type	neighborhood	reviews	overall_satisfaction	accommodates	bedrooms	price	latitude	longitude
room_type	1.00	-0.03	0.29	0.05	-0.27	-0.27	-0.33	-0.01	0.00
neighborhood	-0.03	1.00	-0.05	-0.02	0.03	0.05	-0.09	0.12	0.18
reviews	0.29	-0.05	1.00	0.31	-0.06	-0.11	-0.09	0.05	-0.01
overall_satisfaction	0.05	-0.02	0.31	1.00	-0.05	-0.10	-0.04	0.04	-0.02
accommodates	-0.27	0.03	-0.06	-0.05	1.00	0.75	0.56	0.00	0.10
bedrooms	-0.27	0.05	-0.11	-0.10	0.75	1.00	0.53	-0.01	0.11
price	-0.33	-0.09	-0.09	-0.04	0.56	0.53	1.00	0.01	0.04
latitude	-0.01	0.12	0.05	0.04	0.00	-0.01	0.01	1.00	-0.10
longitude	0.00	0.18	-0.01	-0.02	0.10	0.11	0.04	-0.10	1.00

We can identify a couple of interesting observations: Overall satisfaction correlates strongly with Reviews and Bedrooms correlates with Accomodates. Neither of them is a surprise and confirm intuition. Furthermore, we see a higher correlation between Accomodates and the dependent variable Price as well as between Bedrooms and Price. These indicate the - nonsurprising - relationship between the price of an accomodation and the number of people or bedrooms that can be hosted in the lodging.

Boxplots for Numerical Variables

A simple visualization tool to assess the discriminatory power of the independent variables are the box plots. A box plot visually indicates simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc.). For example consider the box plots for our estimation data for the numerical values.

The chart helps us to get a first understanding of the numerical variables.

Scatterplots

Scatter plots are used to plot data points on a horizontal and a vertical axis in the attempt to show how much one variable is affected by another. In this case, we plotted Price vs. Reviews (which is the main focus of our research).

The next scatterplot shows the price versus the neighborhoods. We can see that some neighborhoods have more properties available than others, and we can observe the disparity in pricing.

Next, we show the price versus the satisfaction. Again, we can see that some satisfaction scores occur more often than other and also a certain relationship between higher prices with higher satisfaction levels.

Lastly, we look at the price versus the number of bedrooms. This - isolated - does not provide much information.

Neighborhoods

We will exclude data from neighborhoods that are not at least represented with 100 listings. Below is the lists of neighborhoods and their incidence in our data pool.

This cutoff affects a total of 290 listings in our estimation dataset (1.84%).

Step 3: Building a Regression Model

We built three regression models: first a linear, then a log-linear and finally a log-linear model with interactions. We iterated the variables in the models to increase R-squared, and reduce MAPE when running the models on the test data. We aimed to optimize the models with the AIC method, however this ended after the first step in most cases, not improving the models much.

Before running the data, we excluded the listings with 0 reviews as the price of these might not be tested by the market, i.e. irrelevant for the price estimation.

Linear Regression

The linear regression model consistently gave an R-squared less than 0.5.

Most variables seemed significant: room_type, neighborhood, reviews, overall_satisfaction, accommodates and bedrooms.


Call:
lm(formula = scale(price) ~ room_type + neighborhood + reviews + 
    overall_satisfaction + accommodates + bedrooms, data = PricingData.estimation.non0)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4940 -0.3912 -0.0820  0.2650  8.7611 

Coefficients:
                                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                        -1.5691983  0.0339631 -46.203  < 2e-16 ***
room_typePrivate room                              -0.4072260  0.0182220 -22.348  < 2e-16 ***
neighborhoodBuitenveldert / Zuidas                  0.0274332  0.0619424   0.443  0.65786    
neighborhoodCentrum Oost                            0.7348485  0.0341592  21.512  < 2e-16 ***
neighborhoodCentrum West                            0.8958976  0.0329601  27.181  < 2e-16 ***
neighborhoodDe Baarsjes / Oud West                  0.3228804  0.0307401  10.504  < 2e-16 ***
neighborhoodDe Pijp / Rivierenbuurt                 0.4163369  0.0323015  12.889  < 2e-16 ***
neighborhoodGeuzenveld / Slotermeer                -0.0891600  0.0710054  -1.256  0.20925    
neighborhoodIjburg / Eiland Zeeburg                -0.0202940  0.0556883  -0.364  0.71555    
neighborhoodNoord-West / Noord-Midden               0.4117225  0.0356388  11.553  < 2e-16 ***
neighborhoodNoord Oost                             -0.2160796  0.0662973  -3.259  0.00112 ** 
neighborhoodNoord West                             -0.2861028  0.0604739  -4.731 2.26e-06 ***
neighborhoodOostelijk Havengebied / Indische Buurt  0.0917014  0.0390737   2.347  0.01895 *  
neighborhoodOsdorp                                 -0.0347492  0.0730297  -0.476  0.63421    
neighborhoodOud Noord                              -0.0010251  0.0488631  -0.021  0.98326    
neighborhoodOud Oost                                0.2295885  0.0366897   6.258 4.03e-10 ***
neighborhoodSlotervaart                            -0.0041252  0.0537983  -0.077  0.93888    
neighborhoodWatergraafsmeer                         0.0676739  0.0506158   1.337  0.18124    
neighborhoodWesterpark                              0.3238754  0.0351195   9.222  < 2e-16 ***
reviews                                            -0.0006414  0.0001978  -3.242  0.00119 ** 
overall_satisfaction                                0.0115545  0.0035883   3.220  0.00128 ** 
accommodates                                        0.2551153  0.0075291  33.884  < 2e-16 ***
bedrooms                                            0.3417779  0.0124274  27.502  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7273 on 12955 degrees of freedom
Multiple R-squared:  0.4719,    Adjusted R-squared:  0.4711 
F-statistic: 526.3 on 22 and 12955 DF,  p-value: < 2.2e-16

Log-Linear Regression

The log-linear regression model consistently gave an R-squared between 0.5 and 0.55.

Percentage changes in price were driven by the following variables: room_type, neighborhood, reviews, overall_satisfaction, log(accommodates) and log(bedrooms). I.e., we assumed that there is a marginally decreasing impact of additional guests or bedrooms on the price.


Call:
lm(formula = log(price) ~ room_type + neighborhood + overall_satisfaction + 
    reviews + log(accommodates) + log(bedrooms), data = PricingData.estimation.non0)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4107 -0.1902 -0.0066  0.1848  2.0021 

Coefficients:
                                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                         4.401e+00  1.568e-02 280.733  < 2e-16 ***
room_typePrivate room                              -3.075e-01  7.939e-03 -38.733  < 2e-16 ***
neighborhoodBuitenveldert / Zuidas                  3.002e-02  2.658e-02   1.130 0.258671    
neighborhoodCentrum Oost                            3.784e-01  1.466e-02  25.810  < 2e-16 ***
neighborhoodCentrum West                            4.451e-01  1.415e-02  31.466  < 2e-16 ***
neighborhoodDe Baarsjes / Oud West                  1.801e-01  1.319e-02  13.657  < 2e-16 ***
neighborhoodDe Pijp / Rivierenbuurt                 2.279e-01  1.386e-02  16.440  < 2e-16 ***
neighborhoodGeuzenveld / Slotermeer                -8.705e-02  3.047e-02  -2.857 0.004279 ** 
neighborhoodIjburg / Eiland Zeeburg                 4.605e-02  2.386e-02   1.930 0.053615 .  
neighborhoodNoord-West / Noord-Midden               2.261e-01  1.529e-02  14.792  < 2e-16 ***
neighborhoodNoord Oost                             -9.166e-02  2.844e-02  -3.223 0.001272 ** 
neighborhoodNoord West                             -1.388e-01  2.593e-02  -5.352 8.83e-08 ***
neighborhoodOostelijk Havengebied / Indische Buurt  5.328e-02  1.677e-02   3.178 0.001488 ** 
neighborhoodOsdorp                                 -8.129e-02  3.134e-02  -2.594 0.009493 ** 
neighborhoodOud Noord                               3.268e-02  2.093e-02   1.561 0.118472    
neighborhoodOud Oost                                1.406e-01  1.574e-02   8.927  < 2e-16 ***
neighborhoodSlotervaart                            -1.645e-02  2.309e-02  -0.712 0.476332    
neighborhoodWatergraafsmeer                         5.926e-02  2.172e-02   2.729 0.006366 ** 
neighborhoodWesterpark                              1.847e-01  1.507e-02  12.256  < 2e-16 ***
overall_satisfaction                                7.050e-03  1.540e-03   4.578 4.74e-06 ***
reviews                                            -2.838e-04  8.492e-05  -3.343 0.000832 ***
log(accommodates)                                   3.709e-01  1.087e-02  34.118  < 2e-16 ***
log(bedrooms)                                       2.569e-01  9.965e-03  25.776  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3121 on 12955 degrees of freedom
Multiple R-squared:  0.5322,    Adjusted R-squared:  0.5314 
F-statistic:   670 on 22 and 12955 DF,  p-value: < 2.2e-16

Log-Linear Regression with Interactions

The log-linear regression model with interactions also consistently gave an R-squared between 0.5 and 0.55, a slightly better model than the log-linear in most cases.

Percentage changes in price were driven by the following variables: room_type, neighborhood, reviews, overall_satisfaction, log(accommodates), log(bedrooms) and log(accommodates):bedrooms. I.e., we assumed that the price-effect of additional guests depends on the number of bedrooms: staying with 3 other people in the same bedroom is a different experience than 4 guests having a bedroom each.


Call:
lm(formula = log(price) ~ room_type + neighborhood + overall_satisfaction + 
    reviews + log(accommodates) * bedrooms + log(bedrooms), data = PricingData.estimation.non0)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.41025 -0.19177 -0.00554  0.18485  2.00717 

Coefficients:
                                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                         4.459e+00  4.297e-02 103.771  < 2e-16 ***
room_typePrivate room                              -3.139e-01  7.940e-03 -39.530  < 2e-16 ***
neighborhoodBuitenveldert / Zuidas                  2.877e-02  2.649e-02   1.086 0.277394    
neighborhoodCentrum Oost                            3.769e-01  1.461e-02  25.802  < 2e-16 ***
neighborhoodCentrum West                            4.429e-01  1.410e-02  31.417  < 2e-16 ***
neighborhoodDe Baarsjes / Oud West                  1.785e-01  1.314e-02  13.581  < 2e-16 ***
neighborhoodDe Pijp / Rivierenbuurt                 2.268e-01  1.381e-02  16.424  < 2e-16 ***
neighborhoodGeuzenveld / Slotermeer                -9.001e-02  3.036e-02  -2.965 0.003033 ** 
neighborhoodIjburg / Eiland Zeeburg                 3.572e-02  2.382e-02   1.499 0.133774    
neighborhoodNoord-West / Noord-Midden               2.220e-01  1.524e-02  14.566  < 2e-16 ***
neighborhoodNoord Oost                             -9.436e-02  2.837e-02  -3.327 0.000881 ***
neighborhoodNoord West                             -1.441e-01  2.587e-02  -5.571 2.59e-08 ***
neighborhoodOostelijk Havengebied / Indische Buurt  5.158e-02  1.671e-02   3.087 0.002025 ** 
neighborhoodOsdorp                                 -8.039e-02  3.123e-02  -2.574 0.010056 *  
neighborhoodOud Noord                               1.753e-02  2.091e-02   0.838 0.401999    
neighborhoodOud Oost                                1.394e-01  1.569e-02   8.888  < 2e-16 ***
neighborhoodSlotervaart                            -1.485e-02  2.301e-02  -0.645 0.518793    
neighborhoodWatergraafsmeer                         5.806e-02  2.166e-02   2.681 0.007343 ** 
neighborhoodWesterpark                              1.816e-01  1.502e-02  12.093  < 2e-16 ***
overall_satisfaction                                6.792e-03  1.535e-03   4.425 9.73e-06 ***
reviews                                            -2.870e-04  8.462e-05  -3.392 0.000695 ***
log(accommodates)                                   2.977e-01  2.005e-02  14.849  < 2e-16 ***
bedrooms                                           -3.120e-02  3.596e-02  -0.868 0.385663    
log(bedrooms)                                       1.841e-01  4.262e-02   4.319 1.58e-05 ***
log(accommodates):bedrooms                          5.117e-02  1.118e-02   4.576 4.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.311 on 12953 degrees of freedom
Multiple R-squared:  0.5357,    Adjusted R-squared:  0.5348 
F-statistic: 622.6 on 24 and 12953 DF,  p-value: < 2.2e-16

Start:  AIC=-30293.53
log(price) ~ room_type + neighborhood + overall_satisfaction + 
    reviews + log(accommodates) * bedrooms + log(bedrooms)

                             Df Sum of Sq    RSS    AIC
<none>                                    1252.5 -30294
- reviews                     1     1.113 1253.7 -30284
- log(bedrooms)               1     1.804 1254.3 -30277
- overall_satisfaction        1     1.893 1254.4 -30276
- log(accommodates):bedrooms  1     2.025 1254.6 -30275
- room_type                   1   151.108 1403.7 -28817
- neighborhood               17   264.132 1516.7 -27844


Call:
lm(formula = log(price) ~ room_type + neighborhood + overall_satisfaction + 
    reviews + log(accommodates) * bedrooms + log(bedrooms), data = PricingData.estimation.non0)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.41025 -0.19177 -0.00554  0.18485  2.00717 

Coefficients:
                                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                         4.459e+00  4.297e-02 103.771  < 2e-16 ***
room_typePrivate room                              -3.139e-01  7.940e-03 -39.530  < 2e-16 ***
neighborhoodBuitenveldert / Zuidas                  2.877e-02  2.649e-02   1.086 0.277394    
neighborhoodCentrum Oost                            3.769e-01  1.461e-02  25.802  < 2e-16 ***
neighborhoodCentrum West                            4.429e-01  1.410e-02  31.417  < 2e-16 ***
neighborhoodDe Baarsjes / Oud West                  1.785e-01  1.314e-02  13.581  < 2e-16 ***
neighborhoodDe Pijp / Rivierenbuurt                 2.268e-01  1.381e-02  16.424  < 2e-16 ***
neighborhoodGeuzenveld / Slotermeer                -9.001e-02  3.036e-02  -2.965 0.003033 ** 
neighborhoodIjburg / Eiland Zeeburg                 3.572e-02  2.382e-02   1.499 0.133774    
neighborhoodNoord-West / Noord-Midden               2.220e-01  1.524e-02  14.566  < 2e-16 ***
neighborhoodNoord Oost                             -9.436e-02  2.837e-02  -3.327 0.000881 ***
neighborhoodNoord West                             -1.441e-01  2.587e-02  -5.571 2.59e-08 ***
neighborhoodOostelijk Havengebied / Indische Buurt  5.158e-02  1.671e-02   3.087 0.002025 ** 
neighborhoodOsdorp                                 -8.039e-02  3.123e-02  -2.574 0.010056 *  
neighborhoodOud Noord                               1.753e-02  2.091e-02   0.838 0.401999    
neighborhoodOud Oost                                1.394e-01  1.569e-02   8.888  < 2e-16 ***
neighborhoodSlotervaart                            -1.485e-02  2.301e-02  -0.645 0.518793    
neighborhoodWatergraafsmeer                         5.806e-02  2.166e-02   2.681 0.007343 ** 
neighborhoodWesterpark                              1.816e-01  1.502e-02  12.093  < 2e-16 ***
overall_satisfaction                                6.792e-03  1.535e-03   4.425 9.73e-06 ***
reviews                                            -2.870e-04  8.462e-05  -3.392 0.000695 ***
log(accommodates)                                   2.977e-01  2.005e-02  14.849  < 2e-16 ***
bedrooms                                           -3.120e-02  3.596e-02  -0.868 0.385663    
log(bedrooms)                                       1.841e-01  4.262e-02   4.319 1.58e-05 ***
log(accommodates):bedrooms                          5.117e-02  1.118e-02   4.576 4.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.311 on 12953 degrees of freedom
Multiple R-squared:  0.5357,    Adjusted R-squared:  0.5348 
F-statistic: 622.6 on 24 and 12953 DF,  p-value: < 2.2e-16

Step 4: Validate Prediction Quality

In this step, we ran the models on the testing data, and finally the log-linear-interaction model on the validation data.

The Mean Absolute Percentage Error of a model gave us an idea about how well it can forecast, the lower being better. A 20% or lower figure indicates a good model, 10% or below an excellent model.

According to their MAPE, our log-linear model (also with interactions) is close to good with a MAPE between 20-25%.

Linear Regression

The Mean Absolute Percentage Error for this model is 100.19%.

Log-Linear Regression

The Mean Absolute Percentage Error for this model is 23.53%.

Log-Linear Regression with Interactions

The Mean Absolute Percentage Error for this model on the test data is 23.35%.

While MAPE on the validation data is 24.25%.

Results Analysis

Looking at the results, it seems as though our model is able to predict the value of an appartment through short-term AirBnB rentals with an adjusted R score of 0.535

Iterating our model and tweaking the analysis process, we chose to segment and remove part of the data. For instance, we thought about the incidence of “reviews” on price: in fact, places which have never been booked will be priced less accurately than “mature” properties on the market. Looking at this, we tried to minimize the impact of the data exclusion on our total number of data points.

Conclusion

Our purpose in this exercise was to be able to predict the price of an accomodation for a night in Amsterdam through AirBnB. This exercise was meant to entice private owners to list their property on the platform, given our routine’s ability to predict the price the listing would sell for.

Going through the process, we ended up seeing shortcomings in the data - we had to filter out the data and noted some inconsistencies. For example, data indicating a ration of “0” could mean either a review score of 0, or an absence of a review (which is usually the case for new properties). As such, we chose to consider only the review ratings above a certain cutoff, in order to minimize the impact on the model accuracy.

As a group, we were able to identify some variables not included in the data which could have been significant: for example, the “premium-ness” of the lodging (equivalent to a hotel’s number of stars) would have a strong impact on price. Other elements were also overlooked in our data, such as the size (Sq-Ft) of the lodging.

To this point, we had difficulty in obtaining a high level of accuracy in our predictions - our finale Adjusted R score converged towards the 60% level.

Our Mean Absolute Percentage Error (23.35%) is converging towards the vaunted level of 20% which is very encouraging in terms of the quality of our output.

Future models would require a better dataset to provide useful predictions. As a preliminary result, however, we can propose the above shown models.

AirBnB Pricing Tool

Team R

February 10, 2018

The Business Context

The Data

Overview Process Steps

Step 1: Prepare and split the data

Shared Rooms

Invalid Number of Bedrooms

K-Fold

Step 2: Exploratory Data Analysis

Dependent variable distribution

Correlation matrix

Boxplots for Numerical Variables

Scatterplots

Neighborhoods

Step 3: Building a Regression Model

Linear Regression

Log-Linear Regression

Log-Linear Regression with Interactions

Step 4: Validate Prediction Quality

Linear Regression

Log-Linear Regression

Log-Linear Regression with Interactions

Results Analysis

Conclusion