We would like to understand who would be the most likely champion of speed dating as well as what would be the key drivers that affect people’s decision in selecting. In addition to that, we aim to check impact of segmentation on effectiveness of classification methods (i.e. hit ratios) and Machine Learning in practice (through usage of Random Forests method).
This is a follow-up to segmentation analysis conducted earlier. Below is 5 sample rows out of the total of 4299:
01 | 02 | 03 | 04 | 05 | |
---|---|---|---|---|---|
attr_o | 6 | 6 | 7 | 8 | 6 |
sinc_o | 7 | 5 | 7 | 8 | 6 |
intel_o | 8 | 10 | 7 | 9 | 7 |
fun_o | 7 | 6 | 9 | 8 | 7 |
amb_o | 7 | 6 | 9 | 8 | 8 |
shar_o | 5 | 5 | 9 | 9 | 7 |
field_cd | 1 | 1 | 1 | 1 | 1 |
race | 2 | 2 | 2 | 2 | 2 |
goal | 1 | 1 | 1 | 1 | 1 |
date | 5 | 5 | 5 | 5 | 5 |
go_out | 1 | 1 | 1 | 1 | 1 |
career_c | 1 | 1 | 1 | 1 | 1 |
sports | 1 | 1 | 1 | 1 | 1 |
tvsports | 1 | 1 | 1 | 1 | 1 |
exercise | 6 | 6 | 6 | 6 | 6 |
dining | 7 | 7 | 7 | 7 | 7 |
museums | 6 | 6 | 6 | 6 | 6 |
art | 7 | 7 | 7 | 7 | 7 |
hiking | 7 | 7 | 7 | 7 | 7 |
gaming | 5 | 5 | 5 | 5 | 5 |
clubbing | 7 | 7 | 7 | 7 | 7 |
reading | 7 | 7 | 7 | 7 | 7 |
tv | 7 | 7 | 7 | 7 | 7 |
theater | 9 | 9 | 9 | 9 | 9 |
movies | 7 | 7 | 7 | 7 | 7 |
concerts | 8 | 8 | 8 | 8 | 8 |
music | 7 | 7 | 7 | 7 | 7 |
shopping | 1 | 1 | 1 | 1 | 1 |
yoga | 8 | 8 | 8 | 8 | 8 |
We followed the following approach to proceed with classification (as explained in class) but we added Machine Learning method (i.e. Random Forests)
Let’s follow these steps.
We have three data samples: estimation_data (e.g. 80% of the data in our case), validation_data (e.g. the 10% of the data) and test_data (e.g. the remaining 10% of the data).
In our case we use 3439 observations in the estimation data, 430 in the validation data, and 430 in the test data.
Our dependent variable is: dec_o. It states whether given subject was selected by the partner. In our data the number of 0/1’s in our estimation sample is as follows.
Class 1 | Class 0 | |
---|---|---|
# of Observations | 1490 | 1949 |
while in the validation sample they are:
Class 1 | Class 0 | |
---|---|---|
# of Observations | 203 | 227 |
Below are the statistics of our independent variables across the two classes, class 1, “selected”
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
attr_o | 1 | 6 | 7 | 7.27 | 8 | 10 | 1.53 |
sinc_o | 0 | 7 | 8 | 7.65 | 9 | 10 | 1.46 |
intel_o | 3 | 7 | 8 | 7.73 | 9 | 10 | 1.29 |
fun_o | 0 | 6 | 7 | 7.26 | 8 | 10 | 1.54 |
amb_o | 2 | 6 | 7 | 7.11 | 8 | 10 | 1.56 |
shar_o | 0 | 5 | 7 | 6.48 | 8 | 10 | 1.81 |
field_cd | 1 | 3 | 8 | 7.21 | 10 | 17 | 3.90 |
race | 1 | 2 | 2 | 2.69 | 4 | 6 | 1.23 |
goal | 1 | 1 | 2 | 2.08 | 2 | 6 | 1.38 |
date | 1 | 4 | 5 | 4.89 | 6 | 7 | 1.46 |
go_out | 1 | 1 | 2 | 2.02 | 3 | 7 | 1.00 |
career_c | 1 | 2 | 5 | 5.03 | 7 | 17 | 3.22 |
sports | 1 | 5 | 7 | 6.50 | 9 | 10 | 2.62 |
tvsports | 1 | 2 | 4 | 4.47 | 7 | 10 | 2.76 |
exercise | 1 | 5 | 7 | 6.31 | 8 | 10 | 2.50 |
dining | 3 | 7 | 8 | 7.77 | 9 | 10 | 1.75 |
museums | 1 | 6 | 7 | 6.95 | 8 | 10 | 2.02 |
art | 1 | 5 | 7 | 6.70 | 8 | 10 | 2.28 |
hiking | 0 | 4 | 6 | 5.81 | 8 | 10 | 2.53 |
gaming | 1 | 1 | 4 | 3.73 | 5 | 10 | 2.36 |
clubbing | 1 | 4 | 6 | 5.78 | 8 | 10 | 2.51 |
reading | 1 | 7 | 8 | 7.68 | 9 | 10 | 1.89 |
tv | 1 | 3 | 5 | 5.09 | 7 | 10 | 2.52 |
theater | 1 | 5 | 7 | 6.80 | 9 | 10 | 2.27 |
movies | 2 | 7 | 8 | 7.92 | 9 | 10 | 1.69 |
concerts | 1 | 6 | 7 | 6.92 | 9 | 10 | 2.12 |
music | 1 | 7 | 8 | 7.88 | 9 | 10 | 1.76 |
shopping | 1 | 4 | 6 | 5.67 | 8 | 10 | 2.61 |
yoga | 1 | 2 | 4 | 4.39 | 7 | 10 | 2.70 |
and class 0, “not selected”:
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
attr_o | 0 | 4 | 6 | 5.37 | 7 | 10 | 1.77 |
sinc_o | 0 | 6 | 7 | 6.85 | 8 | 10 | 1.86 |
intel_o | 0 | 6 | 7 | 7.06 | 8 | 10 | 1.60 |
fun_o | 0 | 5 | 6 | 5.65 | 7 | 10 | 1.96 |
amb_o | 0 | 5 | 7 | 6.50 | 8 | 10 | 1.83 |
shar_o | 0 | 3 | 5 | 4.74 | 6 | 10 | 2.05 |
field_cd | 1 | 5 | 8 | 7.24 | 10 | 17 | 3.66 |
race | 1 | 2 | 2 | 2.75 | 4 | 6 | 1.20 |
goal | 1 | 1 | 2 | 2.26 | 3 | 6 | 1.51 |
date | 1 | 4 | 5 | 5.06 | 6 | 7 | 1.40 |
go_out | 1 | 1 | 2 | 2.23 | 3 | 7 | 1.18 |
career_c | 1 | 2 | 5 | 4.94 | 7 | 17 | 3.20 |
sports | 1 | 4 | 7 | 6.29 | 9 | 10 | 2.69 |
tvsports | 1 | 2 | 4 | 4.55 | 7 | 10 | 2.87 |
exercise | 1 | 4 | 6 | 6.00 | 8 | 10 | 2.53 |
dining | 1 | 7 | 8 | 7.66 | 9 | 10 | 1.85 |
museums | 1 | 5 | 7 | 6.90 | 8 | 10 | 2.10 |
art | 1 | 5 | 7 | 6.62 | 8 | 10 | 2.25 |
hiking | 0 | 3 | 6 | 5.69 | 8 | 10 | 2.68 |
gaming | 1 | 1 | 4 | 3.86 | 6 | 10 | 2.47 |
clubbing | 1 | 4 | 6 | 5.46 | 7 | 10 | 2.37 |
reading | 1 | 7 | 8 | 7.71 | 9 | 10 | 1.98 |
tv | 1 | 3 | 6 | 5.27 | 7 | 10 | 2.43 |
theater | 1 | 5 | 7 | 6.90 | 9 | 10 | 2.09 |
movies | 2 | 7 | 8 | 8.00 | 9 | 10 | 1.56 |
concerts | 1 | 6 | 7 | 6.82 | 8 | 10 | 2.15 |
music | 1 | 7 | 8 | 7.74 | 9 | 10 | 1.78 |
shopping | 1 | 3 | 5 | 5.42 | 8 | 10 | 2.64 |
yoga | 1 | 2 | 4 | 4.33 | 7 | 10 | 2.73 |
A simple visualization of values is presented below using the box plots. These visually indicate simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc). For example, for class 0
and class 1:
For our assignent, we used three classification methods: logistic regression, classification and regression trees (CART) and machine learning (i.e. random forests).
Running a basic CART model with complexity control cp=0.01, leads to the following tree:
Where the key decisions criteria could be explained by the following table.
Attribute | Name |
---|---|
IV1 | attr_o |
IV4 | fun_o |
IV6 | shar_o |
IV5 | amb_o |
IV2 | sinc_o |
IV3 | intel_o |
IV27 | music |
For example, this is how the tree would look like if we set cp = 0.005
:
Attribute | Name |
---|---|
IV1 | attr_o |
IV4 | fun_o |
IV6 | shar_o |
IV5 | amb_o |
IV2 | sinc_o |
IV3 | intel_o |
IV21 | clubbing |
IV16 | dining |
IV27 | music |
Below we present the probability our validation data belong to class 1. For the first few validation data observations, using the first CART above, is:
Actual Class | Probability of Class 1 | |
---|---|---|
Obs 1 | 1 | 0.61 |
Obs 2 | 1 | 0.22 |
Obs 3 | 0 | 0.22 |
Obs 4 | 0 | 0.22 |
Obs 5 | 1 | 0.22 |
Logistic Regression is a method similar to linear regression except that the dependent variable can be discrete (e.g. 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -5.1 | 0.5 | -9.8 | 0.0 |
attr_o | 0.6 | 0.0 | 17.5 | 0.0 |
sinc_o | -0.1 | 0.0 | -1.7 | 0.1 |
intel_o | 0.0 | 0.0 | 0.2 | 0.8 |
fun_o | 0.2 | 0.0 | 7.1 | 0.0 |
amb_o | -0.2 | 0.0 | -4.7 | 0.0 |
shar_o | 0.3 | 0.0 | 10.1 | 0.0 |
field_cd | 0.0 | 0.0 | -1.4 | 0.2 |
race | 0.1 | 0.0 | 1.7 | 0.1 |
goal | 0.0 | 0.0 | -1.4 | 0.2 |
date | 0.0 | 0.0 | 0.5 | 0.6 |
go_out | -0.1 | 0.0 | -1.3 | 0.2 |
career_c | 0.0 | 0.0 | -0.2 | 0.9 |
sports | 0.0 | 0.0 | 0.8 | 0.4 |
tvsports | 0.0 | 0.0 | -1.7 | 0.1 |
exercise | 0.0 | 0.0 | 1.0 | 0.3 |
dining | 0.0 | 0.0 | -0.9 | 0.4 |
museums | 0.0 | 0.0 | -0.4 | 0.7 |
art | 0.0 | 0.0 | 0.8 | 0.4 |
hiking | 0.0 | 0.0 | 1.1 | 0.3 |
gaming | 0.0 | 0.0 | -1.5 | 0.1 |
clubbing | 0.0 | 0.0 | -1.0 | 0.3 |
reading | 0.0 | 0.0 | 0.0 | 1.0 |
tv | 0.0 | 0.0 | -0.5 | 0.6 |
theater | 0.0 | 0.0 | -1.4 | 0.2 |
movies | 0.0 | 0.0 | -0.2 | 0.9 |
concerts | 0.0 | 0.0 | 0.2 | 0.8 |
music | 0.0 | 0.0 | -0.2 | 0.9 |
shopping | 0.0 | 0.0 | 2.2 | 0.0 |
yoga | 0.0 | 0.0 | -1.9 | 0.1 |
Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making. For example, it shows that important factors revolve around being attracive, having fit it hobby sharing or simply being fun.
Below table shows us key drivers of the classification according to each of the used methods.
CART 1 | CART 2 | Logistic Regr. | Random Forests - mean decrease in accuracy | |
---|---|---|---|---|
attr_o | 1.00 | 1.00 | 1.00 | 1.00 |
sinc_o | -0.19 | -0.18 | -0.10 | 0.21 |
intel_o | 0.19 | 0.18 | 0.01 | 0.13 |
fun_o | 0.45 | 0.44 | 0.41 | 0.54 |
amb_o | -0.20 | -0.19 | -0.27 | 0.08 |
shar_o | 0.42 | 0.41 | 0.58 | 0.65 |
field_cd | 0.00 | 0.00 | -0.08 | 0.15 |
race | 0.00 | 0.00 | 0.10 | 0.12 |
goal | 0.00 | 0.00 | -0.08 | 0.12 |
date | 0.00 | 0.00 | 0.03 | 0.22 |
go_out | 0.00 | 0.00 | -0.07 | 0.15 |
career_c | 0.00 | 0.00 | -0.01 | 0.14 |
sports | 0.00 | 0.00 | 0.05 | 0.16 |
tvsports | 0.00 | 0.00 | -0.10 | 0.15 |
exercise | 0.00 | 0.00 | 0.06 | 0.20 |
dining | 0.00 | 0.00 | -0.05 | 0.12 |
museums | 0.00 | 0.00 | -0.02 | 0.17 |
art | 0.00 | 0.00 | 0.05 | 0.18 |
hiking | 0.00 | 0.00 | 0.06 | 0.21 |
gaming | 0.00 | 0.00 | -0.09 | 0.18 |
clubbing | 0.00 | -0.01 | -0.06 | 0.14 |
reading | 0.00 | 0.00 | 0.00 | 0.15 |
tv | 0.00 | 0.00 | -0.03 | 0.13 |
theater | 0.00 | 0.00 | -0.08 | 0.15 |
movies | 0.00 | 0.00 | -0.01 | 0.16 |
concerts | 0.00 | 0.00 | 0.01 | 0.18 |
music | 0.00 | 0.00 | -0.01 | 0.16 |
shopping | 0.00 | 0.00 | 0.13 | 0.21 |
yoga | 0.00 | 0.00 | -0.11 | 0.13 |
In general we do not see very significant differences across all used methods which makes sense.
Below is the percentage of the observations that have been correctly classified (the predicted is the same as the actual class), i.e. exceeded the probability threshold 50% for the validation data:
Hit Ratio | |
---|---|
First CART | 68.13953 |
Second CART | 68.60465 |
Logistic Regression | 69.76744 |
Random Forests | 68.60465 |
while for the estimation data the hit rates are:
Hit Ratio | |
---|---|
First CART | 75.02181 |
Second CART | 75.48706 |
Logistic Regression | 75.10904 |
Random Forests | 99.76737 |
For our validation data if we classified all individuals into the largest group, we could get a hit-rate of 52.79% - without doing any work.
As can be seen above, used methods yield better results.
The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression, 2 CART models and random forests), the confusion matrix for the validation data is:
Predicted 1 | Predicted 0 | |
---|---|---|
Actual 1 | 74.38 | 25.62 |
Actual 0 | 65.64 | 34.36 |
The ROC curves for the validation data for all four methods is below:
The Lift curves for the validation data for our four classifiers are the following:
Below are presented hit ratios for all four methods based on test dataset:
Hit Ratio | |
---|---|
First CART | 76.27907 |
Second CART | 75.81395 |
Logistic Regression | 77.44186 |
Random Forests | 75.58140 |
The Confusion Matrix for the model with the best validation data hit ratio above:
Predicted 1 | Predicted 0 | |
---|---|---|
Actual 1 | 66 | 34 |
Actual 0 | 86 | 14 |
ROC curves for the test data:
Lift Curves for the test data:
One of the objectives for this assgignment was to assess classification effectiveness based on general population and segmented. However, as can be seen below, we did not achieve a tremendous improvements with segmented data.
Method | Hit ratio |
---|---|
First CART | 76.28 |
Second CART | 75.81 |
Logistic Regression | 77.44 |
Random Forests | 75.58 |
Segment #1
Method | Hit ratio |
---|---|
First CART | 75.00 |
Second CART | 75.00 |
Logistic Regression | 74.07 |
Random Forests | 74.07 |
Segment #2
Method | Hit ratio |
---|---|
First CART | 71.43 |
Second CART | 80.22 |
Logistic Regression | 63.74 |
Random Forests | 70.33 |
Segment #3
Method | Hit ratio |
---|---|
First CART | 72.55 |
Second CART | 70.59 |
Logistic Regression | 71.57 |
Random Forests | 77.45 |
Segment #4
Method | Hit ratio |
---|---|
First CART | 74.26 |
Second CART | 70.59 |
Logistic Regression | 72.06 |
Random Forests | 75.74 |