Classification using speed dating dataset

The “Business Decision”

We would like to understand who would be the most likely champion of speed dating as well as what would be the key drivers that affect people’s decision in selecting. In addition to that, we aim to check impact of segmentation on effectiveness of classification methods (i.e. hit ratios) and Machine Learning in practice (through usage of Random Forests method).

The Data

This is a follow-up to segmentation analysis conducted earlier. Below is 5 sample rows out of the total of 4299:

01 02 03 04 05
attr_o 6 6 7 8 6
sinc_o 7 5 7 8 6
intel_o 8 10 7 9 7
fun_o 7 6 9 8 7
amb_o 7 6 9 8 8
shar_o 5 5 9 9 7
field_cd 1 1 1 1 1
race 2 2 2 2 2
goal 1 1 1 1 1
date 5 5 5 5 5
go_out 1 1 1 1 1
career_c 1 1 1 1 1
sports 1 1 1 1 1
tvsports 1 1 1 1 1
exercise 6 6 6 6 6
dining 7 7 7 7 7
museums 6 6 6 6 6
art 7 7 7 7 7
hiking 7 7 7 7 7
gaming 5 5 5 5 5
clubbing 7 7 7 7 7
reading 7 7 7 7 7
tv 7 7 7 7 7
theater 9 9 9 9 9
movies 7 7 7 7 7
concerts 8 8 8 8 8
music 7 7 7 7 7
shopping 1 1 1 1 1
yoga 8 8 8 8 8

A Process for Classification

Classification in 6 steps

We followed the following approach to proceed with classification (as explained in class) but we added Machine Learning method (i.e. Random Forests)

  1. Create an estimation sample and two validation samples by splitting the data into three groups. Steps 2-5 below will then be performed only on the estimation and the first validation data. You should only do step 6 once on the second validation data, also called test data, and report/use the performance on that (second validation) data only to make final business decisions.
  2. Set up the dependent variable (as a categorical 0-1 variable; multi-class classification is also feasible, and similar, but we do not explore it in this note).
  3. Make a preliminary assessment of the relative importance of the explanatory variables using visualization tools and simple descriptive statistics.
  4. Estimate the classification model using the estimation data, and interpret the results.
  5. Assess the accuracy of classification in the first validation sample, possibly repeating steps 2-5 a few times in different ways to increase performance.
  6. Finally, assess the accuracy of classification in the second validation sample. You should eventually use/report all relevant performance measures/plots on this second validation sample only.

Let’s follow these steps.

Step 1: Split the data

We have three data samples: estimation_data (e.g. 80% of the data in our case), validation_data (e.g. the 10% of the data) and test_data (e.g. the remaining 10% of the data).

In our case we use 3439 observations in the estimation data, 430 in the validation data, and 430 in the test data.

Step 2: Choose dependent variable

Our dependent variable is: dec_o. It states whether given subject was selected by the partner. In our data the number of 0/1’s in our estimation sample is as follows.

Class 1 Class 0
# of Observations 1490 1949

while in the validation sample they are:

Class 1 Class 0
# of Observations 203 227

Step 3: Simple Analysis

Below are the statistics of our independent variables across the two classes, class 1, “selected”

min 25 percent median mean 75 percent max std
attr_o 1 6 7 7.27 8 10 1.53
sinc_o 0 7 8 7.65 9 10 1.46
intel_o 3 7 8 7.73 9 10 1.29
fun_o 0 6 7 7.26 8 10 1.54
amb_o 2 6 7 7.11 8 10 1.56
shar_o 0 5 7 6.48 8 10 1.81
field_cd 1 3 8 7.21 10 17 3.90
race 1 2 2 2.69 4 6 1.23
goal 1 1 2 2.08 2 6 1.38
date 1 4 5 4.89 6 7 1.46
go_out 1 1 2 2.02 3 7 1.00
career_c 1 2 5 5.03 7 17 3.22
sports 1 5 7 6.50 9 10 2.62
tvsports 1 2 4 4.47 7 10 2.76
exercise 1 5 7 6.31 8 10 2.50
dining 3 7 8 7.77 9 10 1.75
museums 1 6 7 6.95 8 10 2.02
art 1 5 7 6.70 8 10 2.28
hiking 0 4 6 5.81 8 10 2.53
gaming 1 1 4 3.73 5 10 2.36
clubbing 1 4 6 5.78 8 10 2.51
reading 1 7 8 7.68 9 10 1.89
tv 1 3 5 5.09 7 10 2.52
theater 1 5 7 6.80 9 10 2.27
movies 2 7 8 7.92 9 10 1.69
concerts 1 6 7 6.92 9 10 2.12
music 1 7 8 7.88 9 10 1.76
shopping 1 4 6 5.67 8 10 2.61
yoga 1 2 4 4.39 7 10 2.70

and class 0, “not selected”:

min 25 percent median mean 75 percent max std
attr_o 0 4 6 5.37 7 10 1.77
sinc_o 0 6 7 6.85 8 10 1.86
intel_o 0 6 7 7.06 8 10 1.60
fun_o 0 5 6 5.65 7 10 1.96
amb_o 0 5 7 6.50 8 10 1.83
shar_o 0 3 5 4.74 6 10 2.05
field_cd 1 5 8 7.24 10 17 3.66
race 1 2 2 2.75 4 6 1.20
goal 1 1 2 2.26 3 6 1.51
date 1 4 5 5.06 6 7 1.40
go_out 1 1 2 2.23 3 7 1.18
career_c 1 2 5 4.94 7 17 3.20
sports 1 4 7 6.29 9 10 2.69
tvsports 1 2 4 4.55 7 10 2.87
exercise 1 4 6 6.00 8 10 2.53
dining 1 7 8 7.66 9 10 1.85
museums 1 5 7 6.90 8 10 2.10
art 1 5 7 6.62 8 10 2.25
hiking 0 3 6 5.69 8 10 2.68
gaming 1 1 4 3.86 6 10 2.47
clubbing 1 4 6 5.46 7 10 2.37
reading 1 7 8 7.71 9 10 1.98
tv 1 3 6 5.27 7 10 2.43
theater 1 5 7 6.90 9 10 2.09
movies 2 7 8 8.00 9 10 1.56
concerts 1 6 7 6.82 8 10 2.15
music 1 7 8 7.74 9 10 1.78
shopping 1 3 5 5.42 8 10 2.64
yoga 1 2 4 4.33 7 10 2.73

A simple visualization of values is presented below using the box plots. These visually indicate simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc). For example, for class 0

and class 1:

Step 4: Classification and Interpretation

For our assignent, we used three classification methods: logistic regression, classification and regression trees (CART) and machine learning (i.e. random forests).

Running a basic CART model with complexity control cp=0.01, leads to the following tree:

Where the key decisions criteria could be explained by the following table.

Attribute Name
IV1 attr_o
IV4 fun_o
IV6 shar_o
IV5 amb_o
IV2 sinc_o
IV3 intel_o
IV27 music

For example, this is how the tree would look like if we set cp = 0.005:

Attribute Name
IV1 attr_o
IV4 fun_o
IV6 shar_o
IV5 amb_o
IV2 sinc_o
IV3 intel_o
IV21 clubbing
IV16 dining
IV27 music

Below we present the probability our validation data belong to class 1. For the first few validation data observations, using the first CART above, is:

Actual Class Probability of Class 1
Obs 1 1 0.61
Obs 2 1 0.22
Obs 3 0 0.22
Obs 4 0 0.22
Obs 5 1 0.22

Logistic Regression is a method similar to linear regression except that the dependent variable can be discrete (e.g. 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:

Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.1 0.5 -9.8 0.0
attr_o 0.6 0.0 17.5 0.0
sinc_o -0.1 0.0 -1.7 0.1
intel_o 0.0 0.0 0.2 0.8
fun_o 0.2 0.0 7.1 0.0
amb_o -0.2 0.0 -4.7 0.0
shar_o 0.3 0.0 10.1 0.0
field_cd 0.0 0.0 -1.4 0.2
race 0.1 0.0 1.7 0.1
goal 0.0 0.0 -1.4 0.2
date 0.0 0.0 0.5 0.6
go_out -0.1 0.0 -1.3 0.2
career_c 0.0 0.0 -0.2 0.9
sports 0.0 0.0 0.8 0.4
tvsports 0.0 0.0 -1.7 0.1
exercise 0.0 0.0 1.0 0.3
dining 0.0 0.0 -0.9 0.4
museums 0.0 0.0 -0.4 0.7
art 0.0 0.0 0.8 0.4
hiking 0.0 0.0 1.1 0.3
gaming 0.0 0.0 -1.5 0.1
clubbing 0.0 0.0 -1.0 0.3
reading 0.0 0.0 0.0 1.0
tv 0.0 0.0 -0.5 0.6
theater 0.0 0.0 -1.4 0.2
movies 0.0 0.0 -0.2 0.9
concerts 0.0 0.0 0.2 0.8
music 0.0 0.0 -0.2 0.9
shopping 0.0 0.0 2.2 0.0
yoga 0.0 0.0 -1.9 0.1

Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making. For example, it shows that important factors revolve around being attracive, having fit it hobby sharing or simply being fun.

Below table shows us key drivers of the classification according to each of the used methods.

CART 1 CART 2 Logistic Regr. Random Forests - mean decrease in accuracy
attr_o 1.00 1.00 1.00 1.00
sinc_o -0.19 -0.18 -0.10 0.21
intel_o 0.19 0.18 0.01 0.13
fun_o 0.45 0.44 0.41 0.54
amb_o -0.20 -0.19 -0.27 0.08
shar_o 0.42 0.41 0.58 0.65
field_cd 0.00 0.00 -0.08 0.15
race 0.00 0.00 0.10 0.12
goal 0.00 0.00 -0.08 0.12
date 0.00 0.00 0.03 0.22
go_out 0.00 0.00 -0.07 0.15
career_c 0.00 0.00 -0.01 0.14
sports 0.00 0.00 0.05 0.16
tvsports 0.00 0.00 -0.10 0.15
exercise 0.00 0.00 0.06 0.20
dining 0.00 0.00 -0.05 0.12
museums 0.00 0.00 -0.02 0.17
art 0.00 0.00 0.05 0.18
hiking 0.00 0.00 0.06 0.21
gaming 0.00 0.00 -0.09 0.18
clubbing 0.00 -0.01 -0.06 0.14
reading 0.00 0.00 0.00 0.15
tv 0.00 0.00 -0.03 0.13
theater 0.00 0.00 -0.08 0.15
movies 0.00 0.00 -0.01 0.16
concerts 0.00 0.00 0.01 0.18
music 0.00 0.00 -0.01 0.16
shopping 0.00 0.00 0.13 0.21
yoga 0.00 0.00 -0.11 0.13

In general we do not see very significant differences across all used methods which makes sense.

Step 5: Validation accuracy

1. Hit ratio

Below is the percentage of the observations that have been correctly classified (the predicted is the same as the actual class), i.e. exceeded the probability threshold 50% for the validation data:

Hit Ratio
First CART 68.13953
Second CART 68.60465
Logistic Regression 69.76744
Random Forests 68.60465

while for the estimation data the hit rates are:

Hit Ratio
First CART 75.02181
Second CART 75.48706
Logistic Regression 75.10904
Random Forests 99.76737

For our validation data if we classified all individuals into the largest group, we could get a hit-rate of 52.79% - without doing any work.

As can be seen above, used methods yield better results.

2. Confusion matrix

The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression, 2 CART models and random forests), the confusion matrix for the validation data is:

Predicted 1 Predicted 0
Actual 1 74.38 25.62
Actual 0 65.64 34.36

3. ROC curve

The ROC curves for the validation data for all four methods is below:

4. Lift curve

The Lift curves for the validation data for our four classifiers are the following:

Step 6: Test Accuracy

Below are presented hit ratios for all four methods based on test dataset:

Hit Ratio
First CART 76.27907
Second CART 75.81395
Logistic Regression 77.44186
Random Forests 75.58140

The Confusion Matrix for the model with the best validation data hit ratio above:

Predicted 1 Predicted 0
Actual 1 66 34
Actual 0 86 14

ROC curves for the test data:

Lift Curves for the test data:

Effectiveness comparison

One of the objectives for this assgignment was to assess classification effectiveness based on general population and segmented. However, as can be seen below, we did not achieve a tremendous improvements with segmented data.

General data

Method Hit ratio
First CART 76.28
Second CART 75.81
Logistic Regression 77.44
Random Forests 75.58

Segmented data

Segment #1

Method Hit ratio
First CART 75.00
Second CART 75.00
Logistic Regression 74.07
Random Forests 74.07

Segment #2

Method Hit ratio
First CART 71.43
Second CART 80.22
Logistic Regression 63.74
Random Forests 70.33

Segment #3

Method Hit ratio
First CART 72.55
Second CART 70.59
Logistic Regression 71.57
Random Forests 77.45

Segment #4

Method Hit ratio
First CART 74.26
Second CART 70.59
Logistic Regression 72.06
Random Forests 75.74