Classification using speed dating dataset

The “Business Decision”

We would like to understand who would be the most likely champion of speed dating as well as what would be the key drivers that affect people’s decision in selecting. In addition to that, we aim to check impact of segmentation on effectiveness of classification methods (i.e. hit ratios) and Machine Learning in practice (through usage of Random Forests method).

The Data

This is a follow-up to segmentation analysis conducted earlier. Below is 5 sample rows out of the total of 4299:

	01	02	03	04	05
attr_o	6	6	7	8	6
sinc_o	7	5	7	8	6
intel_o	8	10	7	9	7
fun_o	7	6	9	8	7
amb_o	7	6	9	8	8
shar_o	5	5	9	9	7
field_cd	1	1	1	1	1
race	2	2	2	2	2
goal	1	1	1	1	1
date	5	5	5	5	5
go_out	1	1	1	1	1
career_c	1	1	1	1	1
sports	1	1	1	1	1
tvsports	1	1	1	1	1
exercise	6	6	6	6	6
dining	7	7	7	7	7
museums	6	6	6	6	6
art	7	7	7	7	7
hiking	7	7	7	7	7
gaming	5	5	5	5	5
clubbing	7	7	7	7	7
reading	7	7	7	7	7
tv	7	7	7	7	7
theater	9	9	9	9	9
movies	7	7	7	7	7
concerts	8	8	8	8	8
music	7	7	7	7	7
shopping	1	1	1	1	1
yoga	8	8	8	8	8

A Process for Classification

Classification in 6 steps

We followed the following approach to proceed with classification (as explained in class) but we added Machine Learning method (i.e. Random Forests)

Create an estimation sample and two validation samples by splitting the data into three groups. Steps 2-5 below will then be performed only on the estimation and the first validation data. You should only do step 6 once on the second validation data, also called test data, and report/use the performance on that (second validation) data only to make final business decisions.
Set up the dependent variable (as a categorical 0-1 variable; multi-class classification is also feasible, and similar, but we do not explore it in this note).
Make a preliminary assessment of the relative importance of the explanatory variables using visualization tools and simple descriptive statistics.
Estimate the classification model using the estimation data, and interpret the results.
Assess the accuracy of classification in the first validation sample, possibly repeating steps 2-5 a few times in different ways to increase performance.
Finally, assess the accuracy of classification in the second validation sample. You should eventually use/report all relevant performance measures/plots on this second validation sample only.

Let’s follow these steps.

Step 1: Split the data

We have three data samples: estimation_data (e.g. 80% of the data in our case), validation_data (e.g. the 10% of the data) and test_data (e.g. the remaining 10% of the data).

In our case we use 3439 observations in the estimation data, 430 in the validation data, and 430 in the test data.

Step 2: Choose dependent variable

Our dependent variable is: dec_o. It states whether given subject was selected by the partner. In our data the number of 0/1’s in our estimation sample is as follows.

	Class 1	Class 0
# of Observations	1490	1949

while in the validation sample they are:

	Class 1	Class 0
# of Observations	203	227

Step 3: Simple Analysis

Below are the statistics of our independent variables across the two classes, class 1, “selected”

	min	25 percent	median	mean	75 percent	max	std
attr_o	1	6	7	7.27	8	10	1.53
sinc_o	0	7	8	7.65	9	10	1.46
intel_o	3	7	8	7.73	9	10	1.29
fun_o	0	6	7	7.26	8	10	1.54
amb_o	2	6	7	7.11	8	10	1.56
shar_o	0	5	7	6.48	8	10	1.81
field_cd	1	3	8	7.21	10	17	3.90
race	1	2	2	2.69	4	6	1.23
goal	1	1	2	2.08	2	6	1.38
date	1	4	5	4.89	6	7	1.46
go_out	1	1	2	2.02	3	7	1.00
career_c	1	2	5	5.03	7	17	3.22
sports	1	5	7	6.50	9	10	2.62
tvsports	1	2	4	4.47	7	10	2.76
exercise	1	5	7	6.31	8	10	2.50
dining	3	7	8	7.77	9	10	1.75
museums	1	6	7	6.95	8	10	2.02
art	1	5	7	6.70	8	10	2.28
hiking	0	4	6	5.81	8	10	2.53
gaming	1	1	4	3.73	5	10	2.36
clubbing	1	4	6	5.78	8	10	2.51
reading	1	7	8	7.68	9	10	1.89
tv	1	3	5	5.09	7	10	2.52
theater	1	5	7	6.80	9	10	2.27
movies	2	7	8	7.92	9	10	1.69
concerts	1	6	7	6.92	9	10	2.12
music	1	7	8	7.88	9	10	1.76
shopping	1	4	6	5.67	8	10	2.61
yoga	1	2	4	4.39	7	10	2.70

and class 0, “not selected”:

	min	25 percent	median	mean	75 percent	max	std
attr_o	0	4	6	5.37	7	10	1.77
sinc_o	0	6	7	6.85	8	10	1.86
intel_o	0	6	7	7.06	8	10	1.60
fun_o	0	5	6	5.65	7	10	1.96
amb_o	0	5	7	6.50	8	10	1.83
shar_o	0	3	5	4.74	6	10	2.05
field_cd	1	5	8	7.24	10	17	3.66
race	1	2	2	2.75	4	6	1.20
goal	1	1	2	2.26	3	6	1.51
date	1	4	5	5.06	6	7	1.40
go_out	1	1	2	2.23	3	7	1.18
career_c	1	2	5	4.94	7	17	3.20
sports	1	4	7	6.29	9	10	2.69
tvsports	1	2	4	4.55	7	10	2.87
exercise	1	4	6	6.00	8	10	2.53
dining	1	7	8	7.66	9	10	1.85
museums	1	5	7	6.90	8	10	2.10
art	1	5	7	6.62	8	10	2.25
hiking	0	3	6	5.69	8	10	2.68
gaming	1	1	4	3.86	6	10	2.47
clubbing	1	4	6	5.46	7	10	2.37
reading	1	7	8	7.71	9	10	1.98
tv	1	3	6	5.27	7	10	2.43
theater	1	5	7	6.90	9	10	2.09
movies	2	7	8	8.00	9	10	1.56
concerts	1	6	7	6.82	8	10	2.15
music	1	7	8	7.74	9	10	1.78
shopping	1	3	5	5.42	8	10	2.64
yoga	1	2	4	4.33	7	10	2.73

A simple visualization of values is presented below using the box plots. These visually indicate simple summary statistics of an independent variable (e.g. mean, median, top and bottom quantiles, min, max, etc). For example, for class 0

and class 1:

Step 4: Classification and Interpretation

For our assignent, we used three classification methods: logistic regression, classification and regression trees (CART) and machine learning (i.e. random forests).

Running a basic CART model with complexity control cp=0.01, leads to the following tree:

Where the key decisions criteria could be explained by the following table.

Attribute	Name
IV1	attr_o
IV4	fun_o
IV6	shar_o
IV5	amb_o
IV2	sinc_o
IV3	intel_o
IV27	music

For example, this is how the tree would look like if we set cp = 0.005:

Attribute	Name
IV1	attr_o
IV4	fun_o
IV6	shar_o
IV5	amb_o
IV2	sinc_o
IV3	intel_o
IV21	clubbing
IV16	dining
IV27	music

Below we present the probability our validation data belong to class 1. For the first few validation data observations, using the first CART above, is:

	Actual Class	Probability of Class 1
Obs 1	1	0.61
Obs 2	1	0.22
Obs 3	0	0.22
Obs 4	0	0.22
Obs 5	1	0.22

Logistic Regression is a method similar to linear regression except that the dependent variable can be discrete (e.g. 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-5.1	0.5	-9.8	0.0
attr_o	0.6	0.0	17.5	0.0
sinc_o	-0.1	0.0	-1.7	0.1
intel_o	0.0	0.0	0.2	0.8
fun_o	0.2	0.0	7.1	0.0
amb_o	-0.2	0.0	-4.7	0.0
shar_o	0.3	0.0	10.1	0.0
field_cd	0.0	0.0	-1.4	0.2
race	0.1	0.0	1.7	0.1
goal	0.0	0.0	-1.4	0.2
date	0.0	0.0	0.5	0.6
go_out	-0.1	0.0	-1.3	0.2
career_c	0.0	0.0	-0.2	0.9
sports	0.0	0.0	0.8	0.4
tvsports	0.0	0.0	-1.7	0.1
exercise	0.0	0.0	1.0	0.3
dining	0.0	0.0	-0.9	0.4
museums	0.0	0.0	-0.4	0.7
art	0.0	0.0	0.8	0.4
hiking	0.0	0.0	1.1	0.3
gaming	0.0	0.0	-1.5	0.1
clubbing	0.0	0.0	-1.0	0.3
reading	0.0	0.0	0.0	1.0
tv	0.0	0.0	-0.5	0.6
theater	0.0	0.0	-1.4	0.2
movies	0.0	0.0	-0.2	0.9
concerts	0.0	0.0	0.2	0.8
music	0.0	0.0	-0.2	0.9
shopping	0.0	0.0	2.2	0.0
yoga	0.0	0.0	-1.9	0.1

Random forests is the last method that we used. Below is the overview of key success factors when it comes to speed dating decision making. For example, it shows that important factors revolve around being attracive, having fit it hobby sharing or simply being fun.

Below table shows us key drivers of the classification according to each of the used methods.

	CART 1	CART 2	Logistic Regr.	Random Forests - mean decrease in accuracy
attr_o	1.00	1.00	1.00	1.00
sinc_o	-0.19	-0.18	-0.10	0.21
intel_o	0.19	0.18	0.01	0.13
fun_o	0.45	0.44	0.41	0.54
amb_o	-0.20	-0.19	-0.27	0.08
shar_o	0.42	0.41	0.58	0.65
field_cd	0.00	0.00	-0.08	0.15
race	0.00	0.00	0.10	0.12
goal	0.00	0.00	-0.08	0.12
date	0.00	0.00	0.03	0.22
go_out	0.00	0.00	-0.07	0.15
career_c	0.00	0.00	-0.01	0.14
sports	0.00	0.00	0.05	0.16
tvsports	0.00	0.00	-0.10	0.15
exercise	0.00	0.00	0.06	0.20
dining	0.00	0.00	-0.05	0.12
museums	0.00	0.00	-0.02	0.17
art	0.00	0.00	0.05	0.18
hiking	0.00	0.00	0.06	0.21
gaming	0.00	0.00	-0.09	0.18
clubbing	0.00	-0.01	-0.06	0.14
reading	0.00	0.00	0.00	0.15
tv	0.00	0.00	-0.03	0.13
theater	0.00	0.00	-0.08	0.15
movies	0.00	0.00	-0.01	0.16
concerts	0.00	0.00	0.01	0.18
music	0.00	0.00	-0.01	0.16
shopping	0.00	0.00	0.13	0.21
yoga	0.00	0.00	-0.11	0.13

In general we do not see very significant differences across all used methods which makes sense.

Step 5: Validation accuracy

1. Hit ratio

Below is the percentage of the observations that have been correctly classified (the predicted is the same as the actual class), i.e. exceeded the probability threshold 50% for the validation data:

	Hit Ratio
First CART	68.13953
Second CART	68.60465
Logistic Regression	69.76744
Random Forests	68.60465

while for the estimation data the hit rates are:

	Hit Ratio
First CART	75.02181
Second CART	75.48706
Logistic Regression	75.10904
Random Forests	99.76737

For our validation data if we classified all individuals into the largest group, we could get a hit-rate of 52.79% - without doing any work.

As can be seen above, used methods yield better results.

2. Confusion matrix

The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression, 2 CART models and random forests), the confusion matrix for the validation data is:

	Predicted 1	Predicted 0
Actual 1	74.38	25.62
Actual 0	65.64	34.36

3. ROC curve

The ROC curves for the validation data for all four methods is below:

4. Lift curve

The Lift curves for the validation data for our four classifiers are the following:

Step 6: Test Accuracy

Below are presented hit ratios for all four methods based on test dataset:

	Hit Ratio
First CART	76.27907
Second CART	75.81395
Logistic Regression	77.44186
Random Forests	75.58140

The Confusion Matrix for the model with the best validation data hit ratio above:

	Predicted 1	Predicted 0
Actual 1	66	34
Actual 0	86	14

ROC curves for the test data:

Lift Curves for the test data:

Effectiveness comparison

One of the objectives for this assgignment was to assess classification effectiveness based on general population and segmented. However, as can be seen below, we did not achieve a tremendous improvements with segmented data.

General data

Method	Hit ratio
First CART	76.28
Second CART	75.81
Logistic Regression	77.44
Random Forests	75.58

Segmented data

Segment #1

Method	Hit ratio
First CART	75.00
Second CART	75.00
Logistic Regression	74.07
Random Forests	74.07

Segment #2

Method	Hit ratio
First CART	71.43
Second CART	80.22
Logistic Regression	63.74
Random Forests	70.33

Segment #3

Method	Hit ratio
First CART	72.55
Second CART	70.59
Logistic Regression	71.57
Random Forests	77.45

Segment #4

Method	Hit ratio
First CART	74.26
Second CART	70.59
Logistic Regression	72.06
Random Forests	75.74

SpeedDating - prediciting success in speed dating

T. Evgeniou; adapted by team