The Business Context

A multinational chip manufacturer is concerned that they are retention rate is lower than average for the industry. They are loosing their best employees and want to ensure that these employees stay. They know that if they pay a bonus of USD 15k to an employee who wants to leave, the employee would stay. However, the cost of hiring, training and paying the sign on bonus for a new employee is USD 50k. The HR department is trying to minimize this cost.


The Data

(Data source: https://www.kaggle.com/ludobenistant/hr-analytics/data. We acknowledge the following:

The multinational has gathered information on 14999 employees. The dataset contains information on 9 variables, including satisfaction score, last evaluation score, number of projects the employee has worked on, the average monthly hours, the time spent in the company, the occurence of work place accident, number of promotions, department, salary range, as well as information on the outcome: did the employee leaver or not?

Name Description
satisfaction_level Level of satisfaction (0-1)
last_evaluation Time since last performance evaluation (in Years)
number_project Number of projects completed while at work
average_montly_hours Average monthly hours at workplace
time_spend_company Number of years spent in the company
Work_accident Whether the employee had a workplace accident
left Whether the employee left the workplace or not (1 or 0) Factor
promotion_last_5years Whether the employee was promoted in the last five years
sales Department in which they work for
salary Relative level of salary (high)

Let’s look into the data for a few employees. This is how the first 10 out of the total of 14999 rows look like (transposed, for convenience):

01 02 03 04 05 06 07 08 09 10
satisfaction_level 0.38 0.80 0.11 0.72 0.37 0.41 0.10 0.92 0.89 0.42
last_evaluation 0.53 0.86 0.88 0.87 0.52 0.50 0.77 0.85 1.00 0.53
number_project 2.00 5.00 7.00 5.00 2.00 2.00 6.00 5.00 5.00 2.00
average_montly_hours 157.00 262.00 272.00 223.00 159.00 153.00 247.00 259.00 224.00 142.00
time_spend_company 3.00 6.00 4.00 5.00 3.00 3.00 4.00 5.00 5.00 3.00
Work_accident 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
promotion_last_5years 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
department 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00
salary 2.00 3.00 3.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

A Process for Classification

We will use the following process:

  1. Create an estimation sample and two validation samples by splitting the data into three groups. Steps 2-5 below will then be performed only on the estimation and the first validation data. Step 6 will be done only once on the second validation data, also called test data, and only report/use the performance on that (second validation) data to make final business decisions.
  2. Set up the dependent variable (as a categorical 0-1 variable; multi-class classification is also feasible, and similar, but we do not explore it in this note).
  3. Make a preliminary assessment of the relative importance of the explanatory variables using visualization tools and simple descriptive statistics.
  4. Estimate the classification model using the estimation data, and interpret the results.
  5. Assess the accuracy of classification in the first validation sample, possibly repeating steps 2-5 a few times changing the classifier in different ways to increase performance.
  6. Finally, assess the accuracy of classification in the second validation sample. We will eventually use and report all relevant performance measures and plots on this second validation sample only.

Let’s follow these steps.

Step 1: Split the data

For this project we are splitting the data as 70/20/10. 70% of the data will be used for training, 20% for validation and 10% for testing.

In our case we use 9749 observations in the estimation data, 3750 in the validation data, and 1500 in the test data.

Step 2: Set up the dependent variable

First, make sure the dependent variable is set up as a categorical 0-1 variable. In our example, we use the employee leaving (or not leaving) as the dependent variable.

In our data the number of 0/1’s in our estimation sample is as follows:

Class 1 Class 0
# of Observations 2324 7425

while in the validation sample they are:

Class 1 Class 0
# of Observations 875 2875

Step 3: Simple Analysis

In this section we run simple statistical and visual exploration of the data in order to udnerstand how the classes differ along any of the independent variables.

Starting with a simple stastical exploration below: class 1 (“left”):

min 25 percent median mean 75 percent max std
satisfaction_level 0.09 0.11 0.41 0.44 0.73 0.92 0.26
last_evaluation 0.45 0.52 0.79 0.72 0.90 1.00 0.20
number_project 2.00 2.00 4.00 3.88 6.00 7.00 1.82
average_montly_hours 126.00 147.00 226.00 208.56 262.00 310.00 61.21
time_spend_company 2.00 3.00 4.00 3.89 5.00 6.00 0.98
Work_accident 0.00 0.00 0.00 0.05 0.00 1.00 0.21
promotion_last_5years 0.00 0.00 0.00 0.00 0.00 1.00 0.07
department 1.00 5.00 8.00 7.01 9.00 10.00 2.81
salary 1.00 2.00 2.00 2.35 3.00 3.00 0.52

and class 0 (“did not leave”):

min 25 percent median mean 75 percent max std
satisfaction_level 0.12 0.53 0.69 0.66 0.84 1 0.22
last_evaluation 0.36 0.58 0.71 0.71 0.85 1 0.16
number_project 2.00 3.00 4.00 3.80 4.00 6 0.98
average_montly_hours 96.00 162.00 198.00 198.91 238.00 287 45.88
time_spend_company 2.00 2.00 3.00 3.39 4.00 10 1.57
Work_accident 0.00 0.00 0.00 0.18 0.00 1 0.38
promotion_last_5years 0.00 0.00 0.00 0.03 0.00 1 0.16
department 1.00 5.00 8.00 6.88 9.00 10 2.75
salary 1.00 2.00 2.00 2.35 3.00 3 0.66

Just looking at the simple analysis we see some differences between the two sets of employees. Initial hypothesis is that employees who are less satisfied and who work longer hours tend to be the ones that leave. In the next few sections we will apply statistical rigor to make the classification.

We use box plots below to form a simple visual exploration of the independent variables and their relation to the dependent variable. Variable 1 in this case is Satisfaction Level and Variable 2 is the employee’s last evaluation

## Warning: It is recommended that you provide at least seven colours for the

## swatch.

and class 0:

## Warning: It is recommended that you provide at least seven colours for the

## swatch.

Step 4: Classification and Interpretation

Since we have identified the independent and the dependent variables, we will now explore methods to classify the data.

In this case we start with Logistic Regression and CART.

Logistic Regression: Logistic Regression is a method similar to linear regression except that the dependent variable is discrete (e.g., 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:

Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2 0.2 -0.9 0.4
satisfaction_level -4.1 0.1 -34.4 0.0
last_evaluation 0.9 0.2 4.8 0.0
number_project -0.3 0.0 -12.6 0.0
average_montly_hours 0.0 0.0 7.5 0.0
time_spend_company 0.2 0.0 13.2 0.0
Work_accident -1.5 0.1 -13.9 0.0
promotion_last_5years -2.1 0.3 -6.2 0.0
department 0.0 0.0 2.9 0.0
salary 0.0 0.0 0.5 0.7

The estimated probability that a validation observation belongs to class 1 (e.g., the estimated probability that the employee leaves) for the first few validation observations, using the logistic regression above, is:

Actual Class Predicted Class Probability of Class 1
Obs 1 0 1 0.75
Obs 2 0 0 0.18
Obs 3 0 0 0.24
Obs 4 0 0 0.13
Obs 5 0 0 0.24
Obs 6 0 0 0.23
Obs 7 0 0 0.09
Obs 8 1 1 0.39
Obs 9 1 1 0.27
Obs 10 1 1 0.41

The default decision is to classify each observation in the group with the highest probability - but one can change this choice, as we discuss below.

Running a basic CART model with complexity control cp=0.026, leads to the following tree and the complexity parameter:

Based on the CP tree we chose the CP factor as 0.02 as this is around where the relative error crosses the threshold line.

In our case, the probability our validation data belong to class 1 (i.e., a employee’s likelihood of leaving) for the first few validation observations, using the first CART above, is:

Actual Class Predicted Class Probability of Class 1
Obs 1 0 0 0.07
Obs 2 0 0 0.01
Obs 3 0 0 0.01
Obs 4 0 0 0.01
Obs 5 0 0 0.01
Obs 6 0 0 0.01
Obs 7 0 0 0.01
Obs 8 1 1 0.95
Obs 9 1 1 0.86
Obs 10 1 1 0.95

The table above assumes that the probability threshold for considering an observations as “class 1” is 0.24.

In our case, we can see the relative importance of the independent variables using the variable.importance of the CART trees (see help(rpart.object) in R) or the z-scores from the output of logistic regression. For easier visualization, we scale all values between -1 and 1 (the scaling is done for each method separately - note that CART does not provide the sign of the “coefficients”). From this table we can see the key drivers of the classification according to each of the methods we used here.

Logistic Regression CART 1
satisfaction_level -1.00 -1.00
last_evaluation 0.14 0.47
number_project -0.37 -0.51
average_montly_hours 0.22 0.50
time_spend_company 0.38 0.36
Work_accident -0.40 -0.02
promotion_last_5years -0.18 0.00
department 0.08 0.00
salary 0.01 0.00

Both Logistic Regression and CART agree that satisfaction level is the key driver for an employee leaving. However, among the other variables, there seems to be a disagreement between the models, especially around work accident and average monthly hours.

Step 5: Validation accuracy

For different choices of the probability threshold, we measured a number of classification performance metrics, which are outlined next.

1. Hit ratio

This is the percentage of the observations that have been correctly classified (i.e., the predicted class and the actual class are the same). We can just count the number of the validation data correctly classified and divide this number with the total number of the validation data, using the CART and the logistic regression above. These are as follows for probability threshold 24%:

Hit Ratio
Logistic Regression 73.2
CART 96.8

For the estimation data, the hit rates are:

Hit Ratio
Logistic Regression 72.80747
CART 96.47143

We now benchmark to compare the hit ratio performance of the classification models against the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is employees who do not leave: 2875 out of 3750 employees). Clearly, if we classified all individuals into the largest group, we could get a hit ratio of 76.67% without doing any work. In our case, the hit ratio for Logistic Regression is similar to the Maximum Chance Criterion whereas the CART model performs better than the Maximum Chance Creation.

2. Confusion matrix

The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example, for the method above with the highest hit rate in the validation data (among logistic regression and the CART model), and for probability threshold 24%, the confusion matrix for the validation data for the CART model (one with highest hit ratio) is:

Predicted 1 (Left) Predicted 0 (Did Not Leave)
Actual 1 (Left) 92.46 7.54
Actual 0 (Did Not Leave) 1.88 98.12

3. ROC curve

The ROC curves for the validation data for the logistic regression as well as the CART above are as follows:

4. Gains chart

In the employee attrition case we are studying, the gains charts for the validation data for our three classifiers are the following:

5. Profit curve

Total Estimated Profit = (% of 1’s correctly predicted)x(value of capturing a 1) + (% of 0’s correctly predicted)x(value of capturing a 0) + (% of 1’s incorrectly predicted as 0)x(cost of missing a 1) + (% of 0’s incorrectly predicted as 1)x(cost of missing a 0)

Calculating the expected profit requires we have an estimate of the four costs/values: the value of capturing a 1 or a 0, and the cost of misclassifying a 1 into a 0 or vice versa.

Given the values and costs of correct classifications and misclassifications, we can plot the total estimated profit (or loss) as we change the percentage of cases we select, i.e., the probability threshold of the classifier, like we did for the ROC and the gains chart.

In our employee attrition case, we consider the following business profit and loss to the credit card issuer for the correctly classified and misclassified customers:

Predict 1 (Left) Predict 0 (Did Not Leave)
Actual 1 (Left) 35000 -50000
Actual 0 (Did Not Leave) -15000 0

Based on these profit and cost estimates, the profit curves for the validation data for the three classifiers are:

Step 6. Test Accuracy

Having iterated steps 2-5 we are now satisfied with the performance of our selected model on the validation data, in this step the performance analysis outlined in step 5 needs to be done with the test sample. This is the performance that best mimics what one should expect in practice upon deployment of the classification solution,

Let’s see in our case how the hit ratio, confusion matrix, ROC curve, gains chart, and profit curve look like for our test data. For the hit ratio and the confusion matrix we use 24% as the probability threshold for classification.

Hit Ratio
Logistic Regression 72.4
CART 95.6

The confusion matrix for the model with the best test data hit ratio above:

Predicted 1 (Left) Predicted 0 (Did Not Leave)
Actual 1 (Left) 89.25 10.75
Actual 0 (Did Not Leave) 2.30 97.70

ROC curves for the test data:

Gains chart for the test data:

Finally the profit curves for the test data, using the same profit/cost estimates as we did above: