A multinational chip manufacturer is concerned that they are retention rate is lower than average for the industry. They are loosing their best employees and want to ensure that these employees stay. They know that if they pay a bonus of USD 15k to an employee who wants to leave, the employee would stay. However, the cost of hiring, training and paying the sign on bonus for a new employee is USD 50k. The HR department is trying to minimize this cost.
(Data source: https://www.kaggle.com/ludobenistant/hr-analytics/data. We acknowledge the following:
The multinational has gathered information on 14999 employees. The dataset contains information on 9 variables, including satisfaction score, last evaluation score, number of projects the employee has worked on, the average monthly hours, the time spent in the company, the occurence of work place accident, number of promotions, department, salary range, as well as information on the outcome: did the employee leaver or not?
Name | Description |
---|---|
satisfaction_level | Level of satisfaction (0-1) |
last_evaluation | Time since last performance evaluation (in Years) |
number_project | Number of projects completed while at work |
average_montly_hours | Average monthly hours at workplace |
time_spend_company | Number of years spent in the company |
Work_accident | Whether the employee had a workplace accident |
left | Whether the employee left the workplace or not (1 or 0) Factor |
promotion_last_5years | Whether the employee was promoted in the last five years |
sales | Department in which they work for |
salary | Relative level of salary (high) |
Let’s look into the data for a few employees. This is how the first 10 out of the total of 14999 rows look like (transposed, for convenience):
01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|
satisfaction_level | 0.38 | 0.80 | 0.11 | 0.72 | 0.37 | 0.41 | 0.10 | 0.92 | 0.89 | 0.42 |
last_evaluation | 0.53 | 0.86 | 0.88 | 0.87 | 0.52 | 0.50 | 0.77 | 0.85 | 1.00 | 0.53 |
number_project | 2.00 | 5.00 | 7.00 | 5.00 | 2.00 | 2.00 | 6.00 | 5.00 | 5.00 | 2.00 |
average_montly_hours | 157.00 | 262.00 | 272.00 | 223.00 | 159.00 | 153.00 | 247.00 | 259.00 | 224.00 | 142.00 |
time_spend_company | 3.00 | 6.00 | 4.00 | 5.00 | 3.00 | 3.00 | 4.00 | 5.00 | 5.00 | 3.00 |
Work_accident | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
department | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 | 8.00 |
salary | 2.00 | 3.00 | 3.00 | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 | 2.00 |
We will use the following process:
Let’s follow these steps.
For this project we are splitting the data as 70/20/10. 70% of the data will be used for training, 20% for validation and 10% for testing.
In our case we use 9749 observations in the estimation data, 3750 in the validation data, and 1500 in the test data.
First, make sure the dependent variable is set up as a categorical 0-1 variable. In our example, we use the employee leaving (or not leaving) as the dependent variable.
In our data the number of 0/1’s in our estimation sample is as follows:
Class 1 | Class 0 | |
---|---|---|
# of Observations | 2324 | 7425 |
while in the validation sample they are:
Class 1 | Class 0 | |
---|---|---|
# of Observations | 875 | 2875 |
In this section we run simple statistical and visual exploration of the data in order to udnerstand how the classes differ along any of the independent variables.
Starting with a simple stastical exploration below: class 1 (“left”):
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
satisfaction_level | 0.09 | 0.11 | 0.41 | 0.44 | 0.73 | 0.92 | 0.26 |
last_evaluation | 0.45 | 0.52 | 0.79 | 0.72 | 0.90 | 1.00 | 0.20 |
number_project | 2.00 | 2.00 | 4.00 | 3.88 | 6.00 | 7.00 | 1.82 |
average_montly_hours | 126.00 | 147.00 | 226.00 | 208.56 | 262.00 | 310.00 | 61.21 |
time_spend_company | 2.00 | 3.00 | 4.00 | 3.89 | 5.00 | 6.00 | 0.98 |
Work_accident | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1.00 | 0.21 |
promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.07 |
department | 1.00 | 5.00 | 8.00 | 7.01 | 9.00 | 10.00 | 2.81 |
salary | 1.00 | 2.00 | 2.00 | 2.35 | 3.00 | 3.00 | 0.52 |
and class 0 (“did not leave”):
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
satisfaction_level | 0.12 | 0.53 | 0.69 | 0.66 | 0.84 | 1 | 0.22 |
last_evaluation | 0.36 | 0.58 | 0.71 | 0.71 | 0.85 | 1 | 0.16 |
number_project | 2.00 | 3.00 | 4.00 | 3.80 | 4.00 | 6 | 0.98 |
average_montly_hours | 96.00 | 162.00 | 198.00 | 198.91 | 238.00 | 287 | 45.88 |
time_spend_company | 2.00 | 2.00 | 3.00 | 3.39 | 4.00 | 10 | 1.57 |
Work_accident | 0.00 | 0.00 | 0.00 | 0.18 | 0.00 | 1 | 0.38 |
promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 1 | 0.16 |
department | 1.00 | 5.00 | 8.00 | 6.88 | 9.00 | 10 | 2.75 |
salary | 1.00 | 2.00 | 2.00 | 2.35 | 3.00 | 3 | 0.66 |
Just looking at the simple analysis we see some differences between the two sets of employees. Initial hypothesis is that employees who are less satisfied and who work longer hours tend to be the ones that leave. In the next few sections we will apply statistical rigor to make the classification.
We use box plots below to form a simple visual exploration of the independent variables and their relation to the dependent variable. Variable 1 in this case is Satisfaction Level and Variable 2 is the employee’s last evaluation
## Warning: It is recommended that you provide at least seven colours for the
## swatch.
and class 0:
## Warning: It is recommended that you provide at least seven colours for the
## swatch.
Since we have identified the independent and the dependent variables, we will now explore methods to classify the data.
In this case we start with Logistic Regression and CART.
Logistic Regression: Logistic Regression is a method similar to linear regression except that the dependent variable is discrete (e.g., 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -0.2 | 0.2 | -0.9 | 0.4 |
satisfaction_level | -4.1 | 0.1 | -34.4 | 0.0 |
last_evaluation | 0.9 | 0.2 | 4.8 | 0.0 |
number_project | -0.3 | 0.0 | -12.6 | 0.0 |
average_montly_hours | 0.0 | 0.0 | 7.5 | 0.0 |
time_spend_company | 0.2 | 0.0 | 13.2 | 0.0 |
Work_accident | -1.5 | 0.1 | -13.9 | 0.0 |
promotion_last_5years | -2.1 | 0.3 | -6.2 | 0.0 |
department | 0.0 | 0.0 | 2.9 | 0.0 |
salary | 0.0 | 0.0 | 0.5 | 0.7 |
The estimated probability that a validation observation belongs to class 1 (e.g., the estimated probability that the employee leaves) for the first few validation observations, using the logistic regression above, is:
Actual Class | Predicted Class | Probability of Class 1 | |
---|---|---|---|
Obs 1 | 0 | 1 | 0.75 |
Obs 2 | 0 | 0 | 0.18 |
Obs 3 | 0 | 0 | 0.24 |
Obs 4 | 0 | 0 | 0.13 |
Obs 5 | 0 | 0 | 0.24 |
Obs 6 | 0 | 0 | 0.23 |
Obs 7 | 0 | 0 | 0.09 |
Obs 8 | 1 | 1 | 0.39 |
Obs 9 | 1 | 1 | 0.27 |
Obs 10 | 1 | 1 | 0.41 |
The default decision is to classify each observation in the group with the highest probability - but one can change this choice, as we discuss below.
Running a basic CART model with complexity control cp=0.026, leads to the following tree and the complexity parameter:
Based on the CP tree we chose the CP factor as 0.02 as this is around where the relative error crosses the threshold line.
In our case, the probability our validation data belong to class 1 (i.e., a employee’s likelihood of leaving) for the first few validation observations, using the first CART above, is:
Actual Class | Predicted Class | Probability of Class 1 | |
---|---|---|---|
Obs 1 | 0 | 0 | 0.07 |
Obs 2 | 0 | 0 | 0.01 |
Obs 3 | 0 | 0 | 0.01 |
Obs 4 | 0 | 0 | 0.01 |
Obs 5 | 0 | 0 | 0.01 |
Obs 6 | 0 | 0 | 0.01 |
Obs 7 | 0 | 0 | 0.01 |
Obs 8 | 1 | 1 | 0.95 |
Obs 9 | 1 | 1 | 0.86 |
Obs 10 | 1 | 1 | 0.95 |
The table above assumes that the probability threshold for considering an observations as “class 1” is 0.24.
In our case, we can see the relative importance of the independent variables using the variable.importance
of the CART trees (see help(rpart.object)
in R) or the z-scores from the output of logistic regression. For easier visualization, we scale all values between -1 and 1 (the scaling is done for each method separately - note that CART does not provide the sign of the “coefficients”). From this table we can see the key drivers of the classification according to each of the methods we used here.
Logistic Regression | CART 1 | |
---|---|---|
satisfaction_level | -1.00 | -1.00 |
last_evaluation | 0.14 | 0.47 |
number_project | -0.37 | -0.51 |
average_montly_hours | 0.22 | 0.50 |
time_spend_company | 0.38 | 0.36 |
Work_accident | -0.40 | -0.02 |
promotion_last_5years | -0.18 | 0.00 |
department | 0.08 | 0.00 |
salary | 0.01 | 0.00 |
Both Logistic Regression and CART agree that satisfaction level is the key driver for an employee leaving. However, among the other variables, there seems to be a disagreement between the models, especially around work accident and average monthly hours.
For different choices of the probability threshold, we measured a number of classification performance metrics, which are outlined next.
This is the percentage of the observations that have been correctly classified (i.e., the predicted class and the actual class are the same). We can just count the number of the validation data correctly classified and divide this number with the total number of the validation data, using the CART and the logistic regression above. These are as follows for probability threshold 24%:
Hit Ratio | |
---|---|
Logistic Regression | 73.2 |
CART | 96.8 |
For the estimation data, the hit rates are:
Hit Ratio | |
---|---|
Logistic Regression | 72.80747 |
CART | 96.47143 |
We now benchmark to compare the hit ratio performance of the classification models against the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is employees who do not leave: 2875 out of 3750 employees). Clearly, if we classified all individuals into the largest group, we could get a hit ratio of 76.67% without doing any work. In our case, the hit ratio for Logistic Regression is similar to the Maximum Chance Criterion whereas the CART model performs better than the Maximum Chance Creation.
The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example, for the method above with the highest hit rate in the validation data (among logistic regression and the CART model), and for probability threshold 24%, the confusion matrix for the validation data for the CART model (one with highest hit ratio) is:
Predicted 1 (Left) | Predicted 0 (Did Not Leave) | |
---|---|---|
Actual 1 (Left) | 92.46 | 7.54 |
Actual 0 (Did Not Leave) | 1.88 | 98.12 |
The ROC curves for the validation data for the logistic regression as well as the CART above are as follows:
In the employee attrition case we are studying, the gains charts for the validation data for our three classifiers are the following:
Total Estimated Profit = (% of 1’s correctly predicted)x(value of capturing a 1) + (% of 0’s correctly predicted)x(value of capturing a 0) + (% of 1’s incorrectly predicted as 0)x(cost of missing a 1) + (% of 0’s incorrectly predicted as 1)x(cost of missing a 0)
Calculating the expected profit requires we have an estimate of the four costs/values: the value of capturing a 1 or a 0, and the cost of misclassifying a 1 into a 0 or vice versa.
Given the values and costs of correct classifications and misclassifications, we can plot the total estimated profit (or loss) as we change the percentage of cases we select, i.e., the probability threshold of the classifier, like we did for the ROC and the gains chart.
In our employee attrition case, we consider the following business profit and loss to the credit card issuer for the correctly classified and misclassified customers:
Predict 1 (Left) | Predict 0 (Did Not Leave) | |
---|---|---|
Actual 1 (Left) | 35000 | -50000 |
Actual 0 (Did Not Leave) | -15000 | 0 |
Based on these profit and cost estimates, the profit curves for the validation data for the three classifiers are:
Having iterated steps 2-5 we are now satisfied with the performance of our selected model on the validation data, in this step the performance analysis outlined in step 5 needs to be done with the test sample. This is the performance that best mimics what one should expect in practice upon deployment of the classification solution,
Let’s see in our case how the hit ratio, confusion matrix, ROC curve, gains chart, and profit curve look like for our test data. For the hit ratio and the confusion matrix we use 24% as the probability threshold for classification.
Hit Ratio | |
---|---|
Logistic Regression | 72.4 |
CART | 95.6 |
The confusion matrix for the model with the best test data hit ratio above:
Predicted 1 (Left) | Predicted 0 (Did Not Leave) | |
---|---|---|
Actual 1 (Left) | 89.25 | 10.75 |
Actual 0 (Did Not Leave) | 2.30 | 97.70 |
ROC curves for the test data:
Gains chart for the test data:
Finally the profit curves for the test data, using the same profit/cost estimates as we did above: