Business Problem

Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. Employee attrition leads to significant costs for a business, including the cost of business disruption, hiring new staff and training new staff. As such, there is great business interest in understanding the drivers of, and minimizing staff attrition.

In this context, the use of classification models to predict if an employee is likely to quit could greatly increase the HR’s ability to intervene on time and remedy the situation to prevent attrition. While this model can be routinely run to identify employees who are most likely to quit, the key driver of success would be the human element of reaching out the employee, understanding the current situation of the employee and taking action to remedy controllable factors that can prevent attrition of the employee.

This data set presents an employee survey from IBM, indicating if there is attrition or not. The data set contains approximately 1500 entries. Given the limited size of the data set, the model should only be expected to provide modest improvement in indentification of attrition vs a random allocation of probability of attrition.

While some level of attrition in a company is inevitable, minimizing it and being prepared for the cases that cannot be helped will significantly help improve the operations of most businesses. As a future development, with a sufficiently large data set, it would be used to run a segmentation on employees, to develop certain “at risk” categories of employees. This could generate new insights for the business on what drives attrition, insights that cannot be generated by merely informational interviews with employees.


## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : int  1 0 1 0 0 0 0 0 0 0 ...
##  $ BusinessTravel          : int  3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : int  3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : int  2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : int  1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : int  8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : int  3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : int  2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

The Data

(Data source: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset.)

IBM has gathered information on employee satisfaction, income, seniority and some demographics. It includes the data of 1470 employees. To use a matrix structure, we changed the model to reflect the followin data

Change the Table Below - CHECK TABLE

Name Description
AGE Numerical Value
ATTRITION Employee leaving the company (0=no, 1=yes)
BUSINESS TRAVEL (1=No Travel, 2=Travel Frequently, 3=Tavel Rarely)
DAILY RATE Numerical Value - Salary Level
DEPARTMENT (1=HR, 2=R&D, 3=Sales)
DISTANCE FROM HOME Numerical Value - THE DISTANCE FROM WORK TO HOME
EDUCATION Numerical Value
EDUCATION FIELD (1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6= TEHCNICAL)
EMPLOYEE COUNT Numerical Value
EMPLOYEE NUMBER Numerical Value - EMPLOYEE ID
ENVIROMENT SATISFACTION Numerical Value - SATISFACTION WITH THE ENVIROMENT
GENDER (1=FEMALE, 2=MALE)
HOURLY RATE Numerical Value - HOURLY SALARY
JOB INVOLVEMENT Numerical Value - JOB INVOLVEMENT
JOB LEVEL Numerical Value - LEVEL OF JOB
JOB ROLE (1=HC REP, 2=HR, 3=LAB TECHNICIAN, 4=MANAGER, 5= MANAGING DIRECTOR, 6= REASEARCH DIRECTOR, 7= RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9= SALES REPRESENTATIVE)
JOB SATISFACTION Numerical Value - SATISFACTION WITH THE JOB
MARITAL STATUS (1=DIVORCED, 2=MARRIED, 3=SINGLE)
MONTHLY INCOME Numerical Value - MONTHLY SALARY
MONTHY RATE Numerical Value - MONTHY RATE
NUMCOMPANIES WORKED Numerical Value - NO. OF COMPANIES WORKED AT
OVER 18 (1=YES, 2=NO)
OVERTIME (1=NO, 2=YES)
PERCENT SALARY HIKE Numerical Value - PERCENTAGE INCREASE IN SALARY
PERFORMANCE RATING Numerical Value - ERFORMANCE RATING
RELATIONS SATISFACTION Numerical Value - RELATIONS SATISFACTION
STANDARD HOURS Numerical Value - STANDARD HOURS
STOCK OPTIONS LEVEL Numerical Value - STOCK OPTIONS
TOTAL WORKING YEARS Numerical Value - TOTAL YEARS WORKED
TRAINING TIMES LAST YEAR Numerical Value - HOURS SPENT TRAINING
WORK LIFE BALANCE Numerical Value - TIME SPENT BEWTWEEN WORK AND OUTSIDE
YEARS AT COMPANY Numerical Value - TOTAL NUMBER OF YEARS AT THE COMPNAY
YEARS IN CURRENT ROLE Numerical Value -YEARS IN CURRENT ROLE
YEARS SINCE LAST PROMOTION Numerical Value - LAST PROMOTION
YEARS WITH CURRENT MANAGER Numerical Value - YEARS SPENT WITH CURRENT MANAGER

The Solution - Methodology

We plan to run a Logistic regression model and CART to determine the probability of a certain employee to fall into the condition of Attrition and thus its high risk of leaving the company. We will then test different parameters and probability threshold using confusion Matrixes, Area under the Curve and Gini Coefficient to determine which of the three models is the best predictor and will reccommend its use in practice..

Let’s look into the data for a few customers. This is how the first 10 out of the total of 1470 rows look like (transposed, for convenience):

                           01      02     03      04      05      06     07      08     09      10

Age 41 49 37 33 27 32 59 30 38 36 BusinessTravel 3 2 3 2 3 2 3 3 2 3 EducationField 2 2 5 2 4 2 4 2 2 4 EnvironmentSatisfaction 2 3 4 4 1 4 3 4 4 3 Gender 1 2 2 1 2 2 1 2 2 2 HourlyRate 94 61 92 56 40 79 81 67 44 94 JobInvolvement 3 2 2 3 3 3 4 3 2 3 JobLevel 2 2 1 1 1 1 1 1 3 2 JobRole 8 7 3 7 3 3 3 3 5 1 JobSatisfaction 4 2 3 3 2 4 1 3 3 3 MaritalStatus 3 2 3 2 2 3 2 1 3 2 MonthlyIncome 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 MonthlyRate 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 NumCompaniesWorked 8 1 6 1 9 0 4 1 0 6 OverTime 2 1 2 2 1 1 2 1 1 1 PercentSalaryHike 11 23 15 11 12 13 20 22 21 13 PerformanceRating 1 2 1 1 1 1 2 2 2 1 RelationshipSatisfaction 1 4 2 3 4 3 1 2 2 2 StockOptionLevel 1 2 1 1 2 1 4 2 1 3 TotalWorkingYears 8 10 7 8 6 8 12 1 10 17 TrainingTimesLastYear 0 3 3 3 3 2 3 2 2 3 WorkLifeBalance 1 3 3 3 3 2 2 3 3 2 YearsAtCompany 6 10 0 8 2 7 1 1 9 7 YearsInCurrentRole 4 7 0 7 2 7 0 0 7 7 YearsSinceLastPromotion 0 1 0 3 2 3 0 0 1 7 YearsWithCurrManager 5 7 0 0 2 6 0 0 8 7

The Process for Classification

  1. Create an estimation sample and two validation samples by splitting the data into three groups.
  2. Set up the dependent variable, eployee attrition (as a categorical 0-1 variable)
  3. Estimate the classification model using the estimation data, and interpret the results.
  4. Assess the accuracy of classification in the first validation sample, possibly repeating steps 2-5 a few times changing the classifier in different ways to increase performance.
  5. Finally, assess the accuracy of classification in the second validation sample. You should eventually use and report all relevant performance measures and plots on this second validation sample only.

Step 1: Split the data

We split the data into an estimation sample and two validation samples - using a randomized splitting technique. The second validation data mimic out-of-sample data, and the performance on this validation set is a better approximation of the performance one should expect in practice from the selected classification method. The split used is 80% estimation, 10% validation, and 10% test data, depending on the number of observations - for example, when there is a lot of data, you may only keep a few hundreds of them for the validation and test sets, and use the rest for estimation.


Step 2: Set up the dependent variable

The data original file was not organized as a categorical Valuable, so we changed the column “Attrition” to 0 and 1 values.

In our data the number of 0/1’s in our estimation sample is as follows:

Attrition No Attrition
# of Employees 186 990

while in the validation sample they are:

         Attrition   No Attrition

Employees 25 122

Step 3: Simple Analysis

We are running a simple table to visualize the Data of those values that are attrited

min 25 percent median mean 75 percent max std
Age 18 28.00 32.0 34.11 40.00 58 9.98
BusinessTravel 1 2.00 3.0 2.61 3.00 3 0.59
EducationField 1 2.00 3.0 3.33 4.00 6 1.41
EnvironmentSatisfaction 1 1.00 3.0 2.45 3.75 4 1.17
Gender 1 1.00 2.0 1.61 2.00 2 0.49
HourlyRate 31 50.00 65.5 65.29 83.75 100 20.38
JobInvolvement 1 2.00 3.0 2.50 3.00 4 0.77
JobLevel 1 1.00 1.0 1.63 2.00 5 0.95
JobRole 1 3.00 7.0 5.71 8.00 9 2.63
JobSatisfaction 1 1.00 3.0 2.48 3.00 4 1.10
MaritalStatus 1 2.00 2.5 2.35 3.00 3 0.72
MonthlyIncome 1081 2363.00 3090.5 4805.40 6098.75 19859 3681.86
MonthlyRate 2326 8885.25 14465.0 14343.95 20700.75 26956 7067.61
NumCompaniesWorked 0 1.00 1.0 2.96 5.00 9 2.70
OverTime 1 1.00 2.0 1.55 2.00 2 0.50
PercentSalaryHike 11 12.00 14.0 14.82 17.00 25 3.60
PerformanceRating 1 1.00 1.0 1.12 1.00 2 0.32
RelationshipSatisfaction 1 1.00 3.0 2.56 4.00 4 1.14
StockOptionLevel 1 1.00 1.0 1.54 2.00 4 0.86
TotalWorkingYears 0 3.00 7.0 8.28 10.00 40 7.51
TrainingTimesLastYear 0 2.00 2.5 2.62 3.00 6 1.25
WorkLifeBalance 1 2.00 3.0 2.68 3.00 4 0.80
YearsAtCompany 0 1.00 3.0 5.08 7.00 40 6.19
YearsInCurrentRole 0 0.00 2.0 2.90 4.00 15 3.14
YearsSinceLastPromotion 0 0.00 1.0 1.93 2.00 15 3.03
YearsWithCurrManager 0 0.00 2.0 2.83 5.00 14 3.16

And not attrited:

                         min   25 percent    median       mean   75 percent     max       std

Age 18 31.00 36.0 37.61 43 60 8.95 BusinessTravel 1 2.00 3.0 2.60 3 3 0.68 EducationField 1 2.00 3.0 3.23 4 6 1.33 EnvironmentSatisfaction 1 2.00 3.0 2.75 4 4 1.08 Gender 1 1.00 2.0 1.61 2 2 0.49 HourlyRate 30 48.00 66.0 66.14 83 100 20.52 JobInvolvement 1 2.00 3.0 2.75 3 4 0.69 JobLevel 1 1.00 2.0 2.12 3 5 1.12 JobRole 1 3.00 6.0 5.34 7 9 2.44 JobSatisfaction 1 2.00 3.0 2.81 4 4 1.10 MaritalStatus 1 2.00 2.0 2.05 3 3 0.73 MonthlyIncome 1051 3067.25 5090.0 6710.29 8599 19999 4804.32 MonthlyRate 2094 7744.00 13961.5 14141.13 20364 26997 7077.85 NumCompaniesWorked 0 1.00 2.0 2.66 4 9 2.47 OverTime 1 1.00 1.0 1.23 1 2 0.42 PercentSalaryHike 11 12.00 14.0 15.33 18 25 3.67 PerformanceRating 1 1.00 1.0 1.16 1 2 0.37 RelationshipSatisfaction 1 2.00 3.0 2.75 4 4 1.07 StockOptionLevel 1 1.00 2.0 1.84 2 4 0.85 TotalWorkingYears 0 6.00 10.0 11.86 16 38 7.74 TrainingTimesLastYear 0 2.00 3.0 2.83 3 6 1.28 WorkLifeBalance 1 2.00 3.0 2.79 3 4 0.67 YearsAtCompany 0 3.00 5.0 7.32 10 37 6.15 YearsInCurrentRole 0 2.00 3.0 4.42 7 18 3.69 YearsSinceLastPromotion 0 0.00 1.0 2.19 3 15 3.21 YearsWithCurrManager 0 2.00 3.0 4.30 7 17 3.60

Step 4: Classification and Interpretation

Given our decisions, we decide to use a number of classification methods to develop a model that discriminates the different classes.

In this paper we will consider: logistic regression and classification and regression trees (CART). H

Logistic Regression: Logistic Regression is a method similar to linear regression except that the dependent variable is discrete (e.g., 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:

Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.7 1.3 1.3 0.2
Age 0.0 0.0 -1.3 0.2
BusinessTravel 0.0 0.1 -0.1 0.9
EducationField 0.0 0.1 0.5 0.6
EnvironmentSatisfaction -0.4 0.1 -4.3 0.0
Gender 0.2 0.2 1.1 0.3
HourlyRate 0.0 0.0 -0.5 0.6
JobInvolvement -0.5 0.1 -4.1 0.0
JobLevel -0.1 0.3 -0.2 0.8
JobRole 0.0 0.0 0.1 0.9
JobSatisfaction -0.4 0.1 -4.3 0.0
MaritalStatus 0.5 0.2 3.0 0.0
MonthlyIncome 0.0 0.0 -0.8 0.4
MonthlyRate 0.0 0.0 0.3 0.7
NumCompaniesWorked 0.2 0.0 4.1 0.0
OverTime 1.8 0.2 9.2 0.0
PercentSalaryHike 0.0 0.0 -0.8 0.4
PerformanceRating -0.2 0.4 -0.4 0.7
RelationshipSatisfaction -0.3 0.1 -3.3 0.0
StockOptionLevel -0.2 0.2 -1.3 0.2
TotalWorkingYears -0.1 0.0 -2.4 0.0
TrainingTimesLastYear -0.1 0.1 -1.7 0.1
WorkLifeBalance -0.2 0.1 -1.6 0.1
YearsAtCompany 0.1 0.0 2.1 0.0
YearsInCurrentRole -0.1 0.0 -3.0 0.0
YearsSinceLastPromotion 0.2 0.0 3.8 0.0
YearsWithCurrManager -0.1 0.1 -2.2 0.0

Given a set of independent variables, the output of the estimated logistic regression (the sum of the products of the independent variables with the corresponding regression coefficients) can be used to assess the probability an observation belongs to one of the classes. Specifically, the regression output can be transformed into a probability of belonging to, say, class 1 for each observation. The estimated probability that a validation observation belongs to class 1 (e.g., the estimated probability that the customer defaults) for the first few validation observations, using the logistic regression above, is:

Actual Class Predicted Class Probability of Class 1
Obs 1 0 0 0.01
Obs 2 0 0 0.01
Obs 3 0 0 0.02
Obs 4 0 0 0.09
Obs 5 0 0 0.00
Obs 6 1 0 0.24
Obs 7 0 0 0.01
Obs 8 0 0 0.02
Obs 9 1 0 0.22
Obs 10 0 0 0.09

The default decision is to classify each observation in the group with the highest probability.

CART

CART is a widely used classification method largely because the estimated classification models are easy to interpret. This classification tool iteratively “splits” the data using the most discriminatory independent variable at each step, building a “tree” - as shown below - on the way. The CART methods limit the size of the tree using various statistical techniques in order to avoid overfitting the data. For example, using the rpart and rpart.control functions in R, we can limit the size of the tree by selecting the functions’ complexity control parameter cp.

Running a basic CART model with complexity control cp=0.002, leads to the following tree (NOTE: for better readability of the tree figures below, we will rename the independent variables as IV1 to IV26 when using CART):

The leaves of the tree indicate the number of estimation data observations that “reach that leaf” that belong to each class. A perfect classification would only have data from one class in each of the tree leaves. However, such a perfect classification of the estimation data would most likely not be able to classify well out-of-sample data due to overfitting of the estimation data.

One can estimate larger trees through changing the tree’s complexity control parameter (in this case the rpart.control argument cp). For example, this is how the tree would look like if we set cp=2e-04:

One can also use the percentage of data in each leaf of the tree to get an estimate of the probability that an observation (e.g., customer) belongs to a given class. The purity of the leaf can indicate the probability that an observation that “reaches that leaf” belongs to a class. In our case, the probability our validation data belong to class 1 (i.e., a customer’s likelihood of default) for the first few validation observations, using the first CART above, is:

Actual Class Predicted Class Probability of Class 1
Obs 1 0 0 0.07
Obs 2 0 0 0.07
Obs 3 0 0 0.07
Obs 4 0 0 0.07
Obs 5 0 0 0.07
Obs 6 1 0 0.07
Obs 7 0 0 0.07
Obs 8 0 0 0.12
Obs 9 1 0 0.07
Obs 10 0 0 0.07

The table above assumes that the probability threshold for considering an observations as “class 1” is 0.45. In practice we need to select the probability threshold: this is an important choice that we will discuss below.

                        Logistic Regression   CART 1   CART 2

Age -0.14 -0.20 -0.27 BusinessTravel -0.01 -0.02 -0.01 EducationField 0.05 0.04 0.07 EnvironmentSatisfaction -0.47 -0.11 -0.15 Gender 0.12 0.00 0.00 HourlyRate -0.05 -0.34 -0.38 JobInvolvement -0.45 -0.08 -0.10 JobLevel -0.02 -0.15 -0.17 JobRole 0.01 0.35 0.43 JobSatisfaction -0.47 -0.04 -0.04 MaritalStatus 0.33 0.16 0.15 MonthlyIncome -0.09 -1.00 -1.00 MonthlyRate 0.03 0.07 0.09 NumCompaniesWorked 0.45 0.16 0.14 OverTime 1.00 0.59 0.54 PercentSalaryHike -0.09 -0.17 -0.19 PerformanceRating -0.04 -0.02 -0.02 RelationshipSatisfaction -0.36 -0.03 -0.04 StockOptionLevel -0.14 -0.23 -0.22 TotalWorkingYears -0.26 -0.50 -0.49 TrainingTimesLastYear -0.18 -0.01 -0.01 WorkLifeBalance -0.17 -0.16 -0.14 YearsAtCompany 0.23 0.25 0.26 YearsInCurrentRole -0.33 -0.16 -0.18 YearsSinceLastPromotion 0.41 0.00 0.08 YearsWithCurrManager -0.24 -0.17 -0.18

Step 5: Validation accuracy

Using the predicted class probabilities of the validation data, as outlined above, we can generate some measures of classification performance.

1. Hit ratio

This is the percentage of the observations that have been correctly classified (i.e., the predicted class and the actual class are the same). These are as follows for probability threshold 45%:

Hit Ratio
Logistic Regression 82.99320
First CART 82.31293
Second CART 82.99320

For the estimation data, the hit rates are:

Hit Ratio
Logistic Regression 87.50000
First CART 89.37075
Second CART 89.54082

2. Confusion matrix

The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example, for the method above with the highest hit rate in the validation data (among logistic regression and the 2 CART models), and for probability threshold 45%, the confusion matrix for the validation data is:

Predicted 1 (Attrition) Predicted 0 (No Attrition)
Actual 1 (Attrition) 32.00 68.00
Actual 0 (No Attrition) 6.56 93.44

3. ROC curve

The ROC curves for the validation data for the logistic regression as well as both the CARTs above are as follows:

4. Gains chart

The gains charts for the validation data for our three classifiers are the following:


Step 6. Test Accuracy

Having iterated steps 2-5 until we are satisfied with the performance of our selected model on the validation data, in this step the performance analysis outlined in step 5 needs to be done with the test sample. **

Let’s see in our case how the hit ratio, confusion matrix, ROC curve, gains chart, and profit curve look like for our test data. For the hit ratio and the confusion matrix we use 45% as the probability threshold for classification.

Hit Ratio
Logistic Regression 89.11565
First CART 85.03401
Second CART 82.31293

The confusion matrix for the model with the best test data hit ratio above:

Predicted 1 (Attrition) Predicted 0 (No Attrition)
Actual 1 (Attrition) 42.31 57.69
Actual 0 (No Attrition) 0.83 99.17

ROC curves for the test data:

Gains chart for the test data:


Step 7. Data Analysis

After we ran the model multiple times and iterate to find the best value, we came with some conclusions:

  • Model is biased towards predicting non attrition.
  • There is a tension between probability threshold and the number of employees who are accurately predicted as potential churners. A high probability threshold would end in a high number of errors. The business relevance is predict attrition well, rather than non attrition hence a lower probability threshold is chosen.
  • The confusion matrix shows that of all the people who are going to leave the company, our algorithm identifies about 42% of them accurately. While not ideal, this is a huge improvement on random sampling where we could have predicted only about 16% (the actual attrition rate). On the other hand, there is a cost of wrongly identifying attrition of non-leaving employees resulting in inefficiencies in resource allocation.
  • Log Regression is the best model, as it always predict a higher area under the curve and a better confusion matrix