Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. Employee attrition leads to significant costs for a business, including the cost of business disruption, hiring new staff and training new staff. As such, there is great business interest in understanding the drivers of, and minimizing staff attrition.
In this context, the use of classification models to predict if an employee is likely to quit could greatly increase the HR’s ability to intervene on time and remedy the situation to prevent attrition. While this model can be routinely run to identify employees who are most likely to quit, the key driver of success would be the human element of reaching out the employee, understanding the current situation of the employee and taking action to remedy controllable factors that can prevent attrition of the employee.
This data set presents an employee survey from IBM, indicating if there is attrition or not. The data set contains approximately 1500 entries. Given the limited size of the data set, the model should only be expected to provide modest improvement in indentification of attrition vs a random allocation of probability of attrition.
While some level of attrition in a company is inevitable, minimizing it and being prepared for the cases that cannot be helped will significantly help improve the operations of most businesses. As a future development, with a sufficiently large data set, it would be used to run a segmentation on employees, to develop certain “at risk” categories of employees. This could generate new insights for the business on what drives attrition, insights that cannot be generated by merely informational interviews with employees.
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : int 1 0 1 0 0 0 0 0 0 0 ...
## $ BusinessTravel : int 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : int 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : int 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : int 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : int 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : int 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : int 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
(Data source: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset.)
IBM has gathered information on employee satisfaction, income, seniority and some demographics. It includes the data of 1470 employees. To use a matrix structure, we changed the model to reflect the followin data
Name | Description |
---|---|
AGE | Numerical Value |
ATTRITION | Employee leaving the company (0=no, 1=yes) |
BUSINESS TRAVEL | (1=No Travel, 2=Travel Frequently, 3=Tavel Rarely) |
DAILY RATE | Numerical Value - Salary Level |
DEPARTMENT | (1=HR, 2=R&D, 3=Sales) |
DISTANCE FROM HOME | Numerical Value - THE DISTANCE FROM WORK TO HOME |
EDUCATION | Numerical Value |
EDUCATION FIELD | (1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6= TEHCNICAL) |
EMPLOYEE COUNT | Numerical Value |
EMPLOYEE NUMBER | Numerical Value - EMPLOYEE ID |
ENVIROMENT SATISFACTION | Numerical Value - SATISFACTION WITH THE ENVIROMENT |
GENDER | (1=FEMALE, 2=MALE) |
HOURLY RATE | Numerical Value - HOURLY SALARY |
JOB INVOLVEMENT | Numerical Value - JOB INVOLVEMENT |
JOB LEVEL | Numerical Value - LEVEL OF JOB |
JOB ROLE | (1=HC REP, 2=HR, 3=LAB TECHNICIAN, 4=MANAGER, 5= MANAGING DIRECTOR, 6= REASEARCH DIRECTOR, 7= RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9= SALES REPRESENTATIVE) |
JOB SATISFACTION | Numerical Value - SATISFACTION WITH THE JOB |
MARITAL STATUS | (1=DIVORCED, 2=MARRIED, 3=SINGLE) |
MONTHLY INCOME | Numerical Value - MONTHLY SALARY |
MONTHY RATE | Numerical Value - MONTHY RATE |
NUMCOMPANIES WORKED | Numerical Value - NO. OF COMPANIES WORKED AT |
OVER 18 | (1=YES, 2=NO) |
OVERTIME | (1=NO, 2=YES) |
PERCENT SALARY HIKE | Numerical Value - PERCENTAGE INCREASE IN SALARY |
PERFORMANCE RATING | Numerical Value - ERFORMANCE RATING |
RELATIONS SATISFACTION | Numerical Value - RELATIONS SATISFACTION |
STANDARD HOURS | Numerical Value - STANDARD HOURS |
STOCK OPTIONS LEVEL | Numerical Value - STOCK OPTIONS |
TOTAL WORKING YEARS | Numerical Value - TOTAL YEARS WORKED |
TRAINING TIMES LAST YEAR | Numerical Value - HOURS SPENT TRAINING |
WORK LIFE BALANCE | Numerical Value - TIME SPENT BEWTWEEN WORK AND OUTSIDE |
YEARS AT COMPANY | Numerical Value - TOTAL NUMBER OF YEARS AT THE COMPNAY |
YEARS IN CURRENT ROLE | Numerical Value -YEARS IN CURRENT ROLE |
YEARS SINCE LAST PROMOTION | Numerical Value - LAST PROMOTION |
YEARS WITH CURRENT MANAGER | Numerical Value - YEARS SPENT WITH CURRENT MANAGER |
We plan to run a Logistic regression model and CART to determine the probability of a certain employee to fall into the condition of Attrition and thus its high risk of leaving the company. We will then test different parameters and probability threshold using confusion Matrixes, Area under the Curve and Gini Coefficient to determine which of the three models is the best predictor and will reccommend its use in practice..
Let’s look into the data for a few customers. This is how the first 10 out of the total of 1470 rows look like (transposed, for convenience):
01 02 03 04 05 06 07 08 09 10
We split the data into an estimation sample and two validation samples - using a randomized splitting technique. The second validation data mimic out-of-sample data, and the performance on this validation set is a better approximation of the performance one should expect in practice from the selected classification method. The split used is 80% estimation, 10% validation, and 10% test data, depending on the number of observations - for example, when there is a lot of data, you may only keep a few hundreds of them for the validation and test sets, and use the rest for estimation.
The data original file was not organized as a categorical Valuable, so we changed the column “Attrition” to 0 and 1 values.
In our data the number of 0/1’s in our estimation sample is as follows:
Attrition | No Attrition | |
---|---|---|
# of Employees | 186 | 990 |
while in the validation sample they are:
Attrition No Attrition
We are running a simple table to visualize the Data of those values that are attrited
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
Age | 18 | 28.00 | 32.0 | 34.11 | 40.00 | 58 | 9.98 |
BusinessTravel | 1 | 2.00 | 3.0 | 2.61 | 3.00 | 3 | 0.59 |
EducationField | 1 | 2.00 | 3.0 | 3.33 | 4.00 | 6 | 1.41 |
EnvironmentSatisfaction | 1 | 1.00 | 3.0 | 2.45 | 3.75 | 4 | 1.17 |
Gender | 1 | 1.00 | 2.0 | 1.61 | 2.00 | 2 | 0.49 |
HourlyRate | 31 | 50.00 | 65.5 | 65.29 | 83.75 | 100 | 20.38 |
JobInvolvement | 1 | 2.00 | 3.0 | 2.50 | 3.00 | 4 | 0.77 |
JobLevel | 1 | 1.00 | 1.0 | 1.63 | 2.00 | 5 | 0.95 |
JobRole | 1 | 3.00 | 7.0 | 5.71 | 8.00 | 9 | 2.63 |
JobSatisfaction | 1 | 1.00 | 3.0 | 2.48 | 3.00 | 4 | 1.10 |
MaritalStatus | 1 | 2.00 | 2.5 | 2.35 | 3.00 | 3 | 0.72 |
MonthlyIncome | 1081 | 2363.00 | 3090.5 | 4805.40 | 6098.75 | 19859 | 3681.86 |
MonthlyRate | 2326 | 8885.25 | 14465.0 | 14343.95 | 20700.75 | 26956 | 7067.61 |
NumCompaniesWorked | 0 | 1.00 | 1.0 | 2.96 | 5.00 | 9 | 2.70 |
OverTime | 1 | 1.00 | 2.0 | 1.55 | 2.00 | 2 | 0.50 |
PercentSalaryHike | 11 | 12.00 | 14.0 | 14.82 | 17.00 | 25 | 3.60 |
PerformanceRating | 1 | 1.00 | 1.0 | 1.12 | 1.00 | 2 | 0.32 |
RelationshipSatisfaction | 1 | 1.00 | 3.0 | 2.56 | 4.00 | 4 | 1.14 |
StockOptionLevel | 1 | 1.00 | 1.0 | 1.54 | 2.00 | 4 | 0.86 |
TotalWorkingYears | 0 | 3.00 | 7.0 | 8.28 | 10.00 | 40 | 7.51 |
TrainingTimesLastYear | 0 | 2.00 | 2.5 | 2.62 | 3.00 | 6 | 1.25 |
WorkLifeBalance | 1 | 2.00 | 3.0 | 2.68 | 3.00 | 4 | 0.80 |
YearsAtCompany | 0 | 1.00 | 3.0 | 5.08 | 7.00 | 40 | 6.19 |
YearsInCurrentRole | 0 | 0.00 | 2.0 | 2.90 | 4.00 | 15 | 3.14 |
YearsSinceLastPromotion | 0 | 0.00 | 1.0 | 1.93 | 2.00 | 15 | 3.03 |
YearsWithCurrManager | 0 | 0.00 | 2.0 | 2.83 | 5.00 | 14 | 3.16 |
And not attrited:
min 25 percent median mean 75 percent max std
Given our decisions, we decide to use a number of classification methods to develop a model that discriminates the different classes.
In this paper we will consider: logistic regression and classification and regression trees (CART). H
Logistic Regression: Logistic Regression is a method similar to linear regression except that the dependent variable is discrete (e.g., 0 or 1). Linear logistic regression estimates the coefficients of a linear model using the selected independent variables while optimizing a classification criterion. For example, this is the logistic regression parameters for our data:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 1.7 | 1.3 | 1.3 | 0.2 |
Age | 0.0 | 0.0 | -1.3 | 0.2 |
BusinessTravel | 0.0 | 0.1 | -0.1 | 0.9 |
EducationField | 0.0 | 0.1 | 0.5 | 0.6 |
EnvironmentSatisfaction | -0.4 | 0.1 | -4.3 | 0.0 |
Gender | 0.2 | 0.2 | 1.1 | 0.3 |
HourlyRate | 0.0 | 0.0 | -0.5 | 0.6 |
JobInvolvement | -0.5 | 0.1 | -4.1 | 0.0 |
JobLevel | -0.1 | 0.3 | -0.2 | 0.8 |
JobRole | 0.0 | 0.0 | 0.1 | 0.9 |
JobSatisfaction | -0.4 | 0.1 | -4.3 | 0.0 |
MaritalStatus | 0.5 | 0.2 | 3.0 | 0.0 |
MonthlyIncome | 0.0 | 0.0 | -0.8 | 0.4 |
MonthlyRate | 0.0 | 0.0 | 0.3 | 0.7 |
NumCompaniesWorked | 0.2 | 0.0 | 4.1 | 0.0 |
OverTime | 1.8 | 0.2 | 9.2 | 0.0 |
PercentSalaryHike | 0.0 | 0.0 | -0.8 | 0.4 |
PerformanceRating | -0.2 | 0.4 | -0.4 | 0.7 |
RelationshipSatisfaction | -0.3 | 0.1 | -3.3 | 0.0 |
StockOptionLevel | -0.2 | 0.2 | -1.3 | 0.2 |
TotalWorkingYears | -0.1 | 0.0 | -2.4 | 0.0 |
TrainingTimesLastYear | -0.1 | 0.1 | -1.7 | 0.1 |
WorkLifeBalance | -0.2 | 0.1 | -1.6 | 0.1 |
YearsAtCompany | 0.1 | 0.0 | 2.1 | 0.0 |
YearsInCurrentRole | -0.1 | 0.0 | -3.0 | 0.0 |
YearsSinceLastPromotion | 0.2 | 0.0 | 3.8 | 0.0 |
YearsWithCurrManager | -0.1 | 0.1 | -2.2 | 0.0 |
Given a set of independent variables, the output of the estimated logistic regression (the sum of the products of the independent variables with the corresponding regression coefficients) can be used to assess the probability an observation belongs to one of the classes. Specifically, the regression output can be transformed into a probability of belonging to, say, class 1 for each observation. The estimated probability that a validation observation belongs to class 1 (e.g., the estimated probability that the customer defaults) for the first few validation observations, using the logistic regression above, is:
Actual Class | Predicted Class | Probability of Class 1 | |
---|---|---|---|
Obs 1 | 0 | 0 | 0.01 |
Obs 2 | 0 | 0 | 0.01 |
Obs 3 | 0 | 0 | 0.02 |
Obs 4 | 0 | 0 | 0.09 |
Obs 5 | 0 | 0 | 0.00 |
Obs 6 | 1 | 0 | 0.24 |
Obs 7 | 0 | 0 | 0.01 |
Obs 8 | 0 | 0 | 0.02 |
Obs 9 | 1 | 0 | 0.22 |
Obs 10 | 0 | 0 | 0.09 |
The default decision is to classify each observation in the group with the highest probability.
CART
CART is a widely used classification method largely because the estimated classification models are easy to interpret. This classification tool iteratively “splits” the data using the most discriminatory independent variable at each step, building a “tree” - as shown below - on the way. The CART methods limit the size of the tree using various statistical techniques in order to avoid overfitting the data. For example, using the rpart and rpart.control functions in R, we can limit the size of the tree by selecting the functions’ complexity control parameter cp.
Running a basic CART model with complexity control cp=0.002, leads to the following tree (NOTE: for better readability of the tree figures below, we will rename the independent variables as IV1 to IV26 when using CART):
The leaves of the tree indicate the number of estimation data observations that “reach that leaf” that belong to each class. A perfect classification would only have data from one class in each of the tree leaves. However, such a perfect classification of the estimation data would most likely not be able to classify well out-of-sample data due to overfitting of the estimation data.
One can estimate larger trees through changing the tree’s complexity control parameter (in this case the rpart.control argument cp). For example, this is how the tree would look like if we set cp=2e-04:
One can also use the percentage of data in each leaf of the tree to get an estimate of the probability that an observation (e.g., customer) belongs to a given class. The purity of the leaf can indicate the probability that an observation that “reaches that leaf” belongs to a class. In our case, the probability our validation data belong to class 1 (i.e., a customer’s likelihood of default) for the first few validation observations, using the first CART above, is:
Actual Class | Predicted Class | Probability of Class 1 | |
---|---|---|---|
Obs 1 | 0 | 0 | 0.07 |
Obs 2 | 0 | 0 | 0.07 |
Obs 3 | 0 | 0 | 0.07 |
Obs 4 | 0 | 0 | 0.07 |
Obs 5 | 0 | 0 | 0.07 |
Obs 6 | 1 | 0 | 0.07 |
Obs 7 | 0 | 0 | 0.07 |
Obs 8 | 0 | 0 | 0.12 |
Obs 9 | 1 | 0 | 0.07 |
Obs 10 | 0 | 0 | 0.07 |
The table above assumes that the probability threshold for considering an observations as “class 1” is 0.45. In practice we need to select the probability threshold: this is an important choice that we will discuss below.
Logistic Regression CART 1 CART 2
Using the predicted class probabilities of the validation data, as outlined above, we can generate some measures of classification performance.
This is the percentage of the observations that have been correctly classified (i.e., the predicted class and the actual class are the same). These are as follows for probability threshold 45%:
Hit Ratio | |
---|---|
Logistic Regression | 82.99320 |
First CART | 82.31293 |
Second CART | 82.99320 |
For the estimation data, the hit rates are:
Hit Ratio | |
---|---|
Logistic Regression | 87.50000 |
First CART | 89.37075 |
Second CART | 89.54082 |
The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example, for the method above with the highest hit rate in the validation data (among logistic regression and the 2 CART models), and for probability threshold 45%, the confusion matrix for the validation data is:
Predicted 1 (Attrition) | Predicted 0 (No Attrition) | |
---|---|---|
Actual 1 (Attrition) | 32.00 | 68.00 |
Actual 0 (No Attrition) | 6.56 | 93.44 |
The ROC curves for the validation data for the logistic regression as well as both the CARTs above are as follows:
The gains charts for the validation data for our three classifiers are the following:
Having iterated steps 2-5 until we are satisfied with the performance of our selected model on the validation data, in this step the performance analysis outlined in step 5 needs to be done with the test sample. **
Let’s see in our case how the hit ratio, confusion matrix, ROC curve, gains chart, and profit curve look like for our test data. For the hit ratio and the confusion matrix we use 45% as the probability threshold for classification.
Hit Ratio | |
---|---|
Logistic Regression | 89.11565 |
First CART | 85.03401 |
Second CART | 82.31293 |
The confusion matrix for the model with the best test data hit ratio above:
Predicted 1 (Attrition) | Predicted 0 (No Attrition) | |
---|---|---|
Actual 1 (Attrition) | 42.31 | 57.69 |
Actual 0 (No Attrition) | 0.83 | 99.17 |
ROC curves for the test data:
Gains chart for the test data:
After we ran the model multiple times and iterate to find the best value, we came with some conclusions: