Goal
We decided to analyse HR data about various metrics related to employees. We took a sample data set from a large American corporation to propose a model to analyze one of the most pressing business problems nowadays: employee turnover. Therefore, our primary goal is understanding the drivers of turnover - why do people leave the company?
Load the data from Kaggle.com regarding HR analytics in a major US company
https://www.kaggle.com/ludobenistant/hr-analytics
We picked the underlying data set for the following reasons: it had a considerable number of observations (14,998) and a manageable number of factors (19). This will ease the process of dimensionality reduction, while providing sufficient basis for a meaningful segmentation. The factors include a complete set of characteristics: demographics, performance levels, one/off “dummy†variables. Therefore, we will be able to study the interactions between different kinds of factors.
The factors included in the dataset are:
* Satisfaction level (0-1)
* Last valuation (0-1) - The score received during performance review
* Number of projects in which the employee was engaged
* Average monthly hours worked
* Time spent in the company in years
* Working accident (0 or 1) - If the person had work-related accidents
* Promotion in the last 5 years (0 or 1)
* Department in which the person works
* Salary (low – medium – high)
* Whether the employee has left (0 or 1)
The Department factor included several text options (e.g. sales, hr, accounting). We decided to create an additional column for each option, and to populate them with 1 when the value was present in the original column and with 0 in all the others. In this way, we were able to transform text variables into dummy variables. Given that most of the factors are numerical we will use basic descriptive statistics as a starting point. Then we scale the results 0-1.
Below the visualization of a sample of the data for 15 employees. On the right we can see the columns for the Department that we have created to better work on the data.:
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Salary | Sales | Accounting | Hr | product_mng | technical | support | IT | RandD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.41 | 0.50 | 2 | 153 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.10 | 0.77 | 6 | 247 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.92 | 0.85 | 5 | 259 | 5 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.89 | 1.00 | 5 | 224 | 5 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.42 | 0.53 | 2 | 142 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.45 | 0.54 | 2 | 135 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.11 | 0.81 | 6 | 305 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.84 | 0.92 | 4 | 234 | 5 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.41 | 0.55 | 2 | 148 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.36 | 0.56 | 2 | 137 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We start by calculating the descriptive statistics of the factors. Afterwards, we run the summary statistics of the scaled dataset (excluding dummy variables for the department):
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
satisfaction_level | 0.09 | 0.44 | 0.64 | 0.61 | 0.82 | 1 | 0.25 |
last_evaluation | 0.36 | 0.56 | 0.72 | 0.72 | 0.87 | 1 | 0.17 |
number_project | 2.00 | 3.00 | 4.00 | 3.80 | 5.00 | 7 | 1.23 |
average_montly_hours | 96.00 | 156.00 | 200.00 | 201.05 | 245.00 | 310 | 49.94 |
time_spend_company | 2.00 | 3.00 | 3.00 | 3.50 | 4.00 | 10 | 1.46 |
Work_accident | 0.00 | 0.00 | 0.00 | 0.14 | 0.00 | 1 | 0.35 |
left | 0.00 | 0.00 | 0.00 | 0.24 | 0.00 | 1 | 0.43 |
promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 1 | 0.14 |
Salary | 0.00 | 0.00 | 1.00 | 0.59 | 1.00 | 2 | 0.64 |
Sales | 0.00 | 0.00 | 0.00 | 0.28 | 1.00 | 1 | 0.45 |
Accounting | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
Hr | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
product_mng | 0.00 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.24 |
technical | 0.00 | 0.00 | 0.00 | 0.18 | 0.00 | 1 | 0.39 |
support | 0.00 | 0.00 | 0.00 | 0.15 | 0.00 | 1 | 0.36 |
IT | 0.00 | 0.00 | 0.00 | 0.08 | 0.00 | 1 | 0.27 |
RandD | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
We now scale the data (excluding the dummy variables) to make them more comparable and to ease the analyis.
Notice now the summary statistics of the scaled dataset (excluding dummy variables for the department):
min | 25 percent | median | mean | 75 percent | max | std | |
---|---|---|---|---|---|---|---|
satisfaction_level | -2.10 | -0.70 | 0.11 | 0 | 0.83 | 1.56 | 1 |
last_evaluation | -2.08 | -0.91 | 0.02 | 0 | 0.90 | 1.66 | 1 |
number_project | -1.46 | -0.65 | 0.16 | 0 | 0.97 | 2.59 | 1 |
average_montly_hours | -2.10 | -0.90 | -0.02 | 0 | 0.88 | 2.18 | 1 |
time_spend_company | -1.03 | -0.34 | -0.34 | 0 | 0.34 | 4.45 | 1 |
Work_accident | -0.41 | -0.41 | -0.41 | 0 | -0.41 | 2.43 | 1 |
left | -0.56 | -0.56 | -0.56 | 0 | -0.56 | 1.79 | 1 |
promotion_last_5years | -0.15 | -0.15 | -0.15 | 0 | -0.15 | 6.78 | 1 |
Salary | -0.93 | -0.93 | 0.64 | 0 | 0.64 | 2.21 | 1 |
Let’s see how these are correlated. The correlation matrix is as follows:
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Salary | Sales | Accounting | Hr | product_mng | technical | support | IT | RandD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
satisfaction_level | 1.00 | 0.11 | -0.14 | -0.02 | -0.10 | 0.06 | -0.39 | 0.03 | 0.05 | 0.00 | -0.03 | -0.01 | 0.01 | -0.01 | 0.01 | 0.01 | 0.01 |
last_evaluation | 0.11 | 1.00 | 0.35 | 0.34 | 0.13 | -0.01 | 0.01 | -0.01 | -0.01 | -0.02 | 0.00 | -0.01 | 0.00 | 0.01 | 0.02 | 0.00 | -0.01 |
number_project | -0.14 | 0.35 | 1.00 | 0.42 | 0.20 | 0.00 | 0.02 | -0.01 | 0.00 | -0.01 | 0.00 | -0.03 | 0.00 | 0.03 | 0.00 | 0.00 | 0.01 |
average_montly_hours | -0.02 | 0.34 | 0.42 | 1.00 | 0.13 | -0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | -0.01 | 0.01 | 0.00 | 0.01 | 0.00 |
time_spend_company | -0.10 | 0.13 | 0.20 | 0.13 | 1.00 | 0.00 | 0.14 | 0.07 | 0.05 | 0.02 | 0.00 | -0.02 | 0.00 | -0.03 | -0.03 | -0.01 | -0.02 |
Work_accident | 0.06 | -0.01 | 0.00 | -0.01 | 0.00 | 1.00 | -0.15 | 0.04 | 0.01 | 0.00 | -0.01 | -0.02 | 0.00 | -0.01 | 0.01 | -0.01 | 0.02 |
left | -0.39 | 0.01 | 0.02 | 0.07 | 0.14 | -0.15 | 1.00 | -0.06 | -0.16 | 0.01 | 0.02 | 0.03 | -0.01 | 0.02 | 0.01 | -0.01 | -0.05 |
promotion_last_5years | 0.03 | -0.01 | -0.01 | 0.00 | 0.07 | 0.04 | -0.06 | 1.00 | 0.10 | 0.01 | 0.00 | 0.00 | -0.04 | -0.04 | -0.04 | -0.04 | 0.02 |
Salary | 0.05 | -0.01 | 0.00 | 0.00 | 0.05 | 0.01 | -0.16 | 0.10 | 1.00 | -0.04 | 0.01 | 0.00 | -0.01 | -0.02 | -0.03 | -0.01 | 0.00 |
Sales | 0.00 | -0.02 | -0.01 | 0.00 | 0.02 | 0.00 | 0.01 | 0.01 | -0.04 | 1.00 | -0.14 | -0.14 | -0.16 | -0.29 | -0.26 | -0.18 | -0.15 |
Accounting | -0.03 | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | 0.02 | 0.00 | 0.01 | -0.14 | 1.00 | -0.05 | -0.06 | -0.11 | -0.10 | -0.07 | -0.05 |
Hr | -0.01 | -0.01 | -0.03 | -0.01 | -0.02 | -0.02 | 0.03 | 0.00 | 0.00 | -0.14 | -0.05 | 1.00 | -0.06 | -0.11 | -0.10 | -0.07 | -0.05 |
product_mng | 0.01 | 0.00 | 0.00 | -0.01 | 0.00 | 0.00 | -0.01 | -0.04 | -0.01 | -0.16 | -0.06 | -0.06 | 1.00 | -0.12 | -0.11 | -0.08 | -0.06 |
technical | -0.01 | 0.01 | 0.03 | 0.01 | -0.03 | -0.01 | 0.02 | -0.04 | -0.02 | -0.29 | -0.11 | -0.11 | -0.12 | 1.00 | -0.20 | -0.14 | -0.11 |
support | 0.01 | 0.02 | 0.00 | 0.00 | -0.03 | 0.01 | 0.01 | -0.04 | -0.03 | -0.26 | -0.10 | -0.10 | -0.11 | -0.20 | 1.00 | -0.12 | -0.10 |
IT | 0.01 | 0.00 | 0.00 | 0.01 | -0.01 | -0.01 | -0.01 | -0.04 | -0.01 | -0.18 | -0.07 | -0.07 | -0.08 | -0.14 | -0.12 | 1.00 | -0.07 |
RandD | 0.01 | -0.01 | 0.01 | 0.00 | -0.02 | 0.02 | -0.05 | 0.02 | 0.00 | -0.15 | -0.05 | -0.05 | -0.06 | -0.11 | -0.10 | -0.07 | 1.00 |
Looking at the correlation matrix we already develop some expectations for the employee segmentation analysis:
* We see high positive or negative correlations with at least one other factor for satisfaction level, last evaluation, number of projects, average monthly hours and time spend in the company. Consequently, we expect these factors to play a major role in the segmentation of employees
* Work accidents, salary and promotion do not show any strong correlation with other factors. We cannot tell yet whether or not they are important for employees, but apparently, they will probably not be a driving factor for splitting them into segments
* At the same time, departments show very low correlations with the remaining factors. Thus, we do not expect that they can be used for profiling later on, but this will be analyzed in a seperate step
* Left (employee turnover) is strongly negatively correlated with Satisfaction level: -0.39. Consequently, at the current stage, we expect satisfaction level to be one of the drivers for employee turnover
For Dimensionality reduction we followed the process highlighted in class.
Based on the cummulative variance explained, we decided to group the factors into three components. Three components explain more than 50% of the variance (see charts below).
# Columns used
factor_attributes_used = c(1:17)
# Factor Selection Criteria, Choices: 'eigenvalue', 'variance', 'manual'
factor_selectionciterion = "manual"
# Please ENTER the desired minumum variance explained
minimum_variance_explained = 40 # between 1 and 100
# Please ENTER the number of factors to use
manual_numb_factors_used = 3
# Please ENTER the rotation eventually used (e.g. 'none', 'varimax',
# 'quatimax', 'promax', 'oblimin', 'simplimax', and 'cluster' - see
# help(principal)). Default is 'varimax'
rotation_used = "varimax"
For dimensionability reduction and employee segmentation we use the following eight factors: satisfaction level, last evaluation, number of projects, average monthly hours, time spend in the company, Work accidents, salary, and promotion.
ProjectData_Clustering <- ProjectData2_scaled[, c(1:8)]
factor_attributes_used <- intersect(factor_attributes_used, 1:ncol(ProjectData_Clustering))
ProjectDataFactor <- ProjectData_Clustering[, factor_attributes_used]
ProjectDataFactor <- ProjectData_Clustering <- data.matrix(ProjectDataFactor)
# `PCA` function
Variance_Explained_Table_results <- PCA(ProjectDataFactor, graph = FALSE)
Variance_Explained_Table <- Variance_Explained_Table_results$eig
Variance_Explained_Table_copy <- Variance_Explained_Table
rownames(Variance_Explained_Table) <- paste("Component", 1:nrow(Variance_Explained_Table),
sep = " ")
colnames(Variance_Explained_Table) <- c("Eigenvalue", "Pct of explained variance",
"Cumulative pct of explained variance")
iprint.df(round(Variance_Explained_Table, 2))
Eigenvalue | Pct of explained variance | Cumulative pct of explained variance | |
---|---|---|---|
Component 1 | 1.83 | 22.88 | 22.88 |
Component 2 | 1.16 | 14.46 | 37.33 |
Component 3 | 1.11 | 13.91 | 51.24 |
Component 4 | 0.98 | 12.30 | 63.54 |
Component 5 | 0.89 | 11.18 | 74.72 |
Component 6 | 0.84 | 10.53 | 85.25 |
Component 7 | 0.63 | 7.90 | 93.15 |
Component 8 | 0.55 | 6.85 | 100.00 |
eigenvalues <- Variance_Explained_Table[, "Eigenvalue"]
df <- cbind(as.data.frame(eigenvalues), c(1:length(eigenvalues)), rep(1, length(eigenvalues)))
colnames(df) <- c("eigenvalues", "components", "abline")
iplot.df(melt(df, id = "components"))
if (factor_selectionciterion == "eigenvalue") factors_selected = sum(Variance_Explained_Table_copy[,
1] >= 1)
if (factor_selectionciterion == "variance") factors_selected = 1:head(which(Variance_Explained_Table_copy[,
"cumulative percentage of variance"] >= minimum_variance_explained), 1)
if (factor_selectionciterion == "manual") factors_selected = manual_numb_factors_used
Rotated_Results <- principal(ProjectDataFactor, nfactors = max(factors_selected),
rotate = rotation_used, score = TRUE)
Rotated_Factors <- round(Rotated_Results$loadings, 2)
Rotated_Factors <- as.data.frame(unclass(Rotated_Factors))
colnames(Rotated_Factors) <- paste("Comp.", 1:ncol(Rotated_Factors), sep = "")
sorted_rows <- sort(Rotated_Factors[, 1], decreasing = TRUE, index.return = TRUE)$ix
Rotated_Factors <- Rotated_Factors[sorted_rows, ]
iprint.df(Rotated_Factors, scale = TRUE)
Comp.1 | Comp.2 | Comp.3 | |
---|---|---|---|
average_montly_hours | 0.76 | -0.01 | -0.02 |
number_project | 0.75 | -0.24 | 0.03 |
last_evaluation | 0.74 | 0.24 | -0.04 |
time_spend_company | 0.33 | -0.40 | 0.41 |
satisfaction_level | 0.03 | 0.86 | 0.01 |
Work_accident | 0.01 | 0.34 | 0.22 |
Salary | -0.03 | 0.12 | 0.65 |
promotion_last_5years | -0.04 | 0.05 | 0.71 |
From the above exercise, we have established three key components, as well as the attributes that best associate with these components. We can infer from the data that each of the components refers to the following:
Comp.1: Care about the work
Comp.2: Satisfaction with the company
Comp.3: Driven by external incentives
Using these components, we can now begin our segmentation.
# Select Attributes to use
segmentation_attributes_used = c(1, 4, 8) #c(10,19,5,12,3)
# original raw attributes
profile_attributes_used = c(1:17)
# Please ENTER the number of clusters to eventually use for this report
numb_clusters_used = 3 # for boats possibly use 5, for Mall_Visits use 3
# Please enter the method to use for the segmentation:
profile_with = "kmeans"
# Please ENTER the distance metric eventually used for the clustering in
# case of hierarchical clustering (e.g. 'euclidean', 'maximum', 'manhattan',
# 'canberra', 'binary' or 'minkowski' - see help(dist)). DEFAULT is
# 'euclidean' distance_used = 'euclidean'
# Please ENTER the kmeans clustering method to use (options are:
# 'Hartigan-Wong', 'Lloyd', 'Forgy', 'MacQueen'). DEFAULT is 'Lloyd'
kmeans_method = "Lloyd"
First, we selected our segmentation attributes by taking the attributes that most resonate with each of the three components. Specifically, we used satisfaction_level (table column 1, component 2), average_monthly_hours (table column 4, component 1), and promotion_last_5_years (table column 8, component 3) to represent components 1, 2, and 3 respectively.
Next, for profile_attributes_used, we included all the columns (our eight factors stated above) in the data set, per standard procedure.
For the number of segments used, we conducted a number of trials to arrive at the appropriate number of segments that were clearly distinct from each other. We started with 5, but noticed that while four segments were highly differentiated, there was one segment that had no attributes that they strongly represented. Thus, we tried narrowing the segments to 4, and noticed still some errant clusters. Once we brought it down to 3, we finally reached three unique, highly differentiated segments. This is the number we used for our segmentation method.
Finally, we used a Kmeans-Lloyd segmentation method.
segmentation_attributes_used <- intersect(segmentation_attributes_used, 1:ncol(ProjectData_Clustering))
profile_attributes_used <- intersect(profile_attributes_used, 1:ncol(ProjectData_Clustering))
ProjectData_segment <- ProjectData_Clustering[, segmentation_attributes_used]
ProjectData_profile <- ProjectData_Clustering[, profile_attributes_used]
ProjectData_scaled <- apply(ProjectData_Clustering, 2, function(r) if (sd(r) !=
0) (r - mean(r))/sd(r) else 0 * r)
This is the intermediary step that we undertook, assigning variables and tables for the exercises below.
kmeans_clusters <- kmeans(ProjectData_Clustering, centers = numb_clusters_used,
iter.max = 2000, algorithm = kmeans_method)
ProjectData_with_kmeans_membership <- cbind(1:length(kmeans_clusters$cluster),
kmeans_clusters$cluster)
colnames(ProjectData_with_kmeans_membership) <- c("Empl.Nr", "Cluster_Membership")
iprint.df(round(head(ProjectData_with_kmeans_membership, max_data_report), 2))
Empl.Nr | Cluster_Membership |
---|---|
1 | 3 |
2 | 2 |
3 | 2 |
4 | 2 |
5 | 3 |
6 | 3 |
7 | 2 |
8 | 2 |
9 | 2 |
10 | 3 |
11 | 3 |
12 | 2 |
13 | 2 |
14 | 3 |
15 | 3 |
Please note that in the following table the first five factors are scaled from -1 to 1, the sixth and seventh factor are binary and the eigth factor is based on integers 0, 1, and 2.
cluster_memberships_kmeans <- kmeans_clusters$cluster
cluster_ids_kmeans <- unique(cluster_memberships_kmeans)
if (FALSE) {
if (profile_with == "hclust") {
cluster_memberships <- cluster_memberships_hclust
cluster_ids <- cluster_ids_hclust
}
if (profile_with == "kmeans") {
cluster_memberships <- cluster_memberships_kmeans
cluster_ids <- cluster_ids_kmeans
}
}
cluster_memberships <- cluster_memberships_kmeans
cluster_ids <- cluster_ids_kmeans
NewData = matrix(cluster_memberships, ncol = 1)
population_average = matrix(apply(ProjectData_profile, 2, mean), ncol = 1)
colnames(population_average) <- "Population"
Cluster_Profile_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_profile[(cluster_memberships ==
i), ], 2, mean))
if (ncol(ProjectData_profile) < 2) Cluster_Profile_mean = t(Cluster_Profile_mean)
colnames(Cluster_Profile_mean) <- paste("Seg.", 1:length(cluster_ids), sep = "")
colnames(Cluster_Profile_mean)[3] <- "Unchallenged"
colnames(Cluster_Profile_mean)[1] <- "FreshAndHappy"
colnames(Cluster_Profile_mean)[2] <- "Overworked"
cluster.profile <- cbind(population_average, Cluster_Profile_mean)
iprint.df(round(cluster.profile, 2))
Population | FreshAndHappy | Overworked | Unchallenged | |
---|---|---|---|---|
satisfaction_level | 0.00 | 0.57 | -0.65 | -0.52 |
last_evaluation | 0.00 | 0.23 | 0.64 | -1.00 |
number_project | 0.00 | -0.01 | 0.97 | -0.84 |
average_montly_hours | 0.00 | 0.14 | 0.76 | -0.95 |
time_spend_company | 0.00 | -0.44 | 1.17 | -0.20 |
Work_accident | 0.14 | 0.17 | 0.11 | 0.13 |
promotion_last_5years | 0.02 | 0.02 | 0.02 | 0.02 |
Salary | 0.59 | 0.64 | 0.56 | 0.53 |
ProjectData_scaled_profile = ProjectData_scaled[, profile_attributes_used, drop = F]
Cluster_Profile_standar_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_scaled_profile[(cluster_memberships ==
i), , drop = F], 2, mean))
if (ncol(ProjectData_scaled_profile) < 2) Cluster_Profile_standar_mean = t(Cluster_Profile_standar_mean)
colnames(Cluster_Profile_standar_mean) <- paste("Seg ", 1:length(cluster_ids),
sep = "")
colnames(Cluster_Profile_standar_mean)[3] <- "Unchallenged"
colnames(Cluster_Profile_standar_mean)[1] <- "FreshAndHappy"
colnames(Cluster_Profile_standar_mean)[2] <- "Overworked"
iplot.df(melt(cbind.data.frame(idx = as.numeric(1:nrow(Cluster_Profile_standar_mean)),
Cluster_Profile_standar_mean), id = "idx"), xlab = "Profiling variables (standardized)",
ylab = "Mean of cluster")
We observed before that the correlation between the variables and the Department, in which a person works, was close to 0. However, we wanted to cross-check if segments may be profiled by the departments.
The first table shows the number of people of each segment employed in each department. From the second table, we can see that, despite being different in size, the departments are equally represented in the segments in % term. Basically, every department has the same share of people of the same segment. These numbers appear to be almost too close: this may be a potential red flag about the quality (potential lack of randomness) of the data base.
ProjectData_Departments <- ProjectData[, c(7, 10:17)]
ProjectData_Departments <- cbind(Empl.Nr = c(1:nrow(ProjectData_Departments)),
ProjectData_Departments)
ProjectData_Departments <- merge(x = ProjectData_Departments, y = ProjectData_with_kmeans_membership,
by = "Empl.Nr", all.x = TRUE)
resultstable = NULL
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters))
newtable = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtable)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$Sales == 1))
newtablesales = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablesales)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$Accounting == 1))
newtableaccounting = cbind(unique(ProjectData_Departments$Cluster_Membership),
resultstable)
colnames(newtableaccounting)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$Hr == 1))
newtablehr = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablehr)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$product_mng == 1))
newtablepm = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablepm)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$technical == 1))
newtabletech = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtabletech)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$support == 1))
newtablesupport = cbind(unique(ProjectData_Departments$Cluster_Membership),
resultstable)
colnames(newtablesupport)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$IT == 1))
newtableit = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtableit)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$RandD == 1))
newtablerand = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablerand)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$Sales == 0 & ProjectData_Departments$Accounting ==
0 & ProjectData_Departments$Hr == 0 & ProjectData_Departments$product_mng ==
0 & ProjectData_Departments$technical == 0 & ProjectData_Departments$support ==
0 & ProjectData_Departments$IT == 0 & ProjectData_Departments$RandD == 0))
newtablemktandmgmt = cbind(unique(ProjectData_Departments$Cluster_Membership),
resultstable)
colnames(newtablemktandmgmt)[1] <- "Cluster"
ProjectData_SegmentStats <- merge(x = newtable, y = newtablesales, by = "Cluster",
all.x = TRUE)
colnames(ProjectData_SegmentStats)[2] <- "Total"
colnames(ProjectData_SegmentStats)[3] <- "Sales"
ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtableaccounting,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[4] <- "Accounting"
ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablehr,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[5] <- "Hr"
ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablepm,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[6] <- "product_mng"
ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtabletech,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[7] <- "technical"
ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablesupport,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[8] <- "support"
ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtableit,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[9] <- "IT"
ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablerand,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[10] <- "RandD"
ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablemktandmgmt,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[11] <- "Marketing/Management"
colnames(ProjectData_SegmentStats)[1] <- "Segment"
rownames(ProjectData_SegmentStats)[1] <- "FreshAndHappy"
rownames(ProjectData_SegmentStats)[2] <- "Overworked"
rownames(ProjectData_SegmentStats)[3] <- "Unchallenged"
iprint.df(round(ProjectData_SegmentStats, 1))
Segment | Total | Sales | Accounting | Hr | product_mng | technical | support | IT | RandD | Marketing/Management | |
---|---|---|---|---|---|---|---|---|---|---|---|
FreshAndHappy | 1 | 7532 | 2018 | 356 | 377 | 445 | 1386 | 1160 | 637 | 420 | 733 |
Overworked | 2 | 3514 | 988 | 195 | 150 | 208 | 628 | 485 | 271 | 176 | 413 |
Unchallenged | 3 | 3953 | 1133 | 216 | 212 | 249 | 706 | 584 | 319 | 191 | 343 |
salespercent = c(1, 2, 3)
accountingpercent = c(1, 2, 3)
hrpercent = c(1, 2, 3)
pmpercent = c(1, 2, 3)
techpercent = c(1, 2, 3)
supportpercent = c(1, 2, 3)
itpercent = c(1, 2, 3)
randdpercent = c(1, 2, 3)
mktgmgtpercent = c(1, 2, 3)
newintermediatetable = cbind(ProjectData_SegmentStats, salespercent, accountingpercent,
hrpercent, pmpercent, techpercent, supportpercent, itpercent, randdpercent,
mktgmgtpercent)
newintermediatetable[, "salespercent"] = round(newintermediatetable[, "Sales"]/newintermediatetable[,
"Total"], 4) * 100
newintermediatetable[, "accountingpercent"] = round(newintermediatetable[, "Accounting"]/newintermediatetable[,
"Total"], 4) * 100
newintermediatetable[, "hrpercent"] = round(newintermediatetable[, "Hr"]/newintermediatetable[,
"Total"], 4) * 100
newintermediatetable[, "pmpercent"] = round(newintermediatetable[, "product_mng"]/newintermediatetable[,
"Total"], 4) * 100
newintermediatetable[, "techpercent"] = round(newintermediatetable[, "technical"]/newintermediatetable[,
"Total"], 4) * 100
newintermediatetable[, "supportpercent"] = round(newintermediatetable[, "support"]/newintermediatetable[,
"Total"], 4) * 100
newintermediatetable[, "itpercent"] = round(newintermediatetable[, "IT"]/newintermediatetable[,
"Total"], 4) * 100
newintermediatetable[, "randdpercent"] = round(newintermediatetable[, "RandD"]/newintermediatetable[,
"Total"], 4) * 100
newintermediatetable[, "mktgmgtpercent"] = round(newintermediatetable[, "Marketing/Management"]/newintermediatetable[,
"Total"], 4) * 100
ProjectData_SegmentPcts = newintermediatetable[, c(1, 2, 12:20)]
iprint.df(round(ProjectData_SegmentPcts, 2))
Segment | Total | salespercent | accountingpercent | hrpercent | pmpercent | techpercent | supportpercent | itpercent | randdpercent | mktgmgtpercent | |
---|---|---|---|---|---|---|---|---|---|---|---|
FreshAndHappy | 1 | 7532 | 26.79 | 4.73 | 5.01 | 5.91 | 18.40 | 15.40 | 8.46 | 5.58 | 9.73 |
Overworked | 2 | 3514 | 28.12 | 5.55 | 4.27 | 5.92 | 17.87 | 13.80 | 7.71 | 5.01 | 11.75 |
Unchallenged | 3 | 3953 | 28.66 | 5.46 | 5.36 | 6.30 | 17.86 | 14.77 | 8.07 | 4.83 | 8.68 |
resultstable = NULL
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters))
newtable = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtable)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$Sales == 1 & ProjectData_Departments$left ==
1))
newtablesales = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablesales)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$Accounting == 1 & ProjectData_Departments$left ==
1))
newtableaccounting = cbind(unique(ProjectData_Departments$Cluster_Membership),
resultstable)
colnames(newtableaccounting)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$Hr == 1 & ProjectData_Departments$left ==
1))
newtablehr = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablehr)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$product_mng == 1 & ProjectData_Departments$left ==
1))
newtablepm = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablepm)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$technical == 1 & ProjectData_Departments$left ==
1))
newtabletech = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtabletech)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$support == 1 & ProjectData_Departments$left ==
1))
newtablesupport = cbind(unique(ProjectData_Departments$Cluster_Membership),
resultstable)
colnames(newtablesupport)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$IT == 1 & ProjectData_Departments$left ==
1))
newtableit = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtableit)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$RandD == 1 & ProjectData_Departments$left ==
1))
newtablerand = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablerand)[1] <- "Cluster"
resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership ==
NrClusters & ProjectData_Departments$Sales == 0 & ProjectData_Departments$Accounting ==
0 & ProjectData_Departments$Hr == 0 & ProjectData_Departments$product_mng ==
0 & ProjectData_Departments$technical == 0 & ProjectData_Departments$support ==
0 & ProjectData_Departments$IT == 0 & ProjectData_Departments$RandD == 0 &
ProjectData_Departments$left == 1))
newtablemktandmgmt = cbind(unique(ProjectData_Departments$Cluster_Membership),
resultstable)
colnames(newtablemktandmgmt)[1] <- "Cluster"
ProjectData_SegmentChurnStats <- merge(x = newtable, y = newtablesales, by = "Cluster",
all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[2] <- "Total"
colnames(ProjectData_SegmentChurnStats)[3] <- "Sales"
ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtableaccounting,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[4] <- "Accounting"
ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablehr,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[5] <- "Hr"
ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablepm,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[6] <- "product_mng"
ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtabletech,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[7] <- "technical"
ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablesupport,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[8] <- "support"
ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtableit,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[9] <- "IT"
ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablerand,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[10] <- "RandD"
ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablemktandmgmt,
by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[11] <- "Marketing/Management"
colnames(ProjectData_SegmentChurnStats)[1] <- "Segment"
rownames(ProjectData_SegmentChurnStats)[1] <- "FreshAndHappy"
rownames(ProjectData_SegmentChurnStats)[2] <- "Overworked"
rownames(ProjectData_SegmentChurnStats)[3] <- "Unchallenged"
iprint.df(round(ProjectData_SegmentChurnStats, 1))
Segment | Total | Sales | Accounting | Hr | product_mng | technical | support | IT | RandD | Marketing/Management | |
---|---|---|---|---|---|---|---|---|---|---|---|
FreshAndHappy | 1 | 7532 | 59 | 6 | 18 | 4 | 44 | 25 | 12 | 1 | 7 |
Overworked | 2 | 3514 | 479 | 103 | 88 | 108 | 381 | 290 | 148 | 71 | 145 |
Unchallenged | 3 | 3953 | 475 | 95 | 109 | 86 | 272 | 240 | 113 | 49 | 143 |
newintermediatetable = cbind(ProjectData_SegmentStats, ProjectData_SegmentChurnStats[,
3:11])
newintermediatetable[, "salespercent"] = round(newintermediatetable[, 12]/newintermediatetable[,
3], 4)
newintermediatetable[, "accountingpercent"] = round(newintermediatetable[, 13]/newintermediatetable[,
4], 4)
newintermediatetable[, "hrpercent"] = round(newintermediatetable[, 14]/newintermediatetable[,
5], 4)
newintermediatetable[, "pmpercent"] = round(newintermediatetable[, 15]/newintermediatetable[,
6], 4)
newintermediatetable[, "techpercent"] = round(newintermediatetable[, 16]/newintermediatetable[,
7], 4)
newintermediatetable[, "supportpercent"] = round(newintermediatetable[, 17]/newintermediatetable[,
8], 4)
newintermediatetable[, "itpercent"] = round(newintermediatetable[, 18]/newintermediatetable[,
9], 4)
newintermediatetable[, "randdpercent"] = round(newintermediatetable[, 19]/newintermediatetable[,
10], 4)
newintermediatetable[, "mktgmgtpercent"] = round(newintermediatetable[, 20]/newintermediatetable[,
11], 4)
ProjectData_SegmentChurnPcts = newintermediatetable[, c(21:29)] * 100
iprint.df(round(ProjectData_SegmentChurnPcts, 2))
salespercent | accountingpercent | hrpercent | pmpercent | techpercent | supportpercent | itpercent | randdpercent | mktgmgtpercent | |
---|---|---|---|---|---|---|---|---|---|
FreshAndHappy | 2.92 | 1.69 | 4.77 | 0.90 | 3.17 | 2.16 | 1.88 | 0.24 | 0.95 |
Overworked | 48.48 | 52.82 | 58.67 | 51.92 | 60.67 | 59.79 | 54.61 | 40.34 | 35.11 |
Unchallenged | 41.92 | 43.98 | 51.42 | 34.54 | 38.53 | 41.10 | 35.42 | 25.65 | 41.69 |
ProjectData_SegmentSummary = ProjectData_SegmentStats[, c(1:2)]
ProjectData_SegmentSummary[, "%ofCompany"] = ProjectData_SegmentSummary[, "Total"]/sum(ProjectData_SegmentSummary[,
"Total"]) * 100
ProjectData_SegmentSummary[1, "Churned"] = sum(ProjectData_SegmentChurnStats[1,
c(3:11)])
ProjectData_SegmentSummary[2, "Churned"] = sum(ProjectData_SegmentChurnStats[2,
c(3:11)])
ProjectData_SegmentSummary[3, "Churned"] = sum(ProjectData_SegmentChurnStats[3,
c(3:11)])
ProjectData_SegmentSummary[, "Churn%ofTotal"] = ProjectData_SegmentSummary[,
"Churned"]/ProjectData_SegmentSummary[, "Total"] * 100
iprint.df(round(ProjectData_SegmentSummary, 2))
Segment | Total | %ofCompany | Churned | Churn%ofTotal | |
---|---|---|---|---|---|
FreshAndHappy | 1 | 7532 | 50.22 | 176 | 2.34 |
Overworked | 2 | 3514 | 23.43 | 1813 | 51.59 |
Unchallenged | 3 | 3953 | 26.36 | 1582 | 40.02 |
We used the code from the course website, adjusting for our data set and our problem
The CART1, CART2 and Logistic Regr. return a -1.00 in the variable importance for the first independent variable (Satisfaction level in our base case) in the set (we tried multiple independent variables).
The confusion matrix (validation) returns a huge Type 1 error (99.58%) (people staying although we predicted them to leave) but a relatively small Type 2 error (8.43%). If we increase the probability threshold, the Type 1 error doesn’t decrease which is counter intuitive.
For the Test Accuracy confusion matrix the result is pretty much the same.
We assumed that one reason can be that some of the coefficients for the logistic regression may not be significant. However, if we exclude them the results are still very similar. The logistic regression produces a lot of mistakes.
Unfortunately, we were not able to resolve this issue in the given time. We still left our analysis in this report and are keen on getting your feedback on it.
ProjectDataLeft = ProjectData[, 7]
ProjectDataClass <- cbind(ProjectData2_scaled, ProjectDataLeft)
dependent_variable = 17
independent_variables = c(1:16)
Probability_Threshold = 90 # between 1 and 99%
estimation_data_percent = 80
validation_data_percent = 10
random_sampling = 0
CART_cp = 0.01
min_segment = 100
We will use two classification trees and logistic regression.
CART_control = 0.001
Based on our “small tree” we can make several observation regarding the churn of our employees. If a person has a very low satisfaction level of below and has a high number of projects he will most likely leave. If a person has a high level of satisfaction and spend little time at the company he is unlikely to leave.
Let’s look at the larger tree by changing the tree’s complexity control parameter. For example, this is how the tree would look like if we set cp = 0.001
:
The purity of the leaf indicates the probability an observation which “reaches that leaf” belongs to a class. In our case, the probability of the employee leaving for the first few validation data observations, using the first CART above, is:
Actual Class | Probability of Class 1 | |
---|---|---|
Obs 1 | 0 | 0.01 |
Obs 2 | 1 | 0.95 |
Obs 3 | 1 | 0.94 |
Obs 4 | 1 | 1.00 |
Obs 5 | 1 | 0.94 |
Obs 6 | 1 | 0.95 |
Obs 7 | 1 | 0.95 |
Obs 8 | 1 | 1.00 |
Obs 9 | 1 | 0.94 |
Obs 10 | 1 | 0.94 |
Obs 11 | 1 | 0.95 |
Obs 12 | 1 | 0.95 |
Obs 13 | 1 | 1.00 |
Obs 14 | 1 | 0.94 |
Obs 15 | 1 | 0.95 |
We would like to perform a classification analysis using logistic regression. This is the logistic regression parameters for our data:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -1.5 | 0.1 | -14.7 | 0.0 |
satisfaction_level | -1.0 | 0.0 | -33.2 | 0.0 |
last_evaluation | 0.1 | 0.0 | 3.3 | 0.0 |
number_project | -0.4 | 0.0 | -11.1 | 0.0 |
average_montly_hours | 0.2 | 0.0 | 6.6 | 0.0 |
time_spend_company | 0.4 | 0.0 | 15.1 | 0.0 |
Work_accident | -1.4 | 0.1 | -13.0 | 0.0 |
promotion_last_5years | -1.5 | 0.4 | -3.9 | 0.0 |
Salary | -0.6 | 0.0 | -13.2 | 0.0 |
Sales | 0.1 | 0.1 | 1.3 | 0.2 |
Accounting | 0.1 | 0.2 | 0.5 | 0.6 |
Hr | 0.2 | 0.2 | 1.5 | 0.1 |
product_mng | 0.0 | 0.1 | 0.3 | 0.8 |
technical | 0.2 | 0.1 | 1.4 | 0.2 |
support | 0.2 | 0.1 | 1.7 | 0.1 |
IT | 0.1 | 0.1 | 0.5 | 0.6 |
RandD | -0.3 | 0.2 | -1.7 | 0.1 |
The probability our validation data belong to class 1 (the employee leaves) for the first few validation data observations, using the logistic regression above, is:
Actual Class | Probability of Class 1 | |
---|---|---|
Obs 1 | 0 | 0.22 |
Obs 2 | 1 | 0.42 |
Obs 3 | 1 | 0.12 |
Obs 4 | 1 | 0.42 |
Obs 5 | 1 | 0.19 |
Obs 6 | 1 | 0.43 |
Obs 7 | 1 | 0.38 |
Obs 8 | 1 | 0.61 |
Obs 9 | 1 | 0.11 |
Obs 10 | 1 | 0.11 |
Obs 11 | 1 | 0.37 |
Obs 12 | 1 | 0.33 |
Obs 13 | 1 | 0.67 |
Obs 14 | 1 | 0.17 |
Obs 15 | 1 | 0.38 |
In our case, we can see the relative importance of the independent variables using the variable.importance
of the CART trees (see help(rpart.object)
in R) or the z-scores from the output of logistic regression. For easier visualization, we scale all values between -1 and 1. From this table we can see the key drivers of the classification according to each of the methods we used here which are: last evaluation, average monthly hours, time spent and company.
CART 1 | CART 2 | Logistic Regr. | |
---|---|---|---|
satisfaction_level | -1.00 | -1.00 | -1.00 |
last_evaluation | 0.39 | 0.39 | 0.10 |
number_project | -0.47 | -0.47 | -0.33 |
average_montly_hours | 0.44 | 0.48 | 0.20 |
time_spend_company | 0.32 | 0.34 | 0.45 |
Work_accident | -0.02 | -0.02 | -0.39 |
promotion_last_5years | 0.00 | 0.00 | -0.12 |
Salary | 0.00 | 0.00 | -0.40 |
Sales | 0.00 | 0.00 | 0.04 |
Accounting | 0.00 | 0.00 | 0.02 |
Hr | 0.00 | 0.00 | 0.05 |
product_mng | 0.00 | 0.00 | 0.01 |
technical | 0.00 | 0.00 | 0.04 |
support | 0.00 | 0.00 | 0.05 |
IT | 0.00 | 0.00 | 0.02 |
RandD | 0.00 | 0.00 | -0.05 |
We might suggest to select a relatively high probability threshold (>90%) because it is important for us only to focus on those employee that are almost certain to leave. Imagine that we would want to grant additional benefits only to the employees who are most likely to leave disregarding the “likely, but not so much”.
We now measure other classification performance metric for different choices of the probability threshold.
We calculate the percentage of the observations that have been correctly classified. These are as follows for the probability threshold 90% for the validation data:
Hit Ratio | |
---|---|
First CART | 94.80000 |
Second CART | 95.26667 |
Logistic Regression | 47.80000 |
while for the estimation data the hit rates are:
Hit Ratio | |
---|---|
First CART | 97.90816 |
Second CART | 98.39987 |
Logistic Regression | 83.33194 |
We see that for the estimation data the hit rates are much higher than for the validation data.
To compare the performance of a classification model we use the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is people who do leave: 717 out of 1500 people). Clearly without doing any discriminant analysis, if we classified all individuals into the largest group, we could get a hit-rate of 47.8%.
The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression and the 2 CART models), the confusion matrix for the validation data is:
Predicted 1 | Predicted 0 | |
---|---|---|
Actual 1 | 91.32 | 8.68 |
Actual 0 | 99.58 | 0.42 |
Type 1 error seems extremely high. Even when we reran the model adjusting the variables and features, the Type 1 error remained significant. This makes us believe that the model is not working properly hence we cannot trust the result of this analysis.
The ROC curves for the validation data for both the CARTs above as well as the logistic regression are as follows:
The best point of the ROC curve is around 80-90% True Positive rate
The Lift curves for the validation data for our three classifiers are the following:
The lift curve shows us that the results of the analysis don’t change a lot after 50% of the validation data.
The lift and the ROC curve results do not seem accurate. This supports our belief that the model doesn’t work correctly thus the results should not be taken into consideration.
We now perform the analysis with the test sample. We assume that the data used for this performance analysis is representative.
As can be observed in this section, the results of the test data are in line with those found for the validation data. This confirms our believe that there is a problem with the CART and logistic regression performed and that the results therefore should not be used at this point.
The Confusion Matrix, ROC Curve, Lift Curve, and Profit Curve look like for our test data:
Hit Ratio | |
---|---|
First CART | 94.73333 |
Second CART | 95.80000 |
Logistic Regression | 47.46667 |
The Confusion Matrix for the model with the best validation data hit ratio above:
Predicted 1 | Predicted 0 | |
---|---|---|
Actual 1 | 92.13 | 7.87 |
Actual 0 | 99.86 | 0.14 |
ROC curves for the test data:
Lift Curves for the test data:
We entered this analysis with the goal of understanding what the main drivers for employee churn are. Starting from the raw GOLDHR data set, we first conducted descriptive statistics and created a general correlation matrix to get a better understanding of the data that we were dealing with.
Once we were comfortable with the data, we discovered the most important components of this data (i.e., satisfaction level, amount of work, tenure), and reduced the dimensions of the data set to these important components.
From there we were able to arrive at three distinct segments: Fresh and Happy, who comprise a vast majority of the population and are satisfied with their current standing; Overworked employees who have too many projects and are unhappy about it; and the Unchallenged employees whose workload is disproportionately lax and are unhappy about it. These latter two segments comprise the vast majority of churned employees.
In sum, we conclude that the most significant factor that drives employee churn is whether that person is happy or not in the company. Though this may seem obvious, what we found striking about our analysis was how little the other factors contributed to churn relative to employee satisfaction. For example, at first blush we hypothesized that salary might be a major factor for leaving and that emloyees could be segmented by choosing the importance of salary as a factor. However, salary turned out to be equally important across different segments of employees. Furthermore, we discovered that an employee’s satisfaction has very little to do with the department he or she belongs to.
In this light, we believe the next step for this case is clear. We already know that satisfaction drives retention, so it only makes sense that we have to understand in more detail what drives satisfaction. To do so, we would recommend re-running a similar analysis as above, this time omitting “left” from the study, and instead using “satisfaction_level” as the new dependent variable. We already have some proclivities to what the results may show, as the correlation matrix above shows that projects, tenure, and hours have a high correlation coefficient with satisfaction. Still, it would behoove us to confirm this hypothesis by properly going through the same process we conducted above.
Note. We’re not summarizing the results of the classification at this point due to the issues mentioned above.
=======