Objectives of the project

Goal

We decided to analyse HR data about various metrics related to employees. We took a sample data set from a large American corporation to propose a model to analyze one of the most pressing business problems nowadays: employee turnover. Therefore, our primary goal is understanding the drivers of turnover - why do people leave the company?

Data set


Load the data from Kaggle.com regarding HR analytics in a major US company
https://www.kaggle.com/ludobenistant/hr-analytics

We picked the underlying data set for the following reasons: it had a considerable number of observations (14,998) and a manageable number of factors (19). This will ease the process of dimensionality reduction, while providing sufficient basis for a meaningful segmentation. The factors include a complete set of characteristics: demographics, performance levels, one/off “dummy” variables. Therefore, we will be able to study the interactions between different kinds of factors.

The factors included in the dataset are:
* Satisfaction level (0-1)
* Last valuation (0-1) - The score received during performance review
* Number of projects in which the employee was engaged
* Average monthly hours worked
* Time spent in the company in years
* Working accident (0 or 1) - If the person had work-related accidents
* Promotion in the last 5 years (0 or 1)
* Department in which the person works
* Salary (low – medium – high)
* Whether the employee has left (0 or 1)

The Department factor included several text options (e.g. sales, hr, accounting). We decided to create an additional column for each option, and to populate them with 1 when the value was present in the original column and with 0 in all the others. In this way, we were able to transform text variables into dummy variables. Given that most of the factors are numerical we will use basic descriptive statistics as a starting point. Then we scale the results 0-1.

Below the visualization of a sample of the data for 15 employees. On the right we can see the columns for the Department that we have created to better work on the data.:

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Salary Sales Accounting Hr product_mng technical support IT RandD
0.38 0.53 2 157 3 0 1 0 0 1 0 0 0 0 0 0 0
0.80 0.86 5 262 6 0 1 0 1 1 0 0 0 0 0 0 0
0.11 0.88 7 272 4 0 1 0 1 1 0 0 0 0 0 0 0
0.72 0.87 5 223 5 0 1 0 0 1 0 0 0 0 0 0 0
0.37 0.52 2 159 3 0 1 0 0 1 0 0 0 0 0 0 0
0.41 0.50 2 153 3 0 1 0 0 1 0 0 0 0 0 0 0
0.10 0.77 6 247 4 0 1 0 0 1 0 0 0 0 0 0 0
0.92 0.85 5 259 5 0 1 0 0 1 0 0 0 0 0 0 0
0.89 1.00 5 224 5 0 1 0 0 1 0 0 0 0 0 0 0
0.42 0.53 2 142 3 0 1 0 0 1 0 0 0 0 0 0 0
0.45 0.54 2 135 3 0 1 0 0 1 0 0 0 0 0 0 0
0.11 0.81 6 305 4 0 1 0 0 1 0 0 0 0 0 0 0
0.84 0.92 4 234 5 0 1 0 0 1 0 0 0 0 0 0 0
0.41 0.55 2 148 3 0 1 0 0 1 0 0 0 0 0 0 0
0.36 0.56 2 137 3 0 1 0 0 1 0 0 0 0 0 0 0

Descriptive statistics

We start by calculating the descriptive statistics of the factors. Afterwards, we run the summary statistics of the scaled dataset (excluding dummy variables for the department):

min 25 percent median mean 75 percent max std
satisfaction_level 0.09 0.44 0.64 0.61 0.82 1 0.25
last_evaluation 0.36 0.56 0.72 0.72 0.87 1 0.17
number_project 2.00 3.00 4.00 3.80 5.00 7 1.23
average_montly_hours 96.00 156.00 200.00 201.05 245.00 310 49.94
time_spend_company 2.00 3.00 3.00 3.50 4.00 10 1.46
Work_accident 0.00 0.00 0.00 0.14 0.00 1 0.35
left 0.00 0.00 0.00 0.24 0.00 1 0.43
promotion_last_5years 0.00 0.00 0.00 0.02 0.00 1 0.14
Salary 0.00 0.00 1.00 0.59 1.00 2 0.64
Sales 0.00 0.00 0.00 0.28 1.00 1 0.45
Accounting 0.00 0.00 0.00 0.05 0.00 1 0.22
Hr 0.00 0.00 0.00 0.05 0.00 1 0.22
product_mng 0.00 0.00 0.00 0.06 0.00 1 0.24
technical 0.00 0.00 0.00 0.18 0.00 1 0.39
support 0.00 0.00 0.00 0.15 0.00 1 0.36
IT 0.00 0.00 0.00 0.08 0.00 1 0.27
RandD 0.00 0.00 0.00 0.05 0.00 1 0.22

We now scale the data (excluding the dummy variables) to make them more comparable and to ease the analyis.

Notice now the summary statistics of the scaled dataset (excluding dummy variables for the department):

min 25 percent median mean 75 percent max std
satisfaction_level -2.10 -0.70 0.11 0 0.83 1.56 1
last_evaluation -2.08 -0.91 0.02 0 0.90 1.66 1
number_project -1.46 -0.65 0.16 0 0.97 2.59 1
average_montly_hours -2.10 -0.90 -0.02 0 0.88 2.18 1
time_spend_company -1.03 -0.34 -0.34 0 0.34 4.45 1
Work_accident -0.41 -0.41 -0.41 0 -0.41 2.43 1
left -0.56 -0.56 -0.56 0 -0.56 1.79 1
promotion_last_5years -0.15 -0.15 -0.15 0 -0.15 6.78 1
Salary -0.93 -0.93 0.64 0 0.64 2.21 1

Let’s see how these are correlated. The correlation matrix is as follows:

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Salary Sales Accounting Hr product_mng technical support IT RandD
satisfaction_level 1.00 0.11 -0.14 -0.02 -0.10 0.06 -0.39 0.03 0.05 0.00 -0.03 -0.01 0.01 -0.01 0.01 0.01 0.01
last_evaluation 0.11 1.00 0.35 0.34 0.13 -0.01 0.01 -0.01 -0.01 -0.02 0.00 -0.01 0.00 0.01 0.02 0.00 -0.01
number_project -0.14 0.35 1.00 0.42 0.20 0.00 0.02 -0.01 0.00 -0.01 0.00 -0.03 0.00 0.03 0.00 0.00 0.01
average_montly_hours -0.02 0.34 0.42 1.00 0.13 -0.01 0.07 0.00 0.00 0.00 0.00 -0.01 -0.01 0.01 0.00 0.01 0.00
time_spend_company -0.10 0.13 0.20 0.13 1.00 0.00 0.14 0.07 0.05 0.02 0.00 -0.02 0.00 -0.03 -0.03 -0.01 -0.02
Work_accident 0.06 -0.01 0.00 -0.01 0.00 1.00 -0.15 0.04 0.01 0.00 -0.01 -0.02 0.00 -0.01 0.01 -0.01 0.02
left -0.39 0.01 0.02 0.07 0.14 -0.15 1.00 -0.06 -0.16 0.01 0.02 0.03 -0.01 0.02 0.01 -0.01 -0.05
promotion_last_5years 0.03 -0.01 -0.01 0.00 0.07 0.04 -0.06 1.00 0.10 0.01 0.00 0.00 -0.04 -0.04 -0.04 -0.04 0.02
Salary 0.05 -0.01 0.00 0.00 0.05 0.01 -0.16 0.10 1.00 -0.04 0.01 0.00 -0.01 -0.02 -0.03 -0.01 0.00
Sales 0.00 -0.02 -0.01 0.00 0.02 0.00 0.01 0.01 -0.04 1.00 -0.14 -0.14 -0.16 -0.29 -0.26 -0.18 -0.15
Accounting -0.03 0.00 0.00 0.00 0.00 -0.01 0.02 0.00 0.01 -0.14 1.00 -0.05 -0.06 -0.11 -0.10 -0.07 -0.05
Hr -0.01 -0.01 -0.03 -0.01 -0.02 -0.02 0.03 0.00 0.00 -0.14 -0.05 1.00 -0.06 -0.11 -0.10 -0.07 -0.05
product_mng 0.01 0.00 0.00 -0.01 0.00 0.00 -0.01 -0.04 -0.01 -0.16 -0.06 -0.06 1.00 -0.12 -0.11 -0.08 -0.06
technical -0.01 0.01 0.03 0.01 -0.03 -0.01 0.02 -0.04 -0.02 -0.29 -0.11 -0.11 -0.12 1.00 -0.20 -0.14 -0.11
support 0.01 0.02 0.00 0.00 -0.03 0.01 0.01 -0.04 -0.03 -0.26 -0.10 -0.10 -0.11 -0.20 1.00 -0.12 -0.10
IT 0.01 0.00 0.00 0.01 -0.01 -0.01 -0.01 -0.04 -0.01 -0.18 -0.07 -0.07 -0.08 -0.14 -0.12 1.00 -0.07
RandD 0.01 -0.01 0.01 0.00 -0.02 0.02 -0.05 0.02 0.00 -0.15 -0.05 -0.05 -0.06 -0.11 -0.10 -0.07 1.00

Looking at the correlation matrix we already develop some expectations for the employee segmentation analysis:
* We see high positive or negative correlations with at least one other factor for satisfaction level, last evaluation, number of projects, average monthly hours and time spend in the company. Consequently, we expect these factors to play a major role in the segmentation of employees
* Work accidents, salary and promotion do not show any strong correlation with other factors. We cannot tell yet whether or not they are important for employees, but apparently, they will probably not be a driving factor for splitting them into segments
* At the same time, departments show very low correlations with the remaining factors. Thus, we do not expect that they can be used for profiling later on, but this will be analyzed in a seperate step
* Left (employee turnover) is strongly negatively correlated with Satisfaction level: -0.39. Consequently, at the current stage, we expect satisfaction level to be one of the drivers for employee turnover

Dimensionability reduction


For Dimensionality reduction we followed the process highlighted in class.

Based on the cummulative variance explained, we decided to group the factors into three components. Three components explain more than 50% of the variance (see charts below).

Inputs

# Columns used
factor_attributes_used = c(1:17)

# Factor Selection Criteria, Choices: 'eigenvalue', 'variance', 'manual'
factor_selectionciterion = "manual"

# Please ENTER the desired minumum variance explained
minimum_variance_explained = 40  # between 1 and 100

# Please ENTER the number of factors to use
manual_numb_factors_used = 3

# Please ENTER the rotation eventually used (e.g. 'none', 'varimax',
# 'quatimax', 'promax', 'oblimin', 'simplimax', and 'cluster' - see
# help(principal)). Default is 'varimax'
rotation_used = "varimax"

Define Database for Clustering and Segmentation

For dimensionability reduction and employee segmentation we use the following eight factors: satisfaction level, last evaluation, number of projects, average monthly hours, time spend in the company, Work accidents, salary, and promotion.

ProjectData_Clustering <- ProjectData2_scaled[, c(1:8)]

Save the Inputs in Variables

factor_attributes_used <- intersect(factor_attributes_used, 1:ncol(ProjectData_Clustering))
ProjectDataFactor <- ProjectData_Clustering[, factor_attributes_used]
ProjectDataFactor <- ProjectData_Clustering <- data.matrix(ProjectDataFactor)

Eigenvalue and explained variance

# `PCA` function
Variance_Explained_Table_results <- PCA(ProjectDataFactor, graph = FALSE)
Variance_Explained_Table <- Variance_Explained_Table_results$eig
Variance_Explained_Table_copy <- Variance_Explained_Table

rownames(Variance_Explained_Table) <- paste("Component", 1:nrow(Variance_Explained_Table), 
    sep = " ")
colnames(Variance_Explained_Table) <- c("Eigenvalue", "Pct of explained variance", 
    "Cumulative pct of explained variance")
iprint.df(round(Variance_Explained_Table, 2))
Eigenvalue Pct of explained variance Cumulative pct of explained variance
Component 1 1.83 22.88 22.88
Component 2 1.16 14.46 37.33
Component 3 1.11 13.91 51.24
Component 4 0.98 12.30 63.54
Component 5 0.89 11.18 74.72
Component 6 0.84 10.53 85.25
Component 7 0.63 7.90 93.15
Component 8 0.55 6.85 100.00
eigenvalues <- Variance_Explained_Table[, "Eigenvalue"]
df <- cbind(as.data.frame(eigenvalues), c(1:length(eigenvalues)), rep(1, length(eigenvalues)))
colnames(df) <- c("eigenvalues", "components", "abline")
iplot.df(melt(df, id = "components"))

See top factors and rotate

if (factor_selectionciterion == "eigenvalue") factors_selected = sum(Variance_Explained_Table_copy[, 
    1] >= 1)
if (factor_selectionciterion == "variance") factors_selected = 1:head(which(Variance_Explained_Table_copy[, 
    "cumulative percentage of variance"] >= minimum_variance_explained), 1)
if (factor_selectionciterion == "manual") factors_selected = manual_numb_factors_used
Rotated_Results <- principal(ProjectDataFactor, nfactors = max(factors_selected), 
    rotate = rotation_used, score = TRUE)
Rotated_Factors <- round(Rotated_Results$loadings, 2)
Rotated_Factors <- as.data.frame(unclass(Rotated_Factors))
colnames(Rotated_Factors) <- paste("Comp.", 1:ncol(Rotated_Factors), sep = "")

sorted_rows <- sort(Rotated_Factors[, 1], decreasing = TRUE, index.return = TRUE)$ix
Rotated_Factors <- Rotated_Factors[sorted_rows, ]

iprint.df(Rotated_Factors, scale = TRUE)
Comp.1 Comp.2 Comp.3
average_montly_hours 0.76 -0.01 -0.02
number_project 0.75 -0.24 0.03
last_evaluation 0.74 0.24 -0.04
time_spend_company 0.33 -0.40 0.41
satisfaction_level 0.03 0.86 0.01
Work_accident 0.01 0.34 0.22
Salary -0.03 0.12 0.65
promotion_last_5years -0.04 0.05 0.71

Segmentation


From the above exercise, we have established three key components, as well as the attributes that best associate with these components. We can infer from the data that each of the components refers to the following:

Comp.1: Care about the work
Comp.2: Satisfaction with the company
Comp.3: Driven by external incentives

Using these components, we can now begin our segmentation.

Set Input Variables

# Select Attributes to use
segmentation_attributes_used = c(1, 4, 8)  #c(10,19,5,12,3) 

# original raw attributes
profile_attributes_used = c(1:17)

# Please ENTER the number of clusters to eventually use for this report
numb_clusters_used = 3  # for boats possibly use 5, for Mall_Visits use 3

# Please enter the method to use for the segmentation:
profile_with = "kmeans"


# Please ENTER the distance metric eventually used for the clustering in
# case of hierarchical clustering (e.g. 'euclidean', 'maximum', 'manhattan',
# 'canberra', 'binary' or 'minkowski' - see help(dist)).  DEFAULT is
# 'euclidean' distance_used = 'euclidean'


# Please ENTER the kmeans clustering method to use (options are:
# 'Hartigan-Wong', 'Lloyd', 'Forgy', 'MacQueen').  DEFAULT is 'Lloyd'
kmeans_method = "Lloyd"


First, we selected our segmentation attributes by taking the attributes that most resonate with each of the three components. Specifically, we used satisfaction_level (table column 1, component 2), average_monthly_hours (table column 4, component 1), and promotion_last_5_years (table column 8, component 3) to represent components 1, 2, and 3 respectively.

Next, for profile_attributes_used, we included all the columns (our eight factors stated above) in the data set, per standard procedure.

For the number of segments used, we conducted a number of trials to arrive at the appropriate number of segments that were clearly distinct from each other. We started with 5, but noticed that while four segments were highly differentiated, there was one segment that had no attributes that they strongly represented. Thus, we tried narrowing the segments to 4, and noticed still some errant clusters. Once we brought it down to 3, we finally reached three unique, highly differentiated segments. This is the number we used for our segmentation method.

Finally, we used a Kmeans-Lloyd segmentation method.

Save selected criteria

segmentation_attributes_used <- intersect(segmentation_attributes_used, 1:ncol(ProjectData_Clustering))
profile_attributes_used <- intersect(profile_attributes_used, 1:ncol(ProjectData_Clustering))

ProjectData_segment <- ProjectData_Clustering[, segmentation_attributes_used]
ProjectData_profile <- ProjectData_Clustering[, profile_attributes_used]

ProjectData_scaled <- apply(ProjectData_Clustering, 2, function(r) if (sd(r) != 
    0) (r - mean(r))/sd(r) else 0 * r)


This is the intermediary step that we undertook, assigning variables and tables for the exercises below.

Apply kmeans to calculate clusters/segments

kmeans_clusters <- kmeans(ProjectData_Clustering, centers = numb_clusters_used, 
    iter.max = 2000, algorithm = kmeans_method)

ProjectData_with_kmeans_membership <- cbind(1:length(kmeans_clusters$cluster), 
    kmeans_clusters$cluster)
colnames(ProjectData_with_kmeans_membership) <- c("Empl.Nr", "Cluster_Membership")

iprint.df(round(head(ProjectData_with_kmeans_membership, max_data_report), 2))

Empl.Nr Cluster_Membership
1 3
2 2
3 2
4 2
5 3
6 3
7 2
8 2
9 2
10 3
11 3
12 2
13 2
14 3
15 3

Using the above algorithm, we plotted which employees would fall under the three segments we declared. Here we have a sample of employees, and we see a fairly non-biased distribution across segments.

Segment comparison to mean

Please note that in the following table the first five factors are scaled from -1 to 1, the sixth and seventh factor are binary and the eigth factor is based on integers 0, 1, and 2.

cluster_memberships_kmeans <- kmeans_clusters$cluster
cluster_ids_kmeans <- unique(cluster_memberships_kmeans)

if (FALSE) {
    if (profile_with == "hclust") {
        cluster_memberships <- cluster_memberships_hclust
        cluster_ids <- cluster_ids_hclust
    }
    if (profile_with == "kmeans") {
        cluster_memberships <- cluster_memberships_kmeans
        cluster_ids <- cluster_ids_kmeans
    }
}

cluster_memberships <- cluster_memberships_kmeans
cluster_ids <- cluster_ids_kmeans

NewData = matrix(cluster_memberships, ncol = 1)

population_average = matrix(apply(ProjectData_profile, 2, mean), ncol = 1)
colnames(population_average) <- "Population"
Cluster_Profile_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_profile[(cluster_memberships == 
    i), ], 2, mean))
if (ncol(ProjectData_profile) < 2) Cluster_Profile_mean = t(Cluster_Profile_mean)
colnames(Cluster_Profile_mean) <- paste("Seg.", 1:length(cluster_ids), sep = "")
colnames(Cluster_Profile_mean)[3] <- "Unchallenged"
colnames(Cluster_Profile_mean)[1] <- "FreshAndHappy"
colnames(Cluster_Profile_mean)[2] <- "Overworked"
cluster.profile <- cbind(population_average, Cluster_Profile_mean)

iprint.df(round(cluster.profile, 2))
Population FreshAndHappy Overworked Unchallenged
satisfaction_level 0.00 0.57 -0.65 -0.52
last_evaluation 0.00 0.23 0.64 -1.00
number_project 0.00 -0.01 0.97 -0.84
average_montly_hours 0.00 0.14 0.76 -0.95
time_spend_company 0.00 -0.44 1.17 -0.20
Work_accident 0.14 0.17 0.11 0.13
promotion_last_5years 0.02 0.02 0.02 0.02
Salary 0.59 0.64 0.56 0.53

Visualization of Segments

ProjectData_scaled_profile = ProjectData_scaled[, profile_attributes_used, drop = F]

Cluster_Profile_standar_mean <- sapply(sort(cluster_ids), function(i) apply(ProjectData_scaled_profile[(cluster_memberships == 
    i), , drop = F], 2, mean))
if (ncol(ProjectData_scaled_profile) < 2) Cluster_Profile_standar_mean = t(Cluster_Profile_standar_mean)
colnames(Cluster_Profile_standar_mean) <- paste("Seg ", 1:length(cluster_ids), 
    sep = "")
colnames(Cluster_Profile_standar_mean)[3] <- "Unchallenged"
colnames(Cluster_Profile_standar_mean)[1] <- "FreshAndHappy"
colnames(Cluster_Profile_standar_mean)[2] <- "Overworked"
iplot.df(melt(cbind.data.frame(idx = as.numeric(1:nrow(Cluster_Profile_standar_mean)), 
    Cluster_Profile_standar_mean), id = "idx"), xlab = "Profiling variables (standardized)", 
    ylab = "Mean of cluster")


Here we see in greater detail how each segment of employees values or de-values some of our key attributes. This gives us more information on the nature of each of the segment profiles. Based on the table above and graph above, we interprete the segments as follows:
* The “Fresh&Happy” Segment has very high satisfaction levels, employees are relatively new to the company, and have a slightly higher salary than the average.
* The “Unchallenged” Segment is characterized by few projects, few monthly hours, and low evaluations. They generally have low satisfaction levels.
* The “Overworked” Segment is characterized by long hours and many projects. Employees have generally high evaluations and spent a significant time in the company already. Satisfaction level is low.

Our best guess based on the data and our interpretation is that low satisfaction is triggered by either being unchallenged or overworked. However, the next step in the analysis could be to check whether this hypothesis is correct, by running a regression of the Satisfaction level with other factors.

As already anticipated when looking at the correlation analysis, we now see that salary promotion and work accident cannnot be used as attributes for segmentation as they are equally important or not important across employees in all segments. All three segments show numbers that are very close to the average of the total population.

Profiling

Create Department Table

We observed before that the correlation between the variables and the Department, in which a person works, was close to 0. However, we wanted to cross-check if segments may be profiled by the departments.
The first table shows the number of people of each segment employed in each department. From the second table, we can see that, despite being different in size, the departments are equally represented in the segments in % term. Basically, every department has the same share of people of the same segment. These numbers appear to be almost too close: this may be a potential red flag about the quality (potential lack of randomness) of the data base.

ProjectData_Departments <- ProjectData[, c(7, 10:17)]

ProjectData_Departments <- cbind(Empl.Nr = c(1:nrow(ProjectData_Departments)), 
    ProjectData_Departments)
ProjectData_Departments <- merge(x = ProjectData_Departments, y = ProjectData_with_kmeans_membership, 
    by = "Empl.Nr", all.x = TRUE)
resultstable = NULL

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters))

newtable = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtable)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$Sales == 1))

newtablesales = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablesales)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$Accounting == 1))

newtableaccounting = cbind(unique(ProjectData_Departments$Cluster_Membership), 
    resultstable)
colnames(newtableaccounting)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$Hr == 1))

newtablehr = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablehr)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$product_mng == 1))

newtablepm = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablepm)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$technical == 1))

newtabletech = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtabletech)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$support == 1))

newtablesupport = cbind(unique(ProjectData_Departments$Cluster_Membership), 
    resultstable)
colnames(newtablesupport)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$IT == 1))

newtableit = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtableit)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$RandD == 1))

newtablerand = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablerand)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$Sales == 0 & ProjectData_Departments$Accounting == 
    0 & ProjectData_Departments$Hr == 0 & ProjectData_Departments$product_mng == 
    0 & ProjectData_Departments$technical == 0 & ProjectData_Departments$support == 
    0 & ProjectData_Departments$IT == 0 & ProjectData_Departments$RandD == 0))

newtablemktandmgmt = cbind(unique(ProjectData_Departments$Cluster_Membership), 
    resultstable)
colnames(newtablemktandmgmt)[1] <- "Cluster"

ProjectData_SegmentStats <- merge(x = newtable, y = newtablesales, by = "Cluster", 
    all.x = TRUE)
colnames(ProjectData_SegmentStats)[2] <- "Total"
colnames(ProjectData_SegmentStats)[3] <- "Sales"

ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtableaccounting, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[4] <- "Accounting"

ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablehr, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[5] <- "Hr"

ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablepm, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[6] <- "product_mng"

ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtabletech, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[7] <- "technical"

ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablesupport, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[8] <- "support"

ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtableit, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[9] <- "IT"

ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablerand, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[10] <- "RandD"

ProjectData_SegmentStats <- merge(x = ProjectData_SegmentStats, y = newtablemktandmgmt, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentStats)[11] <- "Marketing/Management"
colnames(ProjectData_SegmentStats)[1] <- "Segment"
rownames(ProjectData_SegmentStats)[1] <- "FreshAndHappy"
rownames(ProjectData_SegmentStats)[2] <- "Overworked"
rownames(ProjectData_SegmentStats)[3] <- "Unchallenged"
iprint.df(round(ProjectData_SegmentStats, 1))
Segment Total Sales Accounting Hr product_mng technical support IT RandD Marketing/Management
FreshAndHappy 1 7532 2018 356 377 445 1386 1160 637 420 733
Overworked 2 3514 988 195 150 208 628 485 271 176 413
Unchallenged 3 3953 1133 216 212 249 706 584 319 191 343
salespercent = c(1, 2, 3)
accountingpercent = c(1, 2, 3)
hrpercent = c(1, 2, 3)
pmpercent = c(1, 2, 3)
techpercent = c(1, 2, 3)
supportpercent = c(1, 2, 3)
itpercent = c(1, 2, 3)
randdpercent = c(1, 2, 3)
mktgmgtpercent = c(1, 2, 3)

newintermediatetable = cbind(ProjectData_SegmentStats, salespercent, accountingpercent, 
    hrpercent, pmpercent, techpercent, supportpercent, itpercent, randdpercent, 
    mktgmgtpercent)

newintermediatetable[, "salespercent"] = round(newintermediatetable[, "Sales"]/newintermediatetable[, 
    "Total"], 4) * 100

newintermediatetable[, "accountingpercent"] = round(newintermediatetable[, "Accounting"]/newintermediatetable[, 
    "Total"], 4) * 100

newintermediatetable[, "hrpercent"] = round(newintermediatetable[, "Hr"]/newintermediatetable[, 
    "Total"], 4) * 100

newintermediatetable[, "pmpercent"] = round(newintermediatetable[, "product_mng"]/newintermediatetable[, 
    "Total"], 4) * 100

newintermediatetable[, "techpercent"] = round(newintermediatetable[, "technical"]/newintermediatetable[, 
    "Total"], 4) * 100

newintermediatetable[, "supportpercent"] = round(newintermediatetable[, "support"]/newintermediatetable[, 
    "Total"], 4) * 100

newintermediatetable[, "itpercent"] = round(newintermediatetable[, "IT"]/newintermediatetable[, 
    "Total"], 4) * 100

newintermediatetable[, "randdpercent"] = round(newintermediatetable[, "RandD"]/newintermediatetable[, 
    "Total"], 4) * 100

newintermediatetable[, "mktgmgtpercent"] = round(newintermediatetable[, "Marketing/Management"]/newintermediatetable[, 
    "Total"], 4) * 100

ProjectData_SegmentPcts = newintermediatetable[, c(1, 2, 12:20)]

iprint.df(round(ProjectData_SegmentPcts, 2))

Segment Total salespercent accountingpercent hrpercent pmpercent techpercent supportpercent itpercent randdpercent mktgmgtpercent
FreshAndHappy 1 7532 26.79 4.73 5.01 5.91 18.40 15.40 8.46 5.58 9.73
Overworked 2 3514 28.12 5.55 4.27 5.92 17.87 13.80 7.71 5.01 11.75
Unchallenged 3 3953 28.66 5.46 5.36 6.30 17.86 14.77 8.07 4.83 8.68
The next two tables limit the number of employees in each segment to those who left the company. We see that the majority of the employees, the “Fresh&Happy” Segment, shows a very low turnover rate, while the turnover rate is high for the other two segments (please note that the “Fresh&Happy” Segment could be partially biased because employees are rather new in the company and consequently did not have too much time to leave yet.) The “Overworked” Segment shows the highest turnrate.

resultstable = NULL

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters))

newtable = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtable)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$Sales == 1 & ProjectData_Departments$left == 
    1))

newtablesales = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablesales)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$Accounting == 1 & ProjectData_Departments$left == 
    1))

newtableaccounting = cbind(unique(ProjectData_Departments$Cluster_Membership), 
    resultstable)
colnames(newtableaccounting)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$Hr == 1 & ProjectData_Departments$left == 
    1))

newtablehr = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablehr)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$product_mng == 1 & ProjectData_Departments$left == 
    1))

newtablepm = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablepm)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$technical == 1 & ProjectData_Departments$left == 
    1))

newtabletech = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtabletech)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$support == 1 & ProjectData_Departments$left == 
    1))

newtablesupport = cbind(unique(ProjectData_Departments$Cluster_Membership), 
    resultstable)
colnames(newtablesupport)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$IT == 1 & ProjectData_Departments$left == 
    1))

newtableit = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtableit)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$RandD == 1 & ProjectData_Departments$left == 
    1))

newtablerand = cbind(unique(ProjectData_Departments$Cluster_Membership), resultstable)
colnames(newtablerand)[1] <- "Cluster"

resultstable = sapply(unique(ProjectData_Departments$Cluster_Membership), function(NrClusters) sum(ProjectData_Departments$Cluster_Membership == 
    NrClusters & ProjectData_Departments$Sales == 0 & ProjectData_Departments$Accounting == 
    0 & ProjectData_Departments$Hr == 0 & ProjectData_Departments$product_mng == 
    0 & ProjectData_Departments$technical == 0 & ProjectData_Departments$support == 
    0 & ProjectData_Departments$IT == 0 & ProjectData_Departments$RandD == 0 & 
    ProjectData_Departments$left == 1))

newtablemktandmgmt = cbind(unique(ProjectData_Departments$Cluster_Membership), 
    resultstable)
colnames(newtablemktandmgmt)[1] <- "Cluster"

ProjectData_SegmentChurnStats <- merge(x = newtable, y = newtablesales, by = "Cluster", 
    all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[2] <- "Total"
colnames(ProjectData_SegmentChurnStats)[3] <- "Sales"

ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtableaccounting, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[4] <- "Accounting"

ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablehr, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[5] <- "Hr"

ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablepm, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[6] <- "product_mng"

ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtabletech, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[7] <- "technical"

ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablesupport, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[8] <- "support"

ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtableit, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[9] <- "IT"

ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablerand, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[10] <- "RandD"

ProjectData_SegmentChurnStats <- merge(x = ProjectData_SegmentChurnStats, y = newtablemktandmgmt, 
    by = "Cluster", all.x = TRUE)
colnames(ProjectData_SegmentChurnStats)[11] <- "Marketing/Management"
colnames(ProjectData_SegmentChurnStats)[1] <- "Segment"
rownames(ProjectData_SegmentChurnStats)[1] <- "FreshAndHappy"
rownames(ProjectData_SegmentChurnStats)[2] <- "Overworked"
rownames(ProjectData_SegmentChurnStats)[3] <- "Unchallenged"
iprint.df(round(ProjectData_SegmentChurnStats, 1))
Segment Total Sales Accounting Hr product_mng technical support IT RandD Marketing/Management
FreshAndHappy 1 7532 59 6 18 4 44 25 12 1 7
Overworked 2 3514 479 103 88 108 381 290 148 71 145
Unchallenged 3 3953 475 95 109 86 272 240 113 49 143
newintermediatetable = cbind(ProjectData_SegmentStats, ProjectData_SegmentChurnStats[, 
    3:11])

newintermediatetable[, "salespercent"] = round(newintermediatetable[, 12]/newintermediatetable[, 
    3], 4)

newintermediatetable[, "accountingpercent"] = round(newintermediatetable[, 13]/newintermediatetable[, 
    4], 4)

newintermediatetable[, "hrpercent"] = round(newintermediatetable[, 14]/newintermediatetable[, 
    5], 4)

newintermediatetable[, "pmpercent"] = round(newintermediatetable[, 15]/newintermediatetable[, 
    6], 4)

newintermediatetable[, "techpercent"] = round(newintermediatetable[, 16]/newintermediatetable[, 
    7], 4)

newintermediatetable[, "supportpercent"] = round(newintermediatetable[, 17]/newintermediatetable[, 
    8], 4)

newintermediatetable[, "itpercent"] = round(newintermediatetable[, 18]/newintermediatetable[, 
    9], 4)

newintermediatetable[, "randdpercent"] = round(newintermediatetable[, 19]/newintermediatetable[, 
    10], 4)

newintermediatetable[, "mktgmgtpercent"] = round(newintermediatetable[, 20]/newintermediatetable[, 
    11], 4)

ProjectData_SegmentChurnPcts = newintermediatetable[, c(21:29)] * 100

iprint.df(round(ProjectData_SegmentChurnPcts, 2))

salespercent accountingpercent hrpercent pmpercent techpercent supportpercent itpercent randdpercent mktgmgtpercent
FreshAndHappy 2.92 1.69 4.77 0.90 3.17 2.16 1.88 0.24 0.95
Overworked 48.48 52.82 58.67 51.92 60.67 59.79 54.61 40.34 35.11
Unchallenged 41.92 43.98 51.42 34.54 38.53 41.10 35.42 25.65 41.69
The summary table below shows the segment, the total number of employees that where grouped into this segment, the segment size compared to the total number of employees in the company (in the database), the number of people who left the company and the churn rate calculated by people who left divided by the total number of people in this segment.

As mentioned above, when implementing measures to decrease the company’s churn rate, one should not focus too much on the “Fresh&Happy” Segment, while being very concerned about the other two segments, which show high rates.

ProjectData_SegmentSummary = ProjectData_SegmentStats[, c(1:2)]
ProjectData_SegmentSummary[, "%ofCompany"] = ProjectData_SegmentSummary[, "Total"]/sum(ProjectData_SegmentSummary[, 
    "Total"]) * 100
ProjectData_SegmentSummary[1, "Churned"] = sum(ProjectData_SegmentChurnStats[1, 
    c(3:11)])
ProjectData_SegmentSummary[2, "Churned"] = sum(ProjectData_SegmentChurnStats[2, 
    c(3:11)])
ProjectData_SegmentSummary[3, "Churned"] = sum(ProjectData_SegmentChurnStats[3, 
    c(3:11)])
ProjectData_SegmentSummary[, "Churn%ofTotal"] = ProjectData_SegmentSummary[, 
    "Churned"]/ProjectData_SegmentSummary[, "Total"] * 100
iprint.df(round(ProjectData_SegmentSummary, 2))
Segment Total %ofCompany Churned Churn%ofTotal
FreshAndHappy 1 7532 50.22 176 2.34
Overworked 2 3514 23.43 1813 51.59
Unchallenged 3 3953 26.36 1582 40.02

Classification

Note on classification analysis below:

We used the code from the course website, adjusting for our data set and our problem

The CART1, CART2 and Logistic Regr. return a -1.00 in the variable importance for the first independent variable (Satisfaction level in our base case) in the set (we tried multiple independent variables).

The confusion matrix (validation) returns a huge Type 1 error (99.58%) (people staying although we predicted them to leave) but a relatively small Type 2 error (8.43%). If we increase the probability threshold, the Type 1 error doesn’t decrease which is counter intuitive.

For the Test Accuracy confusion matrix the result is pretty much the same.

We assumed that one reason can be that some of the coefficients for the logistic regression may not be significant. However, if we exclude them the results are still very similar. The logistic regression produces a lot of mistakes.

Unfortunately, we were not able to resolve this issue in the given time. We still left our analysis in this report and are keen on getting your feedback on it.

ProjectDataLeft = ProjectData[, 7]
ProjectDataClass <- cbind(ProjectData2_scaled, ProjectDataLeft)
dependent_variable = 17


independent_variables = c(1:16)

Probability_Threshold = 90  # between 1 and 99%

estimation_data_percent = 80
validation_data_percent = 10

random_sampling = 0

CART_cp = 0.01

min_segment = 100

We will use two classification trees and logistic regression.

CART_control = 0.001

Based on our “small tree” we can make several observation regarding the churn of our employees. If a person has a very low satisfaction level of below and has a high number of projects he will most likely leave. If a person has a high level of satisfaction and spend little time at the company he is unlikely to leave.

Let’s look at the larger tree by changing the tree’s complexity control parameter. For example, this is how the tree would look like if we set cp = 0.001:

The purity of the leaf indicates the probability an observation which “reaches that leaf” belongs to a class. In our case, the probability of the employee leaving for the first few validation data observations, using the first CART above, is:

Actual Class Probability of Class 1
Obs 1 0 0.01
Obs 2 1 0.95
Obs 3 1 0.94
Obs 4 1 1.00
Obs 5 1 0.94
Obs 6 1 0.95
Obs 7 1 0.95
Obs 8 1 1.00
Obs 9 1 0.94
Obs 10 1 0.94
Obs 11 1 0.95
Obs 12 1 0.95
Obs 13 1 1.00
Obs 14 1 0.94
Obs 15 1 0.95

We would like to perform a classification analysis using logistic regression. This is the logistic regression parameters for our data:

Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5 0.1 -14.7 0.0
satisfaction_level -1.0 0.0 -33.2 0.0
last_evaluation 0.1 0.0 3.3 0.0
number_project -0.4 0.0 -11.1 0.0
average_montly_hours 0.2 0.0 6.6 0.0
time_spend_company 0.4 0.0 15.1 0.0
Work_accident -1.4 0.1 -13.0 0.0
promotion_last_5years -1.5 0.4 -3.9 0.0
Salary -0.6 0.0 -13.2 0.0
Sales 0.1 0.1 1.3 0.2
Accounting 0.1 0.2 0.5 0.6
Hr 0.2 0.2 1.5 0.1
product_mng 0.0 0.1 0.3 0.8
technical 0.2 0.1 1.4 0.2
support 0.2 0.1 1.7 0.1
IT 0.1 0.1 0.5 0.6
RandD -0.3 0.2 -1.7 0.1

The probability our validation data belong to class 1 (the employee leaves) for the first few validation data observations, using the logistic regression above, is:

Actual Class Probability of Class 1
Obs 1 0 0.22
Obs 2 1 0.42
Obs 3 1 0.12
Obs 4 1 0.42
Obs 5 1 0.19
Obs 6 1 0.43
Obs 7 1 0.38
Obs 8 1 0.61
Obs 9 1 0.11
Obs 10 1 0.11
Obs 11 1 0.37
Obs 12 1 0.33
Obs 13 1 0.67
Obs 14 1 0.17
Obs 15 1 0.38

In our case, we can see the relative importance of the independent variables using the variable.importance of the CART trees (see help(rpart.object) in R) or the z-scores from the output of logistic regression. For easier visualization, we scale all values between -1 and 1. From this table we can see the key drivers of the classification according to each of the methods we used here which are: last evaluation, average monthly hours, time spent and company.

CART 1 CART 2 Logistic Regr.
satisfaction_level -1.00 -1.00 -1.00
last_evaluation 0.39 0.39 0.10
number_project -0.47 -0.47 -0.33
average_montly_hours 0.44 0.48 0.20
time_spend_company 0.32 0.34 0.45
Work_accident -0.02 -0.02 -0.39
promotion_last_5years 0.00 0.00 -0.12
Salary 0.00 0.00 -0.40
Sales 0.00 0.00 0.04
Accounting 0.00 0.00 0.02
Hr 0.00 0.00 0.05
product_mng 0.00 0.00 0.01
technical 0.00 0.00 0.04
support 0.00 0.00 0.05
IT 0.00 0.00 0.02
RandD 0.00 0.00 -0.05

Validation accuracy

We might suggest to select a relatively high probability threshold (>90%) because it is important for us only to focus on those employee that are almost certain to leave. Imagine that we would want to grant additional benefits only to the employees who are most likely to leave disregarding the “likely, but not so much”.

We now measure other classification performance metric for different choices of the probability threshold.

1. Hit ratio

We calculate the percentage of the observations that have been correctly classified. These are as follows for the probability threshold 90% for the validation data:

Hit Ratio
First CART 94.80000
Second CART 95.26667
Logistic Regression 47.80000

while for the estimation data the hit rates are:

Hit Ratio
First CART 97.90816
Second CART 98.39987
Logistic Regression 83.33194

We see that for the estimation data the hit rates are much higher than for the validation data.

To compare the performance of a classification model we use the Maximum Chance Criterion. This measures the proportion of the class with the largest size. For our validation data the largest group is people who do leave: 717 out of 1500 people). Clearly without doing any discriminant analysis, if we classified all individuals into the largest group, we could get a hit-rate of 47.8%.

2. Confusion matrix

The confusion matrix shows for each class the number (or percentage) of the data that are correctly classified for that class. For example for the method above with the highest hit rate in the validation data (among logistic regression and the 2 CART models), the confusion matrix for the validation data is:

Predicted 1 Predicted 0
Actual 1 91.32 8.68
Actual 0 99.58 0.42

Type 1 error seems extremely high. Even when we reran the model adjusting the variables and features, the Type 1 error remained significant. This makes us believe that the model is not working properly hence we cannot trust the result of this analysis.

3. ROC curve

The ROC curves for the validation data for both the CARTs above as well as the logistic regression are as follows:

The best point of the ROC curve is around 80-90% True Positive rate

4. Lift curve

The Lift curves for the validation data for our three classifiers are the following:

The lift curve shows us that the results of the analysis don’t change a lot after 50% of the validation data.

The lift and the ROC curve results do not seem accurate. This supports our belief that the model doesn’t work correctly thus the results should not be taken into consideration.

Test Accuracy

We now perform the analysis with the test sample. We assume that the data used for this performance analysis is representative.

As can be observed in this section, the results of the test data are in line with those found for the validation data. This confirms our believe that there is a problem with the CART and logistic regression performed and that the results therefore should not be used at this point.

The Confusion Matrix, ROC Curve, Lift Curve, and Profit Curve look like for our test data:

Hit Ratio
First CART 94.73333
Second CART 95.80000
Logistic Regression 47.46667

The Confusion Matrix for the model with the best validation data hit ratio above:

Predicted 1 Predicted 0
Actual 1 92.13 7.87
Actual 0 99.86 0.14

ROC curves for the test data:

Lift Curves for the test data:

Executive summary and final conclusions

We entered this analysis with the goal of understanding what the main drivers for employee churn are. Starting from the raw GOLDHR data set, we first conducted descriptive statistics and created a general correlation matrix to get a better understanding of the data that we were dealing with.

Once we were comfortable with the data, we discovered the most important components of this data (i.e., satisfaction level, amount of work, tenure), and reduced the dimensions of the data set to these important components.

From there we were able to arrive at three distinct segments: Fresh and Happy, who comprise a vast majority of the population and are satisfied with their current standing; Overworked employees who have too many projects and are unhappy about it; and the Unchallenged employees whose workload is disproportionately lax and are unhappy about it. These latter two segments comprise the vast majority of churned employees.

In sum, we conclude that the most significant factor that drives employee churn is whether that person is happy or not in the company. Though this may seem obvious, what we found striking about our analysis was how little the other factors contributed to churn relative to employee satisfaction. For example, at first blush we hypothesized that salary might be a major factor for leaving and that emloyees could be segmented by choosing the importance of salary as a factor. However, salary turned out to be equally important across different segments of employees. Furthermore, we discovered that an employee’s satisfaction has very little to do with the department he or she belongs to.

In this light, we believe the next step for this case is clear. We already know that satisfaction drives retention, so it only makes sense that we have to understand in more detail what drives satisfaction. To do so, we would recommend re-running a similar analysis as above, this time omitting “left” from the study, and instead using “satisfaction_level” as the new dependent variable. We already have some proclivities to what the results may show, as the correlation matrix above shows that projects, tenure, and hours have a high correlation coefficient with satisfaction. Still, it would behoove us to confirm this hypothesis by properly going through the same process we conducted above.

Note. We’re not summarizing the results of the classification at this point due to the issues mentioned above.

=======