Human Resources Analytics

How do we retain our best and most experienced employees?

Our dataset includes 14,999 observations, with each row representing one single employee.

Fields in the dataset include the following 10 variables for each line:
- Employee satisfaction level
- Last evaluation score
- Number of projects
- Average monthly hours
- Time spent at the company
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Department
- Salary
- Whether the employee has left

Project Objectives

1) Assess what are the relationship between the 10 variables and what are the significant variables to describe the dataset
2) Undestand who are the employees that have left
3) Focus the analysis on the most valuable employees who have left
4) Devolop a predictive model to assess the likelihood of an employee leaving

The report is divided as follows:

Step 1) Data quality check
Step 2) Basic Data Visualisation
Step 3) Principal Component Analysis
Step 4) Futher comparative analysis on employees that left
Step 5) Prediction Model
Step 6) Conclusion

Step 1: Data quality check

First, we will perform basic statistical analysis and understand the type of factors.

Looking at data we can make the following assumptions related the nature of the 10 variables:

satisfaction_level: A numeric indicator filled out by the employee ranging from 0 to 1
last_evaluation: A numeric indicator filled in by the employee’s manager ranging from 0 to 1
number_project: An integer that indicates the number of projects the employee has worked on
average_monthly_hours: The number of hours employees work in the month
time_spend_company: An integer value indicated the years of service
Work_accident: A dummy variable assessing whether(1) or not (0) they had an accident
left: A dummy variable, leave (1), not leave(0)
promoted_last_5years: A dummy variable, promoted(1), not promoted(0)
sales: A categorical variable assessing the department in which employee is working (sales,technical,support,IT, product,marketing, other)
salary: A 3-level categorical variable (low, medium, high)

Data quality report

First of all we assess that there are no missing data with function that we would not report in the code and and we perform basic summary statistic of the dataset.

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

From basic statistical analysis we can see that the overall satisfaction of the company is at a low-medium level of 63% and that approximately 24% of the employees have left.

This brings us to the following step: we would like to visualize better who are the employees that have decided to leave.

Step 2: Data Visualization

We will start our analysis looking more deeply at a subset composed of only the employees that have left. In particular we will analyse the distribution of employees across variables:

We cut database in the desired subset composed of 3,571 observations:

## [1] 3571

We plot first of all the most intuitive variables which could provide initial insights into why people leave - Satisfaction Level, Last Evaluation and Average monthly hours.

From the previous histograms we can make the following preliminary observations:

None of the distributions seem normal but we see peaks at the ends of the histograms
Regarding satisfaction level, the distribution of employers that are leaving is quite polarized; employees who left are mostly low (<0.5)or high on satiafaction level (> 0.7).
Regarding employees evaluation, those that leave seems either really good (>.9) or average.
Employees that leave seem to either work a lot( >250 hours) or below average (<150 hours)

We then look at the distribution in the categorical variables:

From a first analysis of these last three histograms we can make the following observations:

- The frequency of work accident per se doesn’t not mean a lot.
- Employees that left seems to have generally low salary.
- Employees that left comes mainly from sales, support and technical departments.

From the preliminary analysis, we would like to focus our analys on employees that we consider most valuable but that are leaving. We decide to set this criteria looking at the median value and choosing those that have worked for the company for more than 3 years, have good last evaluation results >0.72, and have performed more than 4 projects.

This group is composed by 3556 people (23,7% of employees)

## [1] 3556

As of part of the preliminary analysis we then perform an initial correlation analysis for numerical variables for the following group of employees: 1) all the employees 2) the valuable employees identified before 3) the employees that left