Our dataset includes 14,999 observations, with each row representing one single employee.

Fields in the dataset include the following 10 variables for each line:

- Employee satisfaction level

- Last evaluation score

- Number of projects

- Average monthly hours

- Time spent at the company

- Whether they have had a work accident

- Whether they have had a promotion in the last 5 years

- Department

- Salary

- Whether the employee has left

1) Assess what are the relationship between the 10 variables and what are the significant variables to describe the dataset

2) Undestand who are the employees that have left

3) Focus the analysis on the most valuable employees who have left

4) Devolop a predictive model to assess the likelihood of an employee leaving

Step 1) Data quality check

Step 2) Basic Data Visualisation

Step 3) Principal Component Analysis

Step 4) Futher comparative analysis on employees that left

Step 5) Prediction Model

Step 6) Conclusion

First, we will perform basic statistical analysis and understand the type of factors.

Looking at data we can make the following assumptions related the nature of the 10 variables:

satisfaction_level: A numeric indicator filled out by the employee ranging from 0 to 1

last_evaluation: A numeric indicator filled in by the employeeâ€™s manager ranging from 0 to 1

number_project: An integer that indicates the number of projects the employee has worked on

average_monthly_hours: The number of hours employees work in the month

time_spend_company: An integer value indicated the years of service

Work_accident: A dummy variable assessing whether(1) or not (0) they had an accident

left: A dummy variable, leave (1), not leave(0)

promoted_last_5years: A dummy variable, promoted(1), not promoted(0)

sales: A categorical variable assessing the department in which employee is working (sales,technical,support,IT, product,marketing, other)

salary: A 3-level categorical variable (low, medium, high)

First of all we assess that there are no missing data with function is.na(myData) that we would not report in the code and and we perform basic summary statistic of the dataset.

```
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
```

From basic statistical analysis we can see that the overall satisfaction of the company is at a low-medium level of 63% and that approximately 24% of the employees have left.

This brings us to the following step: we would like to visualize better who are the employees that have decided to leave.

We will start our analysis looking more deeply at a subset composed of only the employees that have left. In particular we will analyse the distribution of employees across variables:

We cut database in the desired subset composed of 3,571 observations:

`## [1] 3571`

We plot first of all the most intuitive variables which could provide initial insights into why people leave - Satisfaction Level, Last Evaluation and Average monthly hours.

From the previous histograms we can make the following preliminary observations:

None of the distributions seem normal but we see peaks at the ends of the histograms

Regarding satisfaction level, the distribution of employers that are leaving is quite polarized; employees who left are mostly low (<0.5)or high on satiafaction level (> 0.7).

Regarding employees evaluation, those that leave seems either really good (>.9) or average.

Employees that leave seem to either work a lot( >250 hours) or below average (<150 hours)

We then look at the distribution in the categorical variables:

From a first analysis of these last three histograms we can make the following observations:

- The frequency of work accident per se doesnâ€™t not mean a lot.

- Employees that left seems to have generally low salary.

- Employees that left comes mainly from sales, support and technical departments.

From the preliminary analysis, we would like to focus our analys on employees that we consider most valuable but that are leaving. We decide to set this criteria looking at the median value and choosing those that have worked for the company for more than 3 years, have good last evaluation results >0.72, and have performed more than 4 projects.

This group is composed by 3556 people (23,7% of employees)

`## [1] 3556`

As of part of the preliminary analysis we then perform an initial correlation analysis for numerical variables for the following group of employees: 1) all the employees 2) the valuable employees identified before 3) the employees that left