-
Notifications
You must be signed in to change notification settings - Fork 0
/
Analysis_of_Employee_Attrition_and_Income.Rmd
2425 lines (1950 loc) · 109 KB
/
Analysis_of_Employee_Attrition_and_Income.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Analysis of Employee Attrition and Income"
author: "Kristin Henderson"
date: "Spring 2024"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Introduction
<br>
To address the challenge of retaining talented employees, the leadership team of Frito Lay have engaged DDS Analytics to conduct an in-depth analysis aimed at uncovering the underlying factors behind employee turnover within their organization. This study, leveraging data from 870 employees across 36 distinct variables, is designed to yield actionable insights to inform decision-making processes. By employing a combination of visualizations, statistical methodologies, and the creation of predictive models, the objective is to provide strategic guidance and practical tools to enhance future strategies and organizational policies.
<br>
<br>
**Presentation**
<br>
View the video presentation of this analysis: https://youtu.be/aY4CYfuHOf4.
<br>
<br>
**R Shiny App**
<br>
Try an interactive app of the data: https://kdhenderson.shinyapps.io/Employee_Attrition_and_Income/.
<br>
<br>
**This imports necessary libraries.**
<br>
```{r libraries, results='hide', message=FALSE}
library(RCurl)
library(tidyverse)
library(ggplot2)
library(DataExplorer)
library(Hmisc) #rcorr(), generates a correlation matrix
library(corrplot) #to plot the correlation matrix
library(gridExtra) #to arrange plots in a grid
library(RColorBrewer)
library(caret) #ConfusionMatrix()
library(e1071) #naiveBayes()
library(class) #knn
library(combinat)
library(patchwork)
library(GGally)
library(kableExtra) #kable() tables
library(pander) #pander() tables
library(dplyr) #loading this after Hmisc because of conflict in summarize? can use summarise instead
```
<br>
### Objective 1: Identify the top three factors that lead to attrition.
<br>
**I import the primary data set from the cloud as `cs2`. Then, I convert the numerical variables that are Likert scale responses to factors, creating a new data frame, `cs2_conv`.**
<br>
```{r importData, comment = ""}
# import data: CaseStudy2-data.csv from AWS S3 msdsds6306 bucket
cs2 = read.table(textConnection(getURL(
"https://s3.us-east-2.amazonaws.com/msdsds6306/CaseStudy2-data.csv"
)), sep=",", header=TRUE, stringsAsFactors = TRUE)
# save a copy of the data
# write.csv(cs2, file = "data/CaseStudy2-data.csv", row.names = FALSE)
# get a sense of the data
str(cs2)
df_summary = summary(cs2)
df_summary
# convert numerical ranking variables to factors
numVars_to_fact = c("Education", "EnvironmentSatisfaction",
"JobInvolvement", "JobLevel", "JobSatisfaction",
"PerformanceRating", "RelationshipSatisfaction",
"StockOptionLevel", "WorkLifeBalance")
cs2_conv = cs2 %>%
mutate(across(all_of(numVars_to_fact), as.factor))
# check the conversion
# str(cs2_conv)
# get the proportion of attrition
summary(cs2_conv$Attrition)
NoAttProb = summary(cs2_conv$Attrition)["No"] / sum(summary(cs2_conv$Attrition))
NoAttProb
AttProb = summary(cs2_conv$Attrition)["Yes"] / sum(summary(cs2_conv$Attrition))
AttProb
```
<br>
<br>
**For the exploratory data analysis (EDA), first I visualize the data. I loop through and plot nearly all the numerical variables in histograms and frequency polygons by density grouped by attrition category. I also loop through each categorical variable, find the proportion of each level within each attrition category and plot them as bar charts.**
<br>
```{r univariatePlots, message = FALSE}
# plot the distributions of numerical variables with respect to attrition group
# select all variables except ID, EmployeeCount (all=1), and StandardHours (all=80))
cs2_conv_subset = cs2_conv %>% select(-ID, -EmployeeCount, -StandardHours)
# check the modification
# str(cs2_conv_subset)
# filter the numerical variables
numVars = sapply(cs2_conv_subset, is.numeric)
cs2_numerical = cs2_conv_subset[, numVars] #this does not include integers that were converted to factors
# check the result
# str(cs2_numerical)
# function to calculate binwidth (by Freedman-Diaconis rule, Bin Width=2*IQR/(cuberoot(n)))
# and flexibly set a binwidth for each variable
calculate_binwidth = function(data) {
IQR = quantile(data, 0.75) - quantile(data, 0.25)
n = length(data)
bw = 2 * IQR / (n^(1/3))
return(bw)
}
# iterate over numerical variables and create plots
dist_plots = list()
for (col in names(cs2_numerical)) {
# calculate binwidth for current variable
bw = calculate_binwidth(cs2_conv[[col]])
# create a histogram with density not frequency for each variable
hist_plot = cs2_conv %>%
ggplot(aes(x = .data[[col]], y = after_stat(density), fill = Attrition)) +
geom_histogram(binwidth = bw) +
facet_wrap(~Attrition)
labs(title = paste("Histogram of", col))
# save the histogram plot
# ggsave(filename = paste0("plots/EDA/", col, "_histogram.png"), plot = hist_plot, device = "png")
# create a frequency polygon plot for each variable
freqpoly_plot = cs2_conv %>%
ggplot(aes(x = .data[[col]], y = after_stat(density), color = Attrition)) +
geom_freqpoly(binwidth = bw) +
labs(title = paste("Frequency Polygon of", col))
# save the frequency polygon plot
# ggsave(filename = paste0("plots/EDA/", col, "_freqpoly.png"), plot = freqpoly_plot, device = "png")
dist_plots[[col]] = list(hist_plot, freqpoly_plot)
}
# display plots
for (col in names(cs2_numerical)) {
print(dist_plots[[col]][[1]]) #displays histogram
print(dist_plots[[col]][[2]]) #displays frequency polygon
}
# make bar charts to visualize categorical variables with respect to attrition group
# filter categorical variables (factors)
cs2_categorical = cs2_conv %>%
select(where(is.factor) & -Over18) #Over18(all=Y)
# check the result
# str(cs2_categorical)
# iterate over categorical variables and create bar plots
bar_plots = list()
for (col in names(cs2_categorical)) {
if (col != "Attrition") { #check if the column is not "Attrition"
# calculate proportions of each category within each level of Attrition
prop_cat_data = cs2_categorical %>%
group_by(Attrition = cs2_conv$Attrition, .data[[col]]) %>%
dplyr::summarize(count = n(), .groups = "drop") %>% #this gets rid of the warning message about grouping
group_by(Attrition) %>%
mutate(proportion = count / sum(count))
bar_plot = prop_cat_data %>%
ggplot(aes(x = .data[[col]], y = proportion, fill = Attrition)) +
geom_bar(position = "dodge", stat = "identity") +
labs(title = paste("Bar Chart of", col)) +
coord_flip()
# save the bar plot
# ggsave(filename = paste0("plots/EDA/", col, "_barplot.png"), plot = bar_plot, device = "png")
bar_plots[[col]] = bar_plot
}
}
# display plots if they exist (there is no plot for attrition)
for (col in names(cs2_categorical)) {
if (!is.null(bar_plots[[col]]))
print(bar_plots[[col]])
}
```
<br>
<br>
Many of the numerical features have right-skewed distributions, e.g. salary, and the variables that are related to years working.
<br>
With the plots above, I look for features that may have higher attrition.
<br>
*Numerical* features with higher attrition may include the following.
* age (under ~32)
* distance from home (above ~13)
* employee number (There are three peaks.)
* num companies worked (higher above ~5, a bit higher ~1)
* percent salary hike?
* total working years (higher below 8 and bump at ~40 for retirement?)
* training time?
* years at company (higher below ~2 and bump at ~40 for retirement?)
* years in current role (higher below ~2)
* years with current manager (higher below ~1)
* hourly rate? (seems flat depending on how you bin it)
* daily rate (low-mid, ~250-650)
* monthly rate? (more or less flat depending on how you bin it)
* monthly income (much higher below ~3000)
<br>
*Categorical* features with higher attrition may include the following.
* business travel (higher with frequent)
* department (higher in sales)
* education (slightly higher with lower education, 1-3)
* education field (higher in tech degree and marketing)
* environmental satisfaction (higher in 1, lower in 2-4)
* gender (slightly higher in males)
* job involvement (higher attrition 1-2, lower 3-4)
* job level (much higher in 1, lower in 2-5)
* job role (higher in sales rep, technician and scientist)
* job satisfaction (higher in 1-3, lower in 4)
* marital status (much higher in single)
* over time (much higher with yes)
* relationship satisfaction (higher in 1, lower in 3)
* stock option level (higher with 0)
* work life balance (higher in 1)
<br>
Something odd, performance rating doesn't seem to have much effect, but attrition is higher in 4.
<br>
These factors could be correlated with age and higher skill: age, education, environmental satisfaction(?), job involvement(?), job level, job satisfaction(?), marital status, monthly income, overtime, stock option level, total working years, years at company, years in current role, years with current manager.
<br>
<br>
**The next step of my EDA is to check for missing values before proceeding with further analysis.**
<br>
```{r missingValues, comment = ""}
# are there missing values
missing_count = sum(is.na(cs2_conv) | cs2_conv == "")
writeLines(sprintf("There are %d missing values in the dataset.", missing_count))
```
<br>
The data set appears to be complete.
<br>
<br>
**This function generates a report to facilitate EDA. It is good to know about and look over for ideas, though perhaps not as useful as my step-by-step analysis.**
<br>
```{r DataExplorer, eval = FALSE}
create_report(cs2_conv, y = "Attrition")
```
<br>
<br>
**For the numerical variables, I generate a matrix of correlation coefficients and plot them. There is also a function to generate a table of correlation and p-values, but only the first few rows are displayed for the sake of space. I find the matrix to be sufficient.**
<br>
```{r correlationMatrix, collapse = TRUE}
# compute correlation matrix and p-values
cor_mat = rcorr(as.matrix(cs2_numerical))
# This function makes a table of each variable combination with the correlation coefficient and p-value
# ++++++++++++++++++++++++++++
# flattenCorrMatrix
# ++++++++++++++++++++++++++++
# cormat : matrix of the correlation coefficients
# pmat : matrix of the correlation p-values
flattenCorrMatrix = function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
# call the function with extracted correlation and p-values
corrTable = flattenCorrMatrix(cor_mat$r, cor_mat$P)
kable(head(corrTable), caption = "correlation coefficients") %>% kable_styling(full_width = FALSE, position = "left")
# plot the correlations using corrplot
corrplot(cor_mat$r, type = "upper", order = "hclust", insig = "blank")
# If I include the p.mat argument I get an error.
# I think because pairs of the same variable create an NA.
```
<br>
From this matrix, there is evidence of correlation between some numerical variables, including total working years with these below:
* years at company
* years in current role
* years with current manager
* (years since last promotion)
* monthly income
<br>
**Next in the EDA, I look at each numerical variable for statistically significant evidence that the mean value of that variable differs between the populations of employees that stay or go. I perform t-tests for each numerical feature with the null hypothesis that the means of the attrition groups are equal.**
<br>
```{r prelim_TTests, comment = ""}
# extract numerical column names from cs2_numerical
numerical_columns = colnames(cs2_numerical)
# initialize an empty list to store t-test results
t_test_results = list()
# initialize an empty dataframe to store summary statistics
summary_table = data.frame(variable = character(),
mean_Attrition_Yes = numeric(),
mean_Attrition_No = numeric(),
p_value = numeric(),
stringsAsFactors = FALSE)
# initialize empty lists to store variable names with p-value < 0.05 or < 0.001
significant_variables = list() #p-value < 0.05
super_sig_variables = list() # p-value < 0.001
# loop through each numerical variable
for (col in numerical_columns) {
# get data for the current column
data_col = cs2_conv[[col]]
# run a t-test
t_test_result = t.test(data_col ~ Attrition, data = cs2_conv)
# store t-test result in the list
t_test_results[[col]] = t_test_result
# extract relevant information and add to summary table
summary_table = summary_table %>%
add_row(variable = col,
mean_Attrition_Yes = mean(data_col[cs2_conv$Attrition == "Yes"], na.rm = TRUE),
mean_Attrition_No = mean(data_col[cs2_conv$Attrition == "No"], na.rm = TRUE),
p_value = t_test_result$p.value)
# check if p-value is less than 0.05
if (t_test_result$p.value < 0.05) {
significant_variables[[col]] = t_test_result
}
# check if p-value is less than 0.001
if (t_test_result$p.value < 0.001) {
super_sig_variables[[col]] = t_test_result
}
}
# print summary table
kable(summary_table, caption = "t-test results") %>% kable_styling(full_width = FALSE, position = "left")
# print list of variables with p-value < 0.05
significant_variable_names = names(significant_variables)
cat("\nVariables with p-value < 0.05:\n", paste(significant_variable_names, collapse = ", "))
# print list of variables with p-value < 0.001
super_sig_variable_names = names(super_sig_variables)
cat("\nVariables with p-value < 0.001:\n", paste(super_sig_variable_names, collapse = ", "))
# print full t-test results for variables with p-value < 0.001
cat("\nT-test results for variables with p-value < 0.001:\n")
for (col in super_sig_variable_names) {
cat("Variable:", col, "\n")
print(super_sig_variables[[col]])
cat("\n")
}
```
<br>
This provides additional evidence of numerical variables that likely influence attrition to narrow the pool of important features in the data set.
<br>
Variables with p-value < 0.05: Age, DistanceFromHome, MonthlyIncome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager
(The last 5 are highly correlated as I saw above.)
<br>
Variables with p-value < 0.001: Age, MonthlyIncome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager
<br>
<br>
**To gather statistical evidence of categorical variables associated with attrition, I perform chi-square tests for each categorical feature with the null hypothesis that the proportion of attrition is the same across all categories (feature levels).**
<br>
```{r prelim_ChiSqTests, warning = FALSE, comment = ""}
# extract categorical column names from cs2_categorical, excluding Attrition
categorical_columns = colnames(cs2_categorical)
categorical_columns = categorical_columns[categorical_columns != "Attrition"]
# initialize an empty list to store chi-square test results
chisq_test_results = list()
# initialize an empty dataframe to store summary statistics
summary_table2 = data.frame(variable = character(),
p_value = numeric(),
stringsAsFactors = FALSE)
# initialize empty lists to store variable names with p-value < 0.05 or < 0.001
significant_variables2 = list()
super_sig_variables2 = list()
# loop through each categorical variable
for (col in categorical_columns) {
# create a contingency table for the chi-square test
contingency_table = table(cs2_conv[[col]], cs2_conv$Attrition)
# print(contingency_table)
# run a chi-square test
chisq_test_result = chisq.test(contingency_table)
# store chi-square test result in the list
chisq_test_results[[col]] = chisq_test_result
# extract relevant information and add to summary table
summary_table2 = summary_table2 %>%
add_row(variable = col,
p_value = chisq_test_result$p.value)
# check if p-value is less than 0.05
if (chisq_test_result$p.value < 0.05) {
significant_variables2[[col]] = chisq_test_result
}
# check if p-value is less than 0.001
if (chisq_test_result$p.value < 0.001) {
super_sig_variables2[[col]] = chisq_test_result
}
}
# print summary table
kable(summary_table2, caption = "chi-square test results") %>% kable_styling(full_width = FALSE, position = "left")
# print list of variables with p-value < 0.05
significant_variable_names2 = names(significant_variables2)
cat("\nVariables with p-value < 0.05:\n", paste(significant_variable_names2, collapse = ", "))
# print list of variables with p-value < 0.001
super_sig_variable_names2 = names(super_sig_variables2)
cat("\nVariables with p-value < 0.001:\n", paste(super_sig_variable_names2, collapse = ", "))
# print full chi-square test results for variables with p-value < 0.001
cat("\nChi-square test results for variables with p-value < 0.001:\n")
for (col in super_sig_variable_names2) {
cat("Variable:", col, "\n")
print(super_sig_variables2[[col]])
cat("\n")
}
```
<br>
Although, the assumptions of a chi-square test may be violated here, a deeper knowledge of the test would be helpful, and use of alternative tests might be more appropriate, this at least provides additional evidence of categorical variables that likely influence attrition, facilitating a narrower pool of variables to consider.
<br>
Variables with p-value < 0.05: BusinessTravel, Department, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, OverTime, StockOptionLevel, WorkLifeBalance
<br>
Variables with p-value < 0.001: JobInvolvement, JobLevel, JobRole, MaritalStatus, OverTime, StockOptionLevel
<br>
<br>
**To look at co-variation among numerical variables that seem likely to influence attrition, I make scatter plots of all combinations of those with highly significant t-test results.**
<br>
```{r covariation_num, message = FALSE}
# look at co-variation among numerical variables
# function to create scatter plots for each pair of variables
make_scatterplot = function(var1, var2) {
cs2_conv %>%
ggplot(aes(x = .data[[var1]], y = .data[[var2]], color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") +
geom_smooth() +
labs(title = paste(var1, "vs", var2))
}
# loop through each variable
for (i in 1:(length(super_sig_variable_names) - 1)) {
var1 = super_sig_variable_names[i]
# plot var1 with every other variable that comes after it in the list
for (j in (i + 1):length(super_sig_variable_names)) {
var2 = super_sig_variable_names[j]
scatter = make_scatterplot(var1, var2)
# display and save the individual plots
print(scatter) # using print() so it's easier to find where I call the plot
filename = paste0("plots/EDA/scatter_", var1, "_vs_", var2, ".png")
# ggsave(filename, plot = scatter, device = "png")
}
}
```
<br>
<br>
**To do the same for combinations of numerical and categorical variables, I make boxplots of all combinations of those with highly significant t-test and chi-square test results.**
<br>
```{r covariation_c_n}
# look at co-variation between numerical and categorical variables
# function to create boxplots for each pair of variables
make_boxplot = function(catvar, numvar) {
cs2_conv %>% group_by(Attrition) %>%
ggplot(aes(x = .data[[catvar]], y = .data[[numvar]], fill = Attrition)) +
geom_boxplot() +
coord_flip() +
labs(title = paste(catvar, "vs", numvar))
}
# loop through each variable
for (i in 1:(length(super_sig_variable_names2))) {
catvar = super_sig_variable_names2[i]
# plot catvar with every numvar
for (j in 1:length(super_sig_variable_names)) {
numvar = super_sig_variable_names[j]
boxplt = make_boxplot(catvar, numvar)
# display and save the individual plots
# I tried to display them in a grid in the interest of space, but they are just too small.
print(boxplt)
filename = paste0("plots/EDA/boxplt_", catvar, "_vs_", numvar, ".png")
# ggsave(filename, plot = boxplt, device = "png")
}
}
```
<br>
<br>
**To do the same for combinations of categorical variables, I make bar charts of all combinations of those with highly significant chi-square test results.**
<br>
```{r covariation_cat, message = FALSE}
# look at co-variation among categorical variables
# It probably is easier to see relationships if I were to calculate proportions a different way to normalize the data.
# Currently, proportions are calculated like this. For each attrition group, and each level of variable 1, the count of each level of variable 2 (the fill variable) is divided by the sum of the total count for all levels of that variable (within the attrition and variable 1 level).
make_barchart = function(var1, var2) {
cs2_conv %>%
group_by(Attrition, .data[[var1]], .data[[var2]]) %>%
dplyr::summarize(n = n(), .groups = 'drop') %>% #count occurrences within each group
group_by(Attrition, .data[[var1]]) %>% # group by Attrition and var1
mutate(prop = n / sum(n)) %>% #calculate proportion within each group
ggplot(aes(y = .data[[var1]], x = prop, fill = .data[[var2]])) + #use prop as fill
geom_bar(stat = "identity", position = position_dodge2(preserve = "single")) +
facet_wrap(~Attrition, labeller = labeller(Attrition = c("No" = "No Attrition", "Yes" = "Yes Attrition"))) +
labs(title = paste(var1, "vs", var2),
y = var1,
x = paste0("Proportion of ", var2, " within Attrition and ", var1, " level"),
fill = var2) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set1")
}
# loop through each variable
for (i in 1:(length(super_sig_variable_names2) - 1)) {
var1 = super_sig_variable_names2[i]
for (j in (i + 1):length(super_sig_variable_names2)) {
var2 = super_sig_variable_names2[j]
barplt = make_barchart(var1, var2)
# display and save the individual plots
print(barplt)
filename = paste0("plots/EDA/barplt_", var1, "_vs_", var2, ".png")
# ggsave(filename, plot = barplt, device = "png")
}
}
```
<br>
<br>
**Observations from co-variation plots**
<br>
**Job level**
* Trends:
- There is a positive trend between job level and several other variables: age, monthly income, total working years, years at company, years in current role and years with current manager.
* Differences:
- Job level 4: age and monthly income look like they may contribute to attrition specifically in this job level
- Job level 5: years at company and years in current role may contribute to attrition in this job level. (This suggests perhaps retirement is a factor, although total working years doesn’t show the same variation.)
**Marital Status**
* Trends:
- Singles that leave are lower in median age, have fewer total working years, years at company and years in current role than other groups.
- Otherwise, age is similar in marital status.
- All marital status levels have higher attrition with overtime, though this trend seems more pronounced in singles.
* Observations:
- Singles fall entirely within stock option level 0.
**Overtime and Stock Option**
* Trends:
- Across age, monthly income, total working years, years at company, years in current role and years with current manager, trends between attrition categories appear similar in overtime and stock option levels.
- Stock option 2 may have a different trend with median age and years in current role being higher among those that leave.
* Observations:
- There are interesting trends in overtime and stock options between job roles.
**Monthly income**
* Trends:
- Monthly income is positively associated with total working years with little difference in attrition groups.
- There a positive association between monthly income and age.
* Differences:
- There seems to be higher attrition in older employees with lower incomes.
- Higher income earners tend to leave if they have been with the company > 10 years or in their current role or with their current manager > ~7 years.
Some additional ideas:
* It might be a good idea to derive a variable, percent of working years at a company from total working years and years at company. This may offset the confound of age.
<br>
<br>
**As there is high correlation between total working years, years at company, years in current role and years with current manager, I want to see if deriving one variable from two of these, might help offset the correlation with age and extract possibly more meaningful information with the use of fewer variables.**
<br>
```{r derivedVariables, message = FALSE, comment = ""}
cs2_conv2 = cs2_conv %>%
mutate(
PercentWrkYrsAtCompany = ifelse(TotalWorkingYears == 0, 0,
round(YearsAtCompany / TotalWorkingYears * 100, 2)),
PercentYrs_wManager = ifelse(TotalWorkingYears == 0, 0,
round(YearsWithCurrManager / TotalWorkingYears * 100, 2)),
PercentYrs_inRole = ifelse(TotalWorkingYears == 0, 0,
round(YearsInCurrentRole / TotalWorkingYears * 100, 2)),
PercentYrs_inRoleAtComp = ifelse(YearsAtCompany == 0, 0,
round(YearsInCurrentRole / YearsAtCompany * 100, 2))
)
# str(cs2_conv2)
# summary(cs2_conv2)
# plot these derived variables with other variables that seem relevant and alone to see the distributions
cs2_conv2 %>% ggplot(aes(x = Age, y = PercentWrkYrsAtCompany, color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = MonthlyIncome, y = PercentWrkYrsAtCompany, color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = Age, y = PercentYrs_inRoleAtComp, color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = MonthlyIncome, y = PercentYrs_inRoleAtComp, color = Attrition)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = MaritalStatus, y = PercentWrkYrsAtCompany, fill = Attrition)) +
geom_boxplot(position = "dodge")
cs2_conv2 %>% ggplot(aes(x = JobLevel, y = PercentWrkYrsAtCompany, fill = Attrition)) +
geom_boxplot(position = "dodge")
cs2_conv2 %>% ggplot(aes(x = MaritalStatus, y = PercentYrs_inRoleAtComp, fill = Attrition)) +
geom_boxplot(position = "dodge")
cs2_conv2 %>% ggplot(aes(x = JobLevel, y = PercentYrs_inRoleAtComp, fill = Attrition)) +
geom_boxplot(position = "dodge")
cs2_conv2 %>% ggplot(aes(x = PercentWrkYrsAtCompany, after_stat(density), fill = Attrition)) +
geom_histogram(binwidth = 10) + facet_wrap(~Attrition)
cs2_conv2 %>% ggplot(aes(x = PercentYrs_inRoleAtComp, y = after_stat(density), fill = Attrition)) +
geom_histogram(binwidth = 10) + facet_wrap(~Attrition)
# run t-tests to check for possible differences
t.test(PercentWrkYrsAtCompany ~ Attrition, data = cs2_conv2)
t.test(PercentYrs_wManager ~ Attrition, data = cs2_conv2)
t.test(PercentYrs_inRole ~ Attrition, data = cs2_conv2)
t.test(PercentYrs_inRoleAtComp ~ Attrition, data = cs2_conv2)
```
<br>
Based on the plots and t-test results, percent years in role at company (derived from years in role divided by years at company), seems like the most potentially useful of these derived variables.
<br>
<br>
**I further explore the usefulness of the percent years in role at company by plotting combinations of percent years in role at company, monthly income, age, and job level.**
<br>
```{r message = FALSE, collapse = TRUE}
cs2_conv2 %>% ggplot(aes(x = PercentYrs_inRoleAtComp, y = MonthlyIncome, color = JobLevel)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = PercentYrs_inRoleAtComp, y = Age, color = JobLevel)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
cs2_conv2 %>% ggplot(aes(x = MonthlyIncome, y = Age, color = JobLevel)) +
geom_point(alpha = 0.5, position = "jitter") + geom_smooth()
```
<br>
Monthly income is a good predictor of job level (or vice versa) no matter the percent years in role at the company. Age is not as good (though there is some predictability). There is a much wider distribution of ages the lower the income, particularly under 5000.
<br>
Job level seems like a powerful predictor.
<br>
<br>
**So far, the visual and statistical evidence and my best guess for top features leading to attrition are Job Level, Marital Status, Monthly Income and Overtime. To confirm this and/or narrow my list to 3, I want to use the list of 12 features that seem like they have the greatest differences between attrition categories. I generate all the unique combinations of 3 variables. Then using the loop in this code, I use each combination to build a Naive Bayes model to predict attrition, and rank them by how well each combination performed.**
<br>
```{r modelTop3}
modelIdx = c(2,15,16,17,19,20,24,29,30,33,34,36) #indexes of predicted top features
# these are the indexes for Age, JobInvolvement, JobLevel, JobRole, MaritalStatus, MonthlyIncome, OverTime, StockOptionLevel, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager
# initialize an empty data frame to store results
results = data.frame(Index1 = integer(), Index2 = integer(), Index3 = integer(), Sensitivity = numeric(), Specificity = numeric(), Accuracy = numeric(), stringsAsFactors = FALSE)
# number of different splits
iters = 50
# 70-30 train/test split
splitPerc = 0.7
# generate all combinations of 3 indexes
index_combinations = combinat::combn(modelIdx, 3)
# loop through each combination of 3 indexes
for (i in 1:ncol(index_combinations)) {
index_set = index_combinations[, i]
# create holders for the sensitivity, specificity, and accuracy for each iteration
masterSens = numeric(iters)
masterSpec = numeric(iters)
masterAcc = numeric(iters)
# iterate over the specified number of iterations
for (j in 1:iters) {
set.seed(j)
# sample 70% for training
trainIdx = sample(1:dim(cs2_conv2)[1], round(splitPerc * dim(cs2_conv2)[1]))
# select the rows for training
AttritionTrn = cs2_conv2[trainIdx, ]
# select the remaining rows for testing
AttritionTst = cs2_conv2[-trainIdx, ]
# temporarily set the current indexes for modeling
temp_modelIdx = index_set
# train the model
modelNB = naiveBayes(AttritionTrn[, temp_modelIdx], AttritionTrn$Attrition, laplace = 1)
# predict using the model
preds = predict(modelNB, AttritionTst[, temp_modelIdx])
# make a confusion matrix
CM = confusionMatrix(table(preds, AttritionTst$Attrition))
# store sensitivity, specificity, and accuracy
masterSens[j] = CM$byClass["Sensitivity"]
masterSpec[j] = CM$byClass["Specificity"]
masterAcc[j] = CM$overall["Accuracy"]
}
# calculate average sensitivity, specificity, and accuracy
avg_sensitivity = mean(masterSens)
avg_specificity = mean(masterSpec)
avg_accuracy = mean(masterAcc)
# store the results including the three indexes
results = rbind(results, list(Index1 = index_set[1], Index2 = index_set[2], Index3 = index_set[3], Sensitivity = avg_sensitivity, Specificity = avg_specificity, Accuracy = avg_accuracy))
}
# output the results
results = results %>% arrange(-Specificity)
kable(head(results, 10), caption = "3 variable combos to predict attrition") %>% kable_styling(full_width = FALSE, position = "left")
```
<br>
The top three features that predict attrition are within the group that I initially highlighted. They are Job Level, Monthly Income and Overtime. Stock Option Level and Marital Status are also strong predictors.
<br>
Obtaining marital status data may pose challenges due to potential privacy considerations, unlike the other predictors which are likely readily available through employment records.
<br>
<br>
**This provides an overview of the first step of my analysis process focusing on visual evidence.**
I started the analysis by plotting each variable separated by attrition group to compare their populations. Visually, the many features on this list seemed to show differences in these populations, but the four starred variables seemed particularly important.
<br>
*Important features predicted by visual evidence:*
Age
Business Travel
Distance from home
Department
Education Field
Environmental Satisfaction
Job Involvement
Job Level *
Job Role
Job Satisfaction
Marital Status *
Monthly Income *
Number Companies Worked For
Overtime *
Relationship Satisfaction
Stock Option Level
Total Working Years
Work Life Balance
Years At Company
Years In Current Role
Years With Current Manager
<br>
**To illustrate an example of positive and negative visual evidence, I plot attrition by job level and gender.**
<br>
```{r top3plots_1, message = FALSE, collapse = TRUE}
# plot positive predictors example: Job Level for presentation
# also plot negative predictor: Gender
# I revert to using my original converted dataset `cs2_conv`.
# plot positive predictor example: job level and attrition
# calculation of proportion of employees: employees in job level attrition yes / total employees in job level, to normalize each job level
jobLevel_bar = cs2_conv %>%
group_by(JobLevel, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes") %>%
ggplot(aes(y = JobLevel, x = prop, fill = JobLevel)) +
geom_bar(stat = "identity") +
labs(y = "Job Level", x = "Proportion of Attrition",
fill = "Job Level",
title = "Proportion of Attrition within Each Job Level",
caption = "Normalized by Total Employees per Job Level") +
theme_bw() +
scale_fill_brewer(palette = "Set1")
print(jobLevel_bar)
# ggsave(jobLevel_bar, filename = "plots/top3_jobLevel_bar.png")
# plot negative predictor example: gender and attrition
genderBar = cs2_conv %>%
group_by(Gender, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes") %>%
ggplot(aes(y = Gender, x = prop)) +
geom_bar(stat = "identity", fill = "#377EB8") + #fill color = Attrition group blue
labs(y = "Gender", x = "Proportion of Attrition",
fill = "Gender",
title = "Proportion of Attrition within Each Gender",
caption = "Normalized by Total Employees per Gender") +
theme_bw() +
theme(text = element_text(size = 12))
genderBar
# ggsave(genderBar, filename = "plots/top3negExample_genderBar.png")
```
<br>
Job level 1 has a much higher percent attrition than any other job level. Over a quarter of the employees in job level 1 leave the company. This is at least 13% higher than any other job level. Also of note, only 5% of employees in job level 4 leave.
<br>
On the other hand, gender doesn't contribute to attrition. There is very little difference in attrition rates between males and females (17% and 15%).
<br>
**This provides an overview of the other steps in my analysis process focusing on statistical evidence and modeling.**
I tested for statistically significant differences in the attrition groups for each variable that was highlighted by visual evidence. For the numerical values, I used a t-test to predict the likelihood that the means of attrition groups are equal. For the categorical variables, I used a chi-square test to predict the likelihood that proportion of attrition is the same across categories of a variable.
<br>
*These variables, including the four starred previously, had significant differences below the alpha 0.05 level.*
Age
Job Involvement
Job Level *
Job Role
Marital Status *
Monthly Income *
Overtime *
Stock Option Level
Total Working Years
Years At Company
Years In Current Role
Years With Current Manager
<br>
From this list, I identified a subset of three variables that achieved the highest specificity when used in a predictive model for attrition. In the context of the model, specificity refers to the percentage of instances where attrition is correctly predicted.
<br>
The subset of three top factors that lead to attrition are Job Level, Monthly Income and Overtime.
<br>
```{r top3plots_2, message = FALSE, collapse = TRUE}
# plot the other positive predictors for presentation: Monthly Income, Job Level and Overtime
# plot monthly income and attrition with a boxplot and two histograms
incomeBox = cs2_conv %>%
ggplot(aes(x = MonthlyIncome, y = Attrition, fill = Attrition)) +
geom_boxplot() +
ylab("Attrition") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_blank()) +
scale_fill_brewer(palette = "Set1")
yAtt_hist = cs2_conv %>% filter(Attrition == "Yes") %>%
ggplot(aes(x = MonthlyIncome)) +
geom_histogram(binwidth = 1000, fill = "#377EB8") +
labs(title = "Distribution of Monthly Income by Employee Attrition",
y = "Yes") +
theme_bw() +
theme(legend.position = "none",
axis.title.x = element_blank())
nAtt_hist = cs2_conv %>% filter(Attrition == "No") %>%
ggplot(aes(x = MonthlyIncome)) +
geom_histogram(binwidth = 1000, fill = "#E41A1C") +
labs(#title = "Employee Attrition and Monthly Income",
y = "No") +
theme_bw() +
theme(legend.position = "none")
incomePlt = yAtt_hist / incomeBox / nAtt_hist
print(incomePlt)
# ggsave(incomePlt, filename = "plots/top3_incomeDist.png")
# make co-variation plots split by attrition groups
# plot positive predictor: job level and monthly income
income_jobLevel_box = ggplot(cs2_conv, aes(x = JobLevel, y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
labs(x = "Job Level", y = "Monthly Income", fill = "Attrition") +
ggtitle("Monthly Income by Job Level and Attrition") +
theme_bw() +
scale_fill_brewer(palette = "Set1")
income_jobLevel_box
# ggsave(income_jobLevel_box, filename = "plots/top3_income_jobLevel_box.png")
# plot positive predictor: job level and attrition, with NO overtime
# employees NOT working overtime
jobLevel_OTno_bar = cs2_conv %>%
group_by(OverTime, JobLevel, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes" & OverTime == "No") %>%
ggplot(aes(y = JobLevel, x = prop, fill = JobLevel)) +
geom_bar(stat = "identity", position = "dodge") +
labs(y = "Job Level", x = "Proportion of Attrition",
fill = "Job Level",
subtitle = "Proportion of Attrition within Each Job Level",
title = "Employees with No Overtime Hours",
caption = "Normalized by Total Employees per Job Level") +
xlim(0.00, 0.55) +
theme_bw() +
theme(text = element_text(size = 12)) +
scale_fill_brewer(palette = "Set1")
print(jobLevel_OTno_bar)
# ggsave(jobLevel_OTno_bar, filename = "plots/top3_jobLevel_OTno_bar.png")
# plot positive predictor: job level and attrition, with overtime
# employees working at least an hour of overtime
jobLevel_OTyes_bar = cs2_conv %>%
group_by(OverTime, JobLevel, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes" & OverTime == "Yes") %>%
ggplot(aes(y = JobLevel, x = prop, fill = JobLevel)) +
geom_bar(stat = "identity", position = "dodge") +
labs(y = "Job Level", x = "Proportion of Attrition",
fill = "Job Level",
subtitle = "Proportion of Attrition within Each Job Level",
title = "Employees Working Overtime",
caption = "Normalized by Total Employees per Job Level") +
xlim(0.00, 0.55) +
theme_bw() +
theme(text = element_text(size = 12)) +
scale_fill_brewer(palette = "Set1")
print(jobLevel_OTyes_bar)
# ggsave(jobLevel_OTyes_bar, filename = "plots/top3_jobLevel_OTyes_bar.png")
# find 25th, 50th and 75th percentile of monthly incomes for each attrition group and job level
# is there a stay/go cut off for job level 4
mid50_income = cs2_conv %>%
group_by(Attrition, JobLevel) %>%
dplyr::summarise(q25 = quantile(MonthlyIncome, 0.25),
median = median(MonthlyIncome),
q75 = quantile(MonthlyIncome, 0.75))
kable(mid50_income, caption = "monthly income q25, median, q75 for each job level and attrition group") %>% kable_styling(full_width = FALSE, position = "left")
```
<br>
The plot of distributions of monthly income reveals a smaller income range and median income among employees that leave (in blue). The high-income outliers are likely due to factors like retirement.
<br>
The boxplot of job levels and income illustrates the positive relationship between these variables. It reveals a large income gap between employees that stay and those that leave in job level 4. Job levels 1 and 3 have a small trend of attrition with lower median incomes.
<br>
Attrition rates are low for all job levels in the graph of employees that do not work overtime, under 15% for job level 1 and under 10% for the others. However, the story changes for employees working at least an hour of overtime. Over 50% of employees in job level 1 working overtime choose to leave. This pattern holds true across all job levels except for job level 4, with attrition rates more than doubling among overtime workers.
<br>
```{r top3plots_appendix, message = FALSE, collapse = TRUE}
# extra plots for the appendix
# plot positive predictor example: overtime and attrition
# calculation of proportion of employees: employees in overtime group & attrition yes / total employees in overtime group, to normalize each overtime group
overtime_bar = cs2_conv %>%
group_by(OverTime, Attrition) %>%
dplyr::summarize(n = n()) %>%
mutate(prop = n / sum(n)) %>%
filter(Attrition == "Yes") %>%
ggplot(aes(y = OverTime, x = prop)) +
geom_bar(stat = "identity", fill = "#377EB8") +
labs(y = "Overtime", x = "Proportion of Attrition",
fill = "Overtime",
title = "Proportion of Attrition for Employees With and Without Overtime",
caption = "Normalized by Total Employees per OT Group") +
theme_bw() +
scale_fill_brewer(palette = "Set1")
print(overtime_bar)
# ggsave(overtime_bar, filename = "plots/appendix/top3_overtime_bar.png")
# plot positive predictor: monthly income and overtime
income_OT_box = ggplot(cs2_conv, aes(x = OverTime, y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
labs(x = "Overtime", y = "Monthly Income", fill = "Attrition") +
ggtitle("Monthly Income by Overtime and Attrition") +
theme_bw() +
scale_fill_brewer(palette = "Set1")
print(income_OT_box)
# ggsave(income_OT_box, filename = "plots/appendix/top3_income_OT_box.png")
```