Predictive Analytics Tutorial: Part 2


Welcome Back!

To part 2 of this 4-part tutorial series on predictive analytics.  The goal of this tutorial is to provide an in-depth example of using predictive analytic techniques that can be replicated to solve your business use case.  In my previous blog post, I covered the first two phases of performing predictive modeling: Define and Set Up.  In this tutorial, I will be covering the third phase of predictive modeling: Exploratory Data Analysis or EDA.


Upon completion of the last tutorial, we had understood the problem, set up our Data Science Experience (DSX) environment, imported the data and very briefly explored it.  To continue, you can follow along with the steps below by copying and pasting the code into the data science experience notebook cells (as shown in the previous blog) and hitting the "run cell" button.  Or you can download the full code from my github repository.


Getting Familiar


Log back into Data Science Experience - Find your project and open the notebook.  Click the little pencil button icon in the upper right hand side of the notebook to edit.  This will allow you to modify and execute notebook cells.


Re-Run all of your cells from the last tutorial -  Minimally, every time you open up your R environment (local or DSX), you need to reload your data and reload all required packages.  For the purposes of this tutorial also please rename your data frame as you did in the last tutorial.  Remember to run your code, select the cell and hit the "run cell" button that looks like a right-facing arrow.  


*Tip: In DSX there is a shortcut to allow you to rerun all cells again, or all cells above or below the selected cell.  This is an awesome time saver.  


Baseline our goals - It's important to remind ourselves of our goals from part 1.  In short, we are trying to create a predictive model to estimate "average_monthly_hours".  This will allow us to estimate how many hours our current employee base is likely to work. 

Install and Load Packages - R packages contain a grouping of R data functions and code that can be used to perform your analysis. We need to install and load them in DSX so that we can call upon them later.  As per the previous tutorial, enter the following code into a new cell, highlight the cell and hit the "run cell"  button. 

#install packages - do this one time

You may get a warning note that the package was installed into a particular directory as 'lib' is unspecified.  This is completely fine.  It still installs successfully and you will not have any problems loading it in the next step.  After you successfully install all the packages, you may want to comment out the lines of code with a "#" in front of each line.  This will help you to rerun all code at a later date without having to import in all packages again.  As done previously, enter the following code into a new cell, highlight the cell and hit the "run cell"  button. 

# Load the relevant libraries - do this every time
library (gcookbook)

hr= fread('')


Exploratory Data Analysis

Understand the correlation between columns - When exploring a new data set with the purpose of predictive modelling it can be beneficial to create a correlation table.  Examining correlation allows us to understand how the variables (or columns) are related.  The correlation value for two columns can be between -1 and +1.   The closer the value is to either -1 or +1, the higher the correlation or association between columns.  If the correlation value is positive, it means that the when one column gets bigger the other column gets bigger.  When the correlation value is negative, this means that when one column gets bigger the other column gets smaller.

#attach allows us to reference columns by their name
#Check Correlations of numeric columns
corMatrix <-cor(hr[1:8], use="complete.obs", method="pearson")
#round to two decimals
round(corMatrix, 2)

 From the above matrix we can see that the columns "last_evaluation" and "number_project" have the highest correlation values to the column which we are trying to predict; "average_monthly_hours".  This means that as the values in the "average_monthly_hours" column get larger, so do the values in the "last_evaluation" and "number_project" columns.  Therefore, as the employee worked more projects or had received a higher evaluation, they are likely to work more hours. 

Create a better visualization for the column correlation - The above table is great, but we should try to visualize this in a way that makes the relationships pop.  This is particularly important when you have a lot of variables (or columns).   The corrplot package has a wide range of options for plotting correlation matrices.  Try out all of the below options and pick the ones you like the best.

#correlation matricies
corrplot(corMatrix, method="circle")
corrplot(corMatrix, method="square")
corrplot(corMatrix, method="number")
corrplot(corMatrix, method="shade")
corrplot(M, type = "upper")

Rename the columns to look more consumable in the graphs - All of the above graphs are great, but before we move on, lets rename the columns to show up more compactly in our visualizations.

#rename columns
colNames <- c("satLevel", "lastEval", "numProj", "avgHrs", "timeCpny", "wrkAcdnt", "left", "fiveYrPrmo", "job", "salary")
setnames(hr, colNames)

View a histogram of the average amount of hours worked per week - Histograms are great to show the distribution of values for a particular value.   They are especially great prep work for understanding what transformations need to take place before creating the model.  In the example depicted here we are viewing the histogram for average weekly hours.  In this graph we can easily see that most employees work between 30 and 70 hours per week. 

# Run a histogram for all numeric variables to understand distribution
hist(avgHrs/4, main="Distribution of Average Hours per Week", xlab="Avg Hours", breaks=7, col="lightblue")
hist(satLevel, main="Distribution of Satisfaction Level", xlab="Satisfaction Level", breaks=7, col="lightblue")
hist(lastEval, main="Distribution of Last Evaluations", xlab="Last Eval", breaks=7, col="lightblue")
hist(numProj, main="Distribution of Number of Projects", xlab="Number of Projects", breaks=7, col="lightblue")

Create some new variables to better display the graph output. For example, I would like to plot the average weekly hours by employee retention.  The problem is that the current values for this variable are 0 (employee did not leave) and 1 (employee left).   To make this easier for our graph readers, we will make a new variable that uses text instead of these binary values. 

hr$leftFactor <- factor(left,levels=c(0,1),
                     labels=c("Did Not Leave Company","Left Company")) 

hr$promoFactor <- factor(fiveYrPrmo,levels=c(0,1),
                     labels=c("Did Not Get Promoted","Did Get Promoted")) 

hr$wrkAcdntFactor <- factor(wrkAcdnt,levels=c(0,1),
                      labels=c("No Accident","Accident")) 



View a density plot showing the average hours per week by salary category - Density plots can be a great alternative to the histogram.  This particular chart allows us to see the density of weekly hours worked by salary level.  Again, this chart does not support the theory that higher salary levels work more hours.  At best, I believe that high salary levels show a more distributed spectrum of hours worked vs the distribution peaks on the low (30-40 hr) and high (65-70 hr) end in low and medium salary categories.    

#density plot 
qplot(avgHrs/4, data=hr, geom="density", fill=salary, alpha=I(.5), 
      main="Avg Weekly Hours by Salary Category", xlab="Average Weekly Hours", 

View a density plot showing the average hours per week by employee retention - Switching it up, we view a density plot of hours worked by retained vs non-retained employees.  From this chart, it appears that the underutilized and overworked employees are the ones most likely to leave.  

qplot(avgHrs/4, data=hr, geom="density", fill=leftFactor, alpha=I(.5), 
      main="Average Weekly Hours and Employee Retention", xlab="Average Weekly Hours", 



Create a box plot to show the percentile distribution of average hours per week by job type.  - Box plots show the quartile distribution of a particular value.  In a typical boxplot you can see their minimum (0%), first quartile (25%), median (50%), third quartile (75%) and maximum (100%) values.  You can also configure them to show outliers which I find very useful.      

boxplot(avgHrs~job,data=hr, main="HR Data",
        xlab="Job Title", ylab="Avg Hours", col="lightblue") 

Next, create a violin plot to visualize the same variables.  Violin plots are a good combination of a density plot and box plot.  They show the full range of values and are often more aesthetically pleasing.  

#violin plot
hrBox <-ggplot(hr, aes(y=avgHrs, x=job)) 
hrBox + geom_violin(trim=FALSE, fill="lightblue") 

Plot a chart with many dimensions   - At times you will need to plot a number of variables at the same time to truly visualize the combined influence they may have on the variable which you are trying to predict.  This chart is able to display 5 variables at once: average weekly hours, time at company, employee retention (leftFactor), salary and promotion. 

#many dimension charts
qplot(avgHrs/4, timeCpny, data=hr, shape=leftFactor, color=salary, facets=numProj~promoFactor, size=I(3),
      xlab="average hours per week", ylab="time at company") 

One thing to take note of in this chart is that all employees with 7 or more projects did not get promoted in the last 5 years.  Further, they did leave the company.  This is an indicator that hard-working employees may be burning out and feeling under appreciated.  It is something we should explore further.


Find clusters of users when considering two variables in a scatter plot - A good scatter plot can help map out the relationship between two columns (variables).   With scatter plots you sometimes end up with high-density plots.  These are simply charts that have a lot of dots on them.  In these cases, it can be beneficial to make the dots transparent.  This means that when there are multiple dots in the same spot, you will be able to infer density from the darkness of color.  In this chart we are mapping out the employee satisfaction level by the average hours worked per week.  This chart exposes a number of clusters of users.  The most interesting in my opinion are those with very low satisfaction ratings (0.1 or less) and very high hours worked (60+).  We will explore this further later on.

hrScat <-ggplot(hr, aes(x=avgHrs/4, y=satLevel))
hrScat + geom_point()
#make the points more transparent so that it's less intense
hrScat + geom_point(alpha=.01)
hrScat + stat_bin2d

Execute the same chart as above but with a hexagon function - While the transparency trick is very neat, the hexagon function is made to represent density within a scatter plot.  In this example; we try a few color scales to get maximum effect. 

hrScat + stat_binhex()
hrScat + stat_binhex() + scale_fill_gradient(low="lightblue", high="red")

Try a different variable combination.  Examine last employee evaluation vs their sat level.  - Very important to note here is that we found another anomaly cluster.  Those with very high evaluations (0.8 or higher) and very low sat levels (0.1 or lower).  We need to investigate this further during the next tutorial.

LEvSL <-ggplot(hr, aes(x=satLevel, y=lastEval))
LEvSL + geom_point()
LEvSL + stat_binhex() + scale_fill_gradient(low="lightblue", high="red")


While this was quite a lot of exploration for one session, likely when doing your own predictive analysis you will do even more.  Any tool you are familiar with will do the job.  The most important thing is that you learn about your data set.  In parts 3 and 4 we will be using the knowledge we have gained to transform the data and create the predictive models!

Thank you for reading.  Please comment below if you enjoyed this blog, have questions or would like to see something different in the future.

Written by Laura Ellis

 Next - Predictive Analytics Tutorial: Part 3

Laura Ellis6 Comments