Explore Your Dataset in R
As person who works with data, one of the most exciting activities is to explore a fresh new dataset. You’re looking to understand what variables you have, how many records the data set contains, how many missing values, what is the variable structure, what are the variable relationships and more. While there is a ton you can do to get up and running, I want to show you a few simple commands to help you get a fast overview of the data set you are working with.
Simple Exploratory Data Analysis (EDA)
Set Up R
In terms of setting up the R working environment, we have a couple of options open to us. We can use something like R Studio for a local analytics on our personal computer. Or we can use a free, hosted, multi-language collaboration environment like Watson Studio. If you'd like to get started with R in IBM Watson Studio, please have a look at the tutorial I wrote.
Download the data set
Before we get rolling with the EDA, we want to download our data set. For this example, we are going to use the dataset produced by my recent science, technology, art and math (STEAM) project.
#Load the readr library to bring in the dataset library(readr) #Download the data set df= read_csv('https://raw.githubusercontent.com/lgellis/STEM/master/DATA-ART-1/Data/FinalData.csv', col_names = TRUE)
Now that we have the data set all loaded, and it’s time to run some very simple commands to preview the data set and it’s structure.
To begin, we are going to run the head function, which allows us to see the first 6 rows by default. We are going to override the default and ask to preview the first 10 rows.
dim and Glimpse
Next, we will run the dim function which displays the dimensions of the table. The output takes the form of row, column.
And then we run the glimpse function from the dplyr package. This will display a vertical preview of the dataset. It allows us to easily preview data type and sample data.
dim(df) #Displays the type and a preview of all columns as a row so that it's very easy to take in. library(dplyr) glimpse(df)
We then run the summary function to show each column, it’s data type and a few other attributes which are especially useful for numeric attributes. We can see that for all the numeric attributes, it also displays min, 1st quartile, median, mean, 3rd quartile and max values.
Next we run the skim function from the skimr package. The skim function is a good addition to the summary function. It displays most of the numerical attributes from summary, but it also displays missing values, more quantile information and an inline histogram for each variable!
The Vis_dat() function of the visdat package is a great way to visualize the data type and missing data within a data frame. This function was suggested by Indrajeet Patil who created the excellent r package ggstatsplot2 which easily plots beautiful data visualizations with inline statistic details. I highly encourage you to check it out.
install.packages("devtools") library(devtools) devtools::install_github("ropensci/visdat") library(visdat) vis_miss(df) vis_dat(df)
create_report in DataExplorer
And finally the pièce de résistance, the main attraction and the reason I wrote this blog; the create_report function in the DataExplorer package. This awesome one line function will pull a full data profile of your data frame. It will produce a html file with the basic statistics, structure, missing data, distribution visualizations, correlation matrix and principal component analysis for your data frame! I recently learned about this function in a workshop given by Stephe Locke hosted by R Ladies Austin. This function is a game changer!
Thanks for reading along while we explored some simple EDA in R. Please share your thoughts and creations with me on twitter.