Exploratory Data Analysis

Content for week of Monday, February 15, 2021–Thursday, February 18, 2021

Lesson Objectives

  • Import (customized) data from flat, rectangular data files using readr
  • Identify and fix common data import challenges using readr
  • Apply different strategies for Exploratory Data Analysis (EDA)
    • Graphical Summaries
    • Numerical Summaries with and without tables
  • Develop ideas for further analysis

Loading data

So far, we’ve been using packages and data() to load in new data to our working session. But what if I want to load in some other data that I have? Go ahead and download the following files and let’s work on how to load them into R.

avengers.csv (source)
crf33.dta (source)
steak-risk-survey.xlsx (source)

Creating tibbles

As you have learned, a tibble and a dataframe are almost the same thing, except that tibbles have much nicer output and are easier to work with. From now on know that anytime I say data frame or tibble, I really mean a type of data that has rows and columns that is usually saved as a tibble.

How can I create a tibble from scratch? There are two main ways, using tibble and tribble.

Tibble creates dataframes in column formats while tribble creates them in rows. Many new learners say tribble is more intuitive.

library(tidyverse)

cats_tibble <- tibble(coat = c("calico", "black", "tabby"),
                      weight = c(2.1, 5.0, 3.2),
                      likes_string = c(1, 0, 1))
cats_tibble
## # A tibble: 3 x 3
##   coat   weight likes_string
##   <chr>   <dbl>        <dbl>
## 1 calico    2.1            1
## 2 black     5              0
## 3 tabby     3.2            1
cats_tribble <- tribble(
  ~coat, ~weight, ~likes_string,
  "calico", 2.1, 1,
  "black", 5.0, 0,
  "tabby", 3.2, 1
)
cats_tribble
## # A tibble: 3 x 3
##   coat   weight likes_string
##   <chr>   <dbl>        <dbl>
## 1 calico    2.1            1
## 2 black     5              0
## 3 tabby     3.2            1
#these are equivalent
cats_tibble == cats_tribble
##      coat weight likes_string
## [1,] TRUE   TRUE         TRUE
## [2,] TRUE   TRUE         TRUE
## [3,] TRUE   TRUE         TRUE

However, in normal practice creating data from scratch is rare. More often than not, we will load in out data from a premade source.

Loading .csv and .tsv

  • A lot of datasets are available as flat files, in comma-separated or tab-separated formats, with the column or variable names in the first row of data.

  • The different file types are usually identifiable by their extension.

  • The extension “.csv” stands for “comma-separated values” where each column is separated by a comma. Each row is separated as a new line.

  • readr is a tidyverse package to import data from a variety of file formats into tibbles faster (10x) and more accurately than base R’s read.csv()

    • readr converts flat files into data frames (tibbles)
    • readr does NOT convert character vectors to factors automatically (like R version 3.0 read.csv())
    • readr functions usually give more informative error messages than base R functions like read.csv()

Use read_csv() to read a CSV (comma-separated values) file into R

avengers <- read_csv(file = here::here("static","data","avengers.csv"))

If the .CSV is online and you know the URL, you can use the URL for the file argument.

avengers <- 
      read_csv(file ="https://raw.githubusercontent.com/fivethirtyeight/data/master/avengers/avengers.csv")

glimpse(avengers)
## Rows: 173
## Columns: 21
## $ URL                           <chr> "http://marvel.wikia.com/Henry_Pym_(E...
## $ `Name/Alias`                  <chr> "Henry Jonathan \"Hank\" Pym", "Janet...
## $ Appearances                   <dbl> 1269, 1165, 3068, 2089, 2402, 612, 34...
## $ `Current?`                    <chr> "YES", "YES", "YES", "YES", "YES", "Y...
## $ Gender                        <chr> "MALE", "FEMALE", "MALE", "MALE", "MA...
## $ `Probationary Introl`         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Full/Reserve Avengers Intro` <chr> "Sep-63", "Sep-63", "Sep-63", "Sep-63...
## $ Year                          <dbl> 1963, 1963, 1963, 1963, 1963, 1963, 1...
## $ `Years since joining`         <dbl> 52, 52, 52, 52, 52, 52, 51, 50, 50, 5...
## $ Honorary                      <chr> "Full", "Full", "Full", "Full", "Full...
## $ Death1                        <chr> "YES", "YES", "YES", "YES", "YES", "N...
## $ Return1                       <chr> "NO", "YES", "YES", "YES", "YES", NA,...
## $ Death2                        <chr> NA, NA, NA, NA, "YES", NA, NA, "YES",...
## $ Return2                       <chr> NA, NA, NA, NA, "NO", NA, NA, "YES", ...
## $ Death3                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Return3                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Death4                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Return4                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Death5                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Return5                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Notes                         <chr> "Merged with Ultron in Rage of Ultron...

readr tells you how it converted the columns - also known as parsing the data. you can uncover the exact ways it was read in with spec

spec(avengers)
## cols(
##   URL = col_character(),
##   `Name/Alias` = col_character(),
##   Appearances = col_double(),
##   `Current?` = col_character(),
##   Gender = col_character(),
##   `Probationary Introl` = col_character(),
##   `Full/Reserve Avengers Intro` = col_character(),
##   Year = col_double(),
##   `Years since joining` = col_double(),
##   Honorary = col_character(),
##   Death1 = col_character(),
##   Return1 = col_character(),
##   Death2 = col_character(),
##   Return2 = col_character(),
##   Death3 = col_character(),
##   Return3 = col_character(),
##   Death4 = col_character(),
##   Return4 = col_character(),
##   Death5 = col_character(),
##   Return5 = col_character(),
##   Notes = col_character()
## )

Readr automatically previews about 1000 rows of each column to guess the column type. If the parsing is incorrect, you can use the optional coltypes = argument within read_csv() to correct the column type.

Loading other data types

Haven for STATA

Sometimes you’ll come across data from SPSS, STATA, or SAS and need to import it into r. The package Haven comes installed with the tidyverse and enables clean import of these sorts of proprietary statistical file structures. Although you have it installed, you must load the package when you want to use it.

library(haven)
crf33 <- read_dta(file = here::here("static","data","crf33.dta"))

#While glimpse doesn't look any different
glimpse(crf33)
## Rows: 45
## Columns: 4
## $ a     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2...
## $ b     <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 2...
## $ y     <dbl> 24, 33, 37, 29, 42, 44, 36, 25, 27, 43, 38, 29, 28, 47, 48, 3...
## $ order <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1...
#but str() will uncover these special attribute labels
str(crf33)
## tibble [45 x 4] (S3: tbl_df/tbl/data.frame)
##  $ a    : num [1:45] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ b    : num [1:45] 1 1 1 1 1 2 2 2 2 2 ...
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ y    : num [1:45] 24 33 37 29 42 44 36 25 27 43 ...
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ order: num [1:45] 1 2 3 4 5 1 2 3 4 5 ...
##   ..- attr(*, "format.stata")= chr "%8.0g"

readxl for loading excel data

Often Not a Good Idea to Import Directly from Excel

  • RStudio is getting better at importing data from Excel as are other packages
  • You may want to import data directly from Excel? Not recommended.
  • Excel is designed for human data input and data analysis and not efficient data management
  • Potential for errors or excess time spent adjusting the data in Excel
  • People tend to color code information in Excel or be inconsistent in their formatting
  • Instead, export the data from the Excel worksheet as a .CSV. Then read the .CSV file into R.
  • Edit the data so the information is encoded by a new variable.
library(readxl)
steak <- read_excel(path = here::here("static","data","steak-risk-survey.xlsx"))
glimpse(steak)
## Rows: 550
## Columns: 15
## $ RespondentID           <dbl> 3237565956, 3234982343, 3234973379, 32349723...
## $ hypothetical_scenario  <chr> "Lottery B", "Lottery A", "Lottery A", "Lott...
## $ smoke_cigarettes       <chr> NA, "No", "No", "Yes", "No", "No", "No", "No...
## $ drink_alcohol          <chr> NA, "Yes", "Yes", "Yes", "Yes", "No", "Yes",...
## $ gamble                 <chr> NA, "No", "Yes", "Yes", "No", "No", "No", "Y...
## $ skydiving              <chr> NA, "No", "No", "No", "No", "No", "No", "Yes...
## $ drive_above_speedlimit <chr> NA, "No", "Yes", "Yes", "Yes", "Yes", "Yes",...
## $ cheated_partner        <chr> NA, "No", "Yes", "Yes", "Yes", "No", "Yes", ...
## $ eat_steak              <chr> NA, "Yes", "Yes", "Yes", "Yes", "Yes", "No",...
## $ steak_preference       <chr> NA, "Medium rare", "Rare", "Medium", "Medium...
## $ gender                 <chr> NA, "Male", "Male", "Male", "Male", "Male", ...
## $ age                    <chr> NA, "> 60", "> 60", "> 60", "> 60", "18-29",...
## $ income                 <chr> NA, "$50,000 - $99,999", "$150,000+", "$50,0...
## $ education              <chr> NA, "Some college or Associate degree", "Gra...
## $ census_region          <chr> NA, "East North Central", "South Atlantic", ...

Again, be very careful when importing excel data. So often, there’s a random value in cell DM1334 that will throw off how it is read in or two rows for column names.

HW & Lab Preview

For your homework this week, you’ll take a deep dive into what is available for a dataset for your final project. I’m having you browse multiple different repositories that I think contain fun and useful data so you may become inspired for you final project. You will be sharing these datasets with your classmates in a ‘data speed-dating’ activity in the lab on Thursday, so do find a few you like.

EDA

Introduction

  • Once you have loaded and checked your data for completeness and consistency you want to begin to look at it.
  • You may have some initial questions or hypotheses about your question of interest
  • EDA is a process for exploring your data to assess initial hypotheses and generate or uncover new ones
  • You have to be careful about “data snooping” from a statistical perspective
  • It helps to follow a general strategy for EDA

General Strategies

  • Plot the distribution of every variable.

  • Look for symmetry, skewness, modality, etc..

  • Plot the bi-variate distribution of every pair of variables (to find which variables are associated).

  • Again, look for patterns and relationships, skewness, curvature, modality, gaps, discontinuities, , etc..

  • Color-code by variables to see if relationships appear more clearly.

  • Calculate lots of numerical summary statistics.

  • Look at “missingness.”

  • Look at extreme values for potential “outliers” and patterns

  • EDA is about curiosity.

  • Ask many questions, use many plots, investigate many aspects of your data.

  • This will let you hone in on the few interesting questions you want to pursue deeper.

  • Keep track of what you are doing with your .Rmd file text chunks and code chunks so you can protext yourself from becoming a victim of data snooping - only cherry picking the “good results”

Automated EDA

skimr

skimr provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skim_df object which can be included in a pipeline or displayed nicely for the human reader.

# install.packages("skimr")
library(skimr)
skimr::skim(avengers)
Name avengers
Number of rows 173
Number of columns 21
_______________________
Column type frequency:
character 18
numeric 3
________________________
Group variables None

Table 1: Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
URL 0 1.00 36 67 0 173 0
Name/Alias 10 0.94 4 35 0 162 0
Current? 0 1.00 2 3 0 2 0
Gender 0 1.00 4 6 0 2 0
Probationary Introl 158 0.09 6 6 0 12 0
Full/Reserve Avengers Intro 14 0.92 5 6 0 93 0
Honorary 0 1.00 4 12 0 4 0
Death1 0 1.00 2 3 0 2 0
Return1 104 0.40 2 3 0 2 0
Death2 156 0.10 2 3 0 2 0
Return2 157 0.09 2 3 0 2 0
Death3 171 0.01 3 3 0 1 0
Return3 171 0.01 2 3 0 2 0
Death4 172 0.01 3 3 0 1 0
Return4 172 0.01 3 3 0 1 0
Death5 172 0.01 3 3 0 1 0
Return5 172 0.01 3 3 0 1 0
Notes 98 0.43 21 255 0 71 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Appearances 0 1 414.05 677.99 2 58 132 491 4333 ▇▁▁▁▁
Year 0 1 1988.45 30.37 1900 1979 1996 2010 2015 ▁▁▁▅▇
Years since joining 0 1 26.55 30.37 0 5 19 36 115 ▇▅▁▁▁
diamonds %>% 
  group_by(cut) %>% 
  skimr::skim()
Name Piped data
Number of rows 53940
Number of columns 10
_______________________
Column type frequency:
factor 2
numeric 7
________________________
Group variables cut

Table 1: Data summary

Variable type: factor

skim_variable cut n_missing complete_rate ordered n_unique top_counts
color Fair 0 1 TRUE 7 G: 314, F: 312, H: 303, E: 224
color Good 0 1 TRUE 7 E: 933, F: 909, G: 871, H: 702
color Very Good 0 1 TRUE 7 E: 2400, G: 2299, F: 2164, H: 1824
color Premium 0 1 TRUE 7 G: 2924, H: 2360, E: 2337, F: 2331
color Ideal 0 1 TRUE 7 G: 4884, E: 3903, F: 3826, H: 3115
clarity Fair 0 1 TRUE 8 SI2: 466, SI1: 408, VS2: 261, I1: 210
clarity Good 0 1 TRUE 8 SI1: 1560, SI2: 1081, VS2: 978, VS1: 648
clarity Very Good 0 1 TRUE 8 SI1: 3240, VS2: 2591, SI2: 2100, VS1: 1775
clarity Premium 0 1 TRUE 8 SI1: 3575, VS2: 3357, SI2: 2949, VS1: 1989
clarity Ideal 0 1 TRUE 8 VS2: 5071, SI1: 4282, VS1: 3589, VVS: 2606

Variable type: numeric

skim_variable cut n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
carat Fair 0 1 1.05 0.52 0.22 0.70 1.00 1.20 5.01 ▇▂▁▁▁
carat Good 0 1 0.85 0.45 0.23 0.50 0.82 1.01 3.01 ▇▆▂▁▁
carat Very Good 0 1 0.81 0.46 0.20 0.41 0.71 1.02 4.00 ▇▃▁▁▁
carat Premium 0 1 0.89 0.52 0.20 0.41 0.86 1.20 4.01 ▇▆▁▁▁
carat Ideal 0 1 0.70 0.43 0.20 0.35 0.54 1.01 3.50 ▇▂▁▁▁
depth Fair 0 1 64.04 3.64 43.00 64.40 65.00 65.90 79.00 ▁▁▃▇▁
depth Good 0 1 62.37 2.17 54.30 61.30 63.40 63.80 67.00 ▁▂▂▇▁
depth Very Good 0 1 61.82 1.38 56.80 60.90 62.10 62.90 64.90 ▁▂▅▇▂
depth Premium 0 1 61.26 1.16 58.00 60.50 61.40 62.20 63.00 ▁▃▆▇▇
depth Ideal 0 1 61.71 0.72 43.00 61.30 61.80 62.20 66.70 ▁▁▁▇▆
table Fair 0 1 59.05 3.95 49.00 56.00 58.00 61.00 95.00 ▇▆▁▁▁
table Good 0 1 58.69 2.85 51.00 56.00 58.00 61.00 66.00 ▁▇▇▅▂
table Very Good 0 1 57.96 2.12 44.00 56.00 58.00 59.00 66.00 ▁▁▆▇▁
table Premium 0 1 58.75 1.48 51.00 58.00 59.00 60.00 62.00 ▁▁▁▇▃
table Ideal 0 1 55.95 1.25 43.00 55.00 56.00 57.00 63.00 ▁▁▅▇▁
price Fair 0 1 4358.76 3560.39 337.00 2050.25 3282.00 5205.50 18574.00 ▇▃▁▁▁
price Good 0 1 3928.86 3681.59 327.00 1145.00 3050.50 5028.00 18788.00 ▇▃▁▁▁
price Very Good 0 1 3981.76 3935.86 336.00 912.00 2648.00 5372.75 18818.00 ▇▃▁▁▁
price Premium 0 1 4584.26 4349.20 326.00 1046.00 3185.00 6296.00 18823.00 ▇▃▁▁▁
price Ideal 0 1 3457.54 3808.40 326.00 878.00 1810.00 4678.50 18806.00 ▇▂▁▁▁
x Fair 0 1 6.25 0.96 0.00 5.63 6.18 6.70 10.74 ▁▁▇▃▁
x Good 0 1 5.84 1.06 0.00 5.02 5.98 6.42 9.44 ▁▁▆▇▁
x Very Good 0 1 5.74 1.10 0.00 4.75 5.74 6.47 10.01 ▁▁▇▆▁
x Premium 0 1 5.97 1.19 0.00 4.80 6.11 6.80 10.14 ▁▁▇▇▁
x Ideal 0 1 5.51 1.06 0.00 4.54 5.25 6.44 9.65 ▁▁▇▃▁
y Fair 0 1 6.18 0.96 0.00 5.57 6.10 6.64 10.54 ▁▁▇▃▁
y Good 0 1 5.85 1.05 0.00 5.02 5.99 6.44 9.38 ▁▁▆▇▁
y Very Good 0 1 5.77 1.10 0.00 4.77 5.77 6.51 9.94 ▁▁▇▆▁
y Premium 0 1 5.94 1.26 0.00 4.79 6.06 6.76 58.90 ▇▁▁▁▁
y Ideal 0 1 5.52 1.07 0.00 4.55 5.26 6.44 31.80 ▇▃▁▁▁
z Fair 0 1 3.98 0.65 0.00 3.61 3.97 4.28 6.98 ▁▁▇▃▁
z Good 0 1 3.64 0.65 0.00 3.07 3.70 4.03 5.79 ▁▁▆▇▁
z Very Good 0 1 3.56 0.73 0.00 2.95 3.56 4.02 31.80 ▇▁▁▁▁
z Premium 0 1 3.65 0.73 0.00 2.94 3.72 4.16 8.06 ▁▅▇▁▁
z Ideal 0 1 3.40 0.66 0.00 2.80 3.23 3.98 6.03 ▁▁▇▃▁

visdat:

helps you visualise a dataframe and “get a look at the data” by displaying the variable classes in a dataframe as a plot with vis_dat, and getting a brief look into missing data patterns using vis_miss.

# install.packages("visdat")
library(visdat)

#overall viz
vis_dat(steak)

vis_dat(avengers)

#missing viz
vis_miss(avengers, 
         cluster = TRUE)

#comparison viz
vis_compare(cats_tibble, cats_tribble)

Why EDA

Why do we want to visualize our data?

  • All the graphs are based on data with the same summary statistics!

We can see this more explicitly with the “DatasauRus Dozen” Package. We have nearly identical summary statistics, but the actual data looks vastly different from eachother.

library(datasauRus)
datasaurus_dozen %>%
  group_by(dataset) %>%
  summarize(
    mean_x    = mean(x),
    mean_y    = mean(y),
    std_dev_x = sd(x),
    std_dev_y = sd(y),
    corr_x_y  = cor(x, y)
  )
## # A tibble: 13 x 6
##    dataset    mean_x mean_y std_dev_x std_dev_y corr_x_y
##  * <chr>       <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
##  1 away         54.3   47.8      16.8      26.9  -0.0641
##  2 bullseye     54.3   47.8      16.8      26.9  -0.0686
##  3 circle       54.3   47.8      16.8      26.9  -0.0683
##  4 dino         54.3   47.8      16.8      26.9  -0.0645
##  5 dots         54.3   47.8      16.8      26.9  -0.0603
##  6 h_lines      54.3   47.8      16.8      26.9  -0.0617
##  7 high_lines   54.3   47.8      16.8      26.9  -0.0685
##  8 slant_down   54.3   47.8      16.8      26.9  -0.0690
##  9 slant_up     54.3   47.8      16.8      26.9  -0.0686
## 10 star         54.3   47.8      16.8      26.9  -0.0630
## 11 v_lines      54.3   47.8      16.8      26.9  -0.0694
## 12 wide_lines   54.3   47.8      16.8      26.9  -0.0666
## 13 x_shape      54.3   47.8      16.8      26.9  -0.0656
ggplot(datasaurus_dozen, 
       aes(x=x, y=y, color=dataset))+
  geom_point(size = .5)+
  theme_void()+
  theme(legend.position = "none")+
  facet_wrap(~dataset, ncol=3)

1

Data Export

write_csv(), write_csv2(), and write_tsv()

  • You can write comma-separated and tab-separated files using write_csv(), write_csv2(), and write_tsv().
steak %>% 
  filter(hypothetical_scenario == "Lottery B") %>% 
  write_csv("../output/nyc_region_data.csv")
  • The defaults are usually fine.

Reading/Writing R Objects with readRDS() and saveRDS().

  • You can save and reload arbitrary R objects (data frames, matrices, lists, vectors) using readRDS() and saveRDS(). - .Rds files are compressed data files so enable much faster loading and more compact storage. - These are what usually go into the /data folder as opposed to the /data_raw folder
steak %>% 
  filter(hypothetical_scenario == "Lottery B") %>% 
  saveRDS("../output/nyc_region_data.rds")

References


  1. “The Datasaurus Data Package,” accessed September 16, 2019, https://cran.r-project.org/web/packages/datasauRus/vignettes/Datasaurus.html. ↩︎