Factors and Dates

Content for week of Monday, March 29, 2021–Thursday, April 1, 2021

Learning Outcomes

  • Use functions from the {forcats} package to manipulate factors in R.
    • Describe the difference between a factor and a string.
    • Convert between strings and factors.
    • Reorder and rename factors.
    • Change how character strings are handled in a data frame.
  • Manipulate dates and times using the lubridate package.
    • Use lubridate to parse different date formats.
    • create date objects from multiple inputs.
    • calculate intervals and durations between dates and times.

Factors: where they fit in

R has a special data class, called factor, to deal with categorical data that you may encounter when creating plots or doing statistical analyses. Factors are very useful and actually contribute to making R particularly well suited to working with data. So we are going to spend a little time introducing them.

Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.

Factors are the variable type that useRs love to hate. It is how we store truly categorical information in R. The values a factor can take on are called the levels. For example, the levels of the factor continent in Gapminder are are “Africa”, “Americas”, etc. and this is what’s usually presented to your eyeballs by R. In general, the levels are friendly human-readable character strings, like “male/female” and “control/treated”.

Factors are particularly useful when making plots or running statistical models. Unfortunately, they can also be very tricky to work with, because they are secretly numbers behind the scenes. Working with factors in base R can lead to errors that are almost impossible for human analysts to catch, but there is a tidyverse package that makes it much easier to work with factors, and prevents many common mistakes. It is called forcats (an anagram of the word “factors” and also because it is a package for working with categorical variables).

The forcats package

forcats is a core package in the tidyverse. It is installed via install.packages("tidyverse") and attached via library(tidyverse). You can always load it individually via library(forcats). Main functions start with fct_. There really is no coherent family of base functions that forcats replaces – that’s why it’s such a welcome addition.

Let’s load the forcats package so we can use the functions it comes with

library(tidyverse)
library(forcats)

And let’s load the data we’ll be using today, gapminder

library(gapminder)

Creating a factor:

Once created, factors can only contain a pre-defined set of values, known as levels. By default, base R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

factor(c("sad", "happy", "happy", "sad"))
## [1] sad   happy happy sad  
## Levels: happy sad

R will assign 1 to the level "happy" and 2 to the level "sad" (because h comes before s in the alphabet, even though the first element in this vector is"sad").

In R’s memory, factors are represented by integers (1, 2), but are more informative than integers because factors are self describing: "happy", "sad" is more descriptive than 1, and 2. Which one is “sad”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels.

However, the default ordering of levels in base R is less than ideal, because it depends on the language you have set for your R session, and can lead to un-reproducble code.

In the forcats package, there is a function that makes a factor but creates the levels in the order they appear.

feeling <- as_factor(c("sad", "happy", "happy", "sad"))
feeling
## [1] sad   happy happy sad  
## Levels: sad happy

You can see the levels and their order by using the function levels() and you can find the number of levels using nlevels():

levels(feeling)
## [1] "sad"   "happy"
nlevels(feeling)
## [1] 2

Let’s move onto our data frame now to investigate the continent factor.

Factor inspection

Get to know your factor before you start touching it! It’s polite. Let’s use gapminder$continent as our example.

str(gapminder$continent)
##  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
nlevels(gapminder$continent)
## [1] 5
class(gapminder$continent)
## [1] "factor"

To get a frequency table as a tibble, from a tibble, use dplyr::count(). To get similar from a free-range factor, use forcats::fct_count().

gapminder %>% 
  count(continent)
## # A tibble: 5 x 2
##   continent     n
## * <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24
fct_count(gapminder$continent)
## # A tibble: 5 x 2
##   f            n
##   <fct>    <int>
## 1 Africa     624
## 2 Americas   300
## 3 Asia       396
## 4 Europe     360
## 5 Oceania     24

This may feel familiar - you’ve worked with factors before!

reordering factor levels

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis.

By default, factor levels are ordered alphabetically. Which might as well be random, when you think about it! It is preferable to order the levels according to some principle:

  • Frequency. Make the most common level the first and so on.
  • Another variable. Order factor levels according to a summary statistic for another variable. Example: order Gapminder countries by life expectancy.

In forcats, one way to reorder our levels in the continent vector would be manually, using fct_relevel:

levels(gapminder$continent) # current order
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
# Reorder by population:
gapminder <- gapminder %>% 
  mutate(continent_reordered = fct_relevel(continent, 
                                           "Asia",
                                           "Africa",
                                           "Americas",
                                           "Europe",
                                           "Oceania"))


levels(gapminder$continent_reordered) # after re-ordering
## [1] "Asia"     "Africa"   "Americas" "Europe"   "Oceania"

The fct_relevel function allows you to move any number of levels to any location. If you re-specify the entire list of levels, it will re-order the whole list. But, if you just specify one level (like we did here) that level gets moved to the front of the list.

# Reorder by population:
gapminder <- gapminder %>% 
  mutate(continent_reordered2 = fct_relevel(continent, "Asia"))

levels(gapminder$continent_reordered2) # after re-ordering
## [1] "Asia"     "Africa"   "Americas" "Europe"   "Oceania"

Another way to re-order your factor levels is by frequency, so the most common factor levels come first, and the less common come later. (This is often useful for plotting!) In this case, it is the frequency of how often each level occurs in the variable, as seen in fct_count(gapminder$continent)

levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
gapminder <- gapminder %>% 
  mutate(continent_infreq = fct_infreq(continent, ordered = TRUE))

levels(gapminder$continent_infreq) # after re-ordering
## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"

What if we want to order the continent factor based on the values of another varialbe? This other variable is usually quantitative and you will order the factor according to a grouped summary. The factor is the grouping variable and the default summarizing function is median() but you can specify something else.

head(levels(gapminder$country))
## [1] "Afghanistan" "Albania"     "Algeria"     "Angola"      "Argentina"  
## [6] "Australia"
## order countries by median life expectancy
gapminder <- gapminder %>% 
  mutate(country_med_lifexp = fct_reorder(country, lifeExp))
head(levels(gapminder$country_med_lifexp))
## [1] "Sierra Leone"  "Guinea-Bissau" "Afghanistan"   "Angola"       
## [5] "Somalia"       "Guinea"
## order according to max population instead of median life expectancy
gapminder <- gapminder %>% 
  mutate(country_min_pop = fct_reorder(country, pop, .fun = max))
head(levels(gapminder$country_min_pop))
## [1] "Sao Tome and Principe" "Iceland"               "Djibouti"             
## [4] "Equatorial Guinea"     "Bahrain"               "Comoros"
## backwards!
gapminder <- gapminder %>% 
  mutate(country_min_pop = fct_reorder(country, pop, .fun = max, .desc = TRUE))
head(levels(gapminder$country_min_pop))
## [1] "China"         "India"         "United States" "Indonesia"    
## [5] "Brazil"        "Pakistan"

Example of why we reorder factor levels: often makes plots much better! When a factor is mapped to x or y, it should almost always be reordered by the quantitative variable you are mapping to the other one.

Compare the interpretability of these two plots of life expectancy in Asian countries in 2007. The only difference is the order of the country factor. Which one do you find easier to learn from?

gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia")
ggplot(gap_asia_2007, aes(x = lifeExp, y = country)) + geom_point()
ggplot(gap_asia_2007, aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
  geom_point()

Use fct_reorder2() when you have a line chart of a quantitative x against another quantitative y and your factor provides the color. This way the legend appears in some order as the data! Contrast the legend on the left with the one on the right.

renaming factor levels

forcats makes easy to rename factor levels. Let’s say we made a mistake and need to recode “Oceania” to “Australia”. We’d use the fct_recode function to do this.

levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
gapminder <- gapminder %>% 
  mutate(continent_recode = fct_recode(continent, Australia = "Oceania"))

levels(gapminder$continent_recode)
## [1] "Africa"    "Americas"  "Asia"      "Europe"    "Australia"

There are many other forcat packages for very specific uses - like making an “other” factor for rare occurrences with fct_lump(). Explore the cheatsheet so you know what is available!

The lubridate Package

The lubridate package has many functions to simplify working with dates and times. The lubridate package does not come loaded as part of the tidyverse so you need to load it separately. Install the package if needed (using the console).

library(tidyverse)
library(lubridate)

Three main classes for date/time data

  1. Date for just the date.
  2. POSIXct for both the date and the time (with Time Zone).
    • “POSIXct” stands for “Portable Operating System Interface - Calendar Time”. It is a part of a standardized system (based on UNIX) of representing time across many computing computing platforms.
  3. hms from the hms R package for just the time. “hms” stands for “hours, minutes, and seconds.”

today() gives the current date in the Date class.

today()
## [1] "2021-03-27"
class(today())
## [1] "Date"

now() gives the current date-time in the POSIXct class.

now()
## [1] "2021-03-27 17:06:46 MST"
class(now())
## [1] "POSIXct" "POSIXt"

Parsing Dates and Times Using lubridate

lubridate has many "helper" functions which parse dates/times more automatically. - The helper *function name specifies the order of the components*: year, month, day, hours, minutes, and seconds. The help page forymd` shows multiple functions to parse dates with different sequences of year, month and day,

Only the order of year, month, and day matters

ymd(c("2011/01-10", "2011-01/10", "20110110"))
## [1] "2011-01-10" "2011-01-10" "2011-01-10"
mdy(c("01/10/2011", "01 adsl; 10 df 2011", "January 10, 2011"))
## [1] "2011-01-10" "2011-01-10" "2011-01-10"

For times, only the order of hours, minutes, and seconds matter

hms(c("10:40:10", "10 40 10"))
## [1] "10H 40M 10S" "10H 40M 10S"

Let’s parse the following date-times.

t1 <- "05/26/2004 UTC 11:11:11.444"
t2 <- "26 2004 05 UTC 11/11/11.444"
mdy_hms(t1)
## [1] "2004-05-26 11:11:11 UTC"
## No dym_hms() function is defined, so need to use parse_datetime()
parse_date_time(t2, "d y m H M S")
## [1] "2004-05-26 11:11:11 UTC"

Now let’s use the appropriate lubridate function to parse the following dates:

d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014 
mdy(d1)
## [1] "2010-01-01"
ymd(d2)
## [1] "2015-03-07"
dmy(d3)
## [1] "2017-06-06"
mdy(d4)
## [1] "2015-08-19" "2015-07-01"
mdy(d5)
## [1] "2014-12-30"

Creating Date-time values from individual components

Use make_date() or make_datetime() to create dates and date-times if you have a vector of years, months, days, hours, minutes, or seconds.

make_date(year = 1981, month = 6, day = 25)
## [1] "1981-06-25"
make_datetime(year = 1972, month = 2, day = 22, hour = 10, min = 9, sec = 01)
## [1] "1972-02-22 10:09:01 UTC"

You can see variables for the year, month, day, hour, and minute of the scheduled departure time with nycflights13 example:

library(nycflights13)
data("flights")

flights <- flights %>%
   mutate(datetime = make_datetime(year   = year, 
                                  month  = month, 
                                  day    = day,
                                  hour   = hour,
                                  min = minute))

select(flights, datetime)
## # A tibble: 336,776 x 1
##    datetime           
##    <dttm>             
##  1 2013-01-01 05:15:00
##  2 2013-01-01 05:29:00
##  3 2013-01-01 05:40:00
##  4 2013-01-01 05:45:00
##  5 2013-01-01 06:00:00
##  6 2013-01-01 05:58:00
##  7 2013-01-01 06:00:00
##  8 2013-01-01 06:00:00
##  9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # ... with 336,766 more rows

Having it in the date-time format makes it easier to plot.

ggplot(flights, aes(x = datetime)) +
  geom_freqpoly(bins = 365)

It also makes it easier to filter by date usingas_date() and ymd

flights %>%
  filter(as_date(datetime) == ymd(20130704)) %>%
  ggplot(aes(x = datetime)) +
  geom_freqpoly(binwidth = 600)

Extracting Components of a date-time

To extract the component of a date-time, use one of the following:

  • year() extracts the year
  • month() extracts the month
  • week() extracts the week
  • mday() extracts the day of the month (1, 2, 3, …)
  • wday() extracts the day of the week (Saturday, Sunday, Monday …)
  • yday() extracts the day of the year (1, 2, 3, …)
  • hour() extracts the hour
  • minute() extract the minute
  • second() extracts the second
ddat <- mdy_hms("01/02/1970 03:51:44")
ddat
## [1] "1970-01-02 03:51:44 UTC"
year(ddat)
## [1] 1970
month(ddat, label = TRUE)
## [1] Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
week(ddat)
## [1] 1
mday(ddat)
## [1] 2
wday(ddat, label = TRUE)
## [1] Fri
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
yday(ddat)
## [1] 2
hour(ddat)
## [1] 3
minute(ddat)
## [1] 51
second(ddat)
## [1] 44

Doing Math with Time

Humans manipulate “clock time” with the use of policies such as Daylight Savings Time which creates irregularities in the “physical time”. lubridate provides three classes of time spans to facilitate math with dates and date-times. + Periods: track changes in “clock time”, and ignore irregularities in “physical time”. + Durations: track the passage of “physical time”, which deviates from “clock time” when irregularities occur. + Intervals: represent specific spans of the timeline, bounded by start and end date-times. We won’t cover this in this lesson, but you can learn more with ?interval-class

Periods

Make a period with the name of a time unit pluralized, e.g.

p <- months(3) + days(12)
p
## [1] "3m 12d 0H 0M 0S"
str(p)
## Formal class 'Period' [package "lubridate"] with 6 slots
##   ..@ .Data : num 0
##   ..@ year  : num 0
##   ..@ month : num 3
##   ..@ day   : num 12
##   ..@ hour  : num 0
##   ..@ minute: num 0

You can read more about periods with

help("Period-class")

Durations

Durations are stored as seconds, the only time unit with a consistent length. Add or subtract durations to model physical processes, like travel or lifespan. You can create durations from years with dyears(), from days with ddays(), etc…

dyears(1)
## [1] "31557600s (~1 years)"
ddays(1)
## [1] "86400s (~1 days)"
dhours(1)
## [1] "3600s (~1 hours)"

You can read about durations using

help("Duration-class")

You can also use duration(quantity, units = ...) to create a duration or as.duration() to coerce an object to a duration. Example: We can find out the exact age for the first single released by K-pop legends, BTS using durations

d1 <- mdy("June 13, 2013")
d2 <- today()
str(d1)
##  Date[1:1], format: "2013-06-13"
d2-d1
## Time difference of 2844 days
as.duration(d2 - d1)
## [1] "245721600s (~7.79 years)"

Resources