Working with strings and `stringr`

Content for week of Monday, March 22, 2021–Thursday, March 25, 2021

Learning Outcomes

  • Manipulate strings with the stringr package.
  • Employ regular expressions (REGEX) to manipulate strings

Intro to strings and character vectors

  • In R, strings (also called “characters”) are created and displayed within quotes:
x <- "I am a string!"
x
## [1] "I am a string!"
class(x)
## [1] "character"
  • Anything within quotes is a string, even numbers!
y <- "3"
class(y)
## [1] "character"
  • You can have a vector of strings.
z <- c("I", "am", "a", "string", "vector")
z[2:3]
## [1] "am" "a"
  • The backslash "\" means what is after the backslash is special in some way.
  • For example, if you want to put a quotation mark in a string, you can “escape” the quotation mark with a backslash.
a <- "As Michael Scott said, \"I'm not superstitious, but I am a little stitious.\""
a
## [1] "As Michael Scott said, \"I'm not superstitious, but I am a little stitious.\""
  • cat() will print out the string itself.
  • *print() will print out the printed representation of the string (with backslashes and all).
print(x)
## [1] "I am a string!"
cat(x)
## I am a string!
print(y)
## [1] "3"
cat(y)
## 3
print(z)
## [1] "I"      "am"     "a"      "string" "vector"
cat(z)
## I am a string vector
print(a)
## [1] "As Michael Scott said, \"I'm not superstitious, but I am a little stitious.\""
cat(a)
## As Michael Scott said, "I'm not superstitious, but I am a little stitious."
  • "\n" represents a new line
n <- "I'm not superstitious,\nbut I am a little stitious."
cat(n)
## I'm not superstitious,
## but I am a little stitious.
# what happens if we put spaces around \n? 
  • "\t" represents a tab
t <- "I'm not superstitious,\tbut I am a little stitious."
cat(t)
## I'm not superstitious,   but I am a little stitious.
  • You can add any Unicode character with a \u followed by the hexadecimal unicode representation of that character.
  • Be careful about whether knitr will accept it though!
mu <- "\u00b5"
cat(mu)
## µ
cat(c("(","\u2310","\u25A0","\u2022","\u25A0",")"))
## ( ¬ ¦ • ¦ )
# http://pages.ucsd.edu/~dkjordan/resources/unicodemaker.html

stringr

  • The stringr package has functions to make manipulating strings easier - (more user-friendly than base R’s grep() and gsub()).
  • stringr is part of the tidyverse so you do not have to load it separately.
library(tidyverse)

All of stringr’s functions begin with “str_,” so in R Studio you can press tab after typing “str_” and a list of possible string manipulation functions will pop up (in RStudio).

For example, use str_length() to get the number of characters in a string.

beyonce <- "I am Beyoncé, always."
str_length(beyonce)
## [1] 21

What about spaces and punctuation marks - they count! What about escaped characters? The ’’ does not but the character itself does.

str_length("I am Beyoncé, \nalways.")
## [1] 22

Combining Strings with str_c()

str_c() collapses two strings together:

x <- "Would I rather be feared or loved?"
y <- "Easy. Both."
z <- "I want people to be afraid of how much they love me."

str_c(x, y, z)
## [1] "Would I rather be feared or loved?Easy. Both.I want people to be afraid of how much they love me."

The default is to separate strings by nothing, but you can use sep to change the separator.

str_c(x, y, z,  sep = " ")
## [1] "Would I rather be feared or loved? Easy. Both. I want people to be afraid of how much they love me."

Just like c(), str_c() can take multiple arguments.

str_c("Where", "are", "the", "turtles?!", sep = " ")
## [1] "Where are the turtles?!"

subsetting substrings with str_sub()

str_sub() extracts a substring between the location of two characters.

bankrupt <- "I… Declare… Bankruptcy!"
str_sub(bankrupt, start = 4, end = 10)
## [1] "Declare"

You often want to use str_sub when the data is highly structured

phone <- "800-800-8553"
#first three
str_sub(phone, end = 3)
## [1] "800"
# last four
str_sub(phone, start = -4)
## [1] "8553"
  • Replace substrings with assignment
str_sub(bankrupt, start = 1, end = 1) <- "We"
bankrupt
## [1] "We… Declare… Bankruptcy!"

replacing words with str_replace()

If I want to replace a specific pattern of text with another pattern, str_replace() or str_replace_all() are very useful.

str_replace(bankrupt, "We", "I")
## [1] "I… Declare… Bankruptcy!"

Back with our phone number example, you’ll see there’s a difference between str_replace() or str_replace_all(). The first only replaces the first instance

phone
## [1] "800-800-8553"
str_replace(phone, "800-", "")
## [1] "800-8553"
str_replace_all(phone, "800-", "")
## [1] "8553"

If I only want to change the second instance of “800-,” I’ll need to use a more complicated pattern match. This would require a regular expression

Regular Expressions

Intro

  • Regular expressions (regex or regexp) are a syntax for pattern matching in strings.

  • Regex structure is used in many different computer languages

  • str_replace() and str_replace_all() search for a pattern as defined by the regex and then replace it (all) with another string.

  • Wherever there is a pattern argument in a stringr function, you can use regex (to extract strings, get a logical if there is a match, etc…).

  • regex includes special characters, e.g., “.” and ““. These must be escaped using”” if you want to match their normal value.

Finding pattern matches with str_view() and str_view_all()

  • Basic usage: find exact match of a string pattern
fruit <- c("Apple", "Strawberry", "Banana", "Pear", "Blackberry", "*berry")
str_view(fruit, "an")
str_view_all(fruit, "an")
  • A period “.” matches any character.
  • A [:alpha:] matches any alphabetical character.
str_view_all(fruit, ".berry")
str_view_all(fruit, "[:alpha:]berry")
  • You can “escape” a symbol with two backslashes “\\.” to match. If you don’t, the asterisk in this case will be interpreted as a regular expression command, not a symbol.
# str_view_all(fruit, "*berry")
str_view_all(fruit, "\\*berry")

Exercise: Use one function call to replace "love" and "loved" with "X" in the following.

love <- "Would I rather be feared or loved? Easy. Both. I want people to be afraid of how much they love me."
str_view_all(love, "love[:alpha:]*")
str_replace_all(love, "love[:alpha:]*", "X")
## [1] "Would I rather be feared or X? Easy. Both. I want people to be afraid of how much they X me."

Anchoring the regex so search starts at the beginning or at the end of a string

  • You can anchor the regex pattern to begin looking for a match from the start (left-to-right) of the string or backwards from the end of a string (working-right to left).

  • ^ forces matching to begin from the start of a string.

  • $ forces matching to begin from the end of a string.

str_view(fruit, "^B")
str_view(fruit, "a$")

Exercise: Use str_replace() to replace all four letter words beginning with an "a" with "foo" in the following list:

x <- c("apple", "barn", "ape", "cart", "alas", "pain", "ally")
str_replace(x, "^a...$", "foo")
## [1] "apple" "barn"  "ape"   "cart"  "foo"   "pain"  "foo"

Special Characters

There are a lot of regular expression character matches in R and I don’t expect you to memorize them all - I often have the cheatsheet open next to me while working. Some important ones you should however be able to recognize:

type this: to mean this:
\\n new line
\\s or [:space:] any whitespace
\\d or [:digit:] any digit
\\w [:alpha:] any word character
[:punct:] any punctuation
. every character except new line
  • We’ll use this character vector for practice:
phones <- c("Abba: 555-1234", "Anna: 555-0987", "Andy: 555-7654")
  • \\d: matches any digit
str_view(phones, "\\d\\d\\d-\\d\\d\\d\\d")
  • \\s: matches any white space (e.g. space, tab, newline).
str_view(phones, "\\s")
  • [abc]: matches a, b, or c.
str_view(phones, "A[bn][bn]a", "XXXX")
  • [^abc]: matches anything except a, b, or c.
  • Note this is a different use of ^ since it is inside the [ ]
str_view(phones, "A[^b]", "XXXX")
  • abc|xyz: matches either abc or xyz. This is called alternation
  • You can use parentheses to control where the alternation occurs.
  • a(bc|xy)z matches either abcz or axyz.
str_view(phones, "An(na|dy)")
  • To ignore case, place a (?i) before the regex.
str_view("AB", "ab")
str_view("AB", "(?i)ab")

Repetition using ?, +, *, {n}, {n,},{0,n}, {n,m}

  • Can match a pattern multiple times in a row:
  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
x <- c("A", "AA", "AAA", "AAAA", "B", "BB")
str_view_all(x, "^A?", "X")
str_view_all(x, "^A+", "X")
str_view_all(x, "^A*", "X")
  • A more realistic example:
str_view_all("color and colour", "colou?r", "X")
  • Control exactly how many repetitions allowed in a match:

  • {n}: exactly n.

  • {n,}: n or more.

  • {0,m}: at most m.

  • {n,m}: between n and m.

str_view_all(x, "A{2}", "X")
str_view_all(x, "A{2,}", "X")
str_view_all(x, "A{0,2}", "X")
str_view_all(x, "A{3,4}", "X")
  • Regex is “greedy” and will automatically match the longest string possible.
str_view("AAAA", "A*",)

Exercise: Create regular expressions to find all words with the following patterns and replace the patterns with “X”:

  1. Start with three consonants. Test on
x1 <- c("string", "priority", "value", "distinction")
str_replace_all(x1, "^[^aeiouAEIOU]{3}", "X")
## [1] "Xing"        "priority"    "value"       "distinction"

There is a lot more to learn about regular expressions what we won’t cover here, like groups and look arounds. Groups allows you to define which part of the expression you want to extract or replace and look arounds allow you to define what follows or precedes the expression. When you need to learn more, there are many tools online like https://regex101.com/ to help you learn. The only important thing to remember with online regular expression tools is that r needs an extra \ preceding each \ in other coding languages.

more stringr

There are a lot of functions to analyze, compare and adjust strings.

Changing Case

  • str_to_lower() and str_to_upper() convert all letters to lower or capital case.
  • str_to_sentence converts all words and letters to sentence case. Includes Acronyms
  • str_to_title converts the first letter of every word to capital case.
cause <- "I have cause. It is beCAUSE I hate him."
str_to_lower(cause)
## [1] "i have cause. it is because i hate him."
str_to_upper(cause)
## [1] "I HAVE CAUSE. IT IS BECAUSE I HATE HIM."
str_to_sentence(cause)
## [1] "I have cause. It is because i hate him."
str_to_title(cause)
## [1] "I Have Cause. It Is Because I Hate Him."

Detecting matches

  • str_detect(): Returns TRUE if a regex pattern matches a string and FALSE if it does not. Very useful for filters.
## Get all John's and Joe's from the Lahman dataset
library(Lahman)
data("People")
People <- People %>% 
  as_tibble()
  People %>%
  filter(str_detect(nameFirst, "^Jo(e|hn)$")) %>%
  select(nameFirst, nameLast) %>% 
  head()
## # A tibble: 6 x 2
##   nameFirst nameLast
##   <chr>     <chr>   
## 1 John      Abadie  
## 2 Joe       Abreu   
## 3 Joe       Adams   
## 4 Joe       Adcock  
## 5 Joe       Agler   
## 6 John      Ake

Counting Matches

  • str_count(): Counts the occurrence of a match within a string.
  • It counts non-overlapping matches
str_view_all(c("banana", "coco"), "[^aeiou][aeiou]")
str_count(c("banana", "coco"), "[^aeiou][aeiou]")
## [1] 3 2
str_view_all("abababa", "aba")
str_count("abababa", "aba")
## [1] 2

Extracting Matches

  • str_extract() returns the first match for pattern.
  • str_extract_all() returns all matches but as a list.
colorstr <- str_c("red", "blue", "yellow", "orange", "brown", sep = "|")
colorstr
## [1] "red|blue|yellow|orange|brown"
str_view_all("I like blue and brown and that's it", colorstr)
str_extract("I like blue and brown and that's it", colorstr)
## [1] "blue"
str_extract_all("I like blue and brown and that's it", colorstr)
## [[1]]
## [1] "blue"  "brown"

Combining strings

paste & paste0

Base R provides us with a useful tool to collapse strings together (often referred to as concatenation). However, this tool has limits and is less useful in data science that glue, which I will teach below.

glue

# install.packages("glue")

library(glue)
## 
## Attaching package: 'glue'

## The following object is masked from 'package:dplyr':
## 
##     collapse
name <- "Fred"
age <- 50
anniversary <- as.Date("1991-10-12")

glue('My name is {name}, my age next year is {age + 1}, and my anniversary is {format(anniversary, "%A, %B %d, %Y")}.') 
## My name is Fred, my age next year is 51, and my anniversary is Saturday, October 12, 1991.
#equivalent in paste:
paste('My name is', name, 'my age next year is', age + 1, 'and my anniversary is', format(anniversary, "%A, %B %d, %Y"), '.')
## [1] "My name is Fred my age next year is 51 and my anniversary is Saturday, October 12, 1991 ."

glue relies on variable calls to be placed inside of curly brackets {} and will interpret the variables within the function.

Use within dataframes:

employees <- tibble::tribble(
     ~Name,              ~Job,  ~Descriptor,
     "Jim",           "sales",     "quirky",
     "Pam",       "reception",   "artistic",
  "Angela",      "accounting",     "strict",
  "Dwight",           "sales",  "eccentric",
    "Toby", "Human Resources", "monotonous"
  ) 


employees  %>%
  mutate(description = glue("{Name} works in {str_to_title(Job)}"))
## # A tibble: 5 x 4
##   Name   Job             Descriptor description                  
##   <chr>  <chr>           <chr>      <glue>                       
## 1 Jim    sales           quirky     Jim works in Sales           
## 2 Pam    reception       artistic   Pam works in Reception       
## 3 Angela accounting      strict     Angela works in Accounting   
## 4 Dwight sales           eccentric  Dwight works in Sales        
## 5 Toby   Human Resources monotonous Toby works in Human Resources

Use glue_data to collapse dataframes into one text output:

employees %>% 
  glue_data("{Name} is {Descriptor} and works in {str_to_title(Job)} at Dunder Mifflin")
## Jim is quirky and works in Sales at Dunder Mifflin
## Pam is artistic and works in Reception at Dunder Mifflin
## Angela is strict and works in Accounting at Dunder Mifflin
## Dwight is eccentric and works in Sales at Dunder Mifflin
## Toby is monotonous and works in Human Resources at Dunder Mifflin

More Resources