Working with strings and `stringr`
Learning Outcomes
- Manipulate strings with the stringr package.
- Employ regular expressions (REGEX) to manipulate strings
Intro to strings and character vectors
- In R, strings (also called “characters”) are created and displayed within quotes:
x <- "I am a string!"
x
## [1] "I am a string!"
class(x)
## [1] "character"
- Anything within quotes is a string, even numbers!
y <- "3"
class(y)
## [1] "character"
- You can have a vector of strings.
z <- c("I", "am", "a", "string", "vector")
z[2:3]
## [1] "am" "a"
- The backslash
"\"
means what is after the backslash is special in some way. - For example, if you want to put a quotation mark in a string, you can “escape” the quotation mark with a backslash.
a <- "As Michael Scott said, \"I'm not superstitious, but I am a little stitious.\""
a
## [1] "As Michael Scott said, \"I'm not superstitious, but I am a little stitious.\""
cat()
will print out the string itself.- *
print()
will print out the printed representation of the string (with backslashes and all).
print(x)
## [1] "I am a string!"
cat(x)
## I am a string!
print(y)
## [1] "3"
cat(y)
## 3
print(z)
## [1] "I" "am" "a" "string" "vector"
cat(z)
## I am a string vector
print(a)
## [1] "As Michael Scott said, \"I'm not superstitious, but I am a little stitious.\""
cat(a)
## As Michael Scott said, "I'm not superstitious, but I am a little stitious."
"\n"
represents a new line
n <- "I'm not superstitious,\nbut I am a little stitious."
cat(n)
## I'm not superstitious,
## but I am a little stitious.
# what happens if we put spaces around \n?
"\t"
represents a tab
t <- "I'm not superstitious,\tbut I am a little stitious."
cat(t)
## I'm not superstitious, but I am a little stitious.
- You can add any Unicode character with a
\u
followed by the hexadecimal unicode representation of that character. - Be careful about whether knitr will accept it though!
mu <- "\u00b5"
cat(mu)
## µ
cat(c("(","\u2310","\u25A0","\u2022","\u25A0",")"))
## ( ¬ ¦ • ¦ )
# http://pages.ucsd.edu/~dkjordan/resources/unicodemaker.html
stringr
- The stringr package has functions to make manipulating strings
easier - (more user-friendly than base R’s
grep()
andgsub()
). - stringr is part of the tidyverse so you do not have to load it separately.
library(tidyverse)
All of stringr’s functions begin with “str_
,” so in R Studio you can
press tab after typing “str_
” and a list of possible string
manipulation functions will pop up (in RStudio).
For example, use str_length()
to get the number of characters in a
string.
beyonce <- "I am Beyoncé, always."
str_length(beyonce)
## [1] 21
What about spaces and punctuation marks - they count! What about escaped characters? The ’’ does not but the character itself does.
str_length("I am Beyoncé, \nalways.")
## [1] 22
Combining Strings with str_c()
str_c()
collapses two strings together:
x <- "Would I rather be feared or loved?"
y <- "Easy. Both."
z <- "I want people to be afraid of how much they love me."
str_c(x, y, z)
## [1] "Would I rather be feared or loved?Easy. Both.I want people to be afraid of how much they love me."
The default is to separate strings by nothing, but you can use sep
to change the separator.
str_c(x, y, z, sep = " ")
## [1] "Would I rather be feared or loved? Easy. Both. I want people to be afraid of how much they love me."
Just like c()
, str_c()
can take multiple arguments.
str_c("Where", "are", "the", "turtles?!", sep = " ")
## [1] "Where are the turtles?!"
subsetting substrings with str_sub()
str_sub()
extracts a substring between the location of two characters.
bankrupt <- "I… Declare… Bankruptcy!"
str_sub(bankrupt, start = 4, end = 10)
## [1] "Declare"
You often want to use str_sub when the data is highly structured
phone <- "800-800-8553"
#first three
str_sub(phone, end = 3)
## [1] "800"
# last four
str_sub(phone, start = -4)
## [1] "8553"
- Replace substrings with assignment
str_sub(bankrupt, start = 1, end = 1) <- "We"
bankrupt
## [1] "We… Declare… Bankruptcy!"
replacing words with str_replace()
If I want to replace a specific pattern of text with another pattern,
str_replace()
or str_replace_all()
are very useful.
str_replace(bankrupt, "We", "I")
## [1] "I… Declare… Bankruptcy!"
Back with our phone number example, you’ll see there’s a difference
between str_replace()
or str_replace_all()
. The first only replaces
the first instance
phone
## [1] "800-800-8553"
str_replace(phone, "800-", "")
## [1] "800-8553"
str_replace_all(phone, "800-", "")
## [1] "8553"
If I only want to change the second instance of “800-,” I’ll need to use a more complicated pattern match. This would require a regular expression
Regular Expressions
Intro
-
Regular expressions (regex or regexp) are a syntax for pattern matching in strings.
-
Regex structure is used in many different computer languages
-
str_replace()
andstr_replace_all()
search for a pattern as defined by the regex and then replace it (all) with another string. -
Wherever there is a
pattern
argument in a stringr function, you can use regex (to extract strings, get a logical if there is a match, etc…). -
regex includes special characters, e.g., “.” and ““. These must be escaped using”” if you want to match their normal value.
Finding pattern matches with str_view()
and str_view_all()
- Basic usage: find exact match of a string pattern
fruit <- c("Apple", "Strawberry", "Banana", "Pear", "Blackberry", "*berry")
str_view(fruit, "an")
str_view_all(fruit, "an")
- A period “
.
” matches any character. - A
[:alpha:]
matches any alphabetical character.
str_view_all(fruit, ".berry")
str_view_all(fruit, "[:alpha:]berry")
- You can “escape” a symbol with two backslashes “
\\.
” to match. If you don’t, the asterisk in this case will be interpreted as a regular expression command, not a symbol.
# str_view_all(fruit, "*berry")
str_view_all(fruit, "\\*berry")
Exercise: Use one function call to replace "love"
and "loved"
with
"X"
in the following.
love <- "Would I rather be feared or loved? Easy. Both. I want people to be afraid of how much they love me."
str_view_all(love, "love[:alpha:]*")
str_replace_all(love, "love[:alpha:]*", "X")
## [1] "Would I rather be feared or X? Easy. Both. I want people to be afraid of how much they X me."
Anchoring the regex so search starts at the beginning or at the end of a string
-
You can anchor the regex pattern to begin looking for a match from the start (left-to-right) of the string or backwards from the end of a string (working-right to left).
-
^
forces matching to begin from the start of a string. -
$
forces matching to begin from the end of a string.
str_view(fruit, "^B")
str_view(fruit, "a$")
Exercise: Use str_replace()
to replace all four letter words beginning
with an "a"
with "foo"
in the following list:
x <- c("apple", "barn", "ape", "cart", "alas", "pain", "ally")
str_replace(x, "^a...$", "foo")
## [1] "apple" "barn" "ape" "cart" "foo" "pain" "foo"
Special Characters
There are a lot of regular expression character matches in R and I don’t expect you to memorize them all - I often have the cheatsheet open next to me while working. Some important ones you should however be able to recognize:
type this: | to mean this: |
---|---|
\\n | new line |
\\s or [:space:] | any whitespace |
\\d or [:digit:] | any digit |
\\w [:alpha:] | any word character |
[:punct:] | any punctuation |
. | every character except new line |
- We’ll use this character vector for practice:
phones <- c("Abba: 555-1234", "Anna: 555-0987", "Andy: 555-7654")
\\d
: matches any digit
str_view(phones, "\\d\\d\\d-\\d\\d\\d\\d")
\\s
: matches any white space (e.g. space, tab, newline).
str_view(phones, "\\s")
[abc]
: matchesa
,b
, orc
.
str_view(phones, "A[bn][bn]a", "XXXX")
[^abc]
: matches anything excepta
,b
, orc
.- Note this is a different use of
^
since it is inside the[ ]
str_view(phones, "A[^b]", "XXXX")
abc|xyz
: matches eitherabc
orxyz
. This is called alternation- You can use parentheses to control where the alternation occurs.
a(bc|xy)z
matches eitherabcz
oraxyz
.
str_view(phones, "An(na|dy)")
- To ignore case, place a
(?i)
before the regex.
str_view("AB", "ab")
str_view("AB", "(?i)ab")
Repetition using ?
, +
, *
, {n}
, {n,}
,{0,n}
, {n,m}
- Can match a pattern multiple times in a row:
?
: 0 or 1+
: 1 or more*
: 0 or more
x <- c("A", "AA", "AAA", "AAAA", "B", "BB")
str_view_all(x, "^A?", "X")
str_view_all(x, "^A+", "X")
str_view_all(x, "^A*", "X")
- A more realistic example:
str_view_all("color and colour", "colou?r", "X")
-
Control exactly how many repetitions allowed in a match:
-
{n}
: exactlyn
. -
{n,}
:n
or more. -
{0,m}
: at mostm
. -
{n,m}
: betweenn
andm
.
str_view_all(x, "A{2}", "X")
str_view_all(x, "A{2,}", "X")
str_view_all(x, "A{0,2}", "X")
str_view_all(x, "A{3,4}", "X")
- Regex is “greedy” and will automatically match the longest string possible.
str_view("AAAA", "A*",)
Exercise: Create regular expressions to find all words with the following patterns and replace the patterns with “X”:
- Start with three consonants. Test on
x1 <- c("string", "priority", "value", "distinction")
str_replace_all(x1, "^[^aeiouAEIOU]{3}", "X")
## [1] "Xing" "priority" "value" "distinction"
There is a lot more to learn about regular expressions what we won’t cover here,
like groups and look arounds. Groups allows you to define which part of the
expression you want to extract or replace and look arounds allow you to define
what follows or precedes the expression. When you need to learn more, there are
many tools online like https://regex101.com/ to help
you learn. The only important thing to remember with online regular expression
tools is that r
needs an extra \
preceding each \
in other coding
languages.
more stringr
There are a lot of functions to analyze, compare and adjust strings.
Changing Case
str_to_lower()
andstr_to_upper()
convert all letters to lower or capital case.str_to_sentence
converts all words and letters to sentence case. Includes Acronymsstr_to_title
converts the first letter of every word to capital case.
cause <- "I have cause. It is beCAUSE I hate him."
str_to_lower(cause)
## [1] "i have cause. it is because i hate him."
str_to_upper(cause)
## [1] "I HAVE CAUSE. IT IS BECAUSE I HATE HIM."
str_to_sentence(cause)
## [1] "I have cause. It is because i hate him."
str_to_title(cause)
## [1] "I Have Cause. It Is Because I Hate Him."
Detecting matches
str_detect()
: ReturnsTRUE
if a regex pattern matches a string andFALSE
if it does not. Very useful for filters.
## Get all John's and Joe's from the Lahman dataset
library(Lahman)
data("People")
People <- People %>%
as_tibble()
People %>%
filter(str_detect(nameFirst, "^Jo(e|hn)$")) %>%
select(nameFirst, nameLast) %>%
head()
## # A tibble: 6 x 2
## nameFirst nameLast
## <chr> <chr>
## 1 John Abadie
## 2 Joe Abreu
## 3 Joe Adams
## 4 Joe Adcock
## 5 Joe Agler
## 6 John Ake
Counting Matches
str_count()
: Counts the occurrence of a match within a string.- It counts non-overlapping matches
str_view_all(c("banana", "coco"), "[^aeiou][aeiou]")
str_count(c("banana", "coco"), "[^aeiou][aeiou]")
## [1] 3 2
str_view_all("abababa", "aba")
str_count("abababa", "aba")
## [1] 2
Extracting Matches
str_extract()
returns the first match for pattern.str_extract_all()
returns all matches but as a list.
colorstr <- str_c("red", "blue", "yellow", "orange", "brown", sep = "|")
colorstr
## [1] "red|blue|yellow|orange|brown"
str_view_all("I like blue and brown and that's it", colorstr)
str_extract("I like blue and brown and that's it", colorstr)
## [1] "blue"
str_extract_all("I like blue and brown and that's it", colorstr)
## [[1]]
## [1] "blue" "brown"
Combining strings
paste
& paste0
Base R provides us with a useful tool to collapse strings together (often referred to as concatenation). However, this tool has limits and is less useful in data science that glue, which I will teach below.
glue
# install.packages("glue")
library(glue)
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
name <- "Fred"
age <- 50
anniversary <- as.Date("1991-10-12")
glue('My name is {name}, my age next year is {age + 1}, and my anniversary is {format(anniversary, "%A, %B %d, %Y")}.')
## My name is Fred, my age next year is 51, and my anniversary is Saturday, October 12, 1991.
#equivalent in paste:
paste('My name is', name, 'my age next year is', age + 1, 'and my anniversary is', format(anniversary, "%A, %B %d, %Y"), '.')
## [1] "My name is Fred my age next year is 51 and my anniversary is Saturday, October 12, 1991 ."
glue relies on variable calls to be placed inside of curly brackets {}
and will interpret the variables within the function.
Use within dataframes:
employees <- tibble::tribble(
~Name, ~Job, ~Descriptor,
"Jim", "sales", "quirky",
"Pam", "reception", "artistic",
"Angela", "accounting", "strict",
"Dwight", "sales", "eccentric",
"Toby", "Human Resources", "monotonous"
)
employees %>%
mutate(description = glue("{Name} works in {str_to_title(Job)}"))
## # A tibble: 5 x 4
## Name Job Descriptor description
## <chr> <chr> <chr> <glue>
## 1 Jim sales quirky Jim works in Sales
## 2 Pam reception artistic Pam works in Reception
## 3 Angela accounting strict Angela works in Accounting
## 4 Dwight sales eccentric Dwight works in Sales
## 5 Toby Human Resources monotonous Toby works in Human Resources
Use glue_data
to collapse dataframes into one text output:
employees %>%
glue_data("{Name} is {Descriptor} and works in {str_to_title(Job)} at Dunder Mifflin")
## Jim is quirky and works in Sales at Dunder Mifflin
## Pam is artistic and works in Reception at Dunder Mifflin
## Angela is strict and works in Accounting at Dunder Mifflin
## Dwight is eccentric and works in Sales at Dunder Mifflin
## Toby is monotonous and works in Human Resources at Dunder Mifflin
More Resources
- Chapter 14 of RDS.
- R Strings Cheathsheet
- R Regex Cheatsheet
- Stringr Overview
- glue Overview