Stringr and Regular Expressions
The goal of this tutorial is to introduce string manipulation and regular expression (wildcard) matching using the stringr package.
The material covered here is also explained in greater detail in Chapter 14 of the course textbook R for Data Science by Hadley Wickham and Garrett Grolemund. I encourage you to use that book for clarification or greater depth. It is free online.
stringr is part of the tidyverse and is loaded when you load tidyverse. It provides a set of functions with consistent syntax for manipulating strings.
pre-requisites
To use the functions in this tutorial in your own R session you would
need to use library to load the tidyverse
package as shown below. I’ve pre-loaded it for this tutorial.
library(tidyverse)
Detect strings with str_detect()
The str_detect() function looks for the presence of a
search pattern in a string and returns TRUE if it detects the
pattern.
For example, consider the following 10 names:
## [1] "Appie" "Roxana" "Yvonne" "Vivien" "Dorris" "Concetta"
## [7] "Kathleen" "Azalee" "Faith" "Hughey"
Run the code below to search for names that contain “a”:
bnames10 %>% str_detect("a")
str_view() performs a similar task but creates a
visualization rather than returning output:
bnames10 %>% str_view("a")
Question: Is str_detect() case sensitive?
Practice: Use str_detect to detect names with “A”.bnames10 %>%
Subset strings with str_subset()
If we want to only retain those strings that match our pattern, we
use str_subset().
Run the code below to retain names that contain “a”:
bnames10 %>% str_subset("a")
Extract matches with str_extract()
If we want to extract the matching part of the string, we can use
str_extract()
bnames10 %>% str_extract("a")
Practice: Go ahead and try it, with the pattern of your choice.
This doesn’t seem very useful right now, but it will…
Regular Expressions, part 1
Regular expressions are a powerful syntax for specifying wildcards and doing complex pattern matching. They are a component of most (or all) modern computer languages. We will only scratch the surface here, but this should be enough to do a fair number of basic tasks.
The wildcard character.
The . character matches any character, so if we want to
find names where there are two “a”s separated by any character, then:
(note we are using a list of 1000 names now)
bnames1000 %>% str_subset("a.a")
Regular expressions can get pretty confusing. By adding the
“match=TRUE” argument tostr_view() we get a nice
visualization that is similar to str_subset():
bnames1000 %>% str_view("a.a", match = TRUE)
Now you try it: use str_subset() or
str_view() on the the bnames1000 list to show
names where there are two “i”s separated by two characters. You should
get 24 matches
Escaping characters
But what if you want to actually find a period? Just using
str_detect()doesn’t work because . matches
everything:
test <- c("has.a.period", "has_no_period", "hasNoPunctuation")
test %>% str_detect(".")
There are two options:
- use
\\to ‘escape’ it, that is to remove its special meaning. Somewhat confusingly we have to use\\because we have to escape the\itself.
test <- c("has.a.period", "has_no_period", "hasNoPunctuation")
test %>% str_detect("\\.")
- If you do not need to use a regular expression at all in your search
string you can embed it within
fixed()which indicates that everything in the pattern should be used literally.
test <- c("has.a.period", "has_no_period", "hasNoPunctuation")
test %>% str_detect(fixed("."))
Specifying the number of matches.
The number of matching can be specified by appending special
characters: (This is taken from the r help file on
regexp)
*specifies that the preceding item will be matched 0 or more times?the preceding item will be matched 0 or 1 times+The preceding item will be matched one or more times.{n}The preceding item is matched exactly n times.{n,}The preceding item is matched n or more times.{n,m}The preceding item is matched at least n times, but not more than m times.
Practice: Start with the even larger
bnamesearly and find all name where there is an “e”
followed by 1 or 2 “z”s. You results should look like this:
## [1] "Inez" "Hezekiah" "Jabez" "Dezzie" "Cortez" "Nezzie"
## [7] "Ebenezer" "Plez" "Onezia" "Hezzie"
bnamesearly %>% str_subset("")
Regular Expressions, part 2
Character sets
If we want to match any of a particular set of characters we can
include the group in [].
Practice: Experiment with [] to find
all names in the bnames100 list that start with a vowel.
People’s names start with capital letters (You should find 19):
bnames100 %>% str_subset("[]")
We can also use character sets to exclude a particular set of
characters by placing a ^ as the first character in our
brackets [^].
Practice: Experiment with [^] to find
all letters in the letters that are not vowels. (You should
find 21):
letters %>% str_subset("[^]")
You can also specify ranges using “-” so that [0-5]
matches numbers from 0 to 5 and [j-n] matches j, k, l, m,
or n, [A-Z] is all capital letters, etc. There are also
some predefined classes but generally I do not recommend using them
because the definitions change depending on your locale.
Or
Somewhat related is the | character that serves as an
or:
bnamesall %>% str_subset("Jack|Jill") %>% head(15)
Parentheses are used to indicate the boundaries of the or statement. Compare the following two code chunks:
No parentheses:bnamesall %>% str_subset("Jack|Jilly")
bnamesall %>% str_subset("(Jack|Jill)y")
Why are these last two different? The first returns all names that contain “Jack” or “Jilly”, whereas the second returns all names that contain “Jacky” or “Jilly”.
Practice: The object fruit contains 80
fruit names.
1) Find all of those that contain “berry” or “pepper”. (There will be 16)
fruit %>% str_subset("")
2) Find all fruit names that have a space ” ” preceding “berry” or “pepper”. (There will be 4)
fruit %>% str_subset("")
3) Find all fruit names that contain “berry” or “pepper” but that do NOT have a preceding space. (There will be 12)
fruit %>% str_subset("")
Regular Expressions, part 3
Earlier you were asked to find all names that start with a vowel.
This was relatively easy because the first letter of the
bnames list is capitalized. But what if you wanted all
fruit that started with a vowel? Luckily we can specify the beginning of
a string with ^ and the end of a string with
$.
Practice: Use this information to find all fruit that start with a vowel: (There will be 8)
fruit %>% str_subset("")
Practice: Now try finding all fruit that start AND
end with a vowel. (There will be 4)
Hint: You will need to find some way to deal with the letters in the
middle of the words
fruit %>% str_subset("")
Practice: How about all fruit that start OR end with a vowel? (There will be 32)
fruit %>% str_subset("")
More str_ functions
Removing matches
str_remove removes the (first) match of the pattern and
returns the rest of the string; str_remove_all() removes
all occurrences of it.
Try removing the first lowercase vowel from each item in
bnames10
bnames10 %>%
Now try removing all lowercase vowels from bnames10
bnames10 %>%
Replacing matches
str_replace() and str_replace_all() will
replace the match with a different string. If we want to capitalize
berry:
fruit %>% str_replace("berry", "BERRY")
knames contains names that start with “Kat”. Often there
are similar names that start with “C”. Use str_replace to convert these
knames to their “C” equivalent.
knames
## [1] "Katherine" "Katie" "Kate" "Kathryn" "Katharine" "Kathleen"
## [7] "Kathrine" "Kattie" "Katheryn" "Katy" "Kathryne" "Katharina"
## [13] "Katharyn" "Kathyrn" "Kathern" "Kathrina" "Katherina" "Katye"
## [19] "Katheryne" "Katherin" "Katrina" "Kathlyn"
knames %>%
Back references
You have already seen that parentheses can be used to clarify the order of search operations (e.g. “(Jack|Jill)y” ). They can also define a group of matched characters that can be referred to subsequently in the search string or in a replace statement. For example, if we want to find all names with pairs of characters:
fruit %>% str_view("(..)\\1", match = TRUE)
What is going on here? the .. matched any two
characters. Since .. is enclosed in parentheses that
defines a group and \\1 refers back to the group.
Practice: Try it! Find all names in
bnames100 that begin and end with the same character. To
make it simpler we will convert to lowercase first. You may want to use
str_view() as you work on your pattern, but use
str_subset() when you are ready to submit your answer.
bnames100 %>% tolower() %>%
You can define multiple match groups with separate sets of
parentheses and back reference each one, the first group is
\\1, the second one is \\2 and so forth. So if
we wanted to swap the first and last letters of these words:
## [1] "deal" "dear" "dog" "no"
str_replace(mywords, "^(.)(.*)(.)$", "\\3\\2\\1")
Stop and Think: What does each term in the regular expression match? How does this reverse the first and last characters? How was the computer able to reverse the order of “no” despite it only being 2 characters?
Mutate
All of these functions can be used with mutate() to
transform columns of a dataframe or tibble. Consider this list of
names:
We can use mutate() to create a new column with just the
first name:
people %>%
mutate(first_name=str_remove(full_name, ".*, "))
Practice: Create a column with the last names.
people %>%
mutate(last_name=)
An alternative approach using mutate() and
str_replace():
people %>%
mutate(first_name=str_replace(full_name, ".*, (.*)", "\\1"))
Stop and Think: In your own words explain the regular expressions above and how they work.
Practice: Use mutate() and
str_replace() to create a column with last names.
people %>%
mutate(last_name=)
Practice: Use mutate() and
str_replace() to create a column with the first name
followed by the last name. Your result should look like this:
people %>%
mutate(first_last=)
Resources and final words
Regular expression are very powerful. They also take a lot of practice before they become intuitive. The payoff is worth it.
A few helpful resources:
- I made a short video on regexplain, an Rstudio plugin that will make visualizing regular expressions much easier. watch it here
- Interactive visualization in R using regexplain
- stringr and regular expression cheatsheet