Stringr and Regular Expressions

Stringr and Regular Expressions

The goal of this tutorial is to introduce string manipulation and regular expression (wildcard) matching using the stringr package.

The material covered here is also explained in greater detail in Chapter 14 of the course textbook R for Data Science by Hadley Wickham and Garrett Grolemund. I encourage you to use that book for clarification or greater depth. It is free online.

stringr is part of the tidyverse and is loaded when you load tidyverse. It provides a set of functions with consistent syntax for manipulating strings.

pre-requisites

To use the functions in this tutorial in your own R session you would need to use library to load the tidyverse package as shown below. I’ve pre-loaded it for this tutorial.

library(tidyverse)

Detect strings with str_detect()

The str_detect() function looks for the presence of a search pattern in a string and returns TRUE if it detects the pattern.

For example, consider the following 10 names:

##  [1] "Appie"    "Roxana"   "Yvonne"   "Vivien"   "Dorris"   "Concetta"
##  [7] "Kathleen" "Azalee"   "Faith"    "Hughey"

Run the code below to search for names that contain “a”:

bnames10 %>% str_detect("a")

str_view() performs a similar task but creates a visualization rather than returning output:

bnames10 %>% str_view("a")

Question: Is str_detect() case sensitive?

Practice: Use str_detect to detect names with “A”.
bnames10 %>%
.answer <- bnames10 %>% str_detect("A")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Subset strings with str_subset()

If we want to only retain those strings that match our pattern, we use str_subset().

Run the code below to retain names that contain “a”:

bnames10 %>% str_subset("a")
Practice: keep names that contain “th”
.answer <- bnames10 %>% str_subset("th")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Extract matches with str_extract()

If we want to extract the matching part of the string, we can use str_extract()

bnames10 %>% str_extract("a")

Practice: Go ahead and try it, with the pattern of your choice.

This doesn’t seem very useful right now, but it will…

Regular Expressions, part 1

Regular expressions are a powerful syntax for specifying wildcards and doing complex pattern matching. They are a component of most (or all) modern computer languages. We will only scratch the surface here, but this should be enough to do a fair number of basic tasks.

The wildcard character.

The . character matches any character, so if we want to find names where there are two “a”s separated by any character, then: (note we are using a list of 1000 names now)

bnames1000 %>% str_subset("a.a")

Regular expressions can get pretty confusing. By adding the “match=TRUE” argument tostr_view() we get a nice visualization that is similar to str_subset():

bnames1000 %>% str_view("a.a", match = TRUE)

Now you try it: use str_subset() or str_view() on the the bnames1000 list to show names where there are two “i”s separated by two characters. You should get 24 matches

Escaping characters

But what if you want to actually find a period? Just using str_detect()doesn’t work because . matches everything:

test <- c("has.a.period", "has_no_period", "hasNoPunctuation")
test %>% str_detect(".")

There are two options:

  1. use \\ to ‘escape’ it, that is to remove its special meaning. Somewhat confusingly we have to use \\ because we have to escape the \ itself.
test <- c("has.a.period", "has_no_period", "hasNoPunctuation")
test %>% str_detect("\\.")
  1. If you do not need to use a regular expression at all in your search string you can embed it within fixed() which indicates that everything in the pattern should be used literally.
test <- c("has.a.period", "has_no_period", "hasNoPunctuation")
test %>% str_detect(fixed("."))

Specifying the number of matches.

The number of matching can be specified by appending special characters: (This is taken from the r help file on regexp)

  • * specifies that the preceding item will be matched 0 or more times
  • ? the preceding item will be matched 0 or 1 times
  • + The preceding item will be matched one or more times.
  • {n} The preceding item is matched exactly n times.
  • {n,} The preceding item is matched n or more times.
  • {n,m} The preceding item is matched at least n times, but not more than m times.

Practice: Start with the even larger bnamesearly and find all name where there is an “e” followed by 1 or 2 “z”s. You results should look like this:

##  [1] "Inez"     "Hezekiah" "Jabez"    "Dezzie"   "Cortez"   "Nezzie"  
##  [7] "Ebenezer" "Plez"     "Onezia"   "Hezzie"
bnamesearly %>% str_subset("")
.answer <- bnamesearly %>% str_subset("ez{1,2}")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Regular Expressions, part 2

Character sets

If we want to match any of a particular set of characters we can include the group in [].

Practice: Experiment with [] to find all names in the bnames100 list that start with a vowel. People’s names start with capital letters (You should find 19):

bnames100 %>% str_subset("[]")
.answer <- bnames100 %>% str_subset("[AEIOU]")

grade_result(
  pass_if(~identical(.result,  .answer))
)

We can also use character sets to exclude a particular set of characters by placing a ^ as the first character in our brackets [^].

Practice: Experiment with [^] to find all letters in the letters that are not vowels. (You should find 21):

letters %>% str_subset("[^]")
.answer <- letters %>% str_subset("[^aeiou]")

grade_result(
  pass_if(~identical(.result,  .answer))
)

You can also specify ranges using “-” so that [0-5] matches numbers from 0 to 5 and [j-n] matches j, k, l, m, or n, [A-Z] is all capital letters, etc. There are also some predefined classes but generally I do not recommend using them because the definitions change depending on your locale.

Or

Somewhat related is the | character that serves as an or:

bnamesall %>% str_subset("Jack|Jill") %>% head(15)

Parentheses are used to indicate the boundaries of the or statement. Compare the following two code chunks:

No parentheses:
bnamesall %>% str_subset("Jack|Jilly") 
With parentheses:
bnamesall %>% str_subset("(Jack|Jill)y") 

Why are these last two different? The first returns all names that contain “Jack” or “Jilly”, whereas the second returns all names that contain “Jacky” or “Jilly”.

Practice: The object fruit contains 80 fruit names.

1) Find all of those that contain “berry” or “pepper”. (There will be 16)

fruit %>% str_subset("") 
.answer <- fruit %>% str_subset("(berry|pepper)")

grade_result(
  pass_if(~identical(.result,  .answer))
)

2) Find all fruit names that have a space " " preceding “berry” or “pepper”. (There will be 4)

fruit %>% str_subset("") 
.answer <- fruit %>% str_subset(" (berry|pepper)")

grade_result(
  pass_if(~identical(.result,  .answer))
)

3) Find all fruit names that contain “berry” or “pepper” but that do NOT have a preceding space. (There will be 12)

fruit %>% str_subset("") 
.answer <- fruit %>% str_subset("[^ ](berry|pepper)")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Regular Expressions, part 3

Earlier you were asked to find all names that start with a vowel. This was relatively easy because the first letter of the bnames list is capitalized. But what if you wanted all fruit that started with a vowel? Luckily we can specify the beginning of a string with ^ and the end of a string with $.

Practice: Use this information to find all fruit that start with a vowel: (There will be 8)

fruit %>% str_subset("") 
.answer <- fruit %>% str_subset("^[aeiou]")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Practice: Now try finding all fruit that start AND end with a vowel. (There will be 4)
Hint: You will need to find some way to deal with the letters in the middle of the words

fruit %>% str_subset("") 
.answer <- fruit %>% str_subset("^[aeiou].*[aeiou]$")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Practice: How about all fruit that start OR end with a vowel? (There will be 32)

fruit %>% str_subset("") 
.answer <- fruit %>% str_subset("^[aeiou]|[aeiou]$")

grade_result(
  pass_if(~identical(.result,  .answer))
)

More str_ functions

Removing matches

str_remove removes the (first) match of the pattern and returns the rest of the string; str_remove_all() removes all occurrences of it.

Try removing the first lowercase vowel from each item in bnames10

bnames10 %>% 
.answer <- bnames10 %>% str_remove("[aeiou]")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Now try removing all lowercase vowels from bnames10

bnames10 %>% 
.answer <- bnames10 %>% str_remove_all("[aeiou]")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Replacing matches

str_replace() and str_replace_all() will replace the match with a different string. If we want to capitalize berry:

fruit %>% str_replace("berry", "BERRY")

knames contains names that start with “Kat”. Often there are similar names that start with “C”. Use str_replace to convert these knames to their “C” equivalent.

knames
##  [1] "Katherine" "Katie"     "Kate"      "Kathryn"   "Katharine" "Kathleen" 
##  [7] "Kathrine"  "Kattie"    "Katheryn"  "Katy"      "Kathryne"  "Katharina"
## [13] "Katharyn"  "Kathyrn"   "Kathern"   "Kathrina"  "Katherina" "Katye"    
## [19] "Katheryne" "Katherin"  "Katrina"   "Kathlyn"
knames %>% 
.answer <- knames %>% str_replace("K", "C")

grade_result(
  pass_if(~identical(.result,  .answer))
)

Back references

You have already seen that parentheses can be used to clarify the order of search operations (e.g. “(Jack|Jill)y” ). They can also define a group of matched characters that can be referred to subsequently in the search string or in a replace statement. For example, if we want to find all names with pairs of characters:

fruit %>% str_view("(..)\\1", match = TRUE)

What is going on here? the .. matched any two characters. Since .. is enclosed in parentheses that defines a group and \\1 refers back to the group.

Practice: Try it! Find all names in bnames100 that begin and end with the same character. To make it simpler we will convert to lowercase first. You may want to use str_view() as you work on your pattern, but use str_subset() when you are ready to submit your answer.

bnames100 %>% tolower() %>% 
.answer <- bnames100 %>% tolower() %>% str_subset("^(.).*\\1$")
.wrong <- bnames100 %>% tolower() %>% str_subset("(.).*\\1")

grade_result(
  pass_if(~identical(.result,  .answer)),
  fail_if(~identical(.result, .wrong), "Nice try but maybe you forgot to specify the start and end of the string")
  )

You can define multiple match groups with separate sets of parentheses and back reference each one, the first group is \\1, the second one is \\2 and so forth. So if we wanted to swap the first and last letters of these words:

## [1] "deal" "dear" "dog"  "no"
str_replace(mywords, "^(.)(.*)(.)$", "\\3\\2\\1")

Stop and Think: What does each term in the regular expression match? How does this reverse the first and last characters? How was the computer able to reverse the order of “no” despite it only being 2 characters?

Mutate

All of these functions can be used with mutate() to transform columns of a dataframe or tibble. Consider this list of names:

We can use mutate() to create a new column with just the first name:

people %>%
  mutate(first_name=str_remove(full_name, ".*, "))

Practice: Create a column with the last names.

people %>%
  mutate(last_name=)
.answer <- people %>% mutate(last_name=str_remove(full_name, ", .*"))

grade_result(
  pass_if(~identical(.result,  .answer))
  )

An alternative approach using mutate() and str_replace():

people %>%
  mutate(first_name=str_replace(full_name, ".*, (.*)", "\\1"))

Stop and Think: In your own words explain the regular expressions above and how they work.

Practice: Use mutate() and str_replace() to create a column with last names.

people %>%
  mutate(last_name=)

Practice: Use mutate() and str_replace() to create a column with the first name followed by the last name. Your result should look like this:

people %>%
  mutate(first_last=)
.answer <- people %>% mutate(first_last=str_replace(full_name, "(.*), (.*)", "\\2 \\1"))

grade_result(
  pass_if(~identical(.result,  .answer))
  )

Resources and final words

Regular expression are very powerful. They also take a lot of practice before they become intuitive. The payoff is worth it.

A few helpful resources:

  1. I made a short video on regexplain, an Rstudio plugin that will make visualizing regular expressions much easier. watch it here
  2. Interactive visualization in R using regexplain
  3. stringr and regular expression cheatsheet