Introduction to Tidyverse

Welcome to the TidyVerse

The goal of this tutorial is to introduce the tidyverse a series of R packages that make data manipulation and processing easier. While most of what is covered here could be accomplished using base R functions, the tidy functions make things easier and more clear.

The material covered here is also explained in greater detail in the course textbook R for Data Science by Hadley Wickham and Garrett Grolemund. I encourage you to use that book for clarification or greater depth. It is free online.

In R there are many add on packages, a.k.a. libraries. Tidyverse is one such package. We use the library() command to load a library for use in our current R session. library() takes the package name as an argument.

pre-requisites

To use the functions in this tutorial in your own R session you would need to use library to load the tidyverse package as shown below. I’ve pre-loaded it for this tutorial.

library(tidyverse)

Tibbles

Optional reading: tibbles in R for Data Science

You were previously introduced to data.frame objects as spreadsheet-like objects that are useful for handling data in R. Tidyverse has a related data type, the tibble. Details of the differences aren’t important right now, but if you are already an R afficiando and are curious you can read about them in the tibbles chapter in R for Data Science

To import data into R as a tibble it is convenient to have it saved as a comma separated value (csv) file. For the rest of this tutorial we will work with a dataset of tomato measurements in the file Tomato.csv

We import the data using the read_csv() function. If you have used R before, please note that this is a different function than read.csv(). read_csv() created a tibble upon import instead of a data frame.

We will use data from a pilot experiment measuring the growth of various wild tomato accessions and species under different light conditions. To load this data into R you would need to run the code below, modifying the path as appropriate (you do not need to run or modify it for this tutorial).

tomato <- read_csv("www/Tomato.csv")

Now use the head function to take a look at the imported data

Explanation of column headings:

shelf, flat, col, row: information about where each plant was grown
acs: the accession number for that strain
trt: “H” is light with a high red:far-red ratio; simulated sun. “L” is light with a low red:far- red ratio; simulated plant shade.
days: days since germination
date: date of measurement
hyp: hypocotyl length (embryonic stem)
int1-int4: individual internode lengths. Internodes are stem segments
leafnum: which leaf was measured
petleng: petiole length
leafleng: leaf length
leafwid: leaf width
ndvi: a measure of plant vegetation density
lat: latitude of origin
lon: longitude of origin
alt: altitude of origin
species: species of plant
who: who measured the plants

Filter data with `filter()`

The tomato data set contains data from five species. We can see their names with:

unique(tomato$species)

Suppose that we only want to look at the data from one of these, S. pennellii. We can use the filter function to specify exactly that:

Spen.data <- filter(tomato,species == "S. pennellii")

The first argument is the data object that we want to work on, and the second argument is a logical statement describing how we want to filter.

How many rows were in the original data set and how many are in your filtered data set? Add additional lines to the code below to answer this question

Spen.data <- filter(tomato,species == "S. pennellii")

There were two researchers that measured the data. Perhaps we only trust the data gathered by one researcher. Find the names of the researchers and then filter to only keep the data measured by the second researcher in the list. How can you confirm that your filtering worked and that only one researcher is contained in the new data set?

You can provide multiple test arguments to filter() and all must evaluate to true for a row to be selected.

Try creating a new data set that consists of only S. chilense samples gathered from altitude greater than 2500 meters.

Want more info? See filter in R 4 Data Science.

Select columns with `select()`

The tomato data set has 23 data columns, and you may not care about all of them. We can use select() to select particular columns

select(tomato, species, hyp)

If we use a minus sign - in front of a variable name then select will select everything except for that variable name:

select(tomato, -shelf, -flat, -col, -row) #remove position information

Select can also be used to rearrange the column order. Combine with everything() so that you don’t have to type them all.

So, if we want to have the “who” column first, then…

select(tomato,who,everything())

Practice by using select to create a new tibble that only has the hypcotyl and internode data, along with species, accession, and trt.

Want more info? See select in R 4 Data Science.

Combine commands using pipes

You will remember the pipe | operator in Linux that takes the output from one command and “pipes” it into another command. The R Tidyverse has a similar operator, although it is written as %>%. (Technically this comes from the package magrittr (get the pun?))

Want more info? See pipes in R 4 Data Science.

So if we want to use filter to keep data collected by Dan and then select to only keep species, accession, who, trt, and leaf data we could do it like so:

Dan_leaf_data <- tomato %>% #start with the tomato data; 
                            #pipe it to the next command (filter).  
                            #final results will go into Dan_leaf_data
  filter(who=="Dan") %>%
  select(species, acs, who, trt, starts_with("leaf")) # note use of "starts_with!"

Dan_leaf_data

Note that we no longer have to give the “tomato” argument to filter or select, pipe is taking care of that first argument for us.

Now try it yourself, selecting the hypocotyl data measured by Pepe from the “H” trt.

Want more info? See pipes in R 4 Data Science.

Arrange data with `arrange()`

The arrange() function allows you to sort your data based on the column(s) of your choice:

if we want to sort by hypocotyl length:

tomato %>% arrange(hyp)

By default arrange will sort from lowest to highest. if you want to go from highest to lowest, then add desc

tomato %>% arrange(desc(hyp))

You can sort on multiple columns by providing multiple arugments to arrange. Try sorting by both accession (acs) and hyp. How does changing the order of the arguments to arrange affect the results? Try it both ways.

Now fill in the blanks the code below to create a new data set that contains data gathered by Pepe, sorted by altitude (alt) and containing the who, trt, species, acs, hyp, and alt columns.

Also you will need to fix an error in the code before this runs.

Pepe_long_hyps <- tomato %>%
  ______(who==Pepe) %>% 
  ______(who, trt, species, acs, hyp, alt) %>%
  ______(alt)

Want more info? See arrange in R 4 Data Science.

Create new columns with `mutate()`

If you want to add a new column to your data set, one way to do so is with mutate. For example, if we wanted a column that represented the leaf length to width ratio:

tomato_lw <- tomato %>%  mutate(lw=leafleng/leafwid)
tomato_lw %>% select(leafleng,leafwid,lw, everything())

Use mutate to make a new column that represents the total stem length of the plant (hypocotyl + all internodes).

Want more info? See mutate in R 4 Data Science.

Create summaries with `summarize()`

Tidyverse functions also provide an easy way to summarize your data.

We first tell R what groups we are interested in summarizing (using group_by(), and then how we want it summarized (summarize()).

For example, if we wanted the mean hypocotyl length of each species, then:

tomato %>%
  group_by(species) %>% #want to compute something for each species
  summarize(hyp.mean=mean(hyp)) #hyp.mean is what the new summary will be called

If we want to also have a summary column that tells us the number of measurements in each group we can use n()

tomato %>%
  group_by(species) %>% 
  summarize(hyp.mean=mean(hyp),
            n=n())

We can group by multiple factors:

tomato %>%
  group_by(species,trt) %>%
  summarize(hyp.mean=mean(hyp),
            n=n())

Also we can create even more summary columns, and, of course, the results can be stored in a new object

hyp.summary <- tomato %>%
  group_by(species,trt) %>%
  summarize(hyp.mean=mean(hyp), 
            hyp.sd=sd(hyp),
            n=n())
hyp.summary

We can take advantage of the fact that TRUE is equivalent to 1 and FALSE to 0 to quickly count the number if items that meet some criteria. If we want to know the number of hypocotyl measurements greater than 30mm for each species:

tomato %>%
  group_by(species) %>%
  summarize(`Total hyp measurements`=n(),
            `Hyp measurements > 30`=sum(hyp>30))

Your turn…

Calculate the median altitude of each species, then pipe the results into arrange to sort by altitude. Which species comes from the highest altitude?

Who worked harder in this experiment? Calculate the number of measurements made by each researcher for each species.

The column leafnum gives information on the leaf that was measured (leaf 1 is the oldest and is at the bottom of the plant). For each species and treatment combination, calculate the percentage of leaf measurements that come from leaf 6. _Self check: S. Chilense H should be 31.37255`

Use `slice()` to slice up your data

The slice() set of functions allows you to subset your data by ranked position. There are several related functions, here we will just work with slice_min() and slice_max().

slice_min() will order the rows by the column that you specify and keep the rows with the n smallest values.

To keep the rows with the 5 shortest hypocotyl measurements:

tomato %>%
  slice_min(order_by = hyp,
            n=5)

You can also combine with group_by() to return n rows per group.

Your turn: use group_by() and slice_max() to return the row with the longest hypocotyl for each species and trt combination. Combine this with another command that you learned earlier in this tutorial to only keep the columns “species”, “acs”, “trt” and “hyp”

The result should be a table of 10 rows and for columns. The first row should be S. chilense LA1958 H 65.32

End

Congratulations, you have reached the end of this tutorial.

Welcome to the TidyVerse

pre-requisites

Tibbles

Filter data with filter()

Select columns with select()

Combine commands using pipes

Arrange data with arrange()

Create new columns with mutate()

Create summaries with summarize()

Use slice() to slice up your data