Welcome to the TidyVerse
The goal of this tutorial is to introduce the tidyverse a series of R packages that make data manipulation and processing easier. While most of what is covered here could be accomplished using base R functions, the tidy functions make things easier and more clear.
The material covered here is also explained in greater detail in the course textbook R for Data Science by Hadley Wickham and Garrett Grolemund. I encourage you to use that book for clarification or greater depth. It is free online.
In R there are many add on packages, a.k.a. libraries
.
Tidyverse is one such package. We use the library()
command
to load a library for use in our current R session.
library()
takes the package name as an argument.
pre-requisites
To use the functions in this tutorial in your own R session you would
need to use library
to load the tidyverse
package as shown below. I’ve pre-loaded it for this tutorial.
library(tidyverse)
Tibbles
Optional reading: tibbles in R for Data Science
You were previously introduced to data.frame
objects as
spreadsheet-like objects that are useful for handling data in R.
Tidyverse has a related data type, the tibble
. Details of
the differences aren’t important right now, but if you are already an R
afficiando and are curious you can read about them in the tibbles chapter in R
for Data Science
To import data into R as a tibble it is convenient to have it saved
as a comma
separated value (csv) file. For the rest of this tutorial we will
work with a dataset of tomato measurements in the file
Tomato.csv
We import the data using the read_csv()
function. If you
have used R before, please note that this is a different function than
read.csv()
. read_csv()
created a tibble upon
import instead of a data frame.
We will use data from a pilot experiment measuring the growth of various wild tomato accessions and species under different light conditions. To load this data into R you would need to run the code below, modifying the path as appropriate (you do not need to run or modify it for this tutorial).
tomato <- read_csv("www/Tomato.csv")
Now use the head
function to take a look at the imported
data
Explanation of column headings:
- shelf, flat, col, row: information about where each plant was grown
- acs: the accession number for that strain
- trt: “H” is light with a high red:far-red ratio; simulated sun. “L” is light with a low red:far- red ratio; simulated plant shade.
- days: days since germination
- date: date of measurement
- hyp: hypocotyl length (embryonic stem)
- int1-int4: individual internode lengths. Internodes are stem segments
- leafnum: which leaf was measured
- petleng: petiole length
- leafleng: leaf length
- leafwid: leaf width
- ndvi: a measure of plant vegetation density
- lat: latitude of origin
- lon: longitude of origin
- alt: altitude of origin
- species: species of plant
- who: who measured the plants
Filter data with filter()
The tomato
data set contains data from five species. We
can see their names with:
unique(tomato$species)
Suppose that we only want to look at the data from one of these,
S. pennellii. We can use the filter
function to
specify exactly that:
Spen.data <- filter(tomato,species == "S. pennellii")
The first argument is the data object that we want to work on, and the second argument is a logical statement describing how we want to filter.
How many rows were in the original data set and how many are in your filtered data set? Add additional lines to the code below to answer this question
Spen.data <- filter(tomato,species == "S. pennellii")
There were two researchers that measured the data. Perhaps we only trust the data gathered by one researcher. Find the names of the researchers and then filter to only keep the data measured by the second researcher in the list. How can you confirm that your filtering worked and that only one researcher is contained in the new data set?
You can provide multiple test arguments to filter()
and
all must evaluate to true for a row to be selected.
Try creating a new data set that consists of only S. chilense samples gathered from altitude greater than 2500 meters.
Want more info? See filter in R 4 Data Science.
Select columns with select()
The tomato data set has 23 data columns, and you may not care about
all of them. We can use select()
to select particular
columns
select(tomato, species, hyp)
If we use a minus sign -
in front of a variable name
then select
will select everything except for that variable
name:
select(tomato, -shelf, -flat, -col, -row) #remove position information
Select can also be used to rearrange the column order. Combine with
everything()
so that you don’t have to type them all.
select(tomato,who,everything())
Practice by using select to create a new tibble that only has the hypcotyl and internode data, along with species, accession, and trt.
Want more info? See select in R 4 Data Science.
Combine commands using pipes
You will remember the pipe |
operator in Linux that
takes the output from one command and “pipes” it into another command.
The R Tidyverse has a similar operator, although it is written as
%>%
. (Technically this comes from the package
magrittr
(get the pun?))
Want more info? See pipes in R 4 Data Science.
So if we want to use filter to keep data collected by Dan and then select to only keep species, accession, who, trt, and leaf data we could do it like so:
Dan_leaf_data <- tomato %>% #start with the tomato data;
#pipe it to the next command (filter).
#final results will go into Dan_leaf_data
filter(who=="Dan") %>%
select(species, acs, who, trt, starts_with("leaf")) # note use of "starts_with!"
Dan_leaf_data
Note that we no longer have to give the “tomato” argument to filter or select, pipe is taking care of that first argument for us.
Now try it yourself, selecting the hypocotyl data measured by Pepe from the “H” trt.
Want more info? See pipes in R 4 Data Science.
Arrange data with arrange()
The arrange()
function allows you to sort your data
based on the column(s) of your choice:
tomato %>% arrange(hyp)
By default arrange
will sort from lowest to highest. if
you want to go from highest to lowest, then add desc
tomato %>% arrange(desc(hyp))
You can sort on multiple columns by providing multiple arugments to
arrange
. Try sorting by both accession (acs) and hyp. How
does changing the order of the arguments to arrange
affect
the results? Try it both ways.
Now fill in the blanks the code below to create a new data set that contains data gathered by Pepe, sorted by altitude (alt) and containing the who, trt, species, acs, hyp, and alt columns.
Also you will need to fix an error in the code before this runs.
Pepe_long_hyps <- tomato %>%
______(who==Pepe) %>%
______(who, trt, species, acs, hyp, alt) %>%
______(alt)
Want more info? See arrange in R 4 Data Science.
Create new columns with mutate()
If you want to add a new column to your data set, one way to do so is with mutate. For example, if we wanted a column that represented the leaf length to width ratio:
tomato_lw <- tomato %>% mutate(lw=leafleng/leafwid)
tomato_lw %>% select(leafleng,leafwid,lw, everything())
Use mutate to make a new column that represents the total stem length of the plant (hypocotyl + all internodes).
Want more info? See mutate in R 4 Data Science.
Create summaries with summarize()
Tidyverse functions also provide an easy way to summarize your data.
We first tell R what groups we are interested in summarizing
(using group_by()
, and then how we want it
summarized (summarize()
).
For example, if we wanted the mean hypocotyl length of each species, then:
tomato %>%
group_by(species) %>% #want to compute something for each species
summarize(hyp.mean=mean(hyp)) #hyp.mean is what the new summary will be called
If we want to also have a summary column that tells us the number of
measurements in each group we can use n()
tomato %>%
group_by(species) %>%
summarize(hyp.mean=mean(hyp),
n=n())
tomato %>%
group_by(species,trt) %>%
summarize(hyp.mean=mean(hyp),
n=n())
Also we can create even more summary columns, and, of course, the results can be stored in a new object
hyp.summary <- tomato %>%
group_by(species,trt) %>%
summarize(hyp.mean=mean(hyp),
hyp.sd=sd(hyp),
n=n())
hyp.summary
We can take advantage of the fact that TRUE is equivalent to 1 and FALSE to 0 to quickly count the number if items that meet some criteria. If we want to know the number of hypocotyl measurements greater than 30mm for each species:
tomato %>%
group_by(species) %>%
summarize(`Total hyp measurements`=n(),
`Hyp measurements > 30`=sum(hyp>30))
Your turn…
Calculate the median altitude of each species, then pipe the results into arrange to sort by altitude. Which species comes from the highest altitude?
Who worked harder in this experiment? Calculate the number of measurements made by each researcher for each species.
The column leafnum
gives information on the leaf that
was measured (leaf 1 is the oldest and is at the bottom of the plant).
For each species and treatment combination, calculate the percentage of
leaf measurements that come from leaf 6. _Self check: S. Chilense H
should be 31.37255`
Use slice()
to slice up your data
The slice()
set of functions allows you to subset your
data by ranked position. There are several related functions, here we
will just work with slice_min()
and
slice_max()
.
slice_min()
will order the rows by the column that you
specify and keep the rows with the n
smallest values.
tomato %>%
slice_min(order_by = hyp,
n=5)
You can also combine with group_by()
to return
n
rows per group.
Your turn: use group_by()
and slice_max()
to return the row with the longest hypocotyl for each species and trt
combination. Combine this with another command that you learned earlier
in this tutorial to only keep the columns “species”, “acs”, “trt” and
“hyp”
The result should be a table of 10 rows and for columns. The first
row should be S. chilense LA1958 H 65.32
End
Congratulations, you have reached the end of this tutorial.