To use the functions in this tutorial in your own R session you would need to use library
to load the tidyverse
package as shown below. I’ve pre-loaded it for this tutorial.
library(tidyverse)
Often we will have different data sets that need to be combined.
For example we have three different data sets on Star Wars characters:
head(starwars.biodata)
head(starwars.species)
head(starwars.home)
How do we combine these data sets? We CANNOT just cbind them together because the names are not in the same order and not all data sets are complete.
We want to join them by a common key. In this case we can use the character’s name as a key.
The dplyr
package in the tidyverse
provides us with a number of join functions, four of which will be discussed here.
We choose a function based on how we want non-matching keys to be handled.
In each of these functions the first two arguments are the tables that we want to join and the third argument species which column(s) contain the keys
inner_join()
only keeps rows where the key is present in both data sets
starwars.bio.species <- inner_join(starwars.biodata,starwars.species, by = "name")
starwars.bio.species
left_join()
keeps all rows from the left data set and keeps rows from the right data set where there is a key that matches the left. Rows with non-matching keys in the right data set are discarded.
starwars.bio.species <- left_join(starwars.biodata,starwars.species, by = "name")
starwars.bio.species
There is an analogous right_join
full_join()
keeps all rows from left or right, regardless of whether or not there is a matching key.
starwars.bio.species <- full_join(starwars.biodata,starwars.species, by = "name")
starwars.bio.species
What if the key columns have different names in the two data sets? you can specify the equivalency as shown here:
starwars.species
starwars.home #oops different columns names for the key!
starwars.species.home <- left_join(starwars.species, starwars.home, by=c("name"="givenName"))
starwars.species.home
We have three data sets containing information on Brassica rapa genes that have been BLASTED agains Arabidopsis thaliana to do a rough annotation.
gene.info
has information on the genomic location of the B. rapa gene.blast.info
has info on the best hit when the B. rapa gene was BLASTED again ArabidopsisAt.info
has information on Arabidopsis gene function.head
to take a look a each of the three data sets.
Create a new data frame that has information on the best BLAST hit for each Brassica gene. Keep all of the Brassica genes in gene.info
even if they don’t have information in the blast.info
data frame. Discard any blast.info
rows that are not matched in gene.info
.
Repeat what you did above, but now do an additional join to add in the A.t. gene info. Only keep rows where there is a matching entry in both data frame.