To use the functions in this tutorial in your own R session you would need to use
library to load the
tidyverse package as shown below. I’ve pre-loaded it for this tutorial.
Often we will have different data sets that need to be combined.
For example we have three different data sets on Star Wars characters:
head(starwars.biodata) head(starwars.species) head(starwars.home)
How do we combine these data sets? We CANNOT just cbind them together because the names are not in the same order and not all data sets are complete.
We want to join them by a common key. In this case we can use the character’s name as a key.
dplyr package in the
tidyverse provides us with a number of join functions, four of which will be discussed here.
We choose a function based on how we want non-matching keys to be handled.
In each of these functions the first two arguments are the tables that we want to join and the third argument species which column(s) contain the keys
inner_join() only keeps rows where the key is present in both data sets
starwars.bio.species <- inner_join(starwars.biodata,starwars.species, by = "name") starwars.bio.species
left_join() keeps all rows from the left data set and keeps rows from the right data set where there is a key that matches the left. Rows with non-matching keys in the right data set are discarded.
starwars.bio.species <- left_join(starwars.biodata,starwars.species, by = "name") starwars.bio.species
There is an analogous
full_join() keeps all rows from left or right, regardless of whether or not there is a matching key.
starwars.bio.species <- full_join(starwars.biodata,starwars.species, by = "name") starwars.bio.species
What if the key columns have different names in the two data sets? you can specify the equivalency as shown here:
starwars.species starwars.home #oops different columns names for the key! starwars.species.home <- left_join(starwars.species, starwars.home, by=c("name"="givenName")) starwars.species.home
We have three data sets containing information on Brassica rapa genes that have been BLASTED agains Arabidopsis thaliana to do a rough annotation.
gene.infohas information on the genomic location of the B. rapa gene.
blast.infohas info on the best hit when the B. rapa gene was BLASTED again Arabidopsis
At.infohas information on Arabidopsis gene function.
headto take a look a each of the three data sets.
Create a new data frame that has information on the best BLAST hit for each Brassica gene. Keep all of the Brassica genes in
gene.info even if they don’t have information in the
blast.info data frame. Discard any
blast.info rows that are not matched in
Repeat what you did above, but now do an additional join to add in the A.t. gene info. Only keep rows where there is a matching entry in both data frame.