Joins

pre-requisites

To use the functions in this tutorial in your own R session you would need to use library to load the tidyverse package as shown below. I’ve pre-loaded it for this tutorial.

library(tidyverse)

Joining data sets

Often we will have different data sets that need to be combined.

For example we have three different data sets on Star Wars characters:

head(starwars.biodata)
head(starwars.species)
head(starwars.home)

How do we combine these data sets? We CANNOT just cbind them together because the names are not in the same order and not all data sets are complete.

We want to join them by a common key. In this case we can use the character’s name as a key.

Type of joins

The dplyr package in the tidyverse provides us with a number of join functions, four of which will be discussed here.

We choose a function based on how we want non-matching keys to be handled.

In each of these functions the first two arguments are the tables that we want to join and the third argument species which column(s) contain the keys

Inner Join

inner_join() only keeps rows where the key is present in both data sets

starwars.bio.species <- inner_join(starwars.biodata,starwars.species, by = "name")
starwars.bio.species

Left Join

left_join() keeps all rows from the left data set and keeps rows from the right data set where there is a key that matches the left. Rows with non-matching keys in the right data set are discarded.

starwars.bio.species <- left_join(starwars.biodata,starwars.species, by = "name")
starwars.bio.species

There is an analogous right_join

Full join.

full_join() keeps all rows from left or right, regardless of whether or not there is a matching key.

starwars.bio.species <- full_join(starwars.biodata,starwars.species, by = "name")
starwars.bio.species

Specifying keys

What if the key columns have different names in the two data sets? you can specify the equivalency as shown here:

starwars.species
starwars.home #oops different columns names for the key!

starwars.species.home <- left_join(starwars.species, starwars.home, by=c("name"="givenName"))

starwars.species.home

Practice

We have three data sets containing information on Brassica rapa genes that have been BLASTED agains Arabidopsis thaliana to do a rough annotation.

  • gene.info has information on the genomic location of the B. rapa gene.
  • blast.info has info on the best hit when the B. rapa gene was BLASTED again Arabidopsis
  • At.info has information on Arabidopsis gene function.

Examine the data sets

First use head to take a look a each of the three data sets.

Practice 1

Create a new data frame that has information on the best BLAST hit for each Brassica gene. Keep all of the Brassica genes in gene.info even if they don’t have information in the blast.info data frame. Discard any blast.info rows that are not matched in gene.info.

Practice 2

Repeat what you did above, but now do an additional join to add in the A.t. gene info. Only keep rows where there is a matching entry in both data frame.

End

This has been a brief intro to joining. It is a pretty complex topic, for more info, see Relational Data in the R for Data Science book