ggplot tutorial

ggplot

ggplot intro

The ggplot2 package is a package that allows you to rapidly make impressive plots from complex data sets.

In addition to this tutorial there are several excellent resources available on the web:

pre-requisites

ggplot is part of the tidyverse, so to use these functions in your own R session you would need to load that library. I’ve pre-loaded it for this tutorial. Otherwise, you would need to enter

library(tidyverse)

(You can also just use library(ggplot2) if you don’t want/need the other tidyverse packages)

A first plot

We again work with the Tomato data set. It has been preloaded for this tutorial.

head(tomato)

To remind you, it has the following columns:

  • shelf, flat, col, row: information about where each plant was grown
  • acs: the accession number for that strain
  • trt: “H” is light with a high red:far-red ratio; simulated sun. “L” is light with a low red:far- red ratio; simulated plant shade.
  • days: days since germination
  • date: date of measurement
  • hyp: hypocotyl length (embryonic stem)
  • int1-int4: individual internode lengths. Internodes are stem segments
  • leafnum: which leaf was measured
  • petleng: petiole length
  • leafleng: leaf length
  • leafwid: leaf width
  • ndvi: a measure of plant vegetation density
  • lat: latitude of origin
  • lon: longitude of origin
  • alt: altitude of origin
  • species: species of plant
  • who: who measured the plants

Let us begun by asking if there is a relationship between the length of the petiole (leaf stem) and the leaf of the leaf blade:

ggplot(data=tomato, 
       mapping = aes(x=petleng,y=leafleng)) +
  geom_point()

Yes, it looks like there is.

Let’s look at each line of the code above:

  • ggplot(data=tomato, ggplot is the main function and initiates the plot. The data argument tells ggplot what data set to work on.
  • mapping = aes(x=petleng,y=leafleng) + Here we are telling ggplot which columns in our data set should be mapped to particular plot AESthetics. The argument is called mapping and the input to mapping is the aes() function. We use aes() to tell ggplot that petiole length should be mapped to the x axis and leaf length should be mapped to the y axis.
  • notice that the line that beings with mapping ends with a + . This tells R that we want to add to the plot created by the ggplot function.
  • geom_point(). Geoms (geometries) indicate what type of plot we want to make.

We will look at these functions in more detail as the tutorial continues…

Note that since we loaded tidyverse we could also pipe our data into ggplot:

tomato %>% ggplot(mapping = aes(x=petleng,y=leafleng)) +
  geom_point()

Aesthetics

In this section we will explore aesthetics. As noted above, aesthetics control the relationship between your data and plot elements.

color

One common aesthetic is the color aesthetic.

Change the code below so that color is mapped to the treatment (trt) column, therby re-creating the plot shown above:

tomato %>% ggplot(mapping = aes(x=petleng,y=leafleng)) +
  geom_point()

shape

The shape aesthetic controls the shape of the plotted points.

25 shapes are available:

Note: color of the fill for shapes 21-24 can be controlled with the fill aesthetics. the color of the rest of the shapes, as well as the border of shapes 21-24 is controlled with the color aesthetic

Create a plot of int3 vs int4 where color indicates trt, and shape indicates who measured the plant.

Your plot should look like this:

size

The size aesthetic controls the size of the plotted points.

To practice, create a plot of latitude vs longitude where altitude is indicated by the size of the point and species is indicated by color

setting plot characteristics without mapping

What if you want to change a plot characteristic but not have it mapped to a data column? You can do this by setting the characteristic in the geom call, but outside of the aes function:

tomato %>% ggplot(mapping = aes(x=petleng,y=leafleng)) +
  geom_point(color="skyblue")
The different plotting shapes are numbered from 0 to 25. Change the code below to use shape 2 (do not use quotes around the 2). Feel free to explore the other shapes.
tomato %>% ggplot(mapping = aes(x=petleng,y=leafleng)) +
  geom_point(color="skyblue")

more aesthetics

There are many more aesthetics available, depending on the geom used. Some of these will be introduced as you learn about additional geoms.

Geoms

Geoms control the type of plot that is made. You have already seen one geom, geom_point.

geom_smooth()

geom_smooth allows you to add trend lines to your plots, for example:

tomato %>% ggplot(aes(x=lon, y = lat)) + 
  geom_smooth()

But wait, what if you also want the original data points? We can add multiple geoms to a plot:

tomato %>% ggplot(aes(x=lon, y = lat)) + 
  geom_smooth() +
  geom_point()

By default geom_smooth fits a smoothed line to the data. But you can also show a best-fit, straight linear regression. To do this we tell geom_smooth to use the “lm” (linear model) function:

tomato %>% ggplot(aes(x=lon, y = lat)) + 
  geom_smooth(method="lm") +
  geom_point()

Make a scatter plot of int3 vs int4 as you have before, but this time add a trendline.

geom_histogram() and geom_density()

geom_histogram() creates histograms. For histograms, values for the y-axis are calculated for you, so we just provide a x aesthetic:

tomato %>% ggplot(aes(x=hyp)) +
  geom_histogram()

Histograms (and many other plots) can use the fill aesthetic to control the color used to fill the bars (or other shapes).

tomato %>% ggplot(aes(x=hyp)) +
  geom_histogram(fill="red")
Density plots are related to histograms. The relevant geom is geom_density(). Make a density plot below:

How would you describe the difference between a density plot and a histogram?

One nice thing about density plots is that we can compare the densities of different subsets of the data:

tomato %>% ggplot(aes(x=hyp, fill=trt)) +
  geom_density(alpha=.5)

What is alpha doing? Experiment with different values; the allowable range is 0 to 1.

tomato %>% ggplot(aes(x=hyp, fill=trt)) +
  geom_density(alpha=.5)

Alpha can be used in most geoms.

geom_boxplot() and geom_violin()

Boxplots and violin plots provide quick summaries of different classes of data. Suppose we want to examine hypocotyl length of each species. We can map hypocotyl length to the y-axis and species to the x-axis.

tomato %>% ggplot(aes(x=species, y=hyp)) +
  geom_boxplot()

In a boxplot the horizontal line represents the median. Look at the help for geom_boxplot to determine what other components represent:

A related geom is geom_violin()

Remake the above plot using geom_violin.

  • Do you like the box or violin plot better?
  • What does the width of the “violin” represent?
  • Which is more informative about the distribution of the data?

Test your skills

Make a boxplot showing hypocotyl length for the “H” and “L” treatments

more geom_boxplot()

If we add a color or fill aesthetic to a box or violin plot then we can start comparing multiple factors in our data.

Use the coding box below to re-create this plot:

Look at the plots to figure out the aesthetics: what is mapped to x, y, and fill? Once you know that you should be able to code it up!

What does this plot illustrate?

geom_col()

geom_col() allows you to make a classic bar chart, where the height of the bars corresponds to some value in the data. This works best for data summaries.

First let’s summarize our data:

sem <- function(x, na.rm=FALSE) {
  sd(x,na.rm=na.rm)/sqrt(length(na.omit(x)))
}

int3.mean.sem <- tomato %>%
  group_by(species, trt) %>%
  summarize(mean=mean(int3, na.rm=TRUE), sem=sem(int3, na.rm=TRUE))

int3.mean.sem
int3.mean.sem %>% ggplot(aes(x=species, y = mean, fill=trt)) +
  geom_col()

by default geom_col stacks the columns…perhaps not what we want. We can change that with position

int3.mean.sem %>% ggplot(aes(x=species, y = mean, fill=trt)) +
  geom_col(position="dodge")

what if we want to add error bars? we use geom_errorbar and the ymin and max aesthetics

int3.mean.sem %>% ggplot(aes(x=species, 
                             y = mean, 
                             fill=trt, 
                             ymax=mean+sem, 
                             ymin=mean-sem)) +
  geom_col(position="dodge") +
  geom_errorbar(position = position_dodge(width=0.9), width=.5)

Your turn…

Make a bar chart that shows average leaf length for each accession (acs) and trt combination.

geom_line

data appropriate for bar charts also can be plotted using lines:

int3.mean.sem %>% ggplot(aes(x=species,
                             y=mean,
                             color=trt,
                             group=trt,
                             shape=trt,
                             ymax=mean+sem,
                             ymin=mean-sem)) +
  geom_line() +
  geom_errorbar(width=.1) +
  geom_point()

It actually doesn’t make a lot of sense to plot this data that way. However, plotting each species’ reaction to the treatment would. Modify the above code to make this plot:

more geoms

There are several more geoms available. You can check the docs to see a listing.

Scales

ggplot does a nice job of automatically defining the scales, but what if you want something different? we add a call to a scale() function.

Consider this bar chart again:

What if we want the fill colors to be something else? We use scale_fill_manual()

int3.mean.sem %>% ggplot(aes(x=species, y = mean, fill=trt, ymax=mean+sem, ymin=mean-sem)) +
  geom_col(position="dodge") +
  geom_errorbar(position = position_dodge(width=0.9), width=.5) +
  scale_fill_manual(values = c("H"="darkblue","L"="red"))

Note: you can get a list of possible colors with colors()

Change the colors in the plot below so that Dan and Pepe are different from the defaults. You can choose colors of your liking:

tomato %>% ggplot(aes(x=species, y = hyp, fill=who)) +
  geom_violin()

Most aesthetics have similar scale commands that allow you to adjust how they are used. See scales

A particular useful one is scale_y_log10() that transforms the y-axis scale (there is an equivalent scale_x_log10()

Facets

You have seen that one way to split your data by categories is to map a categorical variable to an aesthetic. e.g. the code below separates the data into “H” and “L” treatments before making the density plot.

tomato %>% ggplot(aes(x=int3, fill=trt)) +
  geom_density(alpha=.5)

A second way to do this is to facet your data using facet_wrap() or facet_grid().

facet_wrap() uses a single variable for faceting and you can specify the number of rows or columns used in the layout.

tomato %>% ggplot(aes(x=int3)) +
  geom_density(fill="lightblue") +
  facet_wrap(~ trt)

Modify the code below so that the facets are arranged in columns instead of rows (hint, look at the help page for facet_wrap)

tomato %>% ggplot(aes(x=hyp)) +
  geom_density(fill="papayawhip") +
  facet_wrap(~ trt)

facet_grid() can use two variables to facet and uses those variable to specify the grid of rows and columns:

tomato %>% ggplot(aes(x=int3)) +
  geom_histogram(fill="lawngreen") +
  facet_grid(who ~ trt)

Practice by recreating the plot shown below:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Titles and Labels

The plot titles and labels are easily changed. Starting with this plot:

we can use ggtitle to add a main title to the plot, and xlab() and ylab() to change the axis labels.

tomato %>% ggplot(aes(x=species,y=int3,fill=trt)) +
  geom_boxplot() +
  ylab("Internode 3 (mm)")
Modify the code below to further update the plot to add a title and to change the x-axis label to “Tomato Species”
tomato %>% ggplot(aes(x=species,y=int3,fill=trt)) +
  geom_boxplot() +
  ylab("Internode 3 (mm)")

There are many more manipulations to labels, as detailed in “Titles”, “Axes” and “Legends” sections of the Cookbook for R.

Saving plots

If you want to save your plot to an external file you can use ggsave(). This will save the most recent plot to the path that you specify. R will figure out the appropriate file type from the file extension (pdf, png, jpg, tif). You can also specify height and width.

tomato %>% ggplot(aes(x=species,y=int3,fill=trt)) +
  geom_boxplot() +
  ylab("Internode 3 (mm)")
ggsave("~/Desktop/Internode3.pdf", height=6, width = 6)

End

This is the end of the tutorial.

As noted at the beginning, there are several sources for additional information, including: