Boxplots are often used to show data distributions, and
ggplot2 is often used to visualize data. A question that comes up is what exactly do the box plots represent? The
ggplot2 box plots follow standard Tukey representations, and there are many references of this online and in standard statistical text books. The base R function to calculate the box plot limits is
boxplot.stats. The help file for this function is very informative, but it’s often non-R users asking what exactly the plot means. Therefore, this blog post breaks down the calculations into (hopefully!) easy-to-follow chunks of code for you to make your own box plot legend if necessary. Some additional goals here are to create boxplots that come close to USGS style. Features in this blog post take advantage of enhancements to
ggplot2 in version 3.0.0 or later.
First, let’s get some data that might be typically plotted in a USGS report using a boxplot. Here we’ll use chloride data (parameter code “00940”) measured at a USGS station on the Fox River in Green Bay, WI (station ID “04085139”). We’ll use the package
dataRetrieval to get the data (see this tutorial for more information on
dataRetrieval), and plot a simple boxplot by month using
library(dataRetrieval) library(ggplot2) # Get chloride data using dataRetrieval: chloride
Is that graph great? YES! And for presentations and/or journal publications, that graph might be appropriate. However, for an official USGS report, USGS employees need to get the graphics approved to assure they follow specific style guidelines. The approving officer would probably come back from the review with the following comments:
|Remove background color, grid lines||Adjust theme|
|Add horizontal bars to the upper and lower whiskers||Add
|Have tick marks go inside the plot||Adjust theme|
|Tick marks should be on both sides of the y axis||Add
|Remove tick marks from discrete data||Adjust theme|
|y-axis needs to start exactly at 0||Add
|y-axis labels need to be shown at 0 and at the upper scale||Add
|Add very specific legend||Create function
|Add the number of observations above each boxplot||Add custom
|Change text size||Adjust
|Change font (we’ll use “serif” in this blog, although that is not the official USGS font)||Adjust
As you can see, it will not be as simple as creating a single custom ggplot theme to comply with the requirements. However, we can string together ggplot commands in a list for easy re-use. This blog is not going to get you perfect compliance with the USGS standards, but it will get much closer. Also, while these style adjustments are tailored to USGS requirements, the process described here may be useful for other graphic guidelines as well.
So, let’s skip to the exciting conclusion and use some code that will be described later (
ggplot_box_legend) to create the same plot, now closer to those USGS style requirements:
library(cowplot) # NOTE! This is a preview of the FUTURE! # We'll create the functions ggplot_box_legend and boxplot_framework # later in this blog post. # So....by the end of this post, you will be able to: legend_plot
As can be seen in the code chunk, we are now using a function
ggplot_box_legend to make a legend,
boxplot_framework to accommodate all of the style requirements, and the
cowplot package to plot them together.
Let’s get our style requirements figured out. First, we can set some basic plot elements for a theme. We can start with the
theme_bw and add to that. Here we remove the grid, set the size of the title, bring the y-ticks inside the plotting area, and remove the x-ticks:
Next, we can change the defaults of the geom_text to a smaller size and font.
update_geom_defaults("text", list(size = 3, family = "serif"))
We also need to figure out what other
ggplot2 functions need to be added. The basic ggplot code for the chloride plot would be:
Breaking that code down:
|stat_boxplot(geom =’errorbar’)||The “errorbars” are used to make the horizontal lines on the upper and lower whiskers. This needs to happen first so it is in the back of the plot.|
|stat_summary(fun.data = n_fun, geom = “text”, hjust = 0.5)||The
|expand_limits||This forces the plot to include 0.|
|theme_USGS_box||Theme created above to help with grid lines, tick marks, axis size/fonts, etc.|
|scale_y_continuous||A tricky part of the USGS requirements involve 4 parts: Add ticks to the right side, have at least 4 “pretty” labels on the left axis, remove padding, and have the labels start and end at the beginning and end of the plot. Breaking that down further:|
|scale_y_continuous(sec.axis = dup_axis||Handy function to add tick marks to the right side of the graph.|
|scale_y_continuous(expand = expand_scale(mult = c(0, 0))||Remove padding|
|scale_y_continuous(breaks = pretty(c(0,70), n = 5))||Make pretty label breaks, assuring 5 pretty labels if the graph went from 0 to 70|
|scale_y_continuous(limits = c(0,70))||Assure the graph goes from 0 to 70.|
Let’s look at a few other common boxplots to see if there are other ggplot2 elements that would be useful in a common
For another example, we might need to make a boxplot with a logarithm scale. This data is for phosphorus measurements on the Pheasant Branch Creek in Middleton, WI.
site 0]) pretty_logs pretty_range) log_index
What are the new features we have to consider for log scales?
|scale_y_log10||This is used instead of
|annotation_logticks(sides = c(“rl”))||Adds nice log ticks to the right (“r”) and left (“l”) side.|
|prettyLogs||This function forces the y-axis breaks to be on every 10^x. This could be adjusted if a finer scale was needed.|
|fancyNumbers||This is a custom formatting function for the log axis. This function could be adjusted if other formatting was needed.|
We might also want to make grouped boxplots. In ggplot, it’s pretty easy to add a “fill” to the
aes argument. Here we’ll plot temperature distributions at 4 USGS stations. We’ll group the measurements by a “daytime” and “nighttime” factor. Temperature might be a parameter that would not be required to start at 0.
library(dplyr) # Get water temperature data for a variety of USGS stations temp_q_data % renameNWISColumns() %>% mutate(hourOfDay = as.numeric(format(dateTime, "%H")), timeOfDay = case_when(hourOfDay 6 ~ "daytime", TRUE ~ "nighttime" # catchall )) n_fun
What are the new features we have to consider for log scales?
|stat_summary(position)||We need to move the counts to above the boxplots. This is done by shifting them the same amount as the width.|
|stat_summary(aes(group=timeOfDay))||We need to include how the boxplots are grouped.|
|scale_fill_discrete||Need include a fill legend.|
Additionally, the parameter name that comes back from
dataRetrieval could use some formatting. The following function can fix that for both
ggplot2 and base R graphics:
unescape_html ", fancy_chars, ""))) fancy_chars
We’ll use this function in the next section.
Finally, we can bring all of those elements together into a single list for
ggplot2 to use. While we’re at it, we can create a function that is flexible for both linear and logarithmic scales, as well as grouped boxplots. It’s a bit clunky because you need to specify the upper and lower limits of the plot. Much of the USGS style requirements depend on specific upper and lower limits, so I decided this was an acceptable solution for this blog post. There’s almost certainly a slicker way to do that, but for now, it works:
boxplot_framework 0]) pretty_logs pretty_range) log_index
Let’s see if it works! Let’s build the last set of example figures using our new function
boxplot_framework. I’m also going to use the ‘cowplot’ package to print them all together. I’ll also include the
ggplot_box_legend which will be described in the next section.
A non-trivial requirement to the USGS boxplot style guidelines is to make a detailed, prescribed legend. In this section we’ll first verify that
ggplot2 boxplots use the same definitions for the lines and dots, and then we’ll make a function that creates the prescribed legend. To start, let’s set up random data using the R function
sample and then create a function to calculate each value.
Next, we’ll create a function that calculates the necessary values for the boxplots:
ggplot2_boxplot (quartiles - 1.5 * IQR)]) upper_dots (quartiles + 1.5*IQR)] lower_dots
What are those calculations?
- Quartiles (25, 50, 75 percentiles), 50% is the median
- Interquartile range is the difference between the 75th and 25th percentiles
- The upper whisker is the maximum value of the data that is within 1.5 times the interquartile range over the 75th percentile.
- The lower whisker is the minimum value of the data that is within 1.5 times the interquartile range under the 25th percentile.
- Outlier values are considered any values over 1.5 times the interquartile range over the 75th percentile or any values under 1.5 times the interquartile range under the 25th percentile.
Let’s check that the output matches
# Using base R: base_R_output
##  TRUE
# whiskers: ggplot_output[["upper_whisker"]] == base_R_output[["stats"]]
##  TRUE
ggplot_output[["lower_whisker"]] == base_R_output[["stats"]]
##  TRUE
Let’s use this information to generate a legend, and make the code reusable by creating a standalone function that we used in earlier code (
ggplot_box_legend). There is a lot of
ggplot2 code to digest here. Most of it is style adjustments to approximate the USGS style guidelines for a boxplot legend.
ggplot_box_legend (quartiles - 1.5 * IQR)]) upper_dots (quartiles + 1.5*IQR)] lower_dots 1.5 times and"), vjust = 0.5) + geom_text(aes(x = 1.17, y = ggplot_output[["lower_dots"]], label = "
What’s nice about leaving this in the world of
ggplot2 is that it is still possible to use other
ggplot2 elements on the plot. For example, let’s add a reporting limit as horizontal lines to the phosphorous graph:
I hoped you like my “deep dive” into
ggplot2 boxplots. Many of the techniques here can be used to modify other