Boxplots are often used to show data distributions, and ggplot2
is often used to visualize data. A question that comes up is what exactly do the box plots represent? The ggplot2
box plots follow standard Tukey representations, and there are many references of this online and in standard statistical text books. The base R function to calculate the box plot limits is boxplot.stats
. The help file for this function is very informative, but it’s often nonR users asking what exactly the plot means. Therefore, this blog post breaks down the calculations into (hopefully!) easytofollow chunks of code for you to make your own box plot legend if necessary. Some additional goals here are to create boxplots that come close to USGS style. Features in this blog post take advantage of enhancements to ggplot2
in version 3.0.0 or later.
First, let’s get some data that might be typically plotted in a USGS report using a boxplot. Here we’ll use chloride data (parameter code “00940”) measured at a USGS station on the Fox River in Green Bay, WI (station ID “04085139”). We’ll use the package dataRetrieval
to get the data (see this tutorial for more information on dataRetrieval
), and plot a simple boxplot by month using ggplot2
:
library(dataRetrieval)
library(ggplot2)
# Get chloride data using dataRetrieval:
chloride
Is that graph great? YES! And for presentations and/or journal publications, that graph might be appropriate. However, for an official USGS report, USGS employees need to get the graphics approved to assure they follow specific style guidelines. The approving officer would probably come back from the review with the following comments:
Remove background color, grid lines  Adjust theme 
Add horizontal bars to the upper and lower whiskers  Add stat_boxplot

Have tick marks go inside the plot  Adjust theme 
Tick marks should be on both sides of the y axis  Add sec.axis to scale_y_continuous

Remove tick marks from discrete data  Adjust theme 
yaxis needs to start exactly at 0  Add expand_limits

yaxis labels need to be shown at 0 and at the upper scale  Add breaks and limits to scale_y_continuous

Add very specific legend  Create function ggplot_box_legend

Add the number of observations above each boxplot  Add custom stat_summary

Change text size  Adjust geom_text defaults 
Change font (we’ll use “serif” in this blog, although that is not the official USGS font)  Adjust geom_text defaults 
As you can see, it will not be as simple as creating a single custom ggplot theme to comply with the requirements. However, we can string together ggplot commands in a list for easy reuse. This blog is not going to get you perfect compliance with the USGS standards, but it will get much closer. Also, while these style adjustments are tailored to USGS requirements, the process described here may be useful for other graphic guidelines as well.
So, let’s skip to the exciting conclusion and use some code that will be described later (boxplot_framework
and ggplot_box_legend
) to create the same plot, now closer to those USGS style requirements:
library(cowplot)
# NOTE! This is a preview of the FUTURE!
# We'll create the functions ggplot_box_legend and boxplot_framework
# later in this blog post.
# So....by the end of this post, you will be able to:
legend_plot
As can be seen in the code chunk, we are now using a function ggplot_box_legend
to make a legend, boxplot_framework
to accommodate all of the style requirements, and the cowplot
package to plot them together.
Let’s get our style requirements figured out. First, we can set some basic plot elements for a theme. We can start with the theme_bw
and add to that. Here we remove the grid, set the size of the title, bring the yticks inside the plotting area, and remove the xticks:
theme_USGS_box
Next, we can change the defaults of the geom_text to a smaller size and font.
update_geom_defaults("text",
list(size = 3,
family = "serif"))
We also need to figure out what other ggplot2
functions need to be added. The basic ggplot code for the chloride plot would be:
n_fun
Breaking that code down:
stat_boxplot(geom =’errorbar’)  The “errorbars” are used to make the horizontal lines on the upper and lower whiskers. This needs to happen first so it is in the back of the plot. 
geom_boxplot  Regular boxplot 
stat_summary(fun.data = n_fun, geom = “text”, hjust = 0.5)  The stat_summary function is very powerful for adding specific summary statistics to the plot. In this case, we are adding a geom_text that is calculated with our custom n_fun . That function comes back with the count of the boxplot, and puts it at 95% of the hardcoded upper limit. 
expand_limits  This forces the plot to include 0. 
theme_USGS_box  Theme created above to help with grid lines, tick marks, axis size/fonts, etc. 
scale_y_continuous  A tricky part of the USGS requirements involve 4 parts: Add ticks to the right side, have at least 4 “pretty” labels on the left axis, remove padding, and have the labels start and end at the beginning and end of the plot. Breaking that down further: 
scale_y_continuous(sec.axis = dup_axis  Handy function to add tick marks to the right side of the graph. 
scale_y_continuous(expand = expand_scale(mult = c(0, 0))  Remove padding 
scale_y_continuous(breaks = pretty(c(0,70), n = 5))  Make pretty label breaks, assuring 5 pretty labels if the graph went from 0 to 70 
scale_y_continuous(limits = c(0,70))  Assure the graph goes from 0 to 70. 
Let’s look at a few other common boxplots to see if there are other ggplot2 elements that would be useful in a common boxplot_framework
function.
Logrithmic boxplot
For another example, we might need to make a boxplot with a logarithm scale. This data is for phosphorus measurements on the Pheasant Branch Creek in Middleton, WI.
site 0])
pretty_logs pretty_range[1])
log_index
What are the new features we have to consider for log scales?
stat_boxplot  The stat_boxplot function is the same, but our custom function to calculate counts need to be adjusted so the position would be in log units. 
scale_y_log10  This is used instead of scale_y_continuous . 
annotation_logticks(sides = c(“rl”))  Adds nice log ticks to the right (“r”) and left (“l”) side. 
prettyLogs  This function forces the yaxis breaks to be on every 10^x. This could be adjusted if a finer scale was needed. 
fancyNumbers  This is a custom formatting function for the log axis. This function could be adjusted if other formatting was needed. 
Grouped boxplots
We might also want to make grouped boxplots. In ggplot, it’s pretty easy to add a “fill” to the aes
argument. Here we’ll plot temperature distributions at 4 USGS stations. We’ll group the measurements by a “daytime” and “nighttime” factor. Temperature might be a parameter that would not be required to start at 0.
library(dplyr)
# Get water temperature data for a variety of USGS stations
temp_q_data %
renameNWISColumns() %>%
mutate(hourOfDay = as.numeric(format(dateTime, "%H")),
timeOfDay = case_when(hourOfDay 6 ~ "daytime",
TRUE ~ "nighttime" # catchall
))
n_fun
What are the new features we have to consider for log scales?
stat_summary(position)  We need to move the counts to above the boxplots. This is done by shifting them the same amount as the width. 
stat_summary(aes(group=timeOfDay))  We need to include how the boxplots are grouped. 
scale_fill_discrete  Need include a fill legend. 
Additionally, the parameter name that comes back from dataRetrieval
could use some formatting. The following function can fix that for both ggplot2
and base R graphics:
unescape_html ", fancy_chars, "")))
fancy_chars
We’ll use this function in the next section.
Framework function
Finally, we can bring all of those elements together into a single list for ggplot2
to use. While we’re at it, we can create a function that is flexible for both linear and logarithmic scales, as well as grouped boxplots. It’s a bit clunky because you need to specify the upper and lower limits of the plot. Much of the USGS style requirements depend on specific upper and lower limits, so I decided this was an acceptable solution for this blog post. There’s almost certainly a slicker way to do that, but for now, it works:
boxplot_framework 0])
pretty_logs pretty_range[1])
log_index
Let’s see if it works! Let’s build the last set of example figures using our new function boxplot_framework
. I’m also going to use the ‘cowplot’ package to print them all together. I’ll also include the ggplot_box_legend
which will be described in the next section.
legend_plot
A nontrivial requirement to the USGS boxplot style guidelines is to make a detailed, prescribed legend. In this section we’ll first verify that ggplot2
boxplots use the same definitions for the lines and dots, and then we’ll make a function that creates the prescribed legend. To start, let’s set up random data using the R function sample
and then create a function to calculate each value.
set.seed(100)
sample_df
Next, we’ll create a function that calculates the necessary values for the boxplots:
ggplot2_boxplot (quartiles[1]  1.5 * IQR)])
upper_dots (quartiles[3] + 1.5*IQR)]
lower_dots
What are those calculations?
 Quartiles (25, 50, 75 percentiles), 50% is the median
 Interquartile range is the difference between the 75th and 25th percentiles
 The upper whisker is the maximum value of the data that is within 1.5 times the interquartile range over the 75th percentile.
 The lower whisker is the minimum value of the data that is within 1.5 times the interquartile range under the 25th percentile.
 Outlier values are considered any values over 1.5 times the interquartile range over the 75th percentile or any values under 1.5 times the interquartile range under the 25th percentile.
Let’s check that the output matches boxplot.stats
:
# Using base R:
base_R_output
## [1] TRUE
# whiskers:
ggplot_output[["upper_whisker"]] == base_R_output[["stats"]][5]
## [1] TRUE
ggplot_output[["lower_whisker"]] == base_R_output[["stats"]][1]
## [1] TRUE
Boxplot Legend
Let’s use this information to generate a legend, and make the code reusable by creating a standalone function that we used in earlier code (ggplot_box_legend
). There is a lot of ggplot2
code to digest here. Most of it is style adjustments to approximate the USGS style guidelines for a boxplot legend.
Show/Hide Code
ggplot_box_legend (quartiles[1]  1.5 * IQR)])
upper_dots (quartiles[3] + 1.5*IQR)]
lower_dots 1.5 times and"),
vjust = 0.5) +
geom_text(aes(x = 1.17,
y = ggplot_output[["lower_dots"]],
label = "
What’s nice about leaving this in the world of ggplot2
is that it is still possible to use other ggplot2
elements on the plot. For example, let’s add a reporting limit as horizontal lines to the phosphorous graph:
phos_plot_with_DL
I hoped you like my “deep dive” into ggplot2
boxplots. Many of the techniques here can be used to modify other ggplot2
plots.
Related
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source link https://www.rbloggers.com/exploringggplot2boxplotsdefininglimitsandadjustingstyle/