## Introduction

R makes it easy to calculate various data descriptives, as covered in the dispersion tutorial; however, most people find it easier to understand data descriptives when those data are presented graphically. Fortunately, *R* has a great graphic tool for visualizing data descriptives: Boxplot (sometimes called a “Box and Whisker” plot). A Boxplot graphically illustrates Q1, the mean, the median, Q3, outlier boundaries, and outliers (if any are present).

## About Visualizations

This site includes several tutorials that are focued on data visualization because it is a critically important tool for analysis. Visualizations are useful in two different phases of the analysis process: exploration and explanation. In the exploration phase, researchers are looking for interesting relationships in the data and those relationships are often difficult to detect in a table full of numbers but a visualization makes them instantly clear. As an example, here are two ways to look at the *Volume* vector in the *trees* data frame.

```
## [1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 24.2 21.0 21.4 21.3
## [15] 19.1 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3 38.3 42.6 55.4 55.7 58.3
## [29] 51.5 51.0 77.0
```

The above table shows the measured volume for 31 Black Cherry Trees. Researchers looking at these numbers would not be able to detect very much. However, a simple box plot reveals a few interesting details, such as the presence of one upper outlier and that the data are positively skewed (the dark “median” line is low in the box).

Visualizations like this make it easy to detect a patterns that are not obvious from the data table and researchers commonly use these types of visualizations in the *exploratory* phase of analysis. In the *explanatory* phase, where research findings are revealed to the general public, different visualizations that are easier to understand are more appropriate. Researchers must carefully consider the many types of visualizations and which are most useful for exploration or explanation to be certain that the visualizations help rather than hinder understanding.

## Boxplots

Following is the summary data for *hp* from the *mtcars* data frame along with the boxplot for that same data.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.0 96.5 123.0 146.7 180.0 335.0
```

In the boxplot, the median is indicated by a dark line at 123, Q1 is 96.5 (the lower edge of the box) and Q3 is 180 (the upper edge of the box). The following equations show how the “whiskers” are calculated. They indicate the limits for outliers, so any data that lie outside those whiskers are outliers and are indicated by a small circle on the boxplot.

\[ \begin{aligned} LowerBoundary &= Q1 - (1.5 * IRQ) \\ LowerBoundary &= 96.5 - (1.5 * 83.5) \\ LowerBoundary &= 96.5 - 125.25 \\ LowerBoundary &= 0 \end{aligned} \]

Since the smallest value in the vector, 52, is larger than the calculated lower boundary, 0, the lower whisker is placed at 52.

\[ \begin{aligned} UpperBoundary &= Q3 + (1.5 * IRQ) \\ UpperBoundary &= 180.0 + (1.5 * 83.5) \\ UpperBoundary &= 180.0 + 125.25 \\ UpperBoundary &= 305.25 \end{aligned} \]

Since the calculated upper boundary, 305.25, is smaller than the largest value in the vector, 335.0, the upper whisker is placed at the largest data value that is smaller than or equal to 305.25, or 264 for this data vector.

The circle above the boxplot represents an outlier, which is 335 in this vector. If the data are a normal distribution then the whiskers will usually enclose all values in the vector and outliers will be rare.

### Demonstration: Boxplots

The following script generates four different boxplots for four different variables in the *rock* data frame.

- Line 2: Print the summary information for the
*area*vector. - Line 3: Create the boxplot for
*area*. Note that the boxplot includes a`main`

attribute which adds a title above the boxplot. - Lines 5-15 These are repetitions of Lines 1-3 for three other vectors in the
*rock*data frame.

This demonstration contains four different examples of generating a boxplot.

*Plots*tab but because of the size of the interface those plots are “squished” and impossible to read. Click the double-headed arrow button on the

*Plots*tab to open the graph in a larger window for evaluation and copying to a document. If the graphic does not open in a larger window then temporarily pause the browser’s pop-up blocker.

Given these four boxplots, *rock$peri* has the most symmetrical data since the box and whiskers are fairly equally distributed around the median, *rock$perm* is the most skewed since so much of the plot is above the median, and *rock$shape* has three upper outliers.

### Skill Check: Boxplot

Using the *iris* data frame, create a boxplot for *Sepal.Length*.

## Outliers

It is useful to consider data observations that are far outside the “normal” in any given vector. For example, imagine a neighborhood where the houses all cost about $150,000. Suppose someone wins a lottery and decides to build a $500,000 house in that same neighborhood. The value of that house would be an outlier in a vector that contains the house values in the neighborhood; that is, it would be outside the “average” house value. Outliers are important when discussing data since they tend to skew certain types of measures.

Statistically, outliers are defined as values that lie outside boundaries that are 1.5 times the Inter-Quartile Range (IQR) below the first quartile or above the third quartile. *R* includes a function that displays the values used to create a boxplot, including outliers, so the values of any outliers can be easily determined. As an example, consider the boxplot for *rock$shape* that was created above.

### Demonstration: Outliers

To determine what values *R* used to generate a boxplot plot, the command `boxplot.stats(rock$shape)`

is executed.

$stats [1] 0.0903296 0.1621295 0.1988620 0.2626890 0.3412730

$n [1] 48

$conf [1] 0.1759291 0.2217949

$out [1] 0.438712 0.464125 0.420477

This output has four different lines:

**$stats**These are the locations for the five horizontal lines in the plot, so the lower whisker is at 0.0903 on the y-axis, Q1 (the lower edge of the box) is at 0.1621 on the y-axis, the median (the heavy line in the middle of the box) is at 0.1988 on the y-axis, Q3 (the upper edge of the box) is at 0.2626 on the y-axis, and the upper whisker is at 0.3412 on the y-axis.**$n**is the number of observations in the vector, or 48 in this case since 48 trees were measured.**$conf**is the value on the y-axis that would be used to mark a 95% confidence level, but that statistic is not used in this lab.**$out**These are the values of the outliers and the*rock$shape*vector has three: 0.438712, 0,464125, and 0.420477. Thus, to find the outliers of a vector all that is needed is to use the fourth output line of the`boxplot.stats()`

function. If there are no outliers then that line reports*numeric(0)*to indicate that there are zero outliers.

### Skill Check: Outliers

Using the *iris* data frame, use boxplot.stats to see if *Sepal.Width* has any outliers.

## Grouped Boxplots

Boxplots become much more useful when more than one data item is plotted side-by-side for comparison. For example, the following boxplots are helpful in determining if there is a difference in automobile weight by the number of cylinders in the engine.

By comparing the three boxplots it is easy to see that the more cylinders an engine has then the more the automobile weighs since the plots tend to be “higher” as the number of cylinders increases. Also notice that the plot for 8-cylinder cars does not have an upper whisker since it is exactly the same as Q3 but it does include outliers. It is also interesting to note that the whiskers for the three plots overlap, indicating, for example, that some 4-cylinder cars are heavier than some 6-cylinder cars.

## Color

*R* permits designers to use a number of different color palettes to make a graph easier to understand. Here are the palettes available in base *R*, as used in this tutorial. *Note: R can use any of the millions of colors available on a computer screen, but the five fundamental palettes shown below are easy to use and will serve analysts well in many projects.*

Adding the “heat” color palette to the grouped boxplot generated earlier in this tutorial makes it easier to read.

When using color with graphics it is important for the researcher to keep two points in mind.

First, it is estimated that about 8% of males and 0.5% of females are unable to distinguish between two or more colors, a condition that is often called “color blindness.”

Second, if the research is ever printed in a black-and-white form then all color information is lost.

For these two reasons, it is probably best to not rely on color alone to provide information to the reader; rather, color should be used to enhance the understanding of a chart without being a sole source of information for that chart.

### Demonstration: Grouped Boxplots With Color

Grouped boxplots are easy to create with *R* and the following script generates three examples.

- Lines 2-5: This is one long command that is broken over several lines to make it easier to read. Note that R does not require any sort of special “line continuation” character at the end of each line. As long as the parentheses started after the
`boxplot`

keyword does not close then*R*will continue reading that command on the next line. - Line 2:
**Temp ~ Month**This tells*R*to calculate the boxplot for the*Temperature*variable but group those temperatures by*Month*. It is important to remember the order for these two variables. First is the continuous data that should be analyzed and second is the grouping variable. - Line 3:
**data = airquality**In previous tutorials, the data frame name was prepended to the variable name using the`$`

operator. However, for simplicity, many*R*functions are designed to enter only the vector names and then specify the data frame later in the function. In this case, the*airquality*data frame is identified as the source for the two variables being plotted. - Line 4:
**main = “Temp By Month”**This is the main title for the boxplot and is automatically printed in large font abovethe boxplot.*R*has a number of other formatting options available and several will be covered in later labs. - Line 5:
**col = rainbow(8)**This sets the color palette to rainbow and instructs*R*to use eight colors from that palette. It is often useful to experiment with the number of colors requested from the palette since the colors selected will change depending on the number requested and some combinations may be more useful than others. - Lines 7-11: These are similar to Lines 2-5 but use the
*warpbreaks*data frame. Notice that selecting the “heat” color palette is slightly different than the rainbow palette used above. - Lines 13-18: These lines are similar to Lines 7-11 but use the
*chickwts*data frame. A new attribute was specified:*las = 2*. For this boxplot the groups are names of chicken feed rather than numbers or single letters. When those names are printed horizontally they “run into each other” and become unreadable. The*las = 2*specification turns those labels 90° so they do not interfere with each other.

This demonstration contains three different examples of generating grouped boxplots with color.

### Skill Check: Grouped Boxplot With Color

Using the *morley* data frame, generate a grouped boxplot where *Speed* is grouped by *Expt*. Set the *main* title to “Morley Experiment” and the color to *rainbow(5)*.

## Next

This tutorial created a visualization for various central measures and measures of dispersion. The next tutorial is still being built and is not yet available.