## Introduction

Researchers often begin a project with a hypothesis and then gather data to see if the hypothesis supports an underlying theory in a process commonly called the *scientific method*. Categorical data gathered as part of the research project are analyzed using nonparametric techniques and two of the most commonly-used tests are described in this lab. (Note: the concept of “hypothesis” is discussed in Parametric Hypothesis Testing.)

## Kruskal-Wallis H

This test is used to determine if there are any significant differences in three or more groups of data that are not normally-distributed, often categorical. Imagine that a researcher wanted to determine if there was a difference in smoking habit by age group. The subjects were interviewed and asked about how many packs they smoked per week. This data were skewed to the right since a few subjects smoked heavily but most were non-smokers or only smoked a few packs per week. The subjects were also divided into age groups: <20, 20-29, 30-39, 40-49, >49. The researcher would then use a *Kruskal-Wallis H* test to see if there was a significant difference in smoking habit by age group since the dependent variable (packs smoked) was not normally distributed.

### Demonstration: Kruskal-Wallis H

The *R* `kruskal.test`

function requires the two variables being compared to be input in the form of *y ~ x*, where *y* is the dependent variable (measured outcomes) and *x* is the independent variable (the groups used to divide the measured outcomes). Also the data source is specified with a *data =* parameter. In the case of the smoking example mentioned in the previous paragraph, the number of packs smoked would be the dependent variable, the measured outcome, while the age group would be the independent variable.

The following one-line script generates a `kruskal.test`

from the *airquality* data frame. The amount of Ozone in the air was measured every day for several months. To see if there is any significant difference between the months a *Kruskal-Wallis H* would be calculated where the *Ozone* is the measured outcome and the *Month* is the grouping variable.

The `kruskal.test`

function returns a lot of information, most of which is beyond the scope of this lab:

```
Kruskal-Wallis rank sum test
data: Ozone by Month
Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06
```

While a `kruskal.test`

function returns information that would be useful in a more thorough statistical analysis, this lab is only concerned with the p-value, 6.901e-06, which is found at the end of Line 3. Because this is less than 0.05 (5%) it would be considered a significant result. Thus, the null hypothesis would be rejected (that there was no difference in Ozone by Month). Notice that this test does not indicate which month had the greatest ozone reading or if all months had some sort of significant variance from the mean, just that there is a significant difference between the months.

### Skill Check: Kruskal-Wallis H

Using the *chickweight* data frame, calculate a Kruskal-Wallis H for the *weight* output when grouped by *Diet*.

## Mann-Whitney U

This test is used to determine if there are any significant differences in two groups of data that are not normally-distributed, often categorical. Imagine that a movie producer wanted to know if there was a difference in the way the audience in two different cities responded to a movie. The null hypothesis (H_{0}) is “There is no difference in movie-goers’ opinions between these two cities.” The alternate hypothesis (H_{a}) is “Movie-goers’ opinions are significantly different by city.” As the audience members left the theater they would be asked to rate the movie on a scale of one to five stars. The ratings for the two cities would be collected and then a *Mann-Whitney* test would be used to determine if the difference in ratings between the cities was significant.

### Demonstration: Mann-Whitney U

*R* uses the `wilcox.test`

function for several different types of nonparametric tests. It will automatically compute a *Mann-Whitney U* test when the dependent variable is numeric and the independent variable is binary (that is, only two levels). The `wilcox.test`

function requires the two variables being compared to be input in the form of *y ~ x*, where *y* is the dependent variable (measured outcomes) and *x* is the independent variable (the two groups used to divide the measured outcomes). Also the data source is specified with a *data =* parameter. In the case of the movie analysis example mentioned in the previous paragraph, the number of movie rating would be the dependent variable, the measured outcome, while the city would be the independent, or grouping, variable.

The following one-line script generates a *Mann-Whitney U* from the *CO2* data frame. The amount of CO2 uptake of a group of plants was measured while chilled or not chilled. To see if there is any significant difference between the cold tolerance of those plants, a *Mann-Whitney U* test would be calculated where the *uptake* is the measured outcome and the *Treatment* is the grouping variable.

The `wilcox.test`

function returns a lot of information, most of which is beyond the scope of this lab:

```
Warning message: cannot compute exact p-value with ties
Wilcoxon rank sum test with continuity correction
data: uptake by Treatment
W = 1187.5, p-value = 0.006358
alternative hypothesis: true location shift is not equal to 0
```

*Treatment*vector are repeated (“tied”), but that is expected since more than one plant in this study got the same treatment. The estimated p-value is adequate for most research projects.

While a `wilcox.test`

function returns information that would be useful in a more thorough statistical analysis, this lab is only concerned with the p-value, 0.006358, which is found at the end of Line 4. Because this is less than 0.05 (5%) it would be considered a significant result.

### Skill Check: Mann-Whitney U

Using the *mtcars* data frame, calculate a *Mann-Whitney U* for the *disp* output when grouped by *am*.

## Next

This tutorial explored hypothesis testing using *nonparametric* means. Students who are interested in using *R* for their own research should begin to explore this important software on their own. It is available as a download without charge and can be used on *Windows*, *Mac*, or *Linux* systems. It is also recommended that RStudio, a free Integrated Developement Environment (IDE), be used for *R* projects.