Categorical data items are normally reported in frequency tables and crosstabs where the counts for a particular item are displayed. The only difference between these two types of tables is in the number of dimensions they display, frequency tables display only a single variable while a crosstab displays two variables. Both of these types of tables are commonly used to display polling data during the run-up to an election and would list things like the number of voters who would support some proposition (frequency table) or that same data broken out by party affiliation, sex, age, or some other category (crosstab). This lab explores both types of table.
A frequency table is a one-dimensional table that lists a count of the number of times that some categorical data item appears in a vector. As an example, consider the following table which lists the number of cars for each number of cylinders in the mtcars data frame.
## ## 4 6 8 ## 11 7 14
This table shows that 11 cars in the data frame had 4 cylinders, 7 had 6 cylinders, and 14 had eight cylinders.
Frequency tables are only useful for categorical data-type items. To illustrate why this is true, imagine creating a survey for all of the students at the University of Arizona and including “age” (continuous-type data) as one of the survey questions. Attempting to create a frequency table for the ages of the respondents would have, potentially, more than 65 columns since student ages would range from about 15 to more than 80 and each column would report the number of students for that particular age. While R could create a frequency table that large it would have so many columns that it would be virtually unusable. Normally, if continuous-type data need to be displayed in a table the data are grouped in some way, like ages 15-19, 20-24, etc, so there would be a manageable number of group counts to display.
Demonstration: Frequency Tables
The following script creates a frequency table.
- Line 2: Create a simple frequency table listing the number of cars by the forward gears.
Skill Check: Frequency Tables
Using the chickwts data frame, create a frequency table for feed.
It is often useful to include the total number of items counted in a frequency table and that is provided by the
addmargins function. As an example, consider the following table which is the same frequency table shown above but includes the total number of cars in the data frame.
## ## 4 6 8 Sum ## 11 7 14 32
The following script creates a frequency table with margins
- Line 2: Create a frequency table with margins listing the number of cars by the forward gears.
Skill Check: Margins
Using the chickwts data frame, create a frequency table with margins for feed.
Proportion Tables (Proptables)
Occasionally, researchers prefer to present percentages rather than raw numbers since those are easier to quickly interpret. Here is a proportion table of the number of cylinders for cars in the mtcars data frame.
prop.table(table(mtcars$cyl)) * 100
## ## 4 6 8 ## 34.375 21.875 43.750
Thus, about 44% of the cars have eight cylinders.
Demonstration: Proportion Tables (Proptables)
The following script creates a proportion table with the results presented in percent format and rounded to two places.
- Line 2: Create a simple proportion table listing the number of cars by the forward gears.
- Line 5: Multiply all values in the proportion table by 100 to make it a percentage.
This demonstration contains two different examples of generating a proportion table.
Skill Check: Proportion Tables (Proptables)
Using the chickwts data frame, create a proportion table for feed. The proportions should be multiplied by 100.
A crosstab (sometimes called a contingency table or pivot table), is a table of frequencies used to display the relationship between two nominal or ordinal variables. As an example of a crosstab, consider a table listing mtcars by the number of forward gears and the number of cylinders.
## cyl ## gear 4 6 8 ## 3 1 2 12 ## 4 8 4 0 ## 5 2 1 2
In this case, one car had a four cylinder engine and three forward gears while 12 cars had eight cylinders and three forward gears. By using a crosstab, a researcher can determine the frequency of some incident (number of cars) by two different criteria (gears and cylinders).
Here is a second example from the esoph data frame.
## ncases ## agegp 0 1 2 3 4 5 6 8 9 17 ## 25-34 14 1 0 0 0 0 0 0 0 0 ## 35-44 10 2 2 1 0 0 0 0 0 0 ## 45-54 3 2 2 2 3 2 2 0 0 0 ## 55-64 0 0 2 4 3 2 2 1 2 0 ## 65-74 1 4 2 2 2 2 1 0 0 1 ## 75+ 1 7 3 0 0 0 0 0 0 0
Notice that there were few cases of esophageal cancer among people under the age of 45 but the number of cases increased between the ages of 45 and 74, with a peak in the 55-64 age group.
The following script demonstrates how to create crosstabs.
- Line 2: The
xtabscommand (for “Cross Tabs”) creates a crosstab for the variables entered. It is important to notice the tilde character in this function. In many R commands the tilde is used to separate two parts of a formula. The part before the tilde are the data values to be acted upon and the second part are the grouping vectors. In Line 2, there are no data values specified so R will simply count the number of times the various groups show up. The “row” group is listed first and then the “column” group. For example, “gear 3 - cylinder 8” appear 12 times in the data frame.
Skill Check: Crosstabs
Using the esoph data frame, create a crosstabs table for tobgp (tobacco group) and alcgp (alcohol group).
Crosstabs can contain more than two dimensions. As an example, consider an experiment with pea plants where the amount of nitrogen, phosphorus, and potassium was varied to see what would happen to the crop yield. The npk data frame contains the result of that experiment and this crosstab displays that result in a multidimensional table. Notice that the row group is listed first, the column group is second, and the block group is third.
## , , K = 0 ## ## P ## N 0 1 ## 0 3 3 ## 1 3 3 ## ## , , K = 1 ## ## P ## N 0 1 ## 0 3 3 ## 1 3 3
When nitrogen (N) is 0, phosphorus (P) is 0, and potassium (K) is 0 the crop yield is 154.3 pounds/plot.
As a second example, consider the mtcars data frame. A researcher wanted to know if there is any relationship between the number of forward gears, the number of cylinders in the engine (that is, the size of the engine), and whether the car had a manual or automatic transmission. Here is the crosstab that was created.
## , , am = 0 ## ## gear ## cyl 3 4 5 ## 4 1 2 0 ## 6 2 2 0 ## 8 12 0 0 ## ## , , am = 1 ## ## gear ## cyl 3 4 5 ## 4 0 6 2 ## 6 0 2 1 ## 8 0 0 2
When am is 0 (which is the code for an automatic transmission) there were no cars with five forward gears and when am is 1 (manual transmission) there were no cars with three forward gears.
Demonstration: Multi-dimensional Crosstabs
The following script demonstrates how to create multi-dimensional crosstabs.
- Line 2: The
xtabscommand creates a crosstab for the variables entered. In this case, there is nothing on the left side of the tilde so R will return a count of the various categories. Since there are three grouping variables, gear, cyl, and am, R will count instances for all three variables. For example, “gear 4” and “cylinder 4” appear together 2 times when the transmission is 0 and 6 times when the transmission is 1. Note that a data = mtcars parameter is also specified so the word mtcars$ does not need to be used for each variable. This makes the formula easier to read.
- Line 5: This is just another example of multi-dimensional crosstab.
This demonstration contains two different examples of generating a multi-dimensional crosstab.
Skill Check: Multi-dimensional Crosstabs
Using the esoph data frame, create a crosstabs table for tobgp (tobacco group), alcgp (alcohol group), and agegp (age group). To make the R command easier to read, specify the data frame as a “data = esoph” instead of using the “esoph$…” format.
Each of the crosstabs presented so far have displayed only counts of data; however, by using the
aggregate function, crosstabs can also display calculated values, like the mean temperature for each summer month in the airquality data frame.
aggregate(Temp ~ Month, data = airquality, FUN = mean)
## Month Temp ## 1 5 65.54839 ## 2 6 79.10000 ## 3 7 83.90323 ## 4 8 83.96774 ## 5 9 76.90000
Though the above table has a number of calculated values the R command is only one line long.
As another example, the mean horsepower for automobiles can be calculated by the number of forward gears and engine cylinders from the mtcars data frame.
aggregate(hp ~ gear + cyl, data = mtcars, FUN = mean)
## gear cyl hp ## 1 3 4 97.0000 ## 2 4 4 76.0000 ## 3 5 4 102.0000 ## 4 3 6 107.5000 ## 5 4 6 116.5000 ## 6 5 6 175.0000 ## 7 3 8 194.1667 ## 8 5 8 299.5000
Demonstration: Calculated Crosstab
The following script demonstrates how to create calculated crosstab.
- Line 2: This line uses the
aggregatecommand to calculate the mean displacement for each category of cylinders. Notice that this command has a tilde formula where displacement will be aggregated for each category of cylinders. The values to be calculated comes first (displacement) and the grouping variable comes second (cylinder). Next is the name of the dataset used. Finally, the statistical function is listed as FUN = mean. Note that the key word FUN is in capital letters. This aggregate function determines that four cylinder cars have a mean displacement of just over 105 cubic inches.
- Line 5: This line also calculates the mean displacement but groups the result by both number of cylinders and number of forward gears, creating a multi-dimensional calculated crosstab. It calculates that the mean displacement for four cylinder cars with three forward gears is just over 120 cubic inches.
Skill Check: Calculated Crosstabs
Using the npk data frame, create a calculated crosstab for N (nitrogen), P (phosphate), and K (potassium). Aggregate the mean yield for all three vectors. Notice that the vector names are capital letters.
In order to facilite understanding, R makes it easy for researchers to round the results of calculations to whatever level is desired. It is important to note that even if the displayed value is rounded, R still uses the full decimal number for calculations. The R command to round a number is
round() where the number to be rounded is listed first in the parenthesis and the number of decimal places is listed second. So
round(1.3498, 2) would be rounded to 1.35 and
round(1.3498, 1) would be rounded to 1.3. The number to be rounded can be a calculated value rather than just a plain number.
As an example, here is the same
aggregate command used in the previous example, but with the results rounded. Notice how the
aggregate command is the same, but it is wrapped in a
round function to round off the calculations. In the first example the result is rounded to three decimal places and in the second example to two decimal places.
This demonstration contains two different examples of rounding.
Note: this demonstration contains two examples of rounding.
Skill Check: Rounding
Using the infert data frame, create a calculated crosstab for case, induced, and spontaneous. Aggregate the mean age for all three vectors. Round the results to two decimal places.