Welcome to R!
Statistical analysis is the core for nearly all research projects and researchers have a wide variety of statistical tools that they can use, like SPSS and SAS. Unfortunately, these analysis tools are expensive or difficult to master so this lab manual introduces R, a powerful, open source statistical analysis program that is available free of charge. But before diving into a statistics package there is one important background fundamental that must be covered: data.
Types of Data
There are two main types of data and it is important to understand the difference between them since that determines appropriate analytical tests.
Continuous data are integer or decimal numbers and are typically used for counts or measures – like a person’s weight, a tree’s height, or a car’s speed. Continuous data are measured with scales that have equal divisions so the difference between any two values can be calculated. Because continuous data include characteristics like means and standard deviations, they are analyzed using parametric tests.
Categorical data group observations into a limited number of categories; for example, type of pet (cat, dog, bird, etc.) or place of residence (Arizona, California, etc.). One common type of categorical data is generated with an “agree-disagree” type of scale, like “I enjoy reading: Strongly Agree : Agree : Neutral : Disagree : Strongly Disagree.” Because categorical data do not have characteristics like means or standard deviations, they are analyzed using nonparametric tests.
The R Command Line
All R commands are entered from a “Command Line” environment. Many students find this a bit challenging at first but once they learn some foundational concepts the command line becomes easy and fast to use. This is an explanation for the R script in the box below.
Demonstration: The R Command Line
Line 1: This is a comment that is used to record notes in a script. In R, all comments start with a hash-mark (#) and everything after that symbol is ignored. Comments are used frequently in scripts presented in this manual in order to explain what the script is doing. Good programmers comment liberally so team members can easily figure out what they did.
Line 2: Calculate the value of 3+5.
Line 3: Calculate the value of “5 + 8 * 2”.
Lines 6-7: These lines create two variables,
MinScore, and then assign values to the variables. You should note two important things about these lines. First, the “assignment” operator is a less than sign followed by a hyphen, making a leftpointing arrow like
<-. That tells R to store the number on the right side of the arrow operator into the variable named on the left side of the line. Also, keep in mind that capitalization matters with R. Thus, the variable named
MaxScorewould be different than a variable named
maxscore. These lines only store values in variables and nothing gets printed to the screen.
Line 8: The variable
Rangeis filled with the result of subtracting
Line 9: Entering a variable name, like
Range, on a line by itself causes the value stored in that variable to be displayed.
Line 12: In R, a list of numbers can be stored in a single variable by using the “combine” function, which is a c followed by a list of the numbers inside a parenthesis. This line creates a variable called
TestScoresand then stores a list of six numbers in that variable.
Line 13: The contents of the variable
TestScoresis printed to the screen.
Skill Check: The R Command Line
Now is a time to try some of the command line skills demonstrated above. In the following R codebox, calculate these values.
In the second line the
^symbol means “raise to the power of” so that line reads “24 plus the value of 2 raised to the 6th power.”
In the third line the
sqrt()function calculates the square root of the number in the parenthesis, or the square root of 9 in this example.
Set the variable M equal to 15 + ( 37 * 2 )
Set the variable N equal to 24 + ( 2 ^ 6 )
Set the variable P equal to 15 - ( sqrt(9) )
A data frame is a collection of data generated during a research project. An example data frame that is easy to understand would be a spreadsheet that contains the times recorded for a race. R comes configured with a 103 built-in data frames used for training and the R script below is an introduction to one of the data frames used in several of the labs in this manual: mtcars.
Demonstration: Data Frames
Line 2: Entering the name of the data frame, mtcars, on a line by itself causes R to print the contents of the entire the data frame to the screen. Since mtcars is rather small it is fine to print it to the screen, but some data frames have hundreds of lines and that may cause the screen to “scroll” for some time before the end of the data frame is reached.
Line 3: This prints the structure of the mtcars data frame. The result shows that this is a data.frame and has 32 observations (that is how many cars are in the dataset) of 11 variables (things like mpg). Also the structure command displays the type of data that are in the dataset. For example, all 11 variables are of the “number” type. The
strfunction is frequently used to better understand a data frame.
Line 4: This line prints the maximum mpg value for the mtcars data frame. Note that the specific variable desired is indicated by both the data frame name and the variable, separated by a dollar sign, like mtcars$mpg on this line.
Line 5: This prints the minimum mpg value.
If some of the lines in the result are too long to fit on one row in the R Console they will wrap around.
Skill Check: Data Frames
In the following R codebox, explore the airquality data frame.
Determine the structure of the airquality data frame
Set MaxWind to the maximum value of Wind
Set MaxTemp to the maximum value of Temp
Set MinOzone to the minimum value of Ozone
This tutorial was an introduction to R and how to complete basic arithmetic calculations. The next tutorial explores several different commonly-used central measures and how to calculate those measures.