One of the main preoccupations for data researchers is to attempt to find some sort of correlation, or relationship, between two or more attributes of a data frame. The calculated correlation is a number between -1.0 and +1.0 where correlations close to 0.0 are very weak while correlations near the extremes are strong. The type of correlation that can be calculated is dependent on the type of data being compared since the process for categorical data is fundamentally different than that for continuous data. R provides several useful methods for calculating correlations and researchers need to be familiar with each of them.
For this blog post, the Cochise College IPEDs Peer data frame will be used. That data frame was first seen in Introduction to Our IPEDS Peers. That data frame includes 113 attributes for 29 colleges and it is natural to wonder if any of those attributes are related to each other. For example, it is reasonable to assume that the revenue coming into a college is correlated to the expense of operating the college, that is, as one increases the other should also increase. In fact, the correlation between those two attributes is 0.963, which is extremely high.
However, there may be many other unexpected correlations lurking in that data frame. It would be possible to calculate the correlation between each pair of attributes one at a time, but that would be very time-consuming (and, frankly, mind-numbing). However, R creates a correlation matrix if more than two variables are compared at one time. As an example, I extracted information about the price of attending each of the colleges in the peer group and compared those prices to the enrollments to try to determine if there was a relationship between the price of attending and the number of students enrolling.
|Pri1||Price for In-State Students Living Off-Campus Living Alone|
|Pri2||Price for In-State Students Living Off-Campus Living With Family|
|FT||Number of Full-Time Students|
|PT||Number of Part-Time Students|
From the above matrix, I could tell that there is a very strong correlation between enrollment and the number of part-time students (0.968), which means that a lot of the enrollment is driven by part-time students. I also notice a moderate negative correlation between the tuition and the number of part-time students (-0.326). This would indicate that as tuition increases the number of part-time students decreases.
All of this is fine and easy enough to do for just the six attributes in the above table, but what if I wanted to analyze more attributes, maybe a dozen at one time? The table full of numbers would quickly become overwhelming, but R provides a great tool for visually evaluating a correlation matrix: correlogram. The following illustration is a correlogram for 21 enrollment-related attributes from the peers’ data frame. The attribute names are abbreviated in the main diagonal then the upper and lower triangles present the correlations using two different methods.
In the lower triangle, the correlation calculations are shown using a color code. The brighter the blue color then the greater the positive correlation while the brighter the red color the greater the negative correlation. I can then look for bright colors to quickly isolate correlations of interest. To read the correlogram using the colored squares in the lower triangle follow one column up to the diagonal to see one of the factors that are being correlated and then right to the diagonal to see the other factor. The same type of process can be used with the numeric data in the top triangle by following a row across to the left and a column down to find the two factors that are correlated for that number. For example, near the top of the diagonal, the value 0.44 is found. This is a reasonably high positive correlation so it is in bright blue. From that number, the factor to the left is “Ttl” and the factor down is “FT” so this is a correlation between total enrollment and the number of full-time students. This, by the way, is also what was found in the correlation matrix presented above.
Using a correlogram makes it easy to quickly find correlations of interest in even a fairly large data frame. As just one last example, I noticed that the students pursuing no degree (abbreviated as “NoDeg” in the diagonal) have a very strong correlation with part-time students, indicating that there is some sort of relationship between being part-time and not declaring a major.
Here is an example correlogram with financial data from the peer institutions.
It does not take long to detect a strong correlation between revenue (“Rev”) and expenses (“Exp”), but that would be expected. As one quick analysis from this correlogram, I noticed that there is a negative correlation between tuition (“Tui”) and the percentage of revenue generated from local taxes (“Loc”). While a correlation does not imply any sort of causative direction, I would guess that as local taxes decrease for whatever reason the college must increase tuition to make up the difference, but more research would be needed to try to verify that guess.
Correlation is a very important tool for researchers and R provides a number of ways to calculate and interpret correlations. This post has listed a few that I tend to use, but there are many more (several that I’m sure I’ve never even heard of).