While working with a data frame that contains a number of variables it is helpful to look at combinations of those variables to determine if any relationships are evident. One way to do that is to create a correlogram (see the post for August 31, but another interesting method is to create a Scatterplot Matrix (sometimes called a SPLOM) and this post illustrates that feature.
For this exploration I chose to use the IPEDS data frame for several reasons. First, I’m familiar with that data frame and, second, it has a lot of data that can be analyzed as a good example. To make a nice splom that would be easy to read. I wanted four or five variables that would reasonably be related that I could pair so I chose the various categories of salary as a percent of core expenses. Following are the five variables that I compared.
- Core: Salaries and wages for core expenses as a percent of total core expenses
- Inst: Salaries and wages for instruction as a percent of total expenses for instruction
- AcadSup: Salaries and wages for academic support as a percent of total expenses for academic support
- StuSvc: Salaries and wages for student services as a percent of total expenses for student services
- InstSup: Salaries and wages for institutional support as a percent of total expenses for institutional support
It may be helpful to look at the raw data for these five variables.
Here is the splom for these five variables.
A bit of an explanation is in order.
The variables tested are listed along the diagonal, along with the following information about each variable.
- Histogram. This is plot looks like rectangles that graphically indicate the number of values that fall into each bin. As an example, histogram for the InstSup (“Institutional Support”) box is divided into five bins that indicate the various percentages colleges allocate for Institutional Support salaries. The tallest rectangle, between 30 and 40 percent, shows that most colleges in the sample allocate between 30 and 40 percent of their Institutional Support expenses to salaries. Note, the y-axis scale is intended to be used for the scatter plots and is not meaningful for the histogram.
- Density Plot. The black line is the density, or data distribution, smoothed out over the entire range of the variable. This is useful for visualizing the distribution where the histogram may be too course. The density plot for the “Core” variable shows an almost perfect normal distribution which is not evident from the histogram.
- Rug Plot. These are the tiny tick marks along the bottom of the histogram and show where the 29 colleges lie along the histogram. Using the InstSup values, the rug plot makes it clear that while most of the colleges are in the 30-40 percent range there is an outlier at both the lower and upper end of the range. Note: there is not a tick mark for each of the 29 colleges in the data frame since some tick marks indicate two or more colleges.
These boxes display the correlation coefficient for the two variables that intersect in that box. For example, the very top-right corner, with a coefficient of 0.16, is the correlation between Core (Salaries and wages for core expenses as a percent of total core expenses) and InstSup (Salaries and wages for institutional support as a percent of total institutional support expenses).
The boxes in the lower-left corner display a scatter plot between the two variables at that intersection. For example, consider the box in bottom left corner of the splom.
Scatterplot. The dots are a scatter plot of Core (Salaries and wages for core expenses as a percent of total core expenses) along the X-Axis and InstSup (Salaries and wages for institutional support as a percent of total institutional support expenses) along the Y-Axis.
Means. The red dot is the location of the means for the two variables.
Correlation Circle. The circle indicates the strength of the correlation. A perfectly circular trace would indicate a correlation of zero while a straight line would be a correlation of +/- 1.0. Thus, the correlation circle in the lower-left corner is nearly perfectly round, indicating the correlation is quite weak and, in fact, the correlation of 0.16 is weak. Compare that to the correlation circle for AcadSup and StuSvc, which is more elliptical and with a positive slope, graphically illustrating the correlation coefficient of 0.43.
LOESS Line. The red line is the LOESS line, which is a regression line that uses weighting to produce a smooth line for the relationship between variables as opposed to a straight regression line. As an example, while the correlation is very weak, the loess line for the box in the lower left corner (Core vs. InstSup) indicates that colleges spending the least percent for Core salaries would also spend the least for InstSup salaries while colleges spending at the extreme of either variable would also be at the extreme of the other variable. The loess line for AcadSup vs. StuSvc is almost straight and in a positive direction, which would be expected with a correlation of 0.43 for those two variables. Another easy to spot positive loess line is that for Inst vs. AcadSup and that correlation is 0.39. Finally, notice the loess line for Inst vs. StuSvc. That correlation circle is fairly elliptical and corresponds with a coefficient of 0.36. The loess line within that circle is linear with a positive slope. Notice, though, large outliers for both Inst and StuSvc that would not be evident from just the coefficient alone.
A SPLOM is a great tool for initial data analysis and in a very compact display provides a lot of information to help an analyst find relationships of interest.