For this blog post, the Cochise College IPEDs Peer data frame will be used. That data frame was first seen in Introduction to IPEDS Peers. That data frame includes 113 attributes for 29 colleges and it is natural to wonder if any of those attributes are related to each other in such a way that they can be used for predictions. The relationships between selected attributes was explored in About Correlation where two correlograms were generated to find highly-correlated attributes. This post will use several attributes from the IPEDS data as factors in a regression analysis to see if a model can be developed that will predict the value of one attribute when given the value of another.

Here is a correlogram with the race/ethnicity attribute along with the tuition charged.

This correlogram has some interesting correlations, but I suspect that those are a factor of nothing more than geography. Community colleges tend to attract local students so the student demographics would tend to mirror the local population. Thus, there is a strong negative correlation indicated between black and white students and between Hispanic and white students. I suspect that this does not indicate one race refusing to go to college with another but, rather, that the communities where colleges are located are somewhat polarized.

For this analysis, I wanted to determine if tuition has an influence on the ethnic makeup of the student body. At just a quick glance, I can see that there is a moderate positive correlation between tuition and white students and a moderate negative correlation between tuition and Hispanic students.

Two simple linear models will be developed with R and then those models will be used to predict the percent of Hispanic and White students when given a specific tuition.

The general regression formula is \(Y = \alpha X + \beta\) where the output (Y) is determined by two parameters (\(\alpha\) and \(\beta\)) along with the the input (X). As an example, if \(\alpha\) is 2 and \(\beta\) is 1 then an input (X) of 1 would yield an output (Y) of 3. When graphed, X is the independent variable, Y is the dependent variable, \(\alpha\) is the slope of the regression line, and \(\beta\) is its y-intercept.

To create the regression model, I focused first on the relationship between the percentage of Hispanic students (the dependent variable) and tuition (the independent variable). R has a linear model function, lm(Hispanic ~ Tuition) that was used to determine the values of \(\alpha\) and \(\beta\). Plugging those values into the regression formula yielded \(Y = -0.011 X + 58.13\). To calculate the predicted percent of Hispanic students for a tuition of $3,000, I used that number for \(X\) and solved the equation (it comes out to 25.21%).

It may be easier to visualize this relationship with a scatter plot that shows the relationship between those two variables along with a line of best fit for the data.

In the above plot, the various colleges in the IPEDS peer group are indicated by the black dots and the blue line is the line of best fit. Because it is a negative correlation the blue line angles downward. The gray zone indicates a 95% confidence level for the true value of the line of best fit.

Next, I focused on the relationship between the percentage of White students (the dependent variable) and tuition (the independent variable). Here is a scatter plot that shows that relationship along with a line of best fit for the data.

Using the same procedure developed above, the predicted percent of White students for a tuition of $3,000 comes out to 47.05%. While this is an interesting exercise, any sort of cause and effect discussion should be avoided. As the percentage of white students increase is there pressure to increase tuition (perhaps due to services demanded by the student body)? Or does higher tuition tend to deter students of color? Of course, there are many other factors not considered in this simple analysis, like the geographic community where the college is located or the availability of financial aid.

Multiple Regression

It is possible to have more than one variable influence the output and in that case a multiple regression is used for predictions. For this part of the post, I decided to use the three income streams to predict the core revenue available. The three streams are the percent of the core revenue provided by tuition, by local funding, and by state funding. This is a simplified view of revenue and ignores sources like grants, but is adequate for this analysis.

The first step was to construct a correlogram to get a sense of the relationship between these factors. According to this chart the local revenue has a moderate correlation to total revenue but neither the state or tutition has much of a correlation.

The general multiple regression formula is \(Y = \alpha_1X_1 + \alpha_2X_2 + \alpha_3X_3 + \beta\) where the output \(Y\) is determined by the model parameters \(\alpha\) and \(\beta\) and the inputs \(X\). For the revenue regression, there are three input variables (tuition, state, local) that is used to predict the core revenue. In R the formula is lm(Revenue ~ Tuition+State+Local)

This is the regression formula that was generated from the IPEDS peer data: \(Y = (962285 * X_1) + (956919 * X_2) + (1312015 * X_3) + 1953830\). The linear model was used to create the following scatter plot.

In the above plot, the tuition percentages are in blue, the state percentages are in gold, and the local percentages are in red. Notice that the local percentages seem to make the greatest difference since the slope of that line is greater than the other two. The state percentages seem to be the least important since the slope on that line is nearly flat.