Capstone Project

Introduction

The College Scorecard is a service meant to help prospective students make their college decision. Whether by comparing size, popular majors, or comparing costs to the national average, the site’s goal is to help the user find a good fit. For my Springboard Capstone Project, I used the dataset made available by the College Scorecard to try and find additional ways to help users in their decision.

The Data

Available in a .zip file from the link above, the data is split into 19 .csv files - one for each academic year from 1996-’97 to 2014-’15. Each file contains 1,744 recorded data points (columns) and about 7,500 schools (rows). My first step was to add a column to indicate the academic year for which the data was collected and then merging the 19 files into one large .csv, which I called fulldata.csv. From this 2 GB file, I would create subsets for plotting and studying trends. Next, I had to refer to the data dictionary to understand the column names and some of the placeholder values used. Once I merged the files and had a basic understanding of what data was available, I began to create some plots and documented my initial findings in my data story. As a note, my studies have been on US schools only. The dataset includes records of US territories as well, but I have filtered them out when creating subsets for plotting/modeling.

Deliverables

My first item will be using linear regression models to determine what variables are the most influential on the cost of a school, and to potentially predict what the cost of a school may be in the future. This would be useful in two ways. It could help students who are on the fence about going to college immediately after high school by suggesting a decline in cost. If a student sees that their ideal school is likely to be cheaper in two years, they may make the decision to find entry-level work or go traveling before going to college. It could also help the schools by indicating areas of their budget that are influencing the cost of attendance. Of course, the goals of each school are different and they may not be interested in reducing cost, but if a school is experiencing a decline in applications, cost of attendance is likely to be something they look at.

The second item is a recommendation tool that works similarly to clustering methods used by media services such as Netflix, YouTube, and iTunes. Clustering based on things like location, cost, and SAT scores, a student can find options that are close to their top choice. Mentality is a very important part of finding success at school, and feeling out of place in freshman year can be quite discouraging. If their top choice is a far reach school, or it is too expensive/too far, then finding alternatives would hopefully help them to be satisfied in their decision.

My final item will be a basic UI that allows the user to explore the dataset on their own. Although the data will need to be curated and shaved down to a size that a standard internet browser can handle, my hope is that it will provide some transparency between students and universities. As an example, in my data story, I explored SAT averages and admission rates. Even though one could reasonably guess how they are related (higher SAT scores ~ lower admission rates), being able to plot the data and draw a conclusion from a graph is much more convincing. By putting this power in the hands of students and their families, they should be able to make much more educated decisions.

Data Preparation

From fulldata.csv, I created and saved a subset of the columns that would make repeated data manipulation more manageable. I named it subdata and it contains various school identification data (name, ID, location, level of institution), admission data (student demographics, SAT scores), education data (majors, completion rates), and financial data (cost, aid, debt, repayment).

From this subset, I can pick the variables for the cost modeling, as well as the variables for clustering - both tools will get specific subsets made so that the data is organized and loading it will be easier in the future. A combination of the original source data and my curated data will go into the UI plotting tool, keeping in mind that the dataset should stay a reasonable size but also have enough depth for the user to gain whatever insights they can.

Cost Prediction

Using the datayear, I created the training file using years 1996 to 2013 and the test file with just the 2014 data. Since the cost is known for 2014, we can compare afterwards to see how good the model is beyond the \(R^2\) score alone.

From subdata, I created a subset of up to 14 columns that I felt would be related to the attendance cost of a school. This included columns like “CONTROL” (whether the school is public or private), “UGDS” (the undergraduate student population), “PFTFAC” (percentage of faculty that are full time), and of course “COSTT4_A” (cost of attendance for a 4-year academic year school). Upon checking the summary() of the training set, I noticed that the MAX of “UGDS” was unnaturally high. This was because there were a few records of the University of Pheonix’s online campus, which is capable of having about 200,000 students. Since this was skewing the data, I removed its related ID from the training and test sets, which was just a handful of entries. I also removed all records in the training set that did not have a cost entry, since they would not contribute to the model. This filtering had an interesting byproduct - apparently years 2008 and prior did not have cost records, so removing the NA entries also happened to filter the dataset to years 2009 and beyond. After taking care to turn appropriate columns into the numeric data type, I moved on to make small experimental models.

The first model I tried was on Cost by Undergrad Pop., Average Faculty Salary, and % of Faculty that are Full-Time. This, however, only gave an \(R^2\) of 0.2198, which is not that great. However, the model summary suggests that these three variables are still significant (using asterisk notation), so it is a step in the right direction. The 2nd model I tried used 12 independent variables to model Cost. It gave an \(R^2\) of 0.8253, but still showed that three of the variables had little to no significance. After removing these, this was the final model I came to:

Cost by Datayear, Highest degree offered, Pub/Priv, Avg SAT, Undergrad Pop., Avg Faculty Salary, % of Faculty that are Full-Time, Median Debt, and Avg Family Income

Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.510417e+06 1.012741e+05 -14.914149 0e+00
DATAYEAR 7.357168e+02 5.040102e+01 14.597261 0e+00
HIGHDEG 7.879712e+02 1.276996e+02 6.170508 0e+00
CONTROL 1.522268e+04 1.605895e+02 94.792511 0e+00
SAT_AVG 1.140344e+01 8.111172e-01 14.058929 0e+00
UGDS -1.843476e-01 1.155470e-02 -15.954390 0e+00
AVGFACSAL 1.709129e+00 4.681430e-02 36.508723 0e+00
PFTFAC -1.351732e+03 2.673499e+02 -5.056040 4e-07
DEBT_MDN 2.557939e-01 1.992430e-02 12.838282 0e+00
FAMINC 1.043032e-01 4.617600e-03 22.588185 0e+00

With an \(R^2\) of 0.825, this model is almost equally as good as the second model, but now all of the variables used are significant. In other words, the variables that likely have the most influence on the cost of attendance of a school are the ones in this model. This is saved as cost.mod and will be used to predict on the test set.

Summary of test data after predictions

##      OPEID6         INSTNM             DATAYEAR        REGION     
##  Min.   : 1002   Length:3888        Min.   :2014   Min.   :0.000  
##  1st Qu.: 2470   Class :character   1st Qu.:2014   1st Qu.:3.000  
##  Median : 3810   Mode  :character   Median :2014   Median :5.000  
##  Mean   : 9543                      Mean   :2014   Mean   :4.408  
##  3rd Qu.:10727                      3rd Qu.:2014   3rd Qu.:6.000  
##  Max.   :42345                      Max.   :2014   Max.   :8.000  
##                                                                   
##     HIGHDEG        CONTROL         SAT_AVG          UGDS      
##  Min.   :0.00   Min.   :1.000   Min.   : 720   Min.   :    0  
##  1st Qu.:2.00   1st Qu.:1.000   1st Qu.: 973   1st Qu.:  446  
##  Median :3.00   Median :2.000   Median :1039   Median : 1486  
##  Mean   :2.97   Mean   :1.833   Mean   :1059   Mean   : 3893  
##  3rd Qu.:4.00   3rd Qu.:2.000   3rd Qu.:1120   3rd Qu.: 4294  
##  Max.   :4.00   Max.   :3.000   Max.   :1545   Max.   :77657  
##                                 NA's   :2595   NA's   :1      
##    AVGFACSAL         PFTFAC          DEBT_MDN         FAMINC      
##  Min.   :  332   Min.   :0.0000   Min.   : 1354   Min.   :     0  
##  1st Qu.: 4914   1st Qu.:0.2958   1st Qu.: 8414   1st Qu.: 24289  
##  Median : 6102   Median :0.5296   Median :12500   Median : 35595  
##  Mean   : 6351   Mean   :0.5630   Mean   :12954   Mean   : 43701  
##  3rd Qu.: 7553   3rd Qu.:0.8564   3rd Qu.:17500   3rd Qu.: 58456  
##  Max.   :20650   Max.   :1.0000   Max.   :37500   Max.   :152100  
##  NA's   :190     NA's   :520      NA's   :232     NA's   :15      
##     COSTT4_A        PREDICT           DIFF         
##  Min.   : 5536   Min.   : 9691   Min.   :-15605.9  
##  1st Qu.:14634   1st Qu.:24010   1st Qu.: -2884.8  
##  Median :23200   Median :34152   Median :   568.9  
##  Mean   :25242   Mean   :33199   Mean   :   560.2  
##  3rd Qu.:31789   3rd Qu.:40676   3rd Qu.:  3819.4  
##  Max.   :79212   Max.   :65985   Max.   : 26866.1  
##                  NA's   :2608    NA's   :2608

The predict() formula returns a column that I named “PREDICT” and, by taking the difference of the prediction and the actual cost, we can see whether the prediction was too high or too low. This summary tells us that the extreme cases were -$15,605 and $26,866 off the mark. What is more disappointing, however, is the number of NAs in the prediction output. To see where the model is falling short, I can check two plots. The first plot overlays the prediction values (in black) over the training data. This helps to see if the predicted values stayed in the same range and may indicate another reason why some predictions weren’t made. The second plot overlays the prediction values over the actual 2014 values, and they are coloured green if within $4000 of the actual cost and red if predicted too far. If the plot is mostly green, that would be a good sign.

According to these plots, the predictions were mostly succesful for schools in the 0 ~ 3000 range for the school ID, while predictions beyond that are very sporadic. Looking at the dataset and sorting by the ID shows that there are many NAs in the independent variables, causing the prediction to fail for those cases. This is unfortunate, since this causes about 2,600 failed predictions despite a fairly strong model. This leaves us with a few different options. We could remove the independent variable(s) with the most NAs from the model, which risks weakening the model but will yield more successful predictions. Another option could be to replace NAs with another value, such as the mean for that column. This doesn’t hurt the model at all, but the predictions for the schools with approximated values have a chance to be completely off.

The independent variable with significantly more NAs than the others is the SAT average. We can create a new model that does not include this variable and, it turns out, removing it reveals “PFTFAC” (perentage of faculty that is full time) to be an insignificant variable as well. With both removed, the new model has an \(R^2\) of 0.771, which is not bad at all.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.144029e+06 6.731323e+04 -16.99561 0
DATAYEAR 5.619118e+02 3.347814e+01 16.78444 0
HIGHDEG 1.017738e+03 5.860958e+01 17.36470 0
CONTROL 8.353832e+03 7.064582e+01 118.24949 0
UGDS -1.754977e-01 8.042900e-03 -21.82009 0
AVGFACSAL 1.217497e+00 3.119080e-02 39.03382 0
DEBT_MDN 2.815654e-01 1.408230e-02 19.99431 0
FAMINC 2.133767e-01 2.913400e-03 73.23871 0

The summary() of this new model also shows that there are only about 400 NAs in the prediction this time, which is a massive improvement. We can compare the performance of this model to the other option, where NAs would have been replaced by the column mean. Is it worth it to have a slightly weaker model for more accurate predictions, or will replacing the NAs solve the problem? Ideally, the colour coding of the plot will indicate the better model.

It truly looks as though the weaker model with a few NAs remaining is the better one for prediction. The amount of red in the NA-replacement model shows that many predictions were quite off and that forcing the model to make predictions when there was not enough data was the wrong decision. Giving up the two variables and using a slightly weaker model, while keeping the original data, seems to stay within the range of the actual 2014 data for the most part.

If we plot this new model, we can also use the residuals to judge how well it works.

The Residuals vs Fitted plot stands out right away, showing that the residuals generally reside around the fitted line. The Normal Q-Q plot also follows the normal line for the most part, aside from the very bottom and very top of the plot. It seems like this model is fairly strong, and the \(R^2\) and plot of residuals reinforces it.

After revisions, the current conclusion is that Cost is strongly related to the school’s highest provided degree, whether it is public or private, the student population size, the average faculty salary, the median student debt, and the average family income. It is unlikely that the school will change between public/private, or that it will stop providing higher degrees, but there is a chance that the school might look into its admission rates. According to the coefficients table, school population is negatively correlated to attendance cost, which means that school prices go up as student population sizes go down; i.e., fewer people go to the more expensive schools, and the schools that accept the most students can afford to have slightly lower attendance costs since they make it back through sheer numbers.

It should also be mentioned that the $4,000 “acceptable” window I applied to the prediction values was chosen arbitrarily. This window could change depending on the student’s budget - someone who is more well-off could widen the window and someone with a tighter budget could shrink it. Despite removing some of the variables that made what seemed to be a stronger model, the final model we reached seems rather good at predicting cost when all data points are present.

School Recommendation

From subdata, I created another subset for clustering. Since the plan is to have a way to recommend schools to a student, I filtered down to just the school data from 2014 (meaning the 2014-’15 school year). Since this is the most recent year of data available in the dataset, it felt the most appropriate to use when trying to classify them. With some trial and error, I eventually decided on 9 variables to use, aside from the school name column. At this point, I made two subsets - clustdata for performing data manipulation, and a copy called schoolCluster2014 which would remain as is until the end when we have clusters to assign.

Proper data cleaning was important here, since calculating Euclidean distance is the first step to building the clustering model. I needed to scale() the numeric variables, so that variables with different ranges do not skew the model. However, before scaling, there were a few changes I had to make to the data.

For population size, family income, and age of entry, I replaced NAs with the mean of the column. The two cost columns were different - one column shows the cost of academic year schools and the other is for program year schools, meaning if one column had an entry, the other would be NA. In this case, it wouldn’t make sense to replace NAs with the mean, so I replaced them with 0 instead. With the NAs filled, we can run scale(clustdata[2:6]), where the 2nd to 6th columns are the numeric variables. After handling the NA values, I also had to check the mean() and sd() so I could note them down. I kept these in a spreadsheet for later, since I will need them to change the scaled values back to ones we can make sense of for analysis.

I also transformed the categorical variables into their own binary columns - instead of checking the row for the value in the cell, the value is in the column name and the cell is a 1 or 0. This meant adding 51 columns for the states and DC, 3 columns for public or private (for-profit and non-profit), and 12 columns for different locale types (city, town, etc). After adding the new binary columns, the original columns were removed.

After these changes, we can run dist(clustdata[2:72]) to get the Euclidean distances for the schools. With the distances, we can then create a cluster object, which can be used to plot a dendrogram. This will give us an idea of how many clusters we should create from the dataset.

Using a horizontal line, we can cut the dendrogram to help us decide how many clusters to use. If we cut higher up, we would not have enough clusters to give meaningful recommendations. The red boxes here show a spot that crosses the dendrogram 8 times.

The next step involved a little bit of manual work outside of R. I used tapply() to find the mean occurrences for the variables in each cluster, copy/pasted the values into a text editor, and made a .csv manually. This file will show us what values are most common in the cluster, allowing us to come up with classifications. For the numeric variables, the value here will tell us the mean (of the scaled value) in that cluster - for example, telling us the mean cost is high or the mean population is low. And the binary columns will just be \(\frac{N}{7542}\), where N is the number of “1” values and 7542 is the number of schools - this gives us an easy percent.

Cluster Means

name 1 2 3 4 5 6 7 8
UGDS 0.4516796 -0.2067874 3.7970170 0.1853507 -0.1563221 -0.3591402 -0.4336518 -0.2843682
COSTT4_A -0.0125438 -0.1699634 0.3727111 2.4307564 1.0558038 0.8350934 -0.8419533 -0.7801741
COSTT4_P -0.6554611 -0.6600259 -0.6600259 -0.6600259 -0.6589946 -0.6600259 1.3833504 0.0428761
FAMINC -0.1378020 -0.0299488 0.7190665 2.6949241 0.9855728 -0.5555464 -0.4865289 -0.4964065
AGE_ENTRY -0.3509990 0.3126336 -0.8724844 -1.5848950 -0.9838640 0.8368266 0.1085506 1.0393187
public 0.9743191 0.0016779 0.9663300 0.0000000 0.1203611 0.0000000 0.0009843 0.3038627
private_nonprof 0.0116732 0.9966443 0.0168350 1.0000000 0.7392177 0.0000000 0.0004921 0.1493562
private_forprof 0.0140078 0.0016779 0.0168350 0.0000000 0.1404213 1.0000000 0.9985236 0.5467811
city_large 0.0933852 0.2969799 0.3097643 0.2165775 0.2427282 0.3103015 0.2322835 0.1690987
city_mid 0.0980545 0.1442953 0.1178451 0.1737968 0.0651956 0.1608040 0.1318898 0.0669528
city_small 0.1237354 0.1325503 0.2154882 0.1417112 0.1273822 0.1155779 0.1471457 0.0660944
suburb_large 0.1245136 0.2416107 0.2020202 0.2887701 0.1935807 0.3542714 0.3400591 0.2085837
suburb_mid 0.0428016 0.0067114 0.0202020 0.0294118 0.0270812 0.0213568 0.0364173 0.0137339
suburb_small 0.0210117 0.0083893 0.0202020 0.0080214 0.0260782 0.0000000 0.0211614 0.0188841
town_fringe 0.0256809 0.0083893 0.0202020 0.0347594 0.0320963 0.0025126 0.0093504 0.0103004
town_distant 0.1035020 0.0201342 0.0404040 0.0454546 0.1404213 0.0100502 0.0349409 0.0523605
town_remote 0.1143969 0.0167785 0.0538720 0.0320856 0.0441324 0.0100502 0.0324803 0.0266094
rural_fringe 0.1564202 0.0486577 0.0000000 0.0133690 0.0391174 0.0125628 0.0127953 0.0600858
rural_distant 0.0404669 0.0184564 0.0000000 0.0160428 0.0250752 0.0025126 0.0004921 0.0188841
rural_remote 0.0233463 0.0167785 0.0000000 0.0000000 0.0130391 0.0000000 0.0004921 0.0068670

Since the first five rows are still scaled, it is hard to make classifications without context. That is why I kept the mean() and sd() of the unscaled values beforehand. To convert back, we just use

\[ x = (sd\times z) + m \]

Now we can understand the numeric values since they are in their original units. Using Google Sheets, I did some colour coding to see which values here stood out. Some clusters are a bit more distinct than others, but 8 seems to have been a good choice. Yellow cells are close to the original mean(), green cells are above, and red cells are below. Although the colouring is a bit arbitrary from my end, the values are fortunately different enough to make unique clusters. I also checked the count of each of the clusters to see how many schools were placed into them, which could also help determine if the classifications we decide on make sense.

## clusterGroups
##    1    2    3    4    5    6    7    8 
## 1285  596  297  374  997  796 2032 1165

Based on all of these factors, I came up with the following classifications:

  1. Public, Medium Pop, Mid Cost (A), Mid Family Income, Mid 20s, Varying Locales
  2. Private Non-Prof, Low Pop, Mid Cost (A), Mid Family Income, Mid 20s, Large City/Suburb
  3. Public, High Pop, Mid Cost (A), High Family Income, Low 20s, Large City/Suburb
  4. Private Non-Prof, Medium Pop, High Cost (A), High Family Income, Low 20s, Large City/Suburb
  5. Private Non-Prof, Low Pop, High Cost (A), High Family Income, Low 20s, Large City/Suburb
  6. Private For-Prof, Low Pop, High Cost (A), Low Family Income, Upper 20s, Large City/Suburb
  7. Private For-Prof, Low Pop, High Cost (P), Low Family Income, Mid 20s, Large City/Suburb
  8. Public/Private For-Prof, Low Pop, Mid Cost (P), Low Family Income, Upper 20s, Large City/Suburb

There are some interesting take-aways from this already. It is pretty unsurprising that the majority of public schools are in cluster 1, where student population, cost, and family income are all around average, and cluster 7, where students are joining for-profit schools to try and gain new skills in their mid-20s since their incomes are on the lower side. Cluster 1 is also not dominated by any one locale, though the most are small towns or cities. Clusters 3, 4, and 5 also seem to indicate that the students going to schools in large locales are younger, have higher family incomes than the others, but still have different preferences on population and cost of their school.

Finally, we can append the clusters to the end of the original data. Earlier in the process I made a table called schoolCluster2014, to which I can now assign clusters. Now if a student has a school in mind, for example “Drexel University”, they can do a lookup for that school and see what cluster it is in and what its row number is. For Drexel, the row number is 3214 and it is in cluster 4. This row number can be used to do a lookup in a nearest neighbor matrix, which can show us the nearest 10 row names.

Drexel University’s Nearest Neighbors

## # A tibble: 10 x 2
##                               INSTNM CLUSTER
##                                <chr>   <int>
##  1 University of Southern California       4
##  2      George Washington University       4
##  3                 DePaul University       4
##  4         Loyola University Chicago       4
##  5                 Boston University       4
##  6           Northeastern University       4
##  7                Cornell University       4
##  8     St John's University-New York       4
##  9               Syracuse University       4
## 10        University of Pennsylvania       4

Just from the names alone and the little that I know about some of those schools, we can see that they are all big schools in big cities, just like Drexel University. Additionally, with an interactive UI, a student could simply take a look at all of cluster 4 and sort by population, cost, and compare schools based on their own preferences and needs. This brings us to the next and final tool.

College Scorecard Sandbox

While it feels great to have built models that seem to get the job done, I still may not meet all the needs of the students. To account for this, I would like to also provide the tools for analysis to the students’ themselves, using RShiny. This allows an R developer to design a UI and set the various buttons and elements to trigger R scripts, essentially allowing a layman to perform R analysis without needing to know R.

The College Sandbox is the UI that I have created to serve this purpose. For it to work, I need to create any subsets that a user on the site may need. Currently, I have included a smaller version of subdata for custom plotting, as well as schoolCluster2014 for cluster analysis. Since a scatterplot will not always be the best visualization for the selected X and Y axes, there will soon be options to plot as a bar or pie chart. To try a combination that works, enter “SAT_AVG” for the X-axis and “ADM_RATE” for the Y-axis, and press Update. This plot can answer the question: “What does my SAT score need to be to get into the school(s) I’m considering?” They would notice right away that the higher the SAT scores go, the lower the admission rates get, indicating that higher SAT scores are needed for the more exclusive schools. Though this is probably already widely known, seeing some actual numbers can help the student plan ahead and set realistic goals.

A feature I would possibly like to introduce in the future is the ability to log in and save schools, plots, and data tables to a profile. I will have to learn more about server space and what steps I would need to take so that the site could handle a growing list of stored items. I also want the plots to be interactive, so that the user can zoom in or filter in real time rather than by pressing a button. The more steps I can take to ensure that the user experience is simple and intuitive, the more clarity the students will achieve from this tool.

Conclusion

From 19 individual .csv files, to one large 2 GB file, to dozens of smaller data subsets (both raw and transformed), there is a lot to be discovered within the data provided by The College Scorecard. Since there is so much available, machine learning methods like linear regression and clustering were fairly successful at adding more insights to what is available on their website. And now that we have these working baseline models, we could also experiment with adding other variables and seeing where there is room for improvements.

Linear regression showed us that a school’s cost of attendance is linked to the its highest provided degree, whether it is public or private, the student population size, the average faculty salary, the median student debt, and the average family income. While it may not be possible or reasonable for a school to make changes for some of these factors, but how many students they accept and how much aid they are willing to give are both areas they could look into if they are willing. Students can also use this model to see if it is worth waiting a year to go to college, in case their top choice seems like it will drop in cost by then.

Clustering gave us a unique look at the spread of characteristics across the schools in 2014. With 8 uniquely defined clusters, finding similar schools is now an easy task. Students with far reach schools could benefit from this by having backups that they would still be satisfied with attending, since they would be so similar. Even students without a particular school in mind could get some ideas on where to apply, as long as they have some preferences on the characteristics of the schools. Those students could just search and sort through the cluster that matches the type of school they are interested in.

And lastly, I’ve provided an interactive UI that will help give students and their families the analytical tools they need to answer their own questions and try to learn more. The UI has an area to make custom plots out of a huge selection of variables, kept to a reasonable size since it is stored on a server, and also has the ability to filter and sort through the cluster data.

If you have any questions, or you would like to see more detail in how I prepared the data and plots, you can refer to the appendix. Or if you are interested in helping me improve our understanding of higher education and how to help students find a path to success, please feel free to email me. Many thanks to The College Scorecard for making this data available, and thank you very much for reading.