Data Story

Introduction

The dataset I will be exploring is from The College Scorecard, a service meant to help prospective students with their college decision. By recording data regarding admissions, education, and financial activity (to name a few), a new student can try to find a best-fit school based on previous paths of success for students that match their demographic, field of interest, socio-economic status, and so on. In this project, I will be looking at standardized test scoring and popular majors. Do high SAT scorers only go to the very exclusive schools? Where are those schools? What degrees are popular there? Which schools award the most degrees in what I’m interested in? These are the sort of questions that prospective students are asking when considering their college options.

Preparing the Data

The download contains 19 .csv files - one for each academic year from 1996-’97 to 2014-’15, with approximately 7,500 schools (rows) per file and 1,744 recorded data points (columns). I began by creating a merged .csv from these 19 files, so that the data could be easily found for plotting. Before combining the .csv files, I had to add a new column that would indicate the data collection year (taken from the filename).

This process gives me a 2 GB file that contains all of the data, called fulldata.csv. The next step is to read the data dictionary to deciper the column names and try to find what is needed for plotting.

Data Analysis

This dataset holds a wide variety of information. It has school identification data (name, ID, location, level of institution), admission data (student demographics, SAT scores), education data (majors, completion rates), financial data (cost, aid, debt, repayment), and more. To begin plotting, it is a good idea to create subsets of fulldata.csv that contain just the fields necessary. This makes data manipulation and running calculations easier and safer. Although the dataset has records for US territories as well as the 50 states, I will be filtering out the territories for the purposes of this study.

Admission Plot

Since the dataset contains admission records, we can compare the admission rate of a school to the average SAT score (or equivalent) of students admitted. Intutition would suggest that higher scorers are able to gain admission to more exclusive schools. By plotting these data points, we can see if that intuition is correct, close, or completely wrong.

As we can see, students that attend schools with the lower admission rates generally have very high SAT score averages. The center of the distribution suggests that it is possible for low-mid SAT scores to get accepted to low admission rate schools as well - possibly due to extracurriculars or personal recommendations. The top-middle of the plot is a very dense cluster, which is reasonable since students with mid-range SAT scores apply to a lot of high admission rate schools. This is roughly what one might expect of this data. For a closer look, we can also see which states/territories contain the higher-performing, more exclusive schools by subsetting our dataset to just the scores > 1250 and admission rates < 0.25.

The data is much less dense now, but some of the colours are a little too similar to each other. To know which states are shown here, we can run a summary() on the state abbreviation (STABBR).

## CA CO CT DC IL IN LA MA MD ME MN MO NC NH NJ NY PA RI TN TX VA VT 
## 86  2 16 14  9  5  1 70  5 12  2 14 14 13 14 50 30 14  6 14  7 12

It seems like California (Orange Red), Massachusetts (Green), and New York (Navy Blue) are some of the more recurring data points. This makes sense, since these states are home to, not just one or two, but multiple prestigious schools. The plots also increase in density as time goes on, which seems to indicate that students (and entire schools as a result) are becoming smarter on average. Not only that, more Orange Red, Green, and Navy Blue points show up over the years, indicating that more schools from those states are joining the upper ranks. Of course, intelligence is measured by more than just standardized testing, but this is a pretty uplifting result.

Majors Plot

This dataset also contains the percentage of degrees awarded for certain majors at each school. By taking averages, can we see if students from some states favour certain subjects? As a mathematician, I have a personal interest in seeing trends for Maths degrees. Unlike the last plot, this one will require some calculations.

After noticing that many PCIP scores were recorded as “1” (meaning 100% of degrees were awarded in that major at that school in that year), it was clear that taking an average of this would cause some problems - we would end up with greater-than-100% averages in many cases. This presented issues when plotting because the data range would be inconsistent and a few outliers could ruin the plot even with lots of good data available. To account for this, I filtered the data a few different ways. I removed fields where UGDS = “NULL”: records where the enrollment size was not recorded. I also removed ICLEVEL = 3 schools, which are “less than 2-year”" schools. Finally, I removed records with PREDDEG = 0, schools with “not classified” as their predominant degree type. This shrank the number of observations from 130,000 to 84,000, but the quality of the data was much better for taking averages and sums.

To get an average of degrees awarded for majors at the state level, I’ll be using (0.25)*UGDS, which is the undergraduate enrollment size. This will be used as an approximation for the number of students in the graduating class. Then I will multiply that number by the PCIP fields, which are the percentage of degrees awarded. After grouping by STATE and DATAYEAR to find the sum of the graduating size and degrees awarded, we can divide to find a new percent for each major: PCIP_AVG.

Now we are able to see, approximately, how many students from the collective graduating class of a state get a degree in mathematics. To my disappointment, Maths representation is so low that the Y-scale only goes from 0% to 1.5%. Still, we can see where it tends to be a more popular degree. Places like California, Connecticut, Massachusetts, and Pennsylvania are pretty consistently higher than the other states, while Florida and Arizona sit near the bottom. Vermont has reached the highest, though, and it has been steadily growing (like Rhode Island). Other states like Wyoming, Arkansas, and Montana have had a lot of variation over the years.

For reference, we can take a look at the national averages across the years for Maths degrees. To be able to compare easier, we should make the scale of this plot the same as the above plot, so we can see visually which states are above or below the national average.

The national averages range from 0.53% to 0.70%, so any states that consistently award more than ~0.62% could be considered above average. The states we identified earlier - California, Connecticut, Massachusetts, and Pennsylvania - are definitely above average. From this perspective, we can also see that Colorado is slightly above average as well. A lot of states sit near the average, which makes sense, and the states we noticed were low - Arizona and Florida - are quite below average.

It is also very easy to recreate this plot for another major. To check trends for History degrees, PCIP27 just needs to be changed to PCIP54.

Immediately, the plot can be seen to be quite different. The automatic Y-scale goes up to 3% in this plot, meaning the overall distribution is higher than in the Maths plot. Massachusetts, Vermont, and Connecticut still seem to be higher than others in this plot, and Florida and Arizona are near the bottom again. Let’s take a look at the national averages for History too, making sure to match the scale.

These averages range from 0.96% to 1.25%, so the states that consistently award more than ~1.11% could be considered above average. California does not stick out as much this time around, but Connecticut and Massachusetts definitely do. Even Alaska, Maine, New Hampshire, and Vermont are above average in History. Again, Arizona and Florida are well below average, joined this time by North Dakota, New Mexico, and Nevada.

Does the overall higher distribution mean History is more popular than Maths? It’s possible, but we have to remember that we did use an approximation to plot this.

Conclusion

After preparing the data and trying a few initial plots, my approach feels good but has some room for improvement. The plots look nice and are legible, but it is possible to draw the wrong conclusions due to approximations. I also could clean up sections of my code which are rather ineffecient and not that newcomer-friendly if I were to try to collaborate. I would eventually like to be able to hover some points to see plot details and adding some general interactivity would make this dataset much more useful. I will be looking into Shiny and how to incorporate that into my next projects. I also want to expand my research into other areas like cost, debt, and repayment. Since my goal is to ultimately have some sort of web service or app that helps in giving school suggestions, the visualizations I provide will have to present the information in such a way that a layperson will be able to make reasonable conclusions from what they see. This means cleanliness of the data and clarity in the plotting are a must. For those interested in seeing my process in detail and want to see the code behind the plots, please refer to my appendix for more information.