Background
As a recent Grinnell College alum, I’ve closely observed and been impacted by significant shifts in the academic landscape. When I graduated, the acceptance rate at Grinnell had plummeted by 15% from the time I entered, paralleled by a sharp rise in tuition fees. This pattern wasn’t unique to my alma mater; friends from various colleges echoed similar experiences.
This got me thinking: Is this a widespread trend across U.S. colleges? My theory was twofold: firstly, the advent of online applications might have simplified the process of applying to multiple colleges, thereby increasing the applicant pool and reducing acceptance rates. Secondly, an article from the Migration Policy Institute highlighted a doubling in the number of international students in the U.S. from 2000 to 2020 (from 500k to 1 million), potentially intensifying competition. Alongside, I was curious about the tuition fee trends from 2001 to 2022. My aim here is to unravel these patterns through data visualization. For the following analysis, all images, unless otherwise noted, are by the author!
Dataset
The dataset I utilized encompasses a range of data about U.S. colleges from 2001 to 2022, covering aspects like institution type, yearly acceptance rates, state location, and tuition fees. Sourced from the College Scorecard, the original dataset was vast, with over 3,000 columns and 10,000 rows. I meticulously selected pertinent columns for a focused analysis, resulting in a refined dataset available on Kaggle. To ensure relevance and completeness, I concentrated on 4-year colleges featured in the U.S. News college rankings, drawing the list from here.
Change in Acceptance Rates Over the Years
Let’s dive into the evolution of college acceptance rates over the past two decades. Initially, I suspected that I would observe a steady decline. Figure 1 illustrates this trajectory from 2001 to 2022. A consistent drop is evident until 2008, followed by fluctuations leading up to a notable increase around 2020–2021, likely a repercussion of the COVID-19 pandemic influencing gap year decisions and enrollment strategies.
avg_acp_ranked = df_ranked.groupby("year")["ADM_RATE_ALL"].mean().reset_index()plt.figure(figsize=(10, 6)) # Set the figure size
plt.plot(avg_acp_ranked['year'], avg_acp_ranked['ADM_RATE_ALL'], marker='o', linestyle='-', color='b', label='Acceptance Rate')
plt.title('Average Acceptance Rate Over the Years') # Set the title
plt.xlabel('Year') # Label for the x-axis
plt.ylabel('Average Acceptance Rate') # Label for the y-axis
plt.grid(True) # Show grid
# Show a legend
plt.legend()
# Display the plot
plt.show()
However, the overall drop wasn’t as steep as my experience at Grinnell suggested. In contrast, when we zoom into the acceptance rates of more prestigious universities (Figure 2), a steady decline becomes apparent. This led me to categorize colleges into three groups based on their 2022 admission rates (Top 10% competitive, top 50%, and others) and analyze the trends within these segments.
pres_colleges = ["Princeton University", "Massachusetts Institute of Technology", "Yale University", "Harvard University", "Stanford University"]
pres_df = df[df['INSTNM'].isin(pres_colleges)]
pivot_pres = pres_df.pivot_table(index="INSTNM", columns="year", values="ADM_RATE_ALL")
pivot_pres.T.plot(linestyle='-')
plt.title('Change in Acceptance Rate Over the Years')
plt.xlabel('Year')
plt.ylabel('Acceptance Rate')
plt.legend(title='Colleges')
plt.show()
Figure 3 unveils some surprising insights. Except for the least competitive 50%, colleges have generally seen an increase in acceptance rates since 2001. The fluctuations post-2008 across all but the top 10% of colleges could be attributed to economic factors like the recession. Notably, competitive colleges didn’t experience the pandemic-induced spike in acceptance rates seen elsewhere.
top_10_threshold_ranked = df_ranked[df_ranked["year"] == 2001]["ADM_RATE_ALL"].quantile(0.1)
top_50_threshold_ranked = df_ranked[df_ranked["year"] == 2001]["ADM_RATE_ALL"].quantile(0.5)top_10 = df_ranked[(df_ranked["year"]==2001) & (df_ranked["ADM_RATE_ALL"] <= top_10_threshold_ranked)]["UNITID"]
top_50 = df_ranked[(df_ranked["year"]==2001) & (df_ranked["ADM_RATE_ALL"] > top_10_threshold_ranked) & (df_ranked["ADM_RATE_ALL"] <= top_50_threshold_ranked)]["UNITID"]
others = df_ranked[(df_ranked["year"]==2001) & (df_ranked["ADM_RATE_ALL"] > top_50_threshold_ranked)]["UNITID"]
top_10_df = df_ranked[df_ranked["UNITID"].isin(top_10)]
top50_df = df_ranked[df_ranked["UNITID"].isin(top_50)]
others_df = df_ranked[df_ranked["UNITID"].isin(others)]
avg_acp_top10 = top_10_df.groupby("year")["ADM_RATE_ALL"].mean().reset_index()
avg_acp_others = others_df.groupby("year")["ADM_RATE_ALL"].mean().reset_index()
avg_acp_top50 = top50_df.groupby("year")["ADM_RATE_ALL"].mean().reset_index()
plt.figure(figsize=(10, 6)) # Set the figure size
plt.plot(avg_acp_top10['year'], avg_acp_top10['ADM_RATE_ALL'], marker='o', linestyle='-', color='g', label='Top 10%')
plt.plot(avg_acp_top50['year'], avg_acp_top50['ADM_RATE_ALL'], marker='o', linestyle='-', color='b', label='Top 50%')
plt.plot(avg_acp_others['year'], avg_acp_others['ADM_RATE_ALL'], marker='o', linestyle='-', color='r', label='Others')
plt.title('Average Acceptance Rate Over the Years') # Set the title
plt.xlabel('Year') # Label for the x-axis
plt.ylabel('Average Acceptance Rate') # Label for the y-axis
# Show a legend
plt.legend()
# Display the plot
plt.show()
One finding particularly intrigued me: when considering the top 10% of colleges, their acceptance rates hadn’t decreased notably over the years. This led me to question whether the shift in competitiveness was widespread or if it was a case of some colleges becoming significantly harder or easier to get into. The steady decrease in acceptance rates at prestigious institutions (shown in Figure 2) hinted at the latter.
To get a clearer picture, I visualized the changes in college competitiveness from 2001 to 2022. Figure 4 reveals a surprising trend: about half of the colleges actually became less competitive, contrary to my initial expectations.
pivot_pres_ranked = df_ranked.pivot_table(index="INSTNM", columns="year", values="ADM_RATE_ALL")
pivot_pres_ranked_down = pivot_pres_ranked[pivot_pres_ranked[2001] >= pivot_pres_ranked[2022]]
len(pivot_pres_ranked_down)pivot_pres_ranked_up = pivot_pres_ranked[pivot_pres_ranked[2001] < pivot_pres_ranked[2022]]
len(pivot_pres_ranked_up)
categories = ["Up", "Down"]
values = [len(pivot_pres_ranked_up), len(pivot_pres_ranked_down)]
plt.figure(figsize=(8, 6))
plt.bar(categories, values, width=0.4, align='center', color=["blue", "red"])
plt.xlabel('Change in acceptance rate')
plt.ylabel('# of colleges')
plt.title('Change in acceptance rate from 2001 to 2022')
# Show the chart
plt.tight_layout()
plt.show()
This prompted me to explore possible factors influencing these shifts. My hypothesis, reinforced by Figure 2, was that already selective colleges became even more so over time. Figure 5 compares acceptance rates in 2001 and 2022.
The 45-degree line delineates colleges that became more or less competitive. Those below the line saw reduced acceptance rates. A noticeable cluster in the lower-left quadrant represents selective colleges that became increasingly exclusive. This trend is underscored by the observation that colleges with initially low acceptance rates (left side of the plot) tend to fall below this dividing line, while those on the right are more evenly distributed.
Furthermore, it’s interesting to note that since 2001, the most selective colleges are predominantly private. To test whether the changes in acceptance rates differed significantly between the top and bottom 50 percentile colleges, I conducted an independent t-test (Null hypothesis: θ_top = θ_bottom). The results showed a statistically significant difference.
import seaborn as sns
from matplotlib.patches import Ellipsepivot_region = pd.merge(pivot_pres_ranked[[2001, 2022]], df_ranked[["REGION","INSTNM", "UNIVERSITY", "CONTROL"]], on="INSTNM", how="right")
plt.figure(figsize=(8, 8))
sns.scatterplot(data=pivot_region, x=2001, y=2022, hue='CONTROL', palette='Set1', legend='full')
plt.xlabel('Acceptance rate for 2001')
plt.ylabel('Acceptance rate for 2022')
plt.title('Change in acceptance rate')
x_line = np.linspace(0, max(pivot_region[2001]), 100) # X-values for the line
y_line = x_line # Y-values for the line (slope = 1)
plt.plot(x_line, y_line, label='45-Degree Line', color='black', linestyle='--')
# Define ellipse parameters (center, width, height, angle)
ellipse_center = (0.25, 0.1) # Center of the ellipse
ellipse_width = 0.4 # Width of the ellipse
ellipse_height = 0.2 # Height of the ellipse
ellipse_angle = 45 # Rotation angle in degrees
# Create an Ellipse patch
ellipse = Ellipse(
xy=ellipse_center,
width=ellipse_width,
height=ellipse_height,
angle=ellipse_angle,
edgecolor='b', # Edge color of the ellipse
facecolor='none', # No fill color (transparent)
linewidth=2 # Line width of the ellipse border
)
plt.gca().add_patch(ellipse)
# Add the ellipse to the current a
plt.legend()
plt.gca().set_aspect('equal')
plt.show()
Another aspect that piqued my curiosity was regional differences. Figure 6 lists the top 5 colleges with the most significant decrease in acceptance rates (calculated by dividing the 2022 acceptance rate by the 2001 rate).
It was astonishing to see how high the acceptance rate for the University of Chicago was two decades ago — half of the applicants were admitted then!
This also helped me understand my initial bias towards a general decrease in acceptance rates; notably, Grinnell College, my alma mater, is among these top 5 with a significant drop in acceptance rate.
Interestingly, three of the top five colleges are located in the Midwest. My theory is that with the advent of the internet, these institutions, not as historically renowned as those on the West and East Coasts, have gained more visibility both domestically and internationally.
pivot_pres_ranked["diff"] = pivot_pres_ranked[2001] / pivot_pres_ranked[2022]
tmp = pivot_pres_ranked.reset_index()
tmp = tmp.merge(df_ranked[df_ranked["year"]==2022][["INSTNM", "STABBR", "CITY"]],on="INSTNM")
tmp.sort_values(by="diff",ascending=False)[["INSTNM", "diff", "STABBR", "CITY"]].head(5)