I would like to acknowledge FIREOPP’s Preston Cooper for producing this article and I would like to cite his article’s table as I have used it extensively for the income projection of college degrees. Furthermore, I would implore any user of this website to read his article, it is extraordinarily insightful and was my blueprint for this project. I implore anyone to read it, it's brilliant. I would also like to thank the first author who managed to piece everything together Douglas A. Webber who put together the entire methodology and his paper on the Lifetime Earnings Premia of Different Majors here
Secondly, I would like to cite the American Community Survey by the Census Bureau for information needed to create the dataset of 3.3 million Americans who have only high school degrees.
Lastly, I would like to cite the for the National Longitudinal Survey of Youth 1997 by the US Bureau of Labor which I used to calculate an adjustment factor per user to control for a multitude of variables that heavily impact lifetime income.
The method was very heavily inspired by the method of Preston Cooper's Paper here and also another paper by Douglas A. Webber here . I reckon that the main differences are a newer dataset and different variables used, alongside with the interactive component.
It is very difficult to figure out the value of a college degree because there is such a large difference between those who go to college and those who do not. For example, a lot of people who are smart enough to be an astronaut do not end up being plumbers, but what if someone with the potential to be an astronaut becomes a plumber? Would they be one of the best plumbers to ever exist?
The Data :
One of the preconditions of accepting student debt or student debt is that the US government ends up tracking future income. So, there is a VAST amount of individualized data on Student Outcomes that is publicly available called the College Scorecard (5 years of income after graduation). The second feature is that the US Government also keeps extremely detailed information on 1% of the population that ends up being part of the census. This is called the American Community Survey with detailed information such as education, employment, race. The third datapoint is the NSLY97 which is an annual survey for 8000 American Youths from the year 1997 to now.
Income Projection :
From there we could extrapolate the income of graduates from the College Scorecard and the ACS Survey, under the assumption that the income of college graduates who graduated from a specific college and a major can be tied to the general income for Americans with a specific field (As the College Scorecard only accounts for the first 5 years of income). For example, Chemists Scientists from Boston University might graduate at the 90th percentile of all Computer Scientists with a bachelor’s degree in the US and we account for a bit of regression to mean. This part simply uses data from existing datasets as we lack enough data to fully create a way to estimate the granular details of how income projections tie to GPA or class rank for each college.
Counter-Factual Income :
We use the ACS to create the income timeline of the individual had they not gone to college. We train a regression with 2nd degree interaction terms from the existing ACS Dataset (3 Million Rows) to estimate the future income had they not gone to college. log(Income) ~ Profession + State + Gender + Race + Age + 2nd degree Interactions to account for Motherhood and other factors.
Controlling for the Confounding Variables :
We then essentially run two regressions: The first is a simple regression with just (log(Income) ~ Profession + State + Gender + Race + Age) and then the second regression is (log(Income) ~ Profession + State + Gender + Race + Age + the massive questionnaire). The idea is that we use the small regression as a baseline central estimate for the features that exist in the ACS only (Profession + State + Gender + Race + Age). However, the more complicated regression is used as a numerator to see how much the individual overperforms compared to the average person within their cohort when we account for their individualized features. Then the ratio is applied to the Counter-Factual income so that we get a modified counter-factual income that accounts for more detailed personality traits.
Then we simply account for cost of capital, total debt, payback rate etc to create a calculator.
I totally understand that this has a lot of assumptions and problems with potential overfitting and makes many assumptions such as the sample being representative of the population.
Made by Howell Lu