Personal Finance With Data Science: Part 1

Personal finance often feels hopelessly stuck between the personal aspect and the financial aspect. Take buying a house for example: approaching it from a strictly financial point of view, the right answer is maximize return- make as much money from the house for the least amount of investment. From a strictly personal point of view most people want to live in a part of town they like, where they can see their kids playing in the front yard and have their mother-in-law visit, rate of return be damned. The real question most consumers are asking when buying a house is usually “Can I buy the house I want without having to sacrifice too much?”. In other words- can you have the house of your dreams while still being able to retire and not have to worry about money every month? Answering this question involves blending the financial and personal.

Fortunately the analytics brand of data science is especially well suited for these nebulous types of business problems. Analytics teams usually answer questions like “which home page maximizes signups while staying true to our brand?“ That require a decent amount of number crunching but a lot of bigger picture thinking too. This partially explains the explosion in popularity of Bayesian methods- this brand of statistics allows you to better factor in your own subjective assessments and account for the uncertainty inherent in forecasting anything.

If we want to figure out how signing on for a 30 year mortgage vs 15, contributing more to your 401k, rage-quitting your job etc. will impact your life first we need to get the basics down: how much money can you expect to earn vs how much do you expect to spend? We’ll start simple then fold in life events and decisions in subsequent posts. Let’s start with figuring out income.

Income

Sally needs an estimate of how much money is is likely to make until she retires. One approach would be to assume a raise from a normal distribution centered around some mean wage increase, say 2% (I am going to ignore inflation because these numbers are already relative- we’ll fold it in at the last possible moment) that doesn’t change with age. 

normalPriorRaise.jpg
cumulativeNormalRaises.jpg

This obviously seems overly aggressive. As workers approach middle age it makes more sense to forgo some salary for stability and security. Also your skills don’t improve as quickly as they once did. To better fit income we need some empirical evidence of what the trajectory for white-collar workers is like over a career.

lifetimeSalary.jpg

These data from PayScale show the general trend of incomes for college educated professionals as they age. These data are also missing sample sizes, variance, factors like family size, race, industry, geographic location etc. but it gives us a pretty solid idea. If we center our distribution from earlier on the observed value from this data set over time we get a very reasonable looking trend that gets less certain the further out we go. 

Expenses

Expenses are harder. Not only do they vary much more than one’s salary, we tend to spend money irrationally and life events like having children or a medical emergency can drastically alter expenses for years to come. Moreover, the Internet is not a ton of help here. The Bureau of Labor Statistics has super comprehensive data but it’s difficult to slice for our purposes- it has data by income and by age, but not both. This is especially troubling since data are represented by means instead of medians and these numbers are nationwide- from Nob Hill San Francisco to Twin Falls Idaho- all under the same number. Also (and this is a nit) the format is insanely difficult to read, let alone parse. I’ve made a more sensible version of the previously linked data here.

categoricalExpenses.jpg

It’s not all bad- the best part about these data is how they’re broken out by category. Later when we go to evaluate mortgage payments or buying a car we can adjust this number much more accurately. With some handy data transformations (see more in this notebook) we get a rate of change for expenses that will do for now. Notice the error doesn’t balloon like income did- this makes sense because expenses don’t compound like income (no one negotiates a 3% increase in spending at restaurants for the next fiscal year).

Taxes and Income Withholding

taxRate.jpg

Taxes seems easy given there’s a table with exactly what you owe provided by the IRS, but forecasting this rate is a nightmare. The tax code changes constantly, varies by locality, state, martial status, etc etc. This is supposed to be an estimate, not a tax policy paper so I’m just going to assume 25%. This obviously changes a ton if income is drastically higher or lower but for most middle class people, lumping in social security, income tax, medicare, etc. this will have to do. 

I’m also removing 401k contributions. This technically is your money that you’re saving but in reality it’s going somewhere you can’t easily access until retirement so for things like trying to decide if buying a car will wreck your finances it’s effectively gone.

A Baseline for Household Finances

savings.jpg

 If we put all these together we get something like this. This seems like way too much money being saved? But after reading articles like this and this I’m thinking maybe not. Moreover, these leftover dollars have to pull a lot of weight. Saving for college, retirement etc. In future post we’ll take a look at how they’re impacted by emergencies, major purchases, life events and job changes. 

For Karl Popper fans and actual statisticians a lot of what was done here is heretical. Below is a partial list of assumptions made during this process.

  • No unemployment

  • No major changes to lifestyle or spending

  • Income data are for college-educated professionals presumably over historical observations. We’re assuming this trend will hold for another generation

  • Expenses are for nationwide averages, most likely heavily influenced by higher earners and those with more stable lifestyles

  • We just made up the taxes

  • We assume some pretty questionable standard deviations for these distributions. Moreover, we assume a constant standard deviation across years- also questionable.

What’s the point of doing this with so much wrong? What we’re really doing here is taking the subjective assessments we’d make absent of data and projecting them. This makes them easier to visualize and assess the consequences.