Getting Started with Prediction

From historians to financial analysts, researchers of all stripes are interested in prediction. Prediction asks the question, “given what I know so far, what do I expect will come next?” In the current political season, presidential election forecasts abound. This dates back to the work of Ray Fair, whose book is ridiculously cheap on Amazon. In today’s post, I will give an example of a much more basic–and hopefully, relatable–question: given the height of a father, how do we predict the height of his son?

To see how common predictions about children’s traits are, just Google “predict child appearance” and you will be treated to a plethora of websites and iPhone apps with photo uploads. Today’s example is more basic and will follow three questions that we should ask ourselves for making any prediction:

1. How different is the predictor from its baseline?
It’s not enough to just have a single bit of information from which to predict–we need to know something about the baseline of the information we are interested in (often the average value) and how different the predictor we are using is. The “Predictor” in this case will refer to the height of the father, which we will call U. The “outcome” in this case will be the height of the son, which we will call V.

To keep this example simple let us assume that U and V are normally distributed–in other words their distributions look like the familiar “bell curve” when they are plotted. To see how different our given observations of U or V are from their baseline, we “standardize” them into X and Y

X = {{u - \mu_u} \over \sigma_u }

Y = {{v - \mu_v} \over \sigma_v },

where \mu is the mean and \sigma is the standard deviation. In our example, let \mu_u = 69, \mu_v=70, and \sigma_v = \sigma_u = 2.

2. How much variance in the outcome does the predictor explain?
In a simple one-predictor, one-outcome (“bivariate”) example like this, we can answer question #2 by knowing the correlation between  X and Y, which we will call \rho (and which is equal to the correlation between U and V in this case). For simplicity’s sake let’s assume \rho={1 \over 2}. In real life we would probably estimate \rho using regression, which is really just the reverse of predicting. We should also keep in mind that correlation is only useful for describing the linear relationship between X and Y, but that’s not something to worry about in this example. Using \rho, we can set up the following prediction model for Y:

Y= \rho X + \sqrt{1-\rho^2} Z.

Plugging in the values above we get:

Y= {1 \over 2} X + \sqrt{3 \over 4} Z.

Z is explained in the next paragraph.

3. What margin of error will we accept? No matter what we are predicting, we have to accept that our estimates are imperfect. We hope that on average we are correct, but that just means that all of our over- and under-estimates cancel out. In the above equation, Z represents our errors. For our prediction to be unbiased there has to be zero correlation between X and Z. You might think that is unrealistic and you are probably right, even for our simple example. In fact, you can build a decent good career by pestering other researchers with this question every chance you get. But just go with me for now. The level of incorrect prediction that we are able to accept affects the “confidence interval.” We will ignore confidence intervals in this post, focusing instead on point estimates but recognizing that our predictions are unlikely to be exactly correct.

The Prediction

Now that we have set up our prediction model and nailed down all of our assumptions, we are ready to make a prediction. Let’s predict the height of the son of a man who is 72″ tall. In probability notation, we want

\mathbb{E}(V|U=72),

which is the expected son’s height given a father with a height of 72”.

Following the steps above we first need to know how different 72″ is from the average height of fathers.  Looking at the standardizations above, we get

X = {U-69 \over 2}, and

Y = {V - 70 \over 2}, so

\mathbb{E}(V|U=72) = \mathbb{E}(2Y+70|X=1.5) = \mathbb{E}(2({1 \over 2}X + \sqrt{3 \over 4}Z)+70|X=1.5),

which reduces to 1.5 + \mathbb{E}(Z|X=1.5) + 70. As long as we were correct earlier about Z not depending on X and having an average of zero, then we get a predicted son’s height of 71.5 inches, or slightly shorter than his dad, but still above average.

This phenomenon of the outcome (son’s height) being closer to the average than the predictor (father’s height) is known as regression to the mean and it is the source of the term “regression” that is used widely today in statistical analysis. This dates back to one of the earliest large-scale statistical studies by Sir Francis Galton in 1886, entitled, “Regression towards Mediocrity in Hereditary Stature,” (pdf) which fits perfectly with today’s example.

Further reading: If you are already comfortable with the basics of prediction, and know a bit of Ruby or Python, check out Prior Knowledge.