# Statistical Thinking and the Birth of Modern Computing

John von Neumann and the IAS computer, 1945

What do fighter pilots, casinos, and streetlights all have in common? These three disparate topics are all the subject of statistical thinking that led to (and benefitted from) the development of modern computing. This process is described in Turing’s Cathedral by George Dyson, from which most of the quotes below are drawn. Dyson’s book focuses on Alan Turing far less than the title would suggest, in favor of John von Neumann’s work at the Institute for Advanced Studies (IAS). Von Neumann and the IAS computing team are well-known for building the foundation of the digital world, but before Turing’s Cathedral I was unaware of the deep connection with statistics.

Statistical thinking first pops up in the book with Julian Bigelow’s list of fourteen “Maxims for Ideal Prognosticators” for predicting aircraft flight paths on December 2, 1941. Here is a subset (p. 112):

7. Never estimate what may be accurately computed.

8. Never guess what may be estimated.

9. Never guess blindly.

This early focus on estimation will reappear in a moment, but for now let’s focus on the aircraft prediction problem. With the advent of radar it became possible for sorties at night or in weather with poor visibility. In a dark French sky or over a foggy Belgian city it could be tough to tell who was who until,

otherwise adversarial forces agreed on a system of coded signals identifying their aircraft as friend or foe. In contrast to the work of wartime cryptographers, whose job was to design codes that were as difficult to understand as possible, the goal of IFF [Identification Friend or Foe] was to develop codes that were as difficult to misunderstand as possible…. We owe the existence of high-speed digital computer to pilots who preferred to be shot down intentionally by their enemies rather than accidentally by their friends. (p. 116)

In statistics this is known as the distinction between Type I and Type II errors, which we have discussed before. Pilots flying near their own lines likely figured there was a greater probability that their own forces would make a mistake than that the enemy would detect them–and going down as a result of friendly fire is no one’s idea of fun. This emergence of a cooperative norm in the midst of combat is consistent with stories from other conflicts in which the idea of fairness is used to compensate for the rapid progress of weapons technology.

Chapter 10 of the book (one of my two favorites along with Chapter 9, Cyclogenesis) is entitled Monte Carlo. Statistical practitioners today use this method to simulate statistical distributions that are analytically intractable. Dyson weaves the development of Monte Carlo in with a recounting how von Neumann and his second wife Klari fell in love in the city of the same name. A full description of this method is beyond the scope of this post, but here is a useful bit:

Monte Carlo originated as a form of emergency first aid, in answer to the question: What to do until the mathematician arrives? “The idea was to try out thousand of such possibilities and, at each stage, to select by chance, by means of a ‘random number’ with suitable probability, the fate or kind of event, to follow it in a line, so to speak, instead of considering all branches,” [Stan] Ulam explained. “After examining the possible histories of only a few thousand, one will have a good sample and an approximate answer to the problem.”

For a more comprehensive overview of this development in the context of Bayesian statistics, check out The Theory That Would Not Die.

The third and final piece of the puzzle for our post today is the well-known but not sufficiently appreciated distinction between correlation and causation. Philip Thompson, a meteorologist who joined the IAS group in 1946, learned this lesson at the age of 4 and counted it as the beginning of his “scientific education”:

[H]is father, a geneticist at the University of Illinois, sent him to post a letter in a mailbox down the street. “It was dark, and the streetlights were just turning on,” he remembers. “I tried to put the letter in the slot, and it wouldn’t go in. I noticed simultaneously that there was a streetlight that was flickering in a very peculiar, rather scary, way.” He ran home and announced that he had been unable to mail the letter “because the streetlight was making funny lights.”

Thompson’s father seized upon this teachable moment, walked his son back to the mailbox and “pointed out in no uncertain terms that because two unusual events occurred at the same time and at the same place it did not mean that there was any real connection between them.” Thus the four-year-old learned a lesson that many practicing scientists still have not. This is also the topic of Chapter 8 of How to Lie with Statistics and a recent graph shared by Cory Doctorow.

The fact that these three lessons on statistical thinking coincided with the advent of digital computing, along with a number of other anecdotes in the book, impressed upon me the deep connection between these two fields of thought. Most contemporary Bayesian work would be impossible without computers. It is also possible that digital computing would have come about much differently without an understanding of probability and the scientific method.

# Modeling Third-Party Intervention in Civil Wars

Inspired by the 2011 coalition action in Libya, Shahryar Minhas and I recently developed a new agent-based model of third-party intervention in civil wars. If you are attending the ISSS/ISAC conference in Chapel Hill, you will get a chance to hear about the project this afternoon. If not, our slides are here.

We model civil war as a gambler’s ruin problem. Rebels start out with a fixed proportion of territory, always less than or equal to fifty percent. Depending on their relative strength, the long run expectation is that they will lose, just like you will lose to the “house” if you go for broke at a Vegas casino. Intervention can change the game, though: it would be like winning on black and half of the red spaces on a roulette wheel. Our model is a bit more detailed than that (particularly with how we model states’ decisions to intervene) but that’s the gist.

Here are the long-run probabilities of beating the government, given a certain percentage of territory occupied and a certain strength ratio, based on the gambler’s ruin:

And after 1,000,000 simulations, here is how intervention affects conflict duration:

So the takeaway is that we predict intervention to increase conflict duration; but if the outcome is more desirable for the intervener, it may be worth it.

# Voter Loyalty in Two Countries

Preliminary graphs from an ongoing project with Pablo Beramendi (apologies for the very plain presentation):

For both plots, the loyalty rate is calculated as the probability that an individual votes for party x in election t given that they voted for party x in election t-1. These probabilities are frequencies taken from electoral surveys in each country.

In the UK there is no clear trend, but in Portugal it appears that loyalty is declining. However, notice that the two x-axes are different lengths (due to data availability), so it is difficult to say whether the trend in Portugal would hold up over a longer period.

# Getting Started with Prediction

From historians to financial analysts, researchers of all stripes are interested in prediction. Prediction asks the question, “given what I know so far, what do I expect will come next?” In the current political season, presidential election forecasts abound. This dates back to the work of Ray Fair, whose book is ridiculously cheap on Amazon. In today’s post, I will give an example of a much more basic–and hopefully, relatable–question: given the height of a father, how do we predict the height of his son?

To see how common predictions about children’s traits are, just Google “predict child appearance” and you will be treated to a plethora of websites and iPhone apps with photo uploads. Today’s example is more basic and will follow three questions that we should ask ourselves for making any prediction:

1. How different is the predictor from its baseline?
It’s not enough to just have a single bit of information from which to predict–we need to know something about the baseline of the information we are interested in (often the average value) and how different the predictor we are using is. The “Predictor” in this case will refer to the height of the father, which we will call $U$. The “outcome” in this case will be the height of the son, which we will call $V$.

To keep this example simple let us assume that $U$ and $V$ are normally distributed–in other words their distributions look like the familiar “bell curve” when they are plotted. To see how different our given observations of $U$ or $V$ are from their baseline, we “standardize” them into $X$ and $Y$

$X = {{u - \mu_u} \over \sigma_u }$

$Y = {{v - \mu_v} \over \sigma_v }$,

where $\mu$ is the mean and $\sigma$ is the standard deviation. In our example, let $\mu_u = 69$, $\mu_v=70$, and $\sigma_v = \sigma_u = 2$.

2. How much variance in the outcome does the predictor explain?
In a simple one-predictor, one-outcome (“bivariate”) example like this, we can answer question #2 by knowing the correlation between  $X$ and $Y$, which we will call $\rho$ (and which is equal to the correlation between $U$ and $V$ in this case). For simplicity’s sake let’s assume $\rho={1 \over 2}$. In real life we would probably estimate $\rho$ using regression, which is really just the reverse of predicting. We should also keep in mind that correlation is only useful for describing the linear relationship between $X$ and $Y$, but that’s not something to worry about in this example. Using $\rho$, we can set up the following prediction model for $Y$:

$Y= \rho X + \sqrt{1-\rho^2} Z$.

Plugging in the values above we get:

$Y= {1 \over 2} X + \sqrt{3 \over 4} Z$.

$Z$ is explained in the next paragraph.

3. What margin of error will we accept? No matter what we are predicting, we have to accept that our estimates are imperfect. We hope that on average we are correct, but that just means that all of our over- and under-estimates cancel out. In the above equation, $Z$ represents our errors. For our prediction to be unbiased there has to be zero correlation between $X$ and $Z$. You might think that is unrealistic and you are probably right, even for our simple example. In fact, you can build a decent good career by pestering other researchers with this question every chance you get. But just go with me for now. The level of incorrect prediction that we are able to accept affects the “confidence interval.” We will ignore confidence intervals in this post, focusing instead on point estimates but recognizing that our predictions are unlikely to be exactly correct.

The Prediction

Now that we have set up our prediction model and nailed down all of our assumptions, we are ready to make a prediction. Let’s predict the height of the son of a man who is 72″ tall. In probability notation, we want

$\mathbb{E}(V|U=72)$,

which is the expected son’s height given a father with a height of 72”.

Following the steps above we first need to know how different 72″ is from the average height of fathers.  Looking at the standardizations above, we get

$X = {U-69 \over 2}$, and

$Y = {V - 70 \over 2}$, so

$\mathbb{E}(V|U=72) = \mathbb{E}(2Y+70|X=1.5) = \mathbb{E}(2({1 \over 2}X + \sqrt{3 \over 4}Z)+70|X=1.5)$,

which reduces to $1.5 + \mathbb{E}(Z|X=1.5) + 70$. As long as we were correct earlier about $Z$ not depending on $X$ and having an average of zero, then we get a predicted son’s height of 71.5 inches, or slightly shorter than his dad, but still above average.

This phenomenon of the outcome (son’s height) being closer to the average than the predictor (father’s height) is known as regression to the mean and it is the source of the term “regression” that is used widely today in statistical analysis. This dates back to one of the earliest large-scale statistical studies by Sir Francis Galton in 1886, entitled, “Regression towards Mediocrity in Hereditary Stature,” (pdf) which fits perfectly with today’s example.

Further reading: If you are already comfortable with the basics of prediction, and know a bit of Ruby or Python, check out Prior Knowledge.

# How Do We Define Risk?

Statistics are famous for their pliability, as anyone who has heard of Mark Twain will attest. Proponents on either side of a policy position often have”hard numbers” to support their view. When costs or benefits are uncertain, there must be some way to measure risk. Attitudes toward risk can have a powerful influence on which policy positions one finds acceptable.

One way that risk calculations become difficult is when there is a very, very small chance of a very, very bad outcome. Conservatives make use of this logic when they suggest that potential murderers facing a small chance of the death penalty will be dissuaded from their crime. Or, conversely, that a small chance of the criminal killing again is bad enough to justify execution. Liberals employ this same type of reasoning when they talk about potentially extreme consequences of global warming: even a small risk of the Eastern seaboard disappearing would be disastrous.

Humans are very bad at estimating probabilities, particularly when they are below 1 percent. Daniel Kahneman discusses this in his book Thinking Fast and Slow, where he gives an example of the difference between a 0.001% risk of death and a 0.00001% chance. To the human eye/brain the difference seems minuscule, but in practical terms it means 3,000 Americans dying or 30. Combine this with the fact that it is easier to recall vivid memories like September 11, and you have foolish policies that require Americans to remove their shoes and stand in a full-body scanner to reduce their risk of dying in a plane hijacked by suicide bombers from practically zero to slightly-less-than-practically zero.

Kahneman offers a quote from Paul Slovic that summarizes the discussion well:

Whoever controls the definition of risk controls the rational solution to the problem at hand. If you define risk one way, then one option will rise to the top as the most cost-effective or the safest or the best. If you define it another way, perhaps incorporating qualitative characteristics and other contextual factors, you will likely get a different ordering of your action solutions. Defining risk is thus an exercise in power. (original article here)

And until we get our philosopher-king, the ones who understand risk will likely not be the ones holding the keys to the kingdom. But we can at least be aware of when the small-chance/very-bad-outcome argument is being used to frighten us into irrationality.