Simulating the NLDS: Can the Giants Win?

In Allen Downey’s new book, Think Bayes, he relates the “Boston Bruins” problem. The problem is to estimate the Bruins’ probability of winning the 2010-2011 NHL championship after two wins and two losses. I will briefly describe Downey’s approach, and then relate it to the current situation of the San Francisco Giants.

One (naive) approach would be to model this as a gambler’s ruin problem. There are two problems with that model for this problem: the total number of rounds to be played is uncertain (i.e. the championship is a best of n rather than play until one side is totally defeated), and it throws away important information about the score of the games.

Instead, we model baseball as a Poisson process, in which it is equally likely for a run to be scored at any time during the game. This is still somewhat of an oversimplification (the odds are better when you have runners on base, for example), but we are getting closer to the “true” model. Second, we assume that games between the Reds and Giants in this year’s National League Division Series are similar enough that they can be considered as outcomes from Poisson distributions in which each team’s scoring distribution is consistent between games with parameter λ. Different pitchers could cause this assumption to be thrown off (no pun intended), but we will again use it as a not-entirely-implausible simplification.

Having made our assumptions, we now use a four-step process proposed by Downey:

1. Use statistics from previous games to choose a prior distribution for λ.
2. Use the score from the first four games to estimate λ for each team.
3. Use the posterior distributions of λ to compute distribution of goals for each team, the distribution of the goal differential, and the probability that each team wins.
4. Simulate the rest of the series to estimate the probability of each possible outcome.

To calculate λ, we will use the team batting stats from ESPN and the thinkbayes Python package from Downey’s site.

Here is the distribution of λ, using regular season scoring as the prior and updating with the results of the first four games of the division series:

And here are the predicted runs-per-game by team, using simulations:

According to the model’s predictions, the probability that the Giants win today’s game (and the division series) is 0.387. I would have preferred to use a Gamma prior for λ and run some more simulations in R, but I wanted to use Downey’s example and get this up before the game started… which was a few minutes ago (although as I post, the score is still 0-0). Either way, enjoy the game!