3-D Baseball: Math Behind Projecting the Division Winner (THT Article)

Note: this article uses examples from the free statistical software R

In my Hardball Times article about the projecting the number of wins we expect from the division winner, I included the following example:

Instead of having five baseball teams, let's say we have five coins. All we are going to do is flip each coin 162 times. Each time a coin lands on heads, it gets a win, and each time it lands on tails, it gets a loss. The coin with the most wins after 162 flips wins the division.

How many wins would you project for the coin that ends up winning the division, whichever coin that might be?

No coin by itself is going to have an expected value of more than 81 wins, but it is extremely likely that at least one out of the five coins will end up with more than 81 wins just by chance. It turns out that if you repeat this experiment a bunch of times, the coin that wins the division will end up with about 88 wins, on average.

Hopefully this makes sense conceptually, but how do I get 88 wins (or, more precisely, 88.3943...)?

One way, of course, is to actually do what I said, and flip a bunch of coins over and over and over and record the results. Let's say I repeat this experiment 10 times, and I get the following results for the "division winners":

94, 85, 89, 87, 89, 90, 82, 86, 85, 86

That is an average of 87.3--pretty good, but obviously not the most precise estimate. We need to repeat the experiment more than ten times to make sure we get something closer to the true mean. Rather than spend hours upon hours flipping coins, we can actually cheat and get a computer to pretend to do it for us. This is called simulation, and it can be a very powerful statistical tool for determining probabilities, averages, distributions, etc that are not computationally obvious (full disclosure: I actually cheated and simulated the 10 seasons rather than record and tally 8000+ coin flips).

Now, let's simulate 1000 seasons: this time, we get 88.5940 wins leading the division, on average. Much better, but still a couple tenths off. Bumping the number of seasons up to 10,000, this time we get 88.4296. And if we keep simulating more and more seasons, we are going to start seeing the results stay clustered more and more closely around 88.3943.

So that's one way to estimate the expected win total for our division winner. How do I know that the results should cluster around 88.3943 specifically, though, other than simulating millions and millions of seasons?

We can get the answer without simulation by starting with a simpler question. What is the probability that none of the teams wins more than, for example, 81 games? The probability that one team wins no more than 81 games is a simple binomial distribution problem: pbinom(81,162,.5) ~ .5313. The probability that all five are at 81 or lower then becomes .5313^5 ~ .04233.

There is about a 4% chance that the division winner will have 81 or fewer wins. We can repeat that calculation for 80 wins, and we see that there is about a .02262 probability of the division winner having 80 or fewer wins. That means the probability of the division winner having exactly 81 wins is .04233 - .02262 = .01971.

Then, we repeat that process for every number from 0 to 162, and we end up with a table of probabilities of the division winner ending up on each possible number of wins. (If you were to do this by hand, you could shortcut a bit by only going from something like 70 to 115 since the probabilities outside that range are all virtually zero anyway.)

Finally, we multiply each possible win total by the probability of the division winner finishing with that number of wins, and we add up the results to get a mean for the distribution. And doing that gives us 88.3943.

R CODE:

#calculate expected mean value of division winner
p <- .5 #probability of each team winning each game
n <- 162 #number of games per season
teams <- 5 #number of teams in the division

games <- 0:n # list of possible win totals (0:162)
p.list <- pbinom(games,n,p)^teams # p of div winner winning X games or fewer
wts <- c(p.list[1],diff(p.list)) # p of div winner winning exactly X games
sum(games*wts) # average wins by division winner

#RESULT
[1] 88.39431

As we can see, it is possible to calculate the mean of this distribution exactly, but it is still pretty cumbersome to do so without a computer. As such, let's discuss one final way to estimate this mean using simpler calculations.

First, we will need a continuous distribution, so we use a normal approximation for the binomial distribution. The mean of the normal distribution will just be 81 (the average number of wins we expect from a team in our example), and the standard deviation will be sqrt(npq) = sqrt(162*.5*.5) ~ 6.36.

All we need to do now is find the point where there is a 50% chance that five numbers randomly sampled from this distribution will all fall below that number. Start by finding the percentile of the distribution that fulfils this condition:

p^5 = .5
p = .5^(1/5) ~ 0.8706

This means we want point at the 0.8706 percentile of our normal distribution, which is simple to look up using an online tool or simple statistical software:

qnorm(0.8706,81,6.36) ~ 88.1849

That is our estimate for the expected number of wins from the division winner. This is slightly off because we are actually calculating the median and not the mean (and because we used a normal approximation, but that makes less difference), but it is still a pretty good estimate given the amount of calculation we simplified.

3-D Baseball

Math Behind Projecting the Division Winner (THT Article)

0 comments:

Post a Comment

Javier Vazquez K-Watch

Links

Retrosheet Credit

Lahman Credit

Contributors

Blog Archive