3-D Baseball: Poisson Processes in Sports

In sports, the problem of relating a team's offensive and defensive production to its W-L record is closely related to the distribution of scoring events in the sport. For example, say you want to know how often a team that scores, on average, 4 times per game and allows 3 scores per game is expected to win. It is not enough to simply know that the team averages 4 scores and 3 scores allowed; you also have to have an idea of how likely the team is to score (or allow) 0 times, 1 time, 2 times, etc. If the nature of the sport provides for a very tight range of scores for each team (i.e. the 4-score team is very unlikely to score 0 or 1 time, or 7 or 8 times), then the team will win more often than if the sport sees a wider distribution of observed scores for each team.

Let's say, for example, that the team in this example scores and allows scores in the following distribution:

	score	allow
0	0.06	0.14
1	0.1	0.15
2	0.13	0.17
3	0.15	0.17
4	0.17	0.12
5	0.13	0.1
6	0.1	0.07
7	0.07	0.04
8	0.05	0.03
9	0.03	0.01
10	0.01	0

In the above table, the team would score 0 times 6% of the time and allow 0 scores 14% of the time.

To find the chances of the team winning, you first figure its chances of outscoring its opponents when it scores once. Since this team will allow fewer than one score 14% of the time, it would be expected to win 14% of the time it scores once. The team scores once 10% of the time, so one-score victories should account for .1*.14=1.4% of its games. Continuing for 2-score victories, the team allows less than 2 scores 29% of the time (14%+15%), so 2-score victories account for .13 2-score games * .29 wins per 2-score game = 3.8% of the team's total games.

Doing this for each possible number of scores, the team will win a total of 56% of its games. Repeating the same process for losses, it will lose 32% of the time (the other 12% of games will end tied).

As long as we know the probability of each possible number of scores and scores allowed, the expected W-L performance can be found in this way. In terms of summation notation, it looks something like this:

..∞............ i-1
∑ps(i) ∑pa(j)
i=0......,..j=0

where ps(i) is the probability of scoring i number of times and pa(j) is the probability of allowing j number of scores.

This is only useful if you have a reasonable model for finding these probabilities, however, which requires you to have some model for the distribution of possible scores around the team average. In baseball, no such distribution is obvious, so instead of the above process, we use shortcuts like PythagenPat to model the results of translating the underlying distribution of possible run-totals to an expected win percentage (by the way, the above example roughly resembles the actual distribution for a baseball team; traditional pythag would give you 4^2/(4^2+3^2) = 16/25 = .640 w%, while the example (ignoring ties) shows .56W/(.56W+.32L) = .636). Steven Miller showed that a Weibull distribution of runs gives a Pythagorean estimate of W%, and that the Weibull distribution is a reasonable assumption for his sample data (the 2004 American League), but that is just working backwards from the model in place.

Some sports, however, do present an obvious choice of model, namely the Poisson distribution. Both soccer and hockey are decent examples of Poisson processes because

-play happens over a predetermined length, measured in a continuous fashion (i.e. time, as opposed to something like baseball which is measured in discreet units of outs or innings)
-goals can only come one at a time (as opposed to something like basketball, where points can come in groups of 1, 2, 3, or 4)
-the number of goals scored over a given period of the game is largely independent of the number of goals scored over a separate period of the game (the fluid nature of possession is a key attribute here; for a sport like American football where a score dictates who has possession for a significant chunk of time, a team's score over one five-minute span will be somewhat dependent on whether it scored in the previous five-minute span, for example)
-the expectation for the number of goals over a period of time (once you know who is playing) depends mostly on the length of time

Hockey has at least one exception to the requirements of a Poisson process, in that the number of goals scored at the end of the game is not always independent of the number of goals scored earlier in the game due to empty net goals, but I don't know how much of an issue this presents. Soccer is a more straight-forward example (as well as a more homogeneous example due to the relative lack of substitution and penalties that are continually affecting the score-rate in hockey). Both, however, generally fit the mould for a Poisson process.

Using a Poisson distribution to fill out a table as in the above example (if you have Excel or a similar spreadsheet program, it should have a Poisson distribution function built in), we can then calculate expected W-L performances for a team. The first and second columns use the average number of goals for and against , respectively, as λ (in Excel, Poisson.Dist(x,avg goals for/against,False), where x is 1,2,3..). Say we do this for a soccer team that we expect to score an average of 2 goals per game and allow an average of 1 goal per game against its opponent. We get the following probabilities:

W: .606
L: .183
D: .212

Using the traditional soccer point-format (3 points for a win, 1 for a draw), this team would average about 2.03 points per game against its opponent.

We can also use the Poisson distribution to figure out what to expect if the game goes to overtime. Elimination soccer matches typically have a 30 minute OT (two 15-minute periods), so the λ (which, recall, are the average goals for and against, which are 2 and 1 in this example) for the OT will be 1/3 their regulation-match value (note that finding λ for regular-season hockey OTs will be more complicated because the 4v4 format will affect the scoring rate). Reconstructing the table with λ values of 2/3 and 1/3, we get the following results for games that go to OT:

OTW: .384
OTL: .161
OTD: .454

If overtime ends in a draw, the game is usually decided on PKs. If we assume that each team is 50/50 to win in PKs (which is not necessarily the case, but shootout odds should be closer to 50/50 than the rest of the match, and the odds in a shootout aren't necessarily based on expected goals for and against for the match), then our team's expected win% once a game goes to OT is .384 + .5*.454 = .611. Remember that the team wins 60.6% of the time in regulation, and the game goes to OT 21.2% of the time, so the team's total expected wins is .606 + .611*.212 = .735.

If we want to model a sudden death OT, such as in the Stanley Cup playoffs, the odds of winning in regulation remain unchanged, but we have to use a different formula to determine the chances of winning once the game goes to overtime. The Poisson distribution works for estimating the probability of scoring a certain number of goals in a pre-determined amount of time (such as a 20-minute period or a 60-minute game), but not for estimating the time until the next goal. For that, we instead need the exponential distribution, which models the amount of time until the next goal.

We want to know the probability that our team's time until its next goal is less than its opponent's time to its next goal. Recall the above formula we used to determine the odds of our team's goals scored being higher than its opponent's:

..∞............ i-1
∑ps(i) ∑pa(j)
i=0......,..j=0

Here, we use something similar, except that we want to know the chances of our team's value (time to the next goal) is less than that of its opponenent:

..∞............ i-1
∑pa(i) ∑ps(j)
i=0......,..j=0

where ps(j) is the probability of our team's next goal coming after j amount of elapsed time, and pa(i) is the probability of its opponent's next goal coming after i amount of elapsed time.

Additionally, we are now dealing with a continuous variable (time elapsed) rather than a discreet variable (number of goals scored), so we need to integrate instead of summate:

⌠∞.........⌠x
⌡0 f(x) ⌡0 g(x) dx dx

where f(x) models the amount of time until the opponent's next goal, and g(x) models the amount of time until our team's next goal. In this formula, f(x) is an exponential probability density function with λ=expected goals allowed (Ga), and ∫g(x)dx is an exponential cumulative distribution function with λ=expected goals scored (Gs):

⌠∞
⌡0 Ga*e^-(Gax)*(1-e^-(Gsx)) dx

This might look a bit ugly (or maybe not since e^x is such a simple integration), but it simplifies to just:

...Gs
-------
Gs+Ga

This makes perfect sense if we think about the next goal being a goal randomly selected from the distribution of possible goals in the game: the odds that the randomly selected goal comes from our team equal the percentage of total goals we expect to come from our team, and the odds that the randomly selected goal comes from our opponent equal the percentage of total goals we expect to come from them.

Now that we have a model for sudden-death OT, we can estimate a team's chances of winning a game with sudden death OT. For example, say we have a hockey game where we expect our team to score 3 goals and allow 2 goals on average. This team would be expected to win in regulation about 58.5% of the time, lose in regulation about 24.7% of the time, and go to OT 16.8% of the time. Once in OT, it will win 3/(3+2)=60% of the time, so its total expected wins is .585 + .6*.168 = .686.

Another interesting use of these distributions is to evaluate different strategies or lineups for a team (given that you can estimate the expected goals scored and allowed for varying lineups/strategies). Returning to the soccer team example where we have a team that we expect to score two goals and allow one, let's say that they are capable of making adjustments that make them stronger defensively, but at the cost of a significant portion of their offense. Say that they can play a defensive game and allow just .38 goals per game, but that doing so reduces their expected offensive output to 1.2 goals per game. In regular league play, the new defensive alignment will still average 2.03 points per game, so there is no benefit to this change.

In a tournament elimination game, however, their win expectancy rises from .735 to .761, because the increase in regulation draws will still lead to a lot of wins (~61% of OT games) instead of just 1-point outcomes. What's more, if they switch back to the more aggressive game in OT (their 2 goals for, 1 goal against form), they can slightly improve their OT win odds (from .608 to .611) by avoiding more shootouts.

Similarly, a sudden death format, where only the ratio of goals scored to goals allowed matters, can also produce different ideal strategies. Doubling both expected goals scored and allowed, for example, would have a significant effect on a team's odds of winning in regulation, but would have no effect on sudden-death because it preserves the ratio of offense to defense, and changes that have no impact on regulation (like going from 2 goals for/1 goal against to 1.2 goals for/.38 goals against in a regular season format soccer match) could have a significant impact on sudden death chances (.667 to .759 once you get to sudden death). Of course, any changes in strategy called for by different formats would depend on the team's ability to adapt to a different style of play and on how such changes affect its expected offensive and defensive production, but it is possible for an ideal lineup or strategy in one format to not be ideal in another, and using Poisson distributions to find the connection between offensive and defensive production and expected W-L performance is helpful in evaluating potential changes.

3-D Baseball

Poisson Processes in Sports

0 comments:

Post a Comment

Javier Vazquez K-Watch

Links

Retrosheet Credit

Lahman Credit

Contributors

Blog Archive