Regression to the Mean and Beta Distributions

This morning, a discussion of regression to the mean popped up on Phil's and Tango's blogs. This discussion touches upon some of the recent work I've been doing with Beta distributions, so I figured I'd go ahead and lay out the math linking regression to the mean with Bayesian probability with a Beta prior.

Many of the events we measure in baseball are Bernoulli trials, meaning we simply record whether they happen or not for each opportunity. For example, whether or not a team wins a game, or whether or not a batter gets on base are Bernoulli trials. When we observe these events over a period of time, the results follow a binomial distribution.

When we observe these binomial events, each team or player has a certain amount of skill in producing successes. Said skill level will vary from team to team or player to player, and, as a result, we will observe different results from different teams or players. Albert Pujols, for example, has a high degree of skill at getting on base compared to the whole population of MLB hitters, and we would expect to observe him getting on base more often than, say, Emilio Bonifacio.

The variance in talent levels is not the only thing driving the variance in obvserved results, however. As with any binomial process (excepting those with 0% or 100% probabilities, anyway), there is also random variance as described by the binonial distribution. Even if Albert's on-base skill is roughly 40%, and Bonifacio's is roughly 33%, it is still possible that you will occasionally observe Emilio to have a higher OBP than Albert over a given period of time.

In baseball, it is a practical problem that we do not know the true probability linked to each team's or player's skill, only their observed rate of success. Thus, if we want to know the true talent probability, we have to estimate it from the observed.

One way to do this is with regression to the mean. Say that we have a player with a .400 observed OBP over 500 PAs, and we want to estimate his true talent OBP. Regression to the mean says we need to find out how much, on average, our observed sample will reflect the hitter's true talent OBP, and how much it will reflect random binomial variation. Then, that will tell us how many PAs of the league average we need to add to the observed performance to estimate the hitter's true talent.

For example, say we decide that the number of league average PAs we need to add to regress a 500 PA sample of OBP is 250. We would take the observed performance (200 times on base in 500 PAs), and add 82.5 times on base in 250 PAs (i.e. the league average performance, assuming league average is about .330) to that.

200+82.5......282.5
------------ = -------- = .377
500+250........750

Therefore, regression to the mean would estimate the hitter's true OBP talent at .377.

As Phil demonstrated, once you decide that you have to add 250 PAs of league average performance to your sample to regress, you would use that same 250 PA figure to regress any OBP performance, regardless of how many PAs are in the observed sample. Whether you have 10 observed PAs or 1000 observed PAs, the amount of average performance you have to add to regress does not change.

Now, how would one go about finding that 250 PA figure? One way is to figure out the number of PAs at which the random binomial variance is equal to the variance of true talent in the population.

Start by taking the observed variance in the population. You would look at all hitters over a certain number of PAs (say, 500, for example), and you might observe that the variance in their observed OBPs is about .00132, with the average about .330. The observed variance is equal to the sum of the random binomial variance and the variance of true OBP talent across the population of hitters. We don't know the variance of true talent, but we can calculate the random binomial variance as p(1-p)/n, where p is the probability of getting on base (.330 for our observed population) and n is the observed number of PAs (500 in this case). For this example, that would be about .00044. Therefore, the variance of true talent in the population is approximately:

.00132 - .00044 = .00088

Next, we find the number of PAs where the random binomial variance will equal the variance of true talent:

p*(1-p)/n = true_var

.330*(1-.330)/n = .00088

n = .330*(1-.330)/.00088
250


We can also approach the problem of estimating true talent from observed performance using Bayesian probability. In order to use Bayes, we need to make an assumption about the distribution of true talent in the population the hitter is being drawn from (i.e. the prior distribution). We will assume that true talent follows a Beta distribution.

Return now to our .400 observed OBP example. Bayes says the posterior distribution (i.e. the distribution of possible true talents for a hitter drawn from the prior distribution after observing his performance) is proportional to the product of the prior distribution and the likelihood function (i.e. the binomial distribution, which is the likelihood of observing a each possible OBP, given the prior probability).

The prior Beta distribtuion is:

x^(α-1) * (1-x)^(β-1)
------------------------
..........B(α,β)

where B(α,β) is a constant equal to the Beta function with parameters α and β.

The binomial likelihood for observing s successes in n trials (i.e. the observed on-base performance) is:

.....n!
--------- * x^s * (1-x)^(n-s)
s!(n-s)!

where x is the true probability of a success.

Next, we multiply the prior distribution by the likelihood distribution:

x^(α-1) * (1-x)^(β-1) .........n!
------------------------- * --------- * x^s * (1-x)^(n-s)
............B(α,β).... .......... s!(n-s)!


combine the exponents for the x and (1-x) factors:


x^(α + s - 1) * (1-x)^(β + n - s - 1).. .....n!
--------------------------------------- * --------
.......................B(α,β) ...................... s!(n-s)!


Separating the constant factors from the variables:

...........n!
------------------- * x^(α + s - 1) * (1-x)^(β + n - s - 1)
s!(n-s)! * B(α,β)


This product is proportional to the posterior distribution, so the posterior distribution will be the above multiplied by some constant in order to scale it so that the cumulative probability equals one. Since the left portion of the above expression is already a constant, we can simply absorb that into the scaling constant, and the final posterior distribution then becomes:


C * x^(α + s - 1) * (1-x)^(β + n - s - 1)


Notice that the above distribution conforms to a new Beta distribution with parameters α+s and β+n-s, and with a constant C = 1/B(α+s,β+n-s). When the prior distribution is a Beta distribution with parameters α and β and the likelihood function is binomial, then the posterior distribution will also be a Beta distribution, and it will have the parameters α+s and β+n-s.

We still need to choose values for the parameters α and β for the prior distribution. Recall from the regression example that we found a mean of .330 and a variance of .00088 for the true talent in the population (i.e. the prior distribution), so we will choose values for α and β that give us those values. For a Beta distribution, the mean is equal to:

α/(α+β)

and the variance is equal to:

............αβ
----------------------
(α+β)^2 * (α+β+1)


A bit of algebra gives us values for α and β of approximately 82.5 and 167.5 respectively. That means the posterior distribution will have as parameters:

α+s = 82.5 + 200 = 282.5
β+n-s = 167.5 + 500 - 200 = 467.5

and a mean of

........282.5...........282.5
----------------- = ------- = .377
282.5 + 467.5.......750

As you can see, this is identical to the regression estimate. This will always be the case as long as the prior distribution is Beta and the likelihood is binomial. We can see why if we derive the regression constant (the number of PAs of league average we need to add to the observed performance in order to regress) from the prior distribution.

Recall that the regression constant can be found by finding the point where random binomial variance equals prior distribution variance. Therefore:

p(1-p)/k ≈ prior variance

where k is the regression constant and p is the population mean.

p(1-p)/k ≈ αβ / ( (α + β)^2(α + β + 1) ) ; p ≈ α/(α+β)

α/(α+β) * ( 1 - α/(α+β) ) / k . ≈ αβ / ( (α + β)^2 * (α + β + 1) )
α/(α+β) - α^2/(α+β)^2.........≈ k * αβ / ( (α + β)^2 * (α + β + 1) )
(α(α+β) - α^2)/(α+β)^2 ...... ≈ k * αβ / ( (α + β)^2 * (α + β + 1) )
(α(α+β) -α^2)....................... ≈ k * αβ / (α + β + 1)
(α^2 + αβ - α^2) ..................≈ k * αβ / (α + β + 1)
αβ.......................................... ≈ k * αβ / (α + β + 1)
1 .............................................≈ k / (α + β + 1)
k....................,,,,,,,,,,,,,,,,........ ≈ α + β + 1


Since α and β for the prior in our example are 82.5 and 167.5, k would be 82.5 + 167.5 + 1 = 251.

This estimate of k is actually biased, because it assumes a random binomial variance based only on the population mean, whereas the actual random binomial variance for the prior distribution will be the average binomial variance over the entire distribution. In other words, not all of the population will have a .330 OBP skill; some hitters will have a .300 skill, while others will have a .400 skill, and they will all have different binomial variances associated with them. More precisely, the random binomial variation for the prior distribution will be the following definite integral taken from 0 to 1:


⌠ x(1-x)
| ------- * B(x;α,β) dx
...k


which, conceptually, is the weighted sum of the the binomial variances for each possible value from the prior distribution, where each binomial variance is weighted by the probability density function of the prior.

.......1........
------------ | x(1-x) * x^(α-1) * (1-x)^(β-1) dx
k * B(α,β) ⌡


........1.......
------------ | x^α * (1-x)^β dx
k * B(α,β) ⌡



The definite integral is in the form of the Beta function B(α+1,β+1), so we can rewrite this as


B(α+1,β+1)
-------------
k * B(α,β)


The Beta function is interchangeable with the Gamma Function in the following manner:

B(α,β) = Γ(α)*Γ(β) / Γ(α+β)

replacing the two Beta functions with their Gamma equivalencies:


Γ(α+1) * Γ(β+1) * Γ(α+β)
-------------------------------
k * Γ(α) * Γ(β) * Γ(α+β+2)


This revision is useful because the Gamma function has a property where Γ(x+1)/Γ(x) = x, so the above reduces to:


αβ * Γ(α+β)
---------------
k * Γ(α+β+2)


Furthermore, since Γ(x+1)/Γ(x) = x, it follows that Γ(x+2)/Γ(x+1) = x+1. If we multiply those two equations together, we find that

Γ(x+1)....Γ(x+2)
-------- * -------- = x(x+1)
..Γ(x)......Γ(x+1)

Γ(x+2)/Γ(x) = x(x+1)

Γ(x)/Γ(x+2) = 1/(x(x+1))


Therefore


.αβ * Γ(α+β) ..................αβ
---------------- = ------------------------
k * Γ(α+β+2).....k * (α+β) * (α+β+1)



Now that we have a manageable expression for the random binomial variance of the prior distribution, we return to the requirement that random binomial variance equals the variance of the prior distribution:


..............αβ ................................αβ
------------------------ = -----------------------
k * (α+β) * (α+β+1)......(α+β)^2 * (α+β+1)


k * (α+β) * (α+β+1) = (α+β)^2 * (α+β+1)


k = α+β


Using a more precise calculation for the random binomial variance of the prior, we find that k = α+β rather than α+β+1. Note that when we estimate k by assuming a constant binomial variance of p(1-p)/k, we get a value of k exactly 1 higher than when we run the full calculation for the binomial variance. This is useful because the former calculation is much simpler than the latter, so we can calculate k by using the former method and then subtracting 1. Also note that the 250 value we got in the initial regression to the mean example would also be 1 too high if we were using more precise figures; I've just been rounding them off for cleanliness' sake.

Let's look now at the calculation for regression to the mean:

true talent estimate = (s+pk)/(n+k)

where s is the observed successes, n is the observed trials, p is the population mean, and k is the regression constant.

We know from our prior that p=α/(α+β) and k=α+β, so

(s+pk)/(n+k) =

s + (α+β)*α/(α+β)
----------------------
......n + α + β


...α + s
-----------
α + β + n


And what does Bayes say? Our posterior is a Beta with parameters α+s and β+n-s, which has a mean

.......(α+s)
-----------------
(α+s)+(β+n-s)


...α + s
-----------
α + β + n


So Bayes and regression to the mean produce identical talent estimates under these conditions (a binomial process where true talent follows a Beta distribution).

k is far easier to estimate directly (such as by using the method in the initial regression tot he mean example) than α and β, so we would typically calculate α and β from k. To do that, we use the fact that p = α/(α+β), and that k=α+β, so by substitution we can easily find that:

α=kp
β=k(1-p)

where k is the regression constant and p is the population mean.

We can also see that the regression amount will be constant regardless of the number of observed PAs, because when we take our Bayesian talent estimate:

...α + s
-----------
α + β + n

we see that we are always adding the quantity kp (as substituted for α) to the observed successes (s), and always adding the quantity k (as substituted for (α+β)) to the observed trials (n), no matter what observed values we have for s and n. The amounts we add to the observed successes and trials depend only on the parameters of the prior, which do not change.


Continue Reading...