On Correlation, Regression, and Bo Hart

July 4, 2003. The Cardinals were starting a weekend series against the rival Cubs. They were looking to pull ahead of Houston for the division lead and to hold off Chicago, who stood just a game back.

The Cards had come into the year with high expectations, having averaged 95 wins over the past 3 seasons, and were hoping to capture their first pennant in over 15 years. An early-season hamstring injury to second baseman and lead-off hitter Fernando Vina, however, had seemed a costly blow; doubly so when back-up Miguel Cairo also went down with injury. And yet, the Cards were still hanging tough, thanks in large part to the stellar contributions of Vina's and Cairo's replacement, Bo Hart.

Hart had come out of nowhere. A 33rd round draft pick out of Gonzaga, he had just been promoted to AAA that year at age 26 and was in his 5th year with the organization. He had put up an OPS under .700 in his one year at AA in 2002. No one wanted to have to rely on this guy for production, but he was all the Cards had at this point. Now, a couple weeks into his Major League career, he was flirting with .400.

Hart doubled in the second inning against the Cubs to push the lead to 4-0. He drew a walk and scored in the 4th, singled and scored in the 8th, and then singled in the team's final run of the game in the 9th to aid in an 11-8 victory. Now with 75 PAs, he was hitting .412 with an OBP over .450 and a SLG close to .600.

In case the four-paragraph intro about Bo-freaking-Hart playing the part of Ted Williams (or the title of the post) was too vague a tip-off, this post is about regression to the mean.

Regression to the mean is an important concept in understanding baseball statistics. It is key to answering the question of how much an observed performance (i.e. a player's stats) tells us about his underlying talent. Every player has a true talent for a given ability such as getting hits or getting on base. This true talent represents the probability of him succeeding in a given plate appearance. For example, if a player has a 33% chance of getting on base, .330 represents his true talent OBP skill.

The problem is that we don't know what the true talent probabilities are for any player. All we can do is try to infer what those probabilities might be based on each player's observed performance. In Bo Hart's case, his true talent OBP ability may well have been .300, but it is still possible to observe him performing at a .450 level, especially over small samples. If you give Bo Hart (or anyone else) one million PAs, then his observed performance will almost always be very close to his underlying ability, and those one million PAs will allow you to infer with a high level of certainty what his underlying ability is.

Over 75 PAs, however, that is not the case. His observed performance tells you something about his underlying ability, but whatever inferences you make from those 75 PAs will have much more uncertainty.

At what point can we be relatively certain about our inferences of true talent based on observed performance? 75 PAs is not enough, and one million is plenty, but what about 1000? The answer is that there is no point where our inferences become meaningful; they always have some degree of meaning and some degree of uncertainty (even at 75 PA or at one million PA), and the more PAs we observe, the more that balance slides toward having more meaning. What is important is not simply whether a stat is meaningful or not, but to describe the amount of meaning an observed performance has given its sample size.

We can do this with regression to the mean. What regression to the mean does is basically looks at an observed performance (for example, a .450 OBP over 75 PAs), and it gives us an estimate of what the hitter's most likely underlying talent is. For example, let's say that when we regress our .450 OBP performance, we get an estimate of true talent of .360. What that means is that if we see a bunch of hitters who all performed at a .450 level over 75 observed PAs, and then we keep observing the same group of hitters, then we will likely observe roughly a .360 OBP for the group as a whole outside of the initial 75 PA sample. That performance outside the initial sample is going to be pretty close to the true talent of the group (as long as you observe enough players to get a large sample, at least).

A few things to keep in mind about regression to the mean:

-The regressed estimate of true talent is always going to be closer to the league average than the observed performance is (hence the name).
-Regression toward the mean works as a percentage of the difference between the observed performance and the league average. For example, if you would regress a .450 observed OBP to .360, you might only regress a .400 observed OBP to .350; what is important is that you are regressing 75% (or however much) toward the mean, not that you are regressing .090 points toward the mean.
-While the regressed estimate is our best inference about a player's underlying talent, and while this estimate works well for groups of players, sometimes this estimate will be wrong for individual players. If we have a group of players who have a regressed estimate of .360 for their true OBP talent, then some of them will actually be .300 OBP hitters, and some of them will actually be .400 OBP hitters. However, there is no way to tell which hitters are which by looking at their observed performances; all we know is that they are likely around .360 as a whole, and that is our best guess (with some degree of uncertainty) for each hitter in that group.
-The amount of regression toward the mean depends both on the nature of the stat and of size of the observed sample.

The final point is particularly important. For a given stat, we will always observe a variety of performances across the league. If we are looking at OBPs for players league-wide, we will see some guys around .300, and some guys around .400, and all sorts of values in between, plus a handful even more extreme than that. There are two main reasons we observe varying performances: players have varying underlying talents, and the performances we observe have random variation around the players' underlying talents.

These two sources of variation in the observed performances across the league are the key to regression to the mean. Specifically, how much of the variance of observed performances comes from the variance of players' underlying talents and how much comes from random variation determines how much we need to regress a particular stat. If 50% of the observed variance is due to the variance of underlying talent and 50% is due to random variation, then you regress 50% toward the mean. If 70% of the observed variance is from the spread of underlying talent and 30% is from random variation, then you regress 30% toward the mean.

The balance between these two variances (the spread of underlying talent and random variation) will depend on both the nature of the stat you are looking at and the sample size you are looking at. Specifically, the variance in the spread of talent depends on the nature of the stat, while the variance due to random variation depends on the sample size. If the spread of true OBP talent across the league has a standard deviation of .030 points, then the spread of underlying talent follows the same distribution whether you look at 1 PA or 1000 PAs (assuming you are looking at the same group of hitters, at least). This variance does not depend on the sample size. The amount of random variation around each player's true talent, on the other hand, is purely (for the most part, anyway) a function of sample size.

Because the random variation continually drops as the sample size rises while the spread of underlying talent remains the same, it is easy to see how increasing sample sizes will affect regression to the mean: the percentage of total variation from the static distribution of underlying talent will constantly rise as the random variation continues to drop, and you will have to regress by smaller amounts. Furthermore, this illustrates how regression to the mean is dependent on the spread of talent across the league. The larger the variance in the spread of talent, the less you need to regress for a given sample size, because the random variation associated with that sample size will account for a smaller portion of the overall variance.

The trick to regression to the mean is figuring out the balance of these two variances for a given stat and sample size. This can be done with correlations. Say that you have observed several hitters over 75 PAs each and marked each hitter's OBP over those 75 PAs. Next, you are going to repeat this test; that is, you are going to observe the same hitters over another 75 PAs. The correlation between the results of the initial test (the first 75 PAs for each hitter) and the results of the second test will tell you the amount you have to regress OBP at 75 PA. If the correlation is .25, then that means the variance from the spread of talent accounts for 25% of the variance in the sample, and so you need to regress the observed results 75% toward the mean.

If you have taken statistics courses, this last part may seem off to you. There is a number that represents the percentage of variance in one variable that is explained by another variable, and it is not the correlation coefficient (which is also called "r", by the way). It is instead the coefficient of determination, which is equal to the square of the correlation coefficient (also called "r squared", oddly enough). Why, then, am I saying that the correlation coefficient tells us how much of the variance comes from the spread of talent as opposed to random variation?

The two variables we are correlating in this exercise are both observed samples, each of which contains random variation in each player's observed OBP around that player's true talent OBP. What we care about, however, is not the relationship between two separate observed samples, but the relationship between each observed sample and the distribution of underlying talent for the players in the sample. If we get a value for r between the two observed samples of .25, then that means that the variance of one sample only explains 6.25% of the variance of the other sample, but that doesn't tell us how much of the variance of each sample is explained by the variance of true talent.

Remember that each observed sample contains two sources of variance: the spread of true talent, and random variation. When we look at two identical samples, the true talent of each player is the same in both samples*. Therefore, the variance due to the spread of talent in one sample explains 100% of the variance due to the spread of talent in the other sample. Since the rest of the variance in each sample is random, it is going to be completely uncorrelated, and the random variation of one sample explains 0% of the random variation of the other sample.

We know that 6.25% of the variance of one sample is explained by the variance of the other sample, and that all of that 6.25% is from the amount of variance explained in each sample by the spread of talent. Given that, how much of the variance of each individual sample is explained by the spread of talent? The answer is the square root of .0625, which is .25. The reason is as follows.

Assume that 25% of the variance of each sample is due to the spread of talent and 75% is due to random variation. Now, how much of the variation in one sample will be explained by the variation of the other sample? First, we look at how much of the variation we need to explain is random variation. This is 75%, as per the assumption stated in this paragraph. That leaves 25% of the variation that can relate to the other sample. Only 25% of the variation of other sample is explained by non-random variation, however, so the relationship between the variances of the two samples is:

25% of the variance from one sample explains 25% of the variance of the second sample, which is equivalent to .25*.25=.0625.

This matches what we see in reality. When 6.25% of the variance of one observed sample is explained by the variance of a separate observed sample of the same players over the same number of PAs, then 25% of the variance of each sample is explained by the spread of talent across the sample, and 75% is explained by random variation. Therefore, when r is .25 between two observed samples, then the r^2 between each observed sample and the true talent of each player is also .25 (since it is just the square root of the r^2 between the observed samples), which is equal to an r of .5.

That is why it is the correlation coefficient between observed samples that we care about and not the coefficient of determination; the former, and not the latter, is what represents r^2 between each observed sample and each player's true talent, and therefore tells us how much of the variation of the observed sample is explained by non-random variation. This is what tells us how much to regress an observed performance.

Once you know how much to regress a stat for a given sample size, you can estimate the amount of regression for any sample size. You do this by figuring out how many PAs are represented by the amount of regression you need to do. If you have to regress 75% for a 75 PA sample, then that means that the 75 observed PAs need to account for 25% of the final regressed estimate. 75 is 25% of 300, which means you have to add 225 PAs of the league average to the observed performance to regress toward the mean. If you do not know the value of r for different sample sizes, then you can still use this 225 PA figure. Simply add 225 PAs of the league average for the stat you are regressing to the observed performance, no matter how many PAs are in the observed sample.

This is simple enough if you have the correlation between two observed samples of the same size. However, what if your two samples are not the same size? For example, say you have one sample where each hitter has 75 PAs, and another sample where each hitter has 150 PAs, and the correlation between the two samples is .25. How much do you regress? If you regress the 75 PAs 75%, then you can't also regress the 150 PAs 75%, because we know the amount of regression has to diminish as the sample size increases. We can't just choose either the 75 PA or the 150 PA sample to regress 75% since there is no reason to prefer one choice over the other. Rather, the 75 PA sample will be regressed something more than 75%, and the 150 PA sample will be regressed something less.

To figure out how much to regress in this situation, we should keep in mind the relationship of the correlation between the two observed samples to the correlation of each sample to true talent. When the two observed samples have the same number of PAs, the r between the two samples equals the r^2 between each sample and the true talent of the sample. When the samples are different, similar logic leads us to the conclusion that the r between the two samples is equal to the product of the r's between each sample and the true talent of the sample (note that this simplifies to r^2 when both samples have the same number of PA). For example, let's say that we already figured out the proper regressions for both 75 PA and 150 PA, and we found that at 75 PA, we regress 81%, and at 150 PA, we regress 68%. Those represent the r^2 between each sample and true talent, so to find r, we need to take the square roots of those figures:

sqrt(1-.81)=.44
sqrt(1-.68)=.57

Those are the r figures between each sample and true talent, so to find the expected r between observed samples of 75 and 150 PA, multiply those two values:

.44*.57=.25

So, if you have a correlation of .25 between observed samples of 75 and 150 PA, you would regress the 75 PA sample 81% toward the mean and the 150 PA sample 68% toward the mean. Both of these correspond to adding 313 PA of average production to the observed performance (the reason we use those figures is that that is the point where the regression component has the same weight in both regressions). Additionally, once we know that we have to add 313 PA to each sample to properly regress, we can figure out what size sample we would need to regress 75% if both samples had the same number of PAs. This ends up being about 104 PA, which means that having two samples of 75 and 150 PA that correlate at r=.25 is the same as having 104 PA in each sample. You may notice that 104 is fairly close to the harmonic mean of 75 and 150 (100), but that doesn't have to be the case if the two sample sizes are particularly far apart (for example, if the two samples were 75 PA and 1000 PA). When the correlation is close to one or when the number of PA in each sample aren't that different, then the harmonic mean should work pretty well, but as a general rule, it doesn't have to.

If you don't know the correlation coefficient for each individual sample ahead of time (which you don't if you are using this method, otherwise you wouldn't need to to this), then it is a bit trickier to figure out how to regress each sample. You have to use the quadratic formula to find the 313 PA figure, and then you can use that to figure out how much to regress. The quadratic formula is:

x = (-b + sqrt(b^2 - 4ac))/(2a)

x in this case will tell you how many PA of average to add to an observed performance to regress. To solve for x, you need to know the number of PAs in each sample, as well as the r^2 between the two samples (not r, but if you have r, just square it). Set a, b, and c as follows, and then plug them into the quadratic equation above:

a = 1
b = PA1 + PA2
c = PA1 * PA2 * (r^2-1) / r^2

Plugging in PA1=75, PA2=150, and r^2=.0625, that gives a value of x=313. Since 75 is 19% of 75+313, the observed performance of the 75 PA sample makes up 19% of the regressed value, which means you regress 81%. Same thing with the 150 PA sample to find that you regress 68%, just like we saw above.

If you have read this far, I should probably apologize for suckering you in with a nice story about Bo Hart and then turning this into a math post. I could have gone on with Bo Hart's story following that game, but honestly, it's not any prettier than a few thousands words about math, so let's just leave Bo Hart be. The important thing here is that any time you want to make conclusions about a player or team based on their observed performance, understanding how much your observed sample tells you about the player's or team's underlying ability is key to understanding what to expect going forward, and regression to the mean is a key concept to that understanding. Having a general idea of what stats regress how much over a given sample is an important to figuring out how much the stats you are looking at tell you about a player's ability. Pizza Cutter published some figures that are particularly useful here, which are summarized on FanGraphs (he presents the PA totals at which you would regress 30% toward the mean for several common offensive stats).

The relationship between r and regression to the mean is particularly important. While r^2 tells us how much variance of one variable is explained by another, the fact that we cannot directly compare our observed samples to the distribution of underlying talent and can only infer the relationship to underlying talent by comparing two observed samples means that in this case, the observed r, and not r^2, is what represents the amount of variance that is explained by underlying talent and not random variation. If we could measure true talent directly, then we would want to use r^2, but since the two variables we are correlating each include random variation around each player's true talent, knowing how much one variable explains the variation in the other doesn't tell us how the samples relate to true talent, which is what we care about for projecting performance. This is why we use r and not r^2 in determining how much to regress an observed performance toward the mean.

Further reading:
-Tangotiger's response to Pizza Cutter's work
-Phil Birnbaum on r vs. r^2
-Tangotiger/TheBookBlog community on r vs r^2 (this thread just started since I began writing this, so I am not sure where the discussion will end up)


*This article considers the simplified scenario when each player's true talent is identical in both samples. In reality, each player's true talent changes over time, and is also affected by things like park, opposing pitchers and defenses, etc. One way to minimize this problem is to select your two samples by taking the odd and even PAs over a given time period as your two samples instead of taking two samples from different time periods, so that it is virtually impossible for a player's talent to change much between your samples (though it is still possible, though less likely, for the samples to be affected by things like park and opposition). This is what Pizza Cutter did in the linked study.

1 comments:

Martin Monkman said...

Bo Hart is an interesting case, so I've used your blog post as a jumping off point for one of my own.
http://bayesball.blogspot.com/2010/11/bo-knows-probability.html

Post a Comment

Note: Only a member of this blog may post a comment.