3-D Baseball: Math Behind Regression with Changing Talent Levels (THT Article)

In my article "Regression with Changing Talent Levels: the Effects of Variance" on the Hardball Times, I talk about how changes in players' true talent levels from day to day reduce the variance of talent in the population overall over time. In other words, the spread in talent over a 100-game sample will be smaller than the spread in talent over a one-game sample. In the article, I gave the following formula to calculate how much the spread in talent is reduced, which I will further explain here:

*Note: in the THT article, I used d for the number of days instead of n to avoid confusion with another formula that was referenced from a previous article, which used n for something else. For this article, I'm just going to use n for the number of days.

The value given by the formula is the ratio of talent variance over n days to the talent variance for a single day. In other words, the variance in talent drops by a multiplicative factor that is dependent on the length of the sample and the correlation of talent from day to day.

Now, how do we get that formula?

If we only have two days in our sample, it is not too difficult to calculate the drop in talent variance. Let t₀ be a variable representing player talent levels on Day 1, and t₁ be a variable representing player talent levels on Day 2. We want to find the variance of the average talent levels over both days, or (t₀+t₁)/2.

The following formula gives us the variance of the sum of two variables:

The covariance is directly proportional to the correlation between the two variables and is defined as follows:

(Note that sd_t₀sd_t₁ = var_t₀ = var_t₁ because the standard deviation and variance for both variables are the same.)

Before we continue, there is an important thing to note. Because we are trying to derive a formula for a ratio (variance in talent over n days divided by variance in talent over one day), we don't necessarily need to calculate the numerator and denominator of that ratio exactly. As long as we can calculate values that are proportional to those values by the same factor, the ratio will be preserved.

Technically, we want the variance of the value (t₀+t₁)/2 and not just t₀+t₁, which would be vart(1+r)/2 instead of 2vart(1+r). However, those two values are proportional, so it doesn't really matter for now which we calculate as long as we can also calculate a value for the denominator that is proportional by the same factor.

For two days, the above calculations are simple enough. Once you start adding more days, however, it starts to get more complicated. Fortunately, the above math can also be expressed with a covariance matrix:

	t₀	t₁
t₀	var₀	cov_0,1
t₁	cov_0,1	var₁

The variance of the sum t₀+t₁ is equal to the sum of the terms in the covariance matrix, which you can see just gives us the formula: var_t₀+t₁ = var_t₀ + var_t₁ + 2cov_t₀,t₁. The covariance matrix is convenient because it can be expanded for any number of days:

Covariance matrix between talent n days apart

	t₀	t₁	t₂	t₃	...	t_n-1
t₀	var₀	cov_0,1	cov_0,2	cov_0,3	...	cov_0,n-1
t₁	cov_0,1	var₁	cov_1,2	cov_1,3	...	cov_1,n-1
t₂	cov_0,2	cov_1,2	var₂	cov_2,3	...	cov_2,n-1
t₃	cov_0,3	cov_1,3	cov_2,3	var₃	...	cov_3,n-1
⋮	⋮	⋮	⋮	⋮	⋱	⋮
t_n-1	cov_0,n-1	cov_1,n-1	cov_2,n-1	cov_3,n-1	...	var_n-1

We can also construct a correlation matrix. Given that we know the correlation of talent from one day to the next, this isn't that difficult. If the correlation between talent levels on Day 1 and Day 2 is r, and the correlation between talent levels on Day 2 and Day 3 is also r, we can chain those two facts together to find that the correlation between talent levels on Day 1 and Day 3 is r².

The same logic can be extended for any number of days, so that the correlation between talent levels n days apart is rn:

Correlation matrix between talent n days apart

	t₀	t₁	t₂	t₃	...	t_n-1
t₀	r⁰	r¹	r²	r³	...	r^n-1
t₁	r¹	r⁰	r¹	r²	...	r^n-2
t₂	r²	r¹	r⁰	r¹	...	r^n-3
t₃	r³	r²	r¹	r⁰	...	r^n-4
⋮	⋮	⋮	⋮	⋮	⋱	⋮
t_n-1	r^n-1	r^n-2	r^n-3	r^n-4	...	r⁰

This matrix is more useful than the covariance matrix, because all we need to know to fill in the entire correlation matrix is the value of r. And because correlation is proportional to covariance (cov_t₀,t₁ = r · var_t₀), the sum of the correlation matrix is proportional to the sum of the covariance matrix.

Our next step, then, is to calculate the sum of the correlation matrix. Notice that the terms on each diagonal going from the top left to bottom right are identical:

We can use this pattern to simplify the sum. Since the matrix is symmetrical, we can ignore the terms below the long diagonal and calculate the sum for just the top half of the matrix, and then double it later:

r⁰	r¹	r²	r³	...	r^n-1	→	r^n-1
	r⁰	r¹	r²	⋱	⋮		⋮
		r⁰	r¹	⋱	r³	→	(n-3)r³
			r⁰	⋱	r²	→	(n-2)r²
				⋱	r¹	→	(n-1)r¹
					r⁰	→	nr⁰

There is one r⁰ term in each column of the matrix, so there are n r⁰ terms in the sum. Likewise, there are (n-1) r¹ terms, (n-2) r² terms, etc. If we group each diagonal into its own distinct term, we get a sum whose terms follow the pattern (n-1)*rⁱ:

Applying the distributive property and separating the terms of the sum, we get the following:

The first sum is a simple geometric series, which we can calculate using the formula for geometric series:

The second sum is similar, but the additional i factor makes it a bit trickier since it is no longer a geometric series. We can, however, transform it into a geometric series using a trick where we convert this from a single sum to a double sum, where we replace the expression inside the sum with another sum.

The idea is that each term of the series is itself a separate sum which has i terms of rⁱ. This sum can be written as follows:

Notice that we switched to using the index h rather than i. This means there is nothing inside the sum that increments on each successive term, and the i acts as a static value. In other words, this is just adding up the value rⁱ i times, which is of course equal to irⁱ.

In order to visualize how this double sum works, we can write down the terms of the sum in an array with i rows and h columns, where the value corresponding to each pair of (i,h) values is rⁱ. For example, here is what the array would look like with n=4:

	h=0	h=1	h=2	h=3
i=0	r⁰	r⁰	r⁰	r⁰
i=1	r¹	r¹	r¹	r¹
i=2	r²	r²	r²	r²
i=3	r³	r³	r³	r³

The greyed-out values are included to complete the array, but are not actually part of the sum. If we go through the sum iteratively, we start at i=0, and take the sum of rⁱ from h=0 to h=-1. Since you can't count up from 0 to -1, there are no values to count in this row, which represents the fact that irⁱ = 0 when i=0.

Next, we go to i=1, and fill in the values r¹ for k=0 to k=0. The next row, when i=2, we go from h=0 to h=1. And so on.

We are currently taking the sum of each row and then adding those individual sums together. However, we could also start by taking the sum of each column, which would be equivalent to reversing the order of the two sums in our double series:

Note that the inner sum now goes from i=h+1 to i=n-1, which you can see in the columns of the array of terms above.

This is useful because each column of the array is a geometric series, meaning it will be easy to compute. The sum of each column is just the geometric series from i=0 to i=n-1. Then, to eliminate the greyed-out values from the sum, we subtract the geometric series from i=0 to i=h.

This is the value for our inner sum, so we plug that back into the outer sum:

We now have values for both halves of our original sum, so next we combine them to get the full value:

We still have one more step to go to calculate the full sum of the correlation matrix. Recall that when we started, we were working with a symmetrical correlation matrix, and because the matrix was symmetrical along the diameter, we set out to find the sum for only the upper half of the matrix. In order to get the sum of the full matrix, we have to double this value:

Finally, note that the long diagonal of the correlation matrix only occurs once in the matrix, so by doubling our initial sum, we are double-counting that diagonal. In order to correct for this, we need to subtract the sum of that diagonal, which is just n*1 (since each element in that diagonal equals 1):

This value is proportional to the sum of the covariance matrix, which is proportional to the variance of talent in the population over n days.

Next, we need to come up with a corresponding value to represent the variance of talent over a single day. To do this, we can rely on the fact that as long as talent never changes, the variance in talent over any number of days is the same as the variance in talent over a single day. Instead of comparing to the variance in talent over a single day, we can instead compare to the variance in talent over n days when talent is constant from day to day.

This allows us to construct a similar correlation matrix to represent the constant-talent scenario. Compared to the correlation matrix for changing talent, this is trivially simple: since talent levels are the same throughout the sample, the correlation between talent from one day to the next will always be one.

In other words, the correlation matrix will just be an n x n array of 1s. And the sum of an n x n array of 1s is just n^2.

The ratio of these two values will give us the ratio of talent variance after n days of talent changes to the talent variance when talent is constant:

And that is our formula for finding the ratio of variance in true talent over n days to the variance in true talent on a single day, given the value r for the correlation of true talent from one day to the next. With some simplification, the above formula is equivalent to what was posted in the THT article:

3-D Baseball

Math Behind Regression with Changing Talent Levels (THT Article)

0 comments:

Post a Comment

Javier Vazquez K-Watch

Links

Retrosheet Credit

Lahman Credit

Contributors

Blog Archive