3-D Baseball

Weighting Calculator for Past Data

2017-08-09T04:29:00.000-07:00

Calculate Weighting for Past Data

decay factor:

Estimate the regression constant (for binomial stats only)

regression constant:

Math Behind Weighting Past Results (THT Article)

2017-08-09T01:30:00.000-07:00

In my article "The Math of Weighting Past Results" on the Hardball Times, I gave a formula for finding the proper weighting for past data given certain inputs from the dataset. This formula defined the relationship between weighted results and talent, and the proper weighting was the value that maximized that relationship.

I started with a formula for a sample with exactly two days and then generalized that to cover any length of sample. I more or less explained where the two-day version came from in the article, but not the full version, which was as follows:

This supplement will go through the calculations of generalizing the simpler two-day formula to what we see above. It will rely heavily on the use of geometric series, so I would recommend having some familiarity with those before attempting to follow these calculations.

In the article, we treated each day's results as a separate variable and the overall sample as a sum of these individual daily variables. When we had two days in our sample, the combined variance was defined by the following formula:

This formula can be expanded to include more than two variables, but it starts to get messy really quickly. To make expanding is simpler, the formula can be re-written as a covariance matrix. If you have n variables, then the covariance matrix will be an n X n array, where each entry is the covariance between the variables representing that row and column. For two days, we would fill in the covariance matrix as follows:

	x₁	x₂
x₁	Var_x1	Cov_x1,x2
x₂	Cov_x1,x2	Var_x2

The combined variance is equal to the sum of the items in the matrix, which you can see is equivalent to the above formula.

This makes it much simpler to expand the formula for additional variables since you just have to add more rows and columns to the matrix. In the article, we found that the day-to-day correlation of talent (r) and the decay factor used to weight past data (w) can be used to explain changes in the variances and covariances throughout the sample:

Translating this to our covariance matrix gives us (with Var_x1 and Var_true written as v_x and v_t to save space):

	x₁	wx₂
x₁	v_x	rw*v_t
wx₂	rw*v_t	w²v_x

If we expand this to include additional days, every term except those on the diagonal will include a Var_true factor, and those on the diagonal will instead have a Var_x1 factor. (This is because the terms on the diagonal represent the covariance of each variable with itself, which is just the variance of that variable.) Similarly, every term contains an r factor and a w factor, except that the terms on the diagonal have no r (because these are relating the results of one day to themselves, so it is irrelevant how much talent changes from day to day).

For now, let's strip out the variance factors and focus only on what happens to r and w as we expand the matrix to cover more days. We'll look at r and w separately, but keep in mind these are just factors from the same matrix, not two separate matrices. If you placed one on top of the other, so that each r term lines up with the corresponding w term, and then put the variances back in, you'd get the full matrix.

This covariance matrix is essentially the same as what we worked with for the variance article, except now we are introducing weights for past results. As a result, the only real difference here is what happens with the w's, and the r terms follow the same pattern as in math for the variance article:

	x₁	wx₂	w²x₃	w³x₄	...	w^d-1x_d
x₁	r⁰	r¹	r²	r³	...	r^d-1
wx₂	r¹	r⁰	r¹	r²	...	r^d-2
w²x₃	r²	r¹	r⁰	r¹	...	r^d-3
w³x₄	r³	r²	r¹	r⁰	...	r^d-4
⋮	⋮	⋮	⋮	⋮	⋱	⋮
w^d-1x_d	r^d-1	r^d-2	r^d-3	r^d-4	...	r⁰

The weights also follow a pattern, though not the same one as the r factors. The weight for each term equals the combined weight of the two variables it represents:

	x₁	wx₂	w²x₃	w³x₄	...	w^d-1x_d
x₁	w⁰	w¹	w²	w³	...	w^d-1
wx₂	w¹	w²	w³	w⁴	...	w^d
w²x₃	w²	w³	w⁴	w⁵	...	w^d+1
w³x₄	w³	w⁴	w⁵	w⁶	...	w^d+2
⋮	⋮	⋮	⋮	⋮	⋱	⋮
w^d-1x_d	w^d-1	w^d	w^d+1	w^d+2	...	w^2(d-1)

While the two patterns are different, there are three important things to note that hold for both of them:

1) The terms on the main diagonal form their own distinct pattern.
2) The remaining terms are symmetrical about the diagonal, with the terms above and below the diagonal mirroring each other.
3) The terms on each diagonal parallel to the main diagonal follow a distinct pattern.

We need to find the sum of the matrix to get the variance in the weighted results. Using these three observations, we can simplify the sum by dividing the matrix up into parts.

We'll start with the main diagonal of the matrix. The terms on the diagonal follow the form w²ⁱ*Var_x1. The sum of these terms is a geometric series, which makes it simple to evaluate:

Next, because the matrix is symmetrical about the diagonal, we can focus on the sum for only the terms above or below the diagonal and then double our result later.

We'll compute this sum by continuing to divide the matrix along its diagonal rows. The r values within a given diagonal are all identical, which we can see in this graphic from the math for the previous article on variance:

The w values within each diagonal also follow a set pattern, though slightly more complex than the one for r's. Rather than r¹+r¹+r¹+..., we get w¹+w³+w⁵+... The basic pattern for the first diagonal is:

That's just for the w component of each term. If we include the r and variance components, we get this for the sum of the terms in the first diagonal adjacent the main diagonal:

This is still a geometric series, so we can evaluate the sum for this diagonal.

For the second diagonal, the w's go w²+w⁴+w⁶+..., which gives us:

If we keep going, we'll find that for each additional diagonal, the exponent for r will rise by one, the starting value of i in the summation will rise by one (which also means the summation will have one fewer term, which we can see by looking at the matrix), and each diagonal will alternate having an extra w outside the geometric sum due to the diagonals alternating between odd and even exponents.

Fortunately, the alternating w problem disappears when distribute that w back into the result for the geometric sum of each odd diagonal. We end up with the following pattern for the sum of each diagonal (after factoring out the Var_true component from each term):

This gives us two separate geometric series: the first multiplies by a factor of rw, and the second by a factor of r/w. Simplifying these geometric series gives us:

That gives us the sum of everything above the main diagonal in the covariance matrix. To get the full sum of the matrix, we need to double this (to account for everything below the diagonal, which mirrors this calculation) and add the sum of the main diagonal:

This gives us the full variance of the weighted results. Our formula calls for the standard deviation instead of the variance, so we just take the square root of this.

Next, we need to calculate the covariance between current talent and the weighted observations. We can get this using another covariance matrix based on the idea of "shared" variance mentioned in the Hardball Times article. The covariance between the results and talent for a given day is the same as the variance in talent, since the variance in talent is inherent in the variance of the results (i.e. that variance is shared between the results and the talent levels for that day).

To fill out the rest of the covariance matrix, we use the fact that the covariance between results and current talent drops the further the results are from the present time. The amount the covariance drops is determined by the day-to-day correlation in talent and the weight given to past data:

	x₁	wx₂	w²x₃	w³x₄	...	w^d-1x_d
t₁	(rw)⁰v_t	(rw)¹v_t	(rw)²v_t	(rw)³v_t	...	(rw)^d-1v_t

This is also a geometric series which multiplies by a factor of rw. The sum simplifes to:

As long as we know the values for r, w, Var_true and Var_x1, we can work out what the variance will be over any number of days, which means as long as we know r, Var_true and Var_x1, we can find the value of w which maximizes the relationship between weighted results and current talent.

Typically we would find this by taking the derivative of the formula and finding the point where the derivative equals 0, but given that this is a rather unpleasant derivative to calculate (and most likely will have difficult-to-find zeroes), I would strongly recommend just using the optimize function in R or some other statistical program (the calculator on the Hardball Times uses the same method to minimize/maximize a function as the optimize function in R).

One final note: this all relies on the assumption of exponential decay weighting. Exponential decay is not necessarily implied by the underlying mathematical processes; it's an assumption we are making to make our lives easier. Theoretically, we could fit the weight for each day individually, but this is far, far more complicated and not really worth the effort.

If you had 100 days in your sample, instead of maximizing the correlation for w, you would have to maximize it for a system of 100 different weight variables. If you would like to attempt this, by all means, have fun, but, while the exponential decay assumption is a simplification, it does work pretty well.

The true weight values do tend to drop slightly faster for the most recent data and then level out more for older data than exponential decay allows for, but on the whole, it doesn't make that much difference to use exponential decay.

Math Behind Regression with Changing Talent Levels (THT Article)

2017-06-04T18:30:00.000-07:00

In my article "Regression with Changing Talent Levels: the Effects of Variance" on the Hardball Times, I talk about how changes in players' true talent levels from day to day reduce the variance of talent in the population overall over time. In other words, the spread in talent over a 100-game sample will be smaller than the spread in talent over a one-game sample. In the article, I gave the following formula to calculate how much the spread in talent is reduced, which I will further explain here:

*Note: in the THT article, I used d for the number of days instead of n to avoid confusion with another formula that was referenced from a previous article, which used n for something else. For this article, I'm just going to use n for the number of days.

The value given by the formula is the ratio of talent variance over n days to the talent variance for a single day. In other words, the variance in talent drops by a multiplicative factor that is dependent on the length of the sample and the correlation of talent from day to day.

Now, how do we get that formula?

If we only have two days in our sample, it is not too difficult to calculate the drop in talent variance. Let t₀ be a variable representing player talent levels on Day 1, and t₁ be a variable representing player talent levels on Day 2. We want to find the variance of the average talent levels over both days, or (t₀+t₁)/2.

The following formula gives us the variance of the sum of two variables:

The covariance is directly proportional to the correlation between the two variables and is defined as follows:

(Note that sd_t₀sd_t₁ = var_t₀ = var_t₁ because the standard deviation and variance for both variables are the same.)

Before we continue, there is an important thing to note. Because we are trying to derive a formula for a ratio (variance in talent over n days divided by variance in talent over one day), we don't necessarily need to calculate the numerator and denominator of that ratio exactly. As long as we can calculate values that are proportional to those values by the same factor, the ratio will be preserved.

Technically, we want the variance of the value (t₀+t₁)/2 and not just t₀+t₁, which would be vart(1+r)/2 instead of 2vart(1+r). However, those two values are proportional, so it doesn't really matter for now which we calculate as long as we can also calculate a value for the denominator that is proportional by the same factor.

For two days, the above calculations are simple enough. Once you start adding more days, however, it starts to get more complicated. Fortunately, the above math can also be expressed with a covariance matrix:

	t₀	t₁
t₀	var₀	cov_0,1
t₁	cov_0,1	var₁

The variance of the sum t₀+t₁ is equal to the sum of the terms in the covariance matrix, which you can see just gives us the formula: var_t₀+t₁ = var_t₀ + var_t₁ + 2cov_t₀,t₁. The covariance matrix is convenient because it can be expanded for any number of days:

Covariance matrix between talent n days apart

	t₀	t₁	t₂	t₃	...	t_n-1
t₀	var₀	cov_0,1	cov_0,2	cov_0,3	...	cov_0,n-1
t₁	cov_0,1	var₁	cov_1,2	cov_1,3	...	cov_1,n-1
t₂	cov_0,2	cov_1,2	var₂	cov_2,3	...	cov_2,n-1
t₃	cov_0,3	cov_1,3	cov_2,3	var₃	...	cov_3,n-1
⋮	⋮	⋮	⋮	⋮	⋱	⋮
t_n-1	cov_0,n-1	cov_1,n-1	cov_2,n-1	cov_3,n-1	...	var_n-1

We can also construct a correlation matrix. Given that we know the correlation of talent from one day to the next, this isn't that difficult. If the correlation between talent levels on Day 1 and Day 2 is r, and the correlation between talent levels on Day 2 and Day 3 is also r, we can chain those two facts together to find that the correlation between talent levels on Day 1 and Day 3 is r².

The same logic can be extended for any number of days, so that the correlation between talent levels n days apart is rn:

Correlation matrix between talent n days apart

	t₀	t₁	t₂	t₃	...	t_n-1
t₀	r⁰	r¹	r²	r³	...	r^n-1
t₁	r¹	r⁰	r¹	r²	...	r^n-2
t₂	r²	r¹	r⁰	r¹	...	r^n-3
t₃	r³	r²	r¹	r⁰	...	r^n-4
⋮	⋮	⋮	⋮	⋮	⋱	⋮
t_n-1	r^n-1	r^n-2	r^n-3	r^n-4	...	r⁰

This matrix is more useful than the covariance matrix, because all we need to know to fill in the entire correlation matrix is the value of r. And because correlation is proportional to covariance (cov_t₀,t₁ = r · var_t₀), the sum of the correlation matrix is proportional to the sum of the covariance matrix.

Our next step, then, is to calculate the sum of the correlation matrix. Notice that the terms on each diagonal going from the top left to bottom right are identical:

We can use this pattern to simplify the sum. Since the matrix is symmetrical, we can ignore the terms below the long diagonal and calculate the sum for just the top half of the matrix, and then double it later:

r⁰	r¹	r²	r³	...	r^n-1	→	r^n-1
	r⁰	r¹	r²	⋱	⋮		⋮
		r⁰	r¹	⋱	r³	→	(n-3)r³
			r⁰	⋱	r²	→	(n-2)r²
				⋱	r¹	→	(n-1)r¹
					r⁰	→	nr⁰

There is one r⁰ term in each column of the matrix, so there are n r⁰ terms in the sum. Likewise, there are (n-1) r¹ terms, (n-2) r² terms, etc. If we group each diagonal into its own distinct term, we get a sum whose terms follow the pattern (n-1)*rⁱ:

Applying the distributive property and separating the terms of the sum, we get the following:

The first sum is a simple geometric series, which we can calculate using the formula for geometric series:

The second sum is similar, but the additional i factor makes it a bit trickier since it is no longer a geometric series. We can, however, transform it into a geometric series using a trick where we convert this from a single sum to a double sum, where we replace the expression inside the sum with another sum.

The idea is that each term of the series is itself a separate sum which has i terms of rⁱ. This sum can be written as follows:

Notice that we switched to using the index h rather than i. This means there is nothing inside the sum that increments on each successive term, and the i acts as a static value. In other words, this is just adding up the value rⁱ i times, which is of course equal to irⁱ.

In order to visualize how this double sum works, we can write down the terms of the sum in an array with i rows and h columns, where the value corresponding to each pair of (i,h) values is rⁱ. For example, here is what the array would look like with n=4:

	h=0	h=1	h=2	h=3
i=0	r⁰	r⁰	r⁰	r⁰
i=1	r¹	r¹	r¹	r¹
i=2	r²	r²	r²	r²
i=3	r³	r³	r³	r³

The greyed-out values are included to complete the array, but are not actually part of the sum. If we go through the sum iteratively, we start at i=0, and take the sum of rⁱ from h=0 to h=-1. Since you can't count up from 0 to -1, there are no values to count in this row, which represents the fact that irⁱ = 0 when i=0.

Next, we go to i=1, and fill in the values r¹ for k=0 to k=0. The next row, when i=2, we go from h=0 to h=1. And so on.

We are currently taking the sum of each row and then adding those individual sums together. However, we could also start by taking the sum of each column, which would be equivalent to reversing the order of the two sums in our double series:

Note that the inner sum now goes from i=h+1 to i=n-1, which you can see in the columns of the array of terms above.

This is useful because each column of the array is a geometric series, meaning it will be easy to compute. The sum of each column is just the geometric series from i=0 to i=n-1. Then, to eliminate the greyed-out values from the sum, we subtract the geometric series from i=0 to i=h.

This is the value for our inner sum, so we plug that back into the outer sum:

We now have values for both halves of our original sum, so next we combine them to get the full value:

We still have one more step to go to calculate the full sum of the correlation matrix. Recall that when we started, we were working with a symmetrical correlation matrix, and because the matrix was symmetrical along the diameter, we set out to find the sum for only the upper half of the matrix. In order to get the sum of the full matrix, we have to double this value:

Finally, note that the long diagonal of the correlation matrix only occurs once in the matrix, so by doubling our initial sum, we are double-counting that diagonal. In order to correct for this, we need to subtract the sum of that diagonal, which is just n*1 (since each element in that diagonal equals 1):

This value is proportional to the sum of the covariance matrix, which is proportional to the variance of talent in the population over n days.

Next, we need to come up with a corresponding value to represent the variance of talent over a single day. To do this, we can rely on the fact that as long as talent never changes, the variance in talent over any number of days is the same as the variance in talent over a single day. Instead of comparing to the variance in talent over a single day, we can instead compare to the variance in talent over n days when talent is constant from day to day.

This allows us to construct a similar correlation matrix to represent the constant-talent scenario. Compared to the correlation matrix for changing talent, this is trivially simple: since talent levels are the same throughout the sample, the correlation between talent from one day to the next will always be one.

In other words, the correlation matrix will just be an n x n array of 1s. And the sum of an n x n array of 1s is just n^2.

The ratio of these two values will give us the ratio of talent variance after n days of talent changes to the talent variance when talent is constant:

And that is our formula for finding the ratio of variance in true talent over n days to the variance in true talent on a single day, given the value r for the correlation of true talent from one day to the next. With some simplification, the above formula is equivalent to what was posted in the THT article:

The Fight

2016-12-24T17:23:00.002-07:00

It happened on May 15, 1912.

The once-mighty Detroit Tigers were off to a slow start. It was to be a long season, their first losing one in six years. Far from mollifying the pain of defeat, their past success only served to heighten the tension they felt—the old veterans had nearly forgotten what it was to lose, and the youthful among them had not known to begin with. By contrast, their current situation, while not objectively hopeless, only felt that much more dire.

Needless to say, when the Tigers rolled into New York on their steam locomotive from Boston, where they’d just dropped another two out of three, and cozied up to Hilltop Park, they were a cohort on edge.

Hilltop Park, as it happened, seemed at first the perfect destination for such a group of men. The Highlanders, not yet the storied franchise they would later become, were one of the few teams in the American League still worse than they were, and their boys were ripe for the beating. Over the next three days, Detroit began to feel their season reforming beneath their cleats. They took two of the first three and were nearly back to .500. Once-shattered men began again to believe.

And so they took the field for the fourth and final game of the series. Things began inauspiciously, with the teams trading blows for the first two innings and Detroit emerging from the proverbial fracas with a one-run lead. As it were, such acts of violence were not to remain figurative.

Detroit’s star centerfielder, Tyrus Raymond Cobb, was so known for his gentle disposition that his teammates, half-mockingly but not without a hint of affection, referred to him as “the Georgia Peach”. However, as Detroit’s standout performer, it was Cobb who found himself the target of the local malcontents who had made it their duty to suffer Highlander seasons firsthand.

Loudest among these was one Claude Lueker, a man whose brazenness had been honed in the fiery confines of Tammany Hall, and he spoke in ways of which only a man entrenched in politics could even conceive. Such foul narratives poured from his mouth as would turn an oak tree barren just from the stench of their connotations.

For four innings this continued. Cobb tried to escape the abuse by staying in centerfield for both turns at bat, sitting quietly against the outfield scoreboard and only speaking up to help direct the New York outfielders to avoid collisions. However, Cobb was accustomed to reading between innings, and had in fact been looking forward to the New York trip where the country’s leading literary critics resided and published, and had that very day picked up a new analysis of MacBeth from just such a scholar before the game. Only Cobb had left his reading glasses in the dugout, and was unable to study his text from the outfield.

And so, after four innings of careful isolation, Cobb finally felt it safe to brave the trek back to the dugout to retrieve his spectacles. He knew at once he had been mistaken. The heckler was on him again, this time saying things Cobb was certain could turn even the most ardent of free speech advocates into anti-seditionists.

Once in the dugout, Cobb was immediately accosted for his inaction.

“Dammit, Cobb!” cried Sam Crawford. “This has gone on long enough! There are children here, for crying out loud!”

Ed Willett soon chimed in. “You can escape this nonsense out there in centerfield, but I’ve got to stand on the mound and listen to it! You think Donie Bush would let this kind of thing go? Sometimes I wish he were our future Hall of Famer.”

Cobb protested. “Look, I’m sorry you all have to put up with this, but there’s nothing we can do. We’ll be out of New York tomorrow, and we can put the whole thing behind us then.”

Wanting nothing more than to go back to the outfield where the fans were much more docile and many were willing to debate the merits of Mark Twain’s lesser novels (which was one of Cobb’s pet subjects), Cobb hoped he could leave it at that. It was at this moment that an insult so offensive crept over the lip of the dugout and into the ears of the Detroit men that there was no longer anything Cobb could do for the hurler.

Hughie Jennings walked over and put his arm on Cobb’s shoulder. “Look, son, I know you don’t like this any more than the rest of us. Probably less than the rest of us. But you’ve got to do something to shut that man up.” Jennings' eyes glowed with a warm fierceness Cobb knew from experience he could not allay. With a final pat on Cobb's shoulder, Jennings bored into him with those eyes and tried to reassure him: “We’ll have your back.” Cobb turned reluctantly toward the dugout steps.

After a tentative step into the stands, Cobb quickly retreated. Jennings began to protest, but Cobb cut him off. “Look, I know what you’re going to say, but the man is an invalid! He’s got no hands!”

“I don’t care if he doesn’t have any feet!” Jennings bellowed. “What must be done will be done, if not by you then by someone else!”

From the corner of his eye, Cobb saw Bill Burns reaching for his lumber. Burns had long since washed out as an effective pitcher and had never been able to hit a lick, but he remained a towering hulk of a man, and Cobb knew it would not end pleasantly were he commissioned for the task. So, even more reluctantly than before, Cobb slunk back up the dugout steps and into the stands, trailed behind by his fellow Tigers.

“Look,” Cobb said as he approached the man, “I wish you wouldn’t create such a ruckus, but also know that I haven’t any ill intent toward you.” With that, Cobb raised his fist half-heartedly, when suddenly the man heaved his entire weight in the direction of Cobb. Like two anteaters on the savanna they tumbled. Cobb’s teammates jumped at the sight, storming into the stands with bats in hand. Mayhem was upon the lower grandstand like flies on a heap of corpses and was not to be driven away.

At this point, the Highlanders, who had been surveying the local architecture beyond left field using Hal Chase’s new engineering sextant, heard the commotion and were made aware of the delay in the game. They rushed to the aid of their fellow professionals, leaping unaware into the middle of the fray. For the next forty-five minutes, fans and players were at each other in a most uncivilized manner before the umpires managed to get through to the telegraph office in the press box to wire the police.

By the time it was over, more than two dozen fans were injured, and several players received stern warnings for their behavior. Ban Johnson, who happened to be in attendance and witnessed the second half of the brawl after returning from the concession stand, suspended the entire Detroit roster, and they had to play three days later against Philadelphia with a replacement nine.

And that, to this day, remains without a doubt the greatest fight in baseball history.

Baseball is Dying (1892 version)

2015-09-03T10:41:00.000-07:00

At least that seems to be the opinion of Pittsburgh Dispatch sports editor John D. Pringle in his weekly "A Review of Sports" column:

If there were ever any doubts concerning the waning interest in baseball, the meeting of the magnates at Chicago during the past week must have dispelled them. The gathering was more like the meeting together of a lot of men to sing a funeral dirge than anything else. The proceedings were doleful despite the efforts of the magnates to wear smiles. Most certainly this annual meeting was far below par in enthusiasm with those of former years.
...
To be sure, those persons who court notoriety by always wanting rules changed and tinkered were at the meeting. There was no millenium plan this time; it is an exploded bladder now, but there was the new diamond notion and a few other things just as silly and just as characteristic of liquid intellects as the Utopian "plan." Of course all the venders of quack remedies pointed out that "something must be done to revive an interest in baseball." Ah! You see they admit the game's popularity is waning. Happily no changes were decided on.

Even more pessimistic was the Kansas City Times, which apparently wrote:

BASEBALL has apparently served its day and its days seem near an end. Perhaps there may be a renaissance. But the ball players have come to the end of their string; they can play very little better; there is no more progress to be made. The people have seen it all. They are tired of reviewing it.

By the way, this is the "new diamond notion" Pringle refers to:

As you can see, the proposal was to add a fifth base, with the middle bases positioned roughly where the infielders actually play. The basis for the proposal was twofold: One, it would increase the amount of fair territory by widening the angle between the first and third baselines, resulting in more base hits and fewer foul balls. Two, it would shorten the distance between stealable bases to 70 feet (along with the distance the catcher would have to throw the ball), leading to a more active running game.

By keeping the distance to first and to home the same, proponents hoped to minimize the impact on infield hits and scoring plays. By adding an extra base station and increasing the total distance around the bases, the extra action of more base hits and base stealing would not necessarily lead to a huge increase in scoring.

Gender in Chess PART 4: MISREPRESENTING THE DATA

2015-05-29T14:52:00.002-07:00

The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.

Previous:
INTRO
PART 1: MEASURING THE GENDER GAP
PART 2: ELO RATINGS
PART 3: CAUSE AND EFFECT, THE BILALIĆ, SMALLBONE, MCLEOD AND GOBET STUDY

Finally, I think one of the biggest issues is that Howard may have misrepresented his research in the Chessbase.com article. Since the full paper is behind a paywall, I don't know for sure or to what extent, but there are certainly indications that the article overstates Howard's conclusions.

One is the following graph, which is one of the few pieces of data Howard shares from his research:

The graph purportedly refutes the participation hypothesis by showing that the rating gap between males and females increases as the female participation rate increases. This supports Howard's alternative hypothesis that the most talented females are already playing no matter how low the overall female participation rate is, and that increasing the participation rate only adds less talented players and can never catch females up to males.

A few things jump out about this graph, though. First, the data on federations between 5-10% and 15-25% is completely missing from the graph, with the three remaining points forming a neat line with a clear slope. I have no idea if this was deliberate, but it is at least strange.

More importantly, Howard doesn't explain anywhere in his summary how the data is aggregated, how many players are included in each group, what countries are included in each group, how any individual federations rated, or why this particular graph was chosen out of the various studies or various number-of-games controls Howard seems to have run.

Howard singles out only Vietnam and Georgia as countries with high female participation in the text of the article. Except when I downloaded the April, 2015 rating list, the difference between the average male rating and the average female rating in Vietnam (94 points) was significantly lower than the difference worldwide (153 points). And Georgia (35 points) had one of the smallest gender rating gaps in the world. I don't have data on the number of games played to check what happens when you include that control, but as I wrote in the previous post, I am skeptical that that could possibly cause the rating gap for Georgia or Vietnam to suddenly jump above average.

What countries with high (25+%) female participation rate among FIDE-rated players had higher than average gender gaps? Ethiopia had a massive gap, with the average male rated 621 points higher than the average female. But there are only 30 Ethiopian players on the list, with just 9 females. Most of the other countries with a high percentage of females on the rating list that had above-average rating gaps also had very few players.

Now, I don't think it is Ethiopia that is throwing off Howard's chart, because I don't think any of the female players from Ethiopa have played enough FIDE-rated games to qualify for Howard's cutoff, but I wonder if Howard's graph is simply weighting all federations equally when he aggregates the data. If I try to recreate something like Howard's chart with the April, 2015 rating data without any control for games played, then I do get a positive slope if I just take the simple average of each federation's rating gap. If I instead weight each federation's rating gap by the number of female players, so that, for example, Georgia with its hundreds of rated players gets more weight in the aggregate than Ethiopia with its 30, then I get a negative slope:

So it could be that Howard's graph is aggregating the data in a misleading way. I don't know for sure, but his results look a lot more like what I get when I aggregate the data in a misleading way. It is also possible that setting a control for players at 350 rated games played left relatively few players, and that after further splitting up the data into separate federations like this, there are simply not enough data points to get reliable results.

It is definitely misleading for Howard to highlight Georgia as his prime example of a federation that encourages female participation while he is showing that these countries have a larger gender gap, because Georgia definitely has a smaller than average gender gap. The following line in particular sounds suspicious:

"I also tackled the participation rate hypothesis by replicating a variety of studies with players from Georgia, where women are strongly encouraged to play chess and the female FIDE participation rate is high at over 30%. The overall results were much the same as with the entire FIDE list, but sometimes not quite as pronounced."

This is right after the graph showing that the gender gap goes up as female participation increases, and right after he singled out only Georgia and Vietnam as examples of countries included in that graph. Howard finds that the gender gap is actually lower in Georgia ("sometimes not quite as pronounced"), but he completely downplays this finding and neglects to report any quantitative representation showing how the results were less pronounced. It is no wonder that readers like Nigel Short got completely the wrong impression of Howard's results, as when Short summarized this graph in the following manner:

"Howard debunks this by showing that in countries like Georgia, where female participation is substantially higher than average, the gender gap actually increases – which is, of course, the exact opposite of what one would expect were the participatory hypothesis true."

I found this review of the full paper written by Australian grandmaster David Smerdon. Smerdon's review gives a very different impression of Howard's work than Howard's own Chessbase summary. For example, in reference to the Georgia data and Short's interpretation:

"I don’t know what Short is referring to here, because there is nothing in the Howard article that suggests this. Figure 1 of the study shows that the gender gap is, and has always been, lower in Georgia than in the rest of the world for the subsamples tested (top 10 and top 50). Short may be referring to Figure 2, which, to be fair, probably shouldn’t have been included in the final paper. It looks at the gender gap as the number of games increases, but on the previous page of the article, Howard himself acknowledges that accounting for number of games played supports the participation hypothesis at all levels except the very extreme."

And later, summarizing Howard's research on the gender gap in Georgia:

"...This supports a nurture argument to the gender gap, but again, the sample size is too small for anything definitive to be concluded."

This sounds like it is describing completely different research from Howard's Chessbase article. While Short definitely did not do himself or the gender discussion any favours with his interpretation, neither does Howard do his research justice with his published summary.

Gender in Chess PART 3: CAUSE AND EFFECT, THE BILALIĆ, SMALLBONE, MCLEOD AND GOBET STUDY

2015-05-29T14:49:00.002-07:00

The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.

Previous:
INTRO
PART 1: MEASURING THE GENDER GAP
PART 2: ELO RATINGS

Howard criticizes a 2009 study published by the British Royal Society that found support for the participation hypothesis--that there are fewer elite female chess players simply because there are fewer female chess players overall. The Bilalić, et al study looked at the top 100 rated male and female players in the German federation and compared the distribution of their ratings to an expected distribution based on the overall participation rates by gender. The observed gender gap was close to what was expected from the overall participation rates.

Howard issues three main criticisms of the study:

1) It is too difficult to determine cause and effect from their data.
2) They didn't control for the number of rated games played.
3) The study relies on data from only the German Federation and thus could simply be a sample size fluke.

CAUSE/EFFECT

Howard argues that showing that the gender gap is in line with what we would expect from participation rates is not enough to establish participation rates as the cause of the gender gap. However, Howard himself does no better in establishing a cause/effect relationship between the gender gap and his hypothesis that men are innately more talented at chess.

Howard supports his claim with data showing that the rating difference between the top male and female players has remained relatively constant over the years, which he assumes means the gender gap has not closed (which is probably incorrect). He then assumes that if there were non-biological causes behind the gender gap, the gap must have diminished over the past several decades as feminism has advanced in many developed countries, and if it hasn't then that means there is likely a biological cause.

But he doesn't provide any more support for this assumption than Bilalić, et al do for theirs. Several areas in sports that should be unaffected by the physical differences between males and females, such as coaching, general management, and officiating positions, have seen little to no progress in gender disparity over that same time span in spite of any general advances in society. It is not a given that a lack of significant progress means that gender disparity is due to natural talent.

I think Howard overestimates his evidence of a causal relationship in part because underestimates the "gatekeeper" effect in chess. In his 2005 paper, he gives this as an important factor in testing his hypothesis:

"Adequately testing the evolutionary psychology view, that the achievement differences at least partly are due to ability differences, requires a domain with very special characteristics. First, it should be a complete meritocracy with no influence of gatekeepers, in which talent of either gender can rise readily."

Howard relies on the assumption that chess is close to a complete meritocracy because most tournaments are open* and results are based on your performance. Howard contrasts this to fields like science, where decision-makers control access to resources and could be susceptible to bias:

"In most domains, gatekeepers control resources needed for high achievement and may run an ‘old boy’s network’ favouring males. In science, for instance, gatekeepers distribute graduate school places, jobs, research grants, and journal and laboratory space."

*Most major tournaments involving the top players are actually not open, but invitational. Most tournaments below the elite level are open, however.

The absence of decision-makers with the ability to deny players access to tournaments does not mean there are no gatekeeper forces at work, however. There are other forces that can have just as strong an effect. WIM Sabrina Chevannes gives some examples of social pressures (under the section "My thoughts on sexism in chess") that commonly make women feel unwelcome or uncomfortable at predominantly male tournaments, ranging from belittling remarks to flat-out harassment.

These problems are driving established female players away from the game, but they can also be important for young players getting into the game. Most grandmasters start chess at a young age, and research backs the idea that starting age is an important factor in chess mastery (full paper)), both because starting earlier allows for greater total accumulation of practice, and because chess likely has a "critical period" effect for learning (the same effect that makes it much easier for a young child to learn a language than an adult).

This means that even subtle effects, such as a parent being more likely to teach the game to male children at a young age, or young males being more attracted to the social environment of a predominantly male local club, can have a significant gatekeeper effect. Things like age of exposure to chess, access to high-level coaching and competition, and social compatibility with existing chess culture are all important factors in developing a player's ability.

This is probably why we see strong chess countries like Russia or other former Soviet nations consistently dominating chess, even though they probably don't have any biological ability advantage. The more children who are exposed to favourable learning criteria, the more high-level chess players a population will produce. Just like these factors help keep the strongest federations on top, they could conceivably favour male players over female players.

Chevannes also points out more explicit gatekeeper behaviour, such as limited access to funding and coaching for England's womens Olympiad team ("Effects of sexism in English Chess"). Several countries provide state-funding or private grants for chess development, similar to the type of gatekeeper influences Howard describes in science. For example, the USCF has the Samford Chess Fellowship, a private grant currently for $42,000, which has been awarded annually since 1987. Thirty of the 32 recipients (three years the grant was split between two recipients) have been male.

And, as mentioned earlier, most of the top tournaments are actually invitational, which also fits Howard's criteria for gatekeeper influence. The potential gatekeeper effect of invitational tournaments preserving rating gaps is even something players have complained about: when the top tournaments only hand out invitations to the same group of top-rated players, those players just end up trade rating points among themselves, which leaves little opportunity for them to give rating points back to the rest of the field.

These factors are incredibly difficult to measure and separate out from your data, which is why Howard considers the absence of such factors essential to test his hypothesis. By ignoring these factors, Howard strongly inflates his evidence in support of a biological cause. In fact, this is a common criticism of the entire field of evolutionary psychology which Howard uses to approach this question: its hypotheses about cause and effect are so difficult to properly test, it is debatable whether it actually qualifies as science.

NUMBER OF GAMES CONTROL

As discussed in the previous post, I don't think controlling for the number of rated games played adequately separates out the effect of practice and development from that of natural talent. More importantly, though, Howard's criticism here is confusing because he only describes the importance of controlling for number games as something that could avoid a potential bias against female players. Females tend to play far fewer rated games on average, and a player's rating tends to increase the more games they play.

In order for this criticism to be relevant to the Bilalić study, omitting this control would have to bias the results in favour of female players. Howard offers no reasoning as to why this would be the case, and it is not at all obvious how it could be. Howard's own data appears to show a decreased gender gap after controlling for number of games.

NOT ENOUGH DATA POINTS

While Bilalić, et al did only look at players from the German federation, they compared ratings for the top 100 players of each gender. In Howard's original study, he included players from all federations, but still only compared the ratings of the top 10, 50, and 100 players of each gender, so he was not actually using any more data points than the study he is criticizing.

Just as importantly, Bilalić, et al actually had a reason for using data from just one federation rather than FIDE data, as outlined in a later paper by Bilalić, Nemanja Vaci, and Bartosz Gula. FIDE rating data is limited to only above-average players and omits a lot of data from developing or below-average players. Rating data from individual federations can allow for a more comprehensive view of the population, such as a better estimation of overall participation rates, which was necessary for their study.

In Howard's summary article, he refutes the Bilalić study by showing data from more federations, but he doesn't actually repeat their study to create a comparison to their work. Instead, he just shows aggregated data with no indication of how many players were included or how the data was aggregated. It is not clear that he actually used more data points to draw his conclusion than Bilalić, et al used, only that he looked at players from multiple federations.

Howard's most recent study is behind a paywall, so unfortunately all I have to go by is his summary published on Chessbase.com. I assume there are more details in the full study, but it is impossible to tell how his data really compares with the data from the Bilalić study from what he published in the summary, which is largely written as a refutation to the Bilalić study.

NEXT:
PART 4: MISREPRESENTING THE DATA

Gender in Chess PART 2: ELO RATINGS

2015-05-29T14:48:00.003-07:00

The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.

Previous:
Intro
PART 1: MEASURING THE GENDER GAP

We saw previously that the lack of significant change in the rating gap between the top male and female players can actually be evidence that the gender gap in chess has diminished over the past few decades. That is not the only potential interpretation problem with Howard's conclusion, though. Even if control groups hadn't indicated that the Elo gap should be increasing absent any closing of the gender gap, it could still be conceivable that the gender gap has in fact diminished.

This is because Elo ratings are not indicators of absolute playing strength, only of strength relative to the field of rated players. In other words, a 2500 Elo rating among one group of players is not necessarily equivalent to a 2500 rating among another group of players. For example, Hikaru Nakamura's FIDE rating for April, 2015 is 2798. His USCF rating is 2881. Both are Elo ratings, but because they are tracked among different pools of players, they don't have to match up even though they are both describing the strength of the exact same player.

Howard is looking at the same FIDE rating for both male and female players, though, so this shouldn't be a problem, right? Possibly, but we don't know for sure.

Elo ratings work by taking points away from one player and giving them to the other each time a game is played. If a player has a true playing strength of 2500 but is rated at 2400, then they would be expected to take points from their opponents until their rating matches their playing strength. Likewise, a player who is overrated will give points back to the field until their rating returns to their ability.

Many top female players play predominantly or exclusively in womens events. And some of the top female players who play in open events, such as Judit and Susan Polgar when they were still active, rarely play womens events at all, and as a result rarely play against other women. Because of this, if males or females are over- or underrated as a group, there might not be enough games between the two groups to transfer the necessary rating points to bring them back in line. It is possible that female players and male players form two sufficiently isolated player pools that their ratings are not necessarily comparable.

This might sound far-fetched, but it is actually a known problem and has occurred before. In 1987, FIDE commissioned a study comparing the performance of top female players against men to their performance against other women because of this exact issue. The six women who had played a sufficient number of games against both genders over the mid-80s to qualify for the study all held significantly higher performance ratings against male opponents than female opponents--on average more than 100 points higher.

This suggested that, for example, a 2400 rated female player was likely stronger than a 2400 male player. To compensate, FIDE added 100 points to all rated female players (except Susan Polgar*) in order to bring their ratings in line with the male ratings. It is possible that the two pools of players have remained isolated enough to drift out of sync again over the last few decades, however.

*The reasoning was that Polgar already played mostly within the male pool of players and didn't need the adjustment. However, the decision to give the full 100 points to her top rivals, who also played a significant number of games against men, and 0 points to Polgar was nonsensical and controversial, and there were accusations that FIDE was deliberately manipulating the ratings to place Maya Chiburdanidze in the #1 spot ahead of Polgar.

Most people who follow chess believe that some form of inflation exists in the ratings . In other words, they believe a 2800 rating is not as strong now, when there are a handful of players hovering around that level, as it was when Garry Kasparov first achieved it back in 1990 and Anatoly Karpov was the only other player over 2700, or in 1972 when Bobby Fischer topped the ratings list by over 100 points at 2785.

The mechanism of inflation is not well understood, however, and it is not clear that it would necessarily have had the same effect on a fairly isolated pool of female players as on the population as a whole. It could be that after nearly 30 years, male ratings have inflated faster than female ratings, and we have once again reached a point where female players as a whole are underrated.

Howard himself notes another potential interpretation problem with using FIDE ratings to measure the gender gap in chess:

"I found that women typically play many fewer FIDE-rated games than males, only about one third of the number on average. Now, the usual learning curve for chess players is a progressive ascent to a peak at around 750 FIDE-rated games. ... Comparing modestly- and highly-practiced individuals can be misleading. Studies should control for differences in number of games played, either by equating males and females on this or by examining differences at the typical rating peak at around 750 games."

Howard then dismisses this explanation because even after controlling for the number of rated games played, males still had higher ratings.

The number of FIDE-rated games played itself isn't really what we care about, though. It's just a proxy for "modestly-practiced", "highly-practiced", etc. Players who have played more games should, in general, have more experience and further development. Games played are far from a perfect indicator of a player's level of development or experience, however.

Most obviously, not all games are FIDE-rated. While top-rated players do for the most part compete exclusively in FIDE-rated events, that is not true for developing players. For example, U.S. prodigy Sam Sevian has played 539 FIDE-rated games as of April, 2015. He's played 922 USCF-rated games. Even ignoring casual and club games, that is hundreds of competitive games that are not in Howard's data (and it is more than the difference between 922 and 539, because not all FIDE-rated games are USCF-rated).

The amount of study devoted to chess outside of rated games is also a huge factor in development. Someone who is devoted to studying chess full-time will develop much more than someone who competes as a casual hobby, even if you control for the number of rated games played. Likewise, someone who competes fairly regularly and reaches 750 games in their 20s is different from someone who competes less frequently and reaches 750 games in their 40s or 50s (or even later). The former is probably much more likely to still be ascending and hitting their peak at that point, while the latter likely peaked or plateaued at a much lower number of games, and would probably have begun declining with age by the time they reached 750 games.

It's easy to see how two players can be at vastly different stages of development even after the same number of games played. Howard isn't comparing two individual players, though--he is comparing two groups of players (male and female). As long as you look at enough players in each group, shouldn't those other factors start to even out?

Ideally, they should. If there is a bias that applies to the group as a whole, though, that won't happen. For example, if female players tend to begin playing FIDE-events at an earlier stage in their development, or if they tend to compete less frequently than male competitors, that would introduce a bias that won't even out.

Many female players compete predominantly in female-only events, which are less frequent than open-gender events. And because these female-only events draw from a much smaller segment of the chess-playing population than the open-gender events, they also tend to be less intimidating for less-experienced players to enter. So there is a good chance this bias does exist.

In fact, Howard's data supports this. His 2005 paper includes a table summarizing males and females who entered the rating list between 1985 and 1989, and shows that the median age at which females first appeared on the list was about five years younger than the median age for males (though for top 100 females, it was only about 6 months younger than for top 100 males). And, in spite of entering the rating list at a younger age, the females on average still played significantly fewer games in their competitive careers.

Howard hits on an important idea about a player's rating being reflective of both their innate abilities and their level of practice and development. In order to test for the effects of innate abilities alone, as Howard sets out to do, he realizes that he needs to strip out the effects of development. However, this is a much more complicated issue than Howard acknowledges, and simply controlling for the number of rated games played is not adequate to make the assumption that any remaining differences must reflect natural ability.

NEXT:
PART 3: CAUSE AND EFFECT, THE BILALIĆ, SMALLBONE, MCLEOD AND GOBET STUDY
PART 4: MISREPRESENTING THE DATA

Gender in Chess PART 1: MEASURING THE GENDER GAP

2015-05-29T14:42:00.000-07:00

The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.

Previous:
INTRO

Howard begins by revisiting a 2005 paper he published on the same topic showing the gap between the average Elo rating of the top 50 male players and the top 50 female players:

Howard then argues that because the Elo gap has remained relatively constant in spite of societal changes over that time period, the difference between the male and female ratings is not due to societal factors and is at least partially biologically-based.

This finding is likely surprising to most people in chess. For example, the legendary Garry Kasparov, who early in his career expressed a somewhat Fischer-esque dismissal of female chess talent, grew to greatly respect the Polgar sisters (one of whom has defeated Kasparov himself) and felt they broke new ground for female players. In a recent interview at an exhibition match with Short in St. Louis, Kasparov rejected the claim that the gender gap has not closed. Even Short himself wrote that he had assumed the gap had closed somewhat before reading Howard's article.

Howard acknowledges this prior expectation in his 2005 paper:

"Anecdotally at least, there has been some convergence in chess at top levels. For example, there are more female grandmasters. Judit Polgar, born in 1976 and the strongest-ever female player, regularly wins tournaments against top male competition and several times has made the top ten players list. She once held the record for youngest-ever grandmaster. But, the extent of gender differences and their trends over time have never been quantified."

After quantifying the Elo difference, though, Howard simply assumes that the difference remaining flat means there has been no closing of the gender gap. This might seem like an reasonable assumption, but it lacks an important step: he has no control group to help interpret his results.

***

Computers have revolutionized how chess is played and studied at the top level. With the help of computer engines that are much stronger than any human grandmaster, known opening lines are constantly being analyzed more and more thoroughly. The more thoroughly these lines are known, the more important it is for players to memorize them, and the deeper they have to look for new ideas that could lead to a winning position. Strong grandmasters spend most of their time studying and developing these lines.

Former World Champion Vladimir Kramnik (born 1975, reached grandmaster 1991) said at his most recent tournament that top players have to work much harder now than when his career was starting. However, only the very top players can support themselves studying chess and competing full time. Most grandmasters, let alone lower titled or untitled players, don't have the time to keep up with all of these advances.

It is possible that this has led to the top players distancing themselves from the field. If that is the case, then, absent any closing of the gender gap, we would expect the Elo gap between the top 50 males and the top 50 females to have grown over time, just because there are more males in the group at the very top that is pulling away from everyone else. We need some kind of control group to compare to in order to help us interpret Howard's graph before we conclude the gender gap has not closed.

One way to do this is to compare breakdowns other than the top 50 females vs. the top 50 males. For example, what if we take the top 50 Russian players, and compare them to the top 50 non-Russian players?

The top 50 Russians in the April 2015 FIDE rating list have an average Elo rating of 2659. The top 50 players from outside Russia are at 2726. So the top 50 Russian players are 67 points below the non-Russians.

If we go back to 1991 (the first year the Soviet federations were listed separately--it would be impossible to make comparisons before that because the USSR included many strong players from outside Russia), the top 50 Russians were 54 points behind the top non-Russians. So the gap has grown a bit in the last couple decades, in spite of the fact that Russia remains by far the top federation.

Of course, you might be able to make a case that Russia is a bit weaker than it was in the early 90s when Kasparov and Karpov were still dominating chess. Except here's the thing: when we compare Russia to the rest of the world, Russia has lost ground. But if we instead compare Russia to each individual federation, they have actually gained ground over most of them. This seems paradoxical, but it makes sense if the top end of the spectrum is stretching itself out.

Let's take a look at some of these other countries.

The U.S. is experiencing something of a golden age for chess right now. They currently have two of the top ten players in the world. Hikaru Nakamura, the best American player since Bobby Fischer*, has been as high as #2 in the world in the live rankings this year, and recently became the first American to hit 2800 Elo. Increased funding and efforts in development programs have produced some remarkable young talent, including Sam Sevian, who in 2014 became the sixth-youngest grandmaster ever at 13 years old.

*at least not counting Fabiano Caruana, who has spent most of the last year as the #2 player in the world--Caruana was born in the U.S. but moved to Europe at age 13 and has represented Italy for his professional career

The emergence of serious collegiate chess teams has also attracted strong talent from around the world to the U.S. For example, five of the twelve competitors in the open-gender division of the 2015 U.S. National Championship (and at least that many from the Womens division) had originally competed under a different national federation before transferring to the USCF, including world #7 Wesley So. Likely influenced by the emergence of American chess, the aforementioned Caruana recently announced that he is transferring back to the USCF.

You would be hard pressed to argue that the U.S. federation is weaker now than in 1991, and certainly not much weaker. Yet in 1991, the top 50 American players were 105 points behind the top 50 non-Americans. Now, they're 185 points back.

What about Norway, the home of current World Champion and clear #1 Magnus Carlsen? Carlsen has sparked a chess craze in Norway, where tournaments now get national TV coverage. Norway hosts one of the top chess tournaments in the world (Norway Chess) and last year hosted the Chess Olympiad. The number of Norwegians in FIDE's published rating list grew from 92 in 1991 to 1306 this year.

The gap between the top 50 Norwegian players and the rest of the world grew from 289 points in 1991 to 337 points in 2015.

Not all federations saw their gap increase. China, for example, has without a doubt become much stronger in chess since 1991. Chess has had difficulty catching on in China due to the prevalence of xianqi, China's native chess variant, and go, another popular strategy game. Chess was even outlawed for a period in the 1960s and '70s as part of Chairman Mao's Cultural Revolution. Starting in the 1970s, however, China began pouring an increasing amount of funding and effort into growing its chess program.

This has ramped up in recent years, and China has finally emerged as a world chess power. Their women's team has won gold in four of the nine Chess Olympiads held since 1998 and three of the five World Team Championships since a women's division was created in 2007. The open-gender team won gold in the 2014 Olympiad and the 2015 World Team Championships. Their top 50 went from 329 points back of the world in 1991 to 207 points back in 2015.

Still, the vast majority of federations saw increases. Here are the Elo gaps for each of the 38 federations that had at least 50 FIDE-rated players in both 1991 and 2015:

Only 5 of the 38 federations closed the Elo gap at all, and on average the gap grew by 54 points.

When we look at the individual federations as control groups, we see evidence that the top really is separating itself further away from the field as time goes on. In spite of that, Howard's graph shows that women actually closed the Elo gap by a small amount. This can be interpreted as evidence that the gender gap is in fact closing, because it is offsetting the effect we are seeing with the national federations.

It is tempting to see evidence that supports your hypothesis in a vacuum, such as the relatively constant Elo gap between male and female players over the years, and to stop there. It is also tempting to believe a variable you believe to be objective and unbiased, such as Elo ratings, is self-explanatory and needs no control group to interpret. However, this is a dangerous practice. Especially when your results run counter to what subject matter experts would expect, as this finding did, it is important to make sure you have the proper context to interpret your results before jumping to conclusions.

NEXT:
PART 2: ELO RATINGS
PART 3: CAUSE AND EFFECT, THE BILALIĆ, SMALLBONE, MCLEOD AND GOBET STUDY
PART 4: MISREPRESENTING THE DATA

THE GENDER GAP IN CHESS: A CASE STUDY IN STATISTICAL RESEARCH

2015-05-29T14:41:00.003-07:00

The following is an introduction to a series of posts about some of the difficulties with conducting and interpreting statistical research, with links to the rest of the series at the end of this post.

Bobby Fischer once said he could beat any woman in the world giving them knight odds* (the full quote, in true Fischer fashion, is worse). Mikhail Tal famously responded, "Fischer is Fischer, but a knight is a knight!"

*Knight odds means the player giving odds starts the game with one knight already off the board.

Tal was correct, of course. In 2008, a master player named John Meyer, rated 2284 (grandmasters are rated 2500+, with the top GMs well over 2700 or even 2800), played a match against the computer program Rybka with knight odds. By that time, computers had far surpassed humans in chess. Rybka could have easily defeated the world champion in a non-handicapped match. With knight odds, Meyer won the match 4-0. There were women in Fischer's generation much stronger than Meyer who would have had no problem beating Fischer given such a handicap.

Still, chess remains a largely male-dominated profession. Currently, there are just two women in the top 100 rated players in the world, and one (Judit Polgar) is retired and will fall out of the active rankings later this year. Theoretically, chess should be among the most gender-neutral competitive disciplines, but the overwhelming majority of players are male. In fact, the predominance of male players is so strong that the URL for FIDE's top 100 overall list actually ends with "...?list=men", even though there are women on the list.

The question of why this is and what can (or should) be done about it has long been a point of discussion in the game, but this discussion reached the mainstream media last month due to controversy over an article written by British Grandmaster Nigel Short in the magazine New in Chess.

If you don't know Short (and you probably don't unless you particularly follow chess or remember his highly publicized World Championship match with Garry Kasparov in 1993), he's...well he's not really the best representative to speak about anything, really. When asked to write an obituary in his newspaper chess column for fellow British Grandmaster Tony Miles, he pretended to write a proper obituary for a few paragraphs before descending into a long-winded rant about why he didn't like Tony Miles, culminating with the line "I obtained a measure of revenge not only by eclipsing Tony in terms of chess performance, but also by sleeping with his girlfriend, which was definitely satisfying but perhaps not entirely gentlemanly." Nigel Short, everyone.

So it's no surprise that Short set off some fuses when asked to write about this topic (by the time he gets to the part about how he has to "manoeuvre the car out of our narrow garage" for his wife, you kind of get the sense that he's just doing this on purpose--which, in a media environment where controversy equals views equals money, he may well be.)

In the midst of his rambling, though, Short actually does cite an academic paper by Robert Howard (actually, a synopsis of the study posted by Howard to the chess website Chessbase.com):

"Nevertheless, my gut feeling was that female chess players are both stronger and more numerous than they were when I first began competing. The latter is certainly true, but an excellent article by the Australian Robert Howard on the chessbase.com website last year demonstrated that, despite the enormous societal changes over 40 years, the gap between the leading males and females has remained fairly constant at nearly 250 Elo points – a yawning chasm in ability. That women seem stronger has more to do with universally higher standards, due to the ubiquity of computers, than any closing of the gender gap."

Unfortunately, Short's citation comes with a clear agenda, as is evident in how he presents a second academic study which reached different conclusions:

"Howard also subtly critiques the most absurd theory to gain prominence in recent years, by Bilalić, Smallbone, McLeod and Gobet (which was submitted to the prestigious Royal Society, no less), that the rating sex difference is almost entirely attributable to participatory numbers (they comprise just 1% of the readership of this magazine). With the aid of a couple of bell curves this foursome neatly solve the eternal chess conundrum of why women lag behind their male counterparts, while simultaneously satisfying that irritating modern psychological urge to prove all of us, everywhere, are equal. Only a bunch of academics could come up with such a preposterous conclusion which flies in the face of observation, common sense and an enormous amount of empirical evidence too. Howard debunks this by showing that in countries like Georgia, where female participation is substantially higher than average, the gender gap actually increases – which is, of course, the exact opposite of what one would expect were the participatory hypothesis true."

The problem is partially that Short probably has no idea what the studies are doing (for example, Short seems unaware that Howard found the gender gap did decrease in Georgia compared to the rest of the world, makes up the term "enormous amount of empirical evidence" without justification, and I don't get the impression he's even read the Bilalić, et al study), but in this case, the blame doesn't lie entirely with Short. Howard's synopsis itself is largely responsible. It appears to misrepresent Howard's own work, as well as point to some potential critical issues with the study.

That being the case, I'd like to use this as an opportunity to cover some of the potential pitfalls in running this type of statistical analysis.

PART 1: MEASURING THE GENDER GAP
PART 2: ELO RATINGS
PART 3: CAUSE AND EFFECT, THE BILALIĆ, SMALLBONE, MCLEOD AND GOBET STUDY
PART 4: MISREPRESENTING THE DATA

Math Behind Projecting the Division Winner (THT Article)

2015-03-16T12:06:00.002-07:00

Note: this article uses examples from the free statistical software R

In my Hardball Times article about the projecting the number of wins we expect from the division winner, I included the following example:

Instead of having five baseball teams, let's say we have five coins. All we are going to do is flip each coin 162 times. Each time a coin lands on heads, it gets a win, and each time it lands on tails, it gets a loss. The coin with the most wins after 162 flips wins the division.

How many wins would you project for the coin that ends up winning the division, whichever coin that might be?

No coin by itself is going to have an expected value of more than 81 wins, but it is extremely likely that at least one out of the five coins will end up with more than 81 wins just by chance. It turns out that if you repeat this experiment a bunch of times, the coin that wins the division will end up with about 88 wins, on average.

Hopefully this makes sense conceptually, but how do I get 88 wins (or, more precisely, 88.3943...)?

One way, of course, is to actually do what I said, and flip a bunch of coins over and over and over and record the results. Let's say I repeat this experiment 10 times, and I get the following results for the "division winners":

94, 85, 89, 87, 89, 90, 82, 86, 85, 86

That is an average of 87.3--pretty good, but obviously not the most precise estimate. We need to repeat the experiment more than ten times to make sure we get something closer to the true mean. Rather than spend hours upon hours flipping coins, we can actually cheat and get a computer to pretend to do it for us. This is called simulation, and it can be a very powerful statistical tool for determining probabilities, averages, distributions, etc that are not computationally obvious (full disclosure: I actually cheated and simulated the 10 seasons rather than record and tally 8000+ coin flips).

Now, let's simulate 1000 seasons: this time, we get 88.5940 wins leading the division, on average. Much better, but still a couple tenths off. Bumping the number of seasons up to 10,000, this time we get 88.4296. And if we keep simulating more and more seasons, we are going to start seeing the results stay clustered more and more closely around 88.3943.

So that's one way to estimate the expected win total for our division winner. How do I know that the results should cluster around 88.3943 specifically, though, other than simulating millions and millions of seasons?

We can get the answer without simulation by starting with a simpler question. What is the probability that none of the teams wins more than, for example, 81 games? The probability that one team wins no more than 81 games is a simple binomial distribution problem: pbinom(81,162,.5) ~ .5313. The probability that all five are at 81 or lower then becomes .5313^5 ~ .04233.

There is about a 4% chance that the division winner will have 81 or fewer wins. We can repeat that calculation for 80 wins, and we see that there is about a .02262 probability of the division winner having 80 or fewer wins. That means the probability of the division winner having exactly 81 wins is .04233 - .02262 = .01971.

Then, we repeat that process for every number from 0 to 162, and we end up with a table of probabilities of the division winner ending up on each possible number of wins. (If you were to do this by hand, you could shortcut a bit by only going from something like 70 to 115 since the probabilities outside that range are all virtually zero anyway.)

Finally, we multiply each possible win total by the probability of the division winner finishing with that number of wins, and we add up the results to get a mean for the distribution. And doing that gives us 88.3943.

R CODE:

#calculate expected mean value of division winner
p <- .5 #probability of each team winning each game
n <- 162 #number of games per season
teams <- 5 #number of teams in the division

games <- 0:n # list of possible win totals (0:162)
p.list <- pbinom(games,n,p)^teams # p of div winner winning X games or fewer
wts <- c(p.list[1],diff(p.list)) # p of div winner winning exactly X games
sum(games*wts) # average wins by division winner

#RESULT
[1] 88.39431

As we can see, it is possible to calculate the mean of this distribution exactly, but it is still pretty cumbersome to do so without a computer. As such, let's discuss one final way to estimate this mean using simpler calculations.

First, we will need a continuous distribution, so we use a normal approximation for the binomial distribution. The mean of the normal distribution will just be 81 (the average number of wins we expect from a team in our example), and the standard deviation will be sqrt(npq) = sqrt(162*.5*.5) ~ 6.36.

All we need to do now is find the point where there is a 50% chance that five numbers randomly sampled from this distribution will all fall below that number. Start by finding the percentile of the distribution that fulfils this condition:

p^5 = .5
p = .5^(1/5) ~ 0.8706

This means we want point at the 0.8706 percentile of our normal distribution, which is simple to look up using an online tool or simple statistical software:

qnorm(0.8706,81,6.36) ~ 88.1849

That is our estimate for the expected number of wins from the division winner. This is slightly off because we are actually calculating the median and not the mean (and because we used a normal approximation, but that makes less difference), but it is still a pretty good estimate given the amount of calculation we simplified.

More From Jesse Burkett (Hit Batsmen)

2015-02-25T05:00:00.001-07:00

From the same news archive binge as the previous article, we get more from Jesse Burkett on the NL's rule changes, and...holy crap. Apparently there was a (thankfully short) period in the NL where hitting a batter only awarded the batter a ball, not a base:

"That rule penalizing the pitcher with only a ball for hitting a batter is a very bad one," said the great hitter. "My word on it, some of those pitchers will be bounding fast ones off the batter's ribs this season."*

Well said, Mr. Burkett. I can see why that change didn't stick.

The article also insinuates that part of the reason Cy Young jumped to the AL (which did not lessen the penalty for hitting batters) was that he thought the new HBP rule was dumb.

*The St. Louis Republic. (St. Louis, Mo.), 29 March 1901. Chronicling America: Historic American Newspapers. Lib. of Congress.

NL Institutes Pitch Clock...in 1901

2015-02-25T04:23:00.000-07:00

With MLB's newfound interest in speeding up the pace of play, it's easy to forget that MLB rules actually had a pitch clock in place before this year. Granted, it was virtually never enforced (I think I saw an automatic ball called for a clock violation once, and no one knew what was going on when it happened), but the rule was technically there.

I had no idea just how far back that rule went, though, until I saw this quote from Hall of Famer Jesse Burkett while browsing through some old sports pages (scroll/zoom to the highlighted word at the bottom right corner of the page):

I have been reading how the rule limiting the pitcher to twenty seconds on the slab before throwing will handicap "Cup". That is only a National League rule, and "Cup" is in the American, where the rule is not in force.*

"Cup" here is George Cuppy, a longtime teammate of Burkett's who had just signed with the newly formed American League. I don't know if the rule was on the books continuously from 1901-present time, but apparently the idea of a 20-second limit on pitchers dates back at least that far.

*The St. Louis Republic. (St. Louis, Mo.), 27 March 1901. Chronicling America: Historic American Newspapers. Lib. of Congress.

Jeff Manship and the Denny Bautista-line

2015-02-12T11:21:00.001-07:00

Jeff Manship signed a minor league deal with Cleveland this past December. These are the sorts of deals the Jeff Manships of the world get. Manship has made two Opening Day rosters in his career--in 2011 and again in 2014--and he had to fight it out in Spring Training for both. In 2011, he made it to April 17, just 3.1 IP over 5 games, before getting sent down. Last year, he stayed up until July 23, but over a month of that time was spent on the DL.

No, there's nothing remarkable at all about Manship's contract with Cleveland. He's the type of player who is only even a free agent at all because no one wants to hand him an MLB roster spot, and he's out of options. What is remarkable is that Manship keeps making the Majors anyway, every single year. Since his first call-up in 2009, he has now spent time in the Majors for six years running. And in every single one, he's had an ERA above 5.00.

Denny Bautista was, in some ways, a rather un-Manship-like prospect. Manship was drafted in the 50th round out of high school, went to Notre Dame, and then signed as a 14th round pick three years later. In 2008, he climbed to #9 on Baseball-America's list of the Twins' top ten prospects, only to drop back out of the list the following year. John Sickels, who also had Manship as the Twins' #9 prospect in 2008, had him at the back end of their top 20 each of the next two years.

Bautista, meanwhile, was signed as a 17-year-old out of the Dominican Republic. He was (and still is, presumably) the cousin of Ramon and Pedro Martinez. He twice (in 2002 and 2004) cracked Baseball-America's top 100 prospect list, peaking at #59. At 21, an age where Manship was still finishing up his career at Notre Dame and just starting off in the Gulf Coast and Florida State Leagues, Bautista was already in the Majors.

In spite of their different pedigrees, Jeff Manship and Denny Bautista ended up as very similar pitchers: failed starters, journeyman relievers, shuttling up and down between cities like Minneapolis and Denver and Kansas City and cities like Rochester and Colorado Springs and Omaha. They are so similar, in fact, that Denny Bautista is the only other pitcher in Major League history to keep succeeding in precisely the same sub-par way that Manship has.

There have been other pitchers who kept putting up ERAs above 5.00 and kept getting shots in the Majors. A handful of them, including future Cy Young winner R.A. Dickey in his pre-knuckleball days, have even had ERAs above 5.00 in each of their first six seasons. None of them, other than Manship and Bautista, have kept getting back to the Majors every single year, though. There has always been a year or two in between somewhere where they languished in the minors without getting the call.

Of course, Dickey aside, there isn't much hope for success for these kind of pitchers. Kevin Jarvis somehow managed to stick around another six seasons and pitch past his 37th birthday after putting up 5+ ERAs in each of his first six years, but he was just as ineffective in those final six years as in the first six (140 ERA-/124 FIP- in his first six seasons, 135 ERA-/125 FIP- in his final six). Everyone else disappeared pretty quickly.

As for Bautista, he did get a seventh year. It was actually his best, at least by ERA. He finally broke the 5.00 barrier and posted a 3.27 ERA in 2010, good for right about average for a reliever. However, in a rather cruel statement about age and the ticking clock on failed prospects, this of all years was finally the year that failed to earn him another shot in the Majors. The following June, he was released from Seattle's system and wound up pitching in Korea. He's still around--he pitched in the Mexican league last year--but he hasn't been back in affiliated pro ball since.

It's not like you need any careful analysis to know that the outlook is not good for Manship's career, though. I mean, he's a guy who has thrown 139.1 innings over the past six years with a 6.46 ERA and just got released by his third team in three years. It's interesting, though, don't you think? That he keeps finding his way back, year after year? That even when you split his career ERA into 20-30 inning chunks (or 3.1 inning chunks, as was the case in 2011), they all still come in over 5.00? Even the progression is interesting: his 6.65 ERA was actually the third straight season his ERA dropped from the year before (Manship's ERAs from 2010-2014: 8.10, 7.89, 7.04, 6.65). And he could go right on dropping the ERA again and again for years to come, and still not be any good. That's amazing, in its own way.

If he ends up above 5.00 again this year, he would be the first pitcher ever to pitch in seven different MLB seasons and post an ERA that high in every one. Here's something else interesting, though: Steamer actually projects him for a 4.38 ERA this year. That's...that's less than 5.00! By a pretty fair amount!

When you think about it, the projection actually makes sense. Even with ERAs consistently north of replacement level, teams have to be projecting him for something below 5.00, or they wouldn't bother calling him up. And his fielding-independent numbers are actually...well, they're not good, but they're a lot better than his ERAs. So there is a pretty good reason to believe he can break the Bautista-barrier if he finds his way back to the Majors this year.

Even so, every year that passes, Manship's career is on thinner and thinner ice. It has to be--look what happened to Bautista, whose 3.27 ERA in year seven couldn't even save his career. In all likelihood, if he gets another shot, he probably will have the best ERA of his career, but there is a very real chance that it would still be his last year anyway. Heck, there is a very real chance we've seen the last of Jeff Manship in the Majors already. That is, of course, unless he starts working on his knuckleball.

Edit: Manship did make it back to the Majors in 2015, and posted an 0.92 ERA in 39.1 IP with Cleveland. Well done, Jeff!

(Possibly) the First Baseball Article I Ever Published

2015-02-10T11:46:00.000-07:00

I was digging through some stuff the other day and came across an old newspaper from college that had what might be the first baseball article I ever published. It's not really anything in-depth or analytical--just a short opinion piece on the Bagwell contract situation that was in the news at the time. I think my writing style has definitely evolved since then, but it was interesting to see something I wrote so early in my development. Anyway, here's the article:

Bagwell article.PDF

Effects of Playing the Sun Field on OF Putouts per BIP

2013-11-01T02:49:00.001-07:00

I recently looked into how playing the sun field affects an outfielder's defensive performance. I was inspired by Craig Wright's discovery that Babe Ruth regularly switched between left and right field throughout his career to avoid playing the sun field, as I wanted to know what kind of effect this knowledge would have on his defensive value.

As far as I can tell, there isn't much of an effect. You can stop reading here if you care whether reading material is interesting, but I'll detail my methodology below so that those interested can know what I mean when I say I didn't find an effect.

METHODOLOGY

First, I compiled as best I could a list of sun fields for all open air stadiums in the Retrosheet era (1950-2012 for my current database). Sun fields were estimated based on diagrams and images from Seamheads ballpark database, Ballparks.com, and AndrewClem.com. I was able to corroborate or correct some parks by looking around the web for written mentions of sun fields or photographs showing shadows during a game.

This is actually trickier than it sounds--you can get a decent idea of where the sun should set from maps or diagrams that include stadium orientation, but the sun's position also depends on the stadium's latitude and changes based on the time of day and time of year (which also means you can get conflicting results from photographs depending on when they were taken--see the shadows pointing to CF vs the shadows pointing to RF in Busch Stadium). Still, I did the best I could to identify a primary sun field, and while I doubt I came up with a perfect list, it should be good enough to detect an effect if there is one.

Once I had a list of sun fields for each stadium, I looked at putouts per ball in play for corner outfielders in each stadium. I then divided these into day and night games, so that I had average number of putouts per ball in play for left and right fielders in day and night games for each stadium. Using these figures, I checked the difference between PO/BIP between day and night games. Parks with roofs (retractable or not) and parks with CF sun fields were ignored.

From there, I checked how much the average PO/BIP went up or down for fielders playing the sun field. If playing the sun field makes impairs the fielder, then they should see a drop in performance from night to day games. However, it is also possible that playing the outfield is generally easier or harder in day games, so I also checked the change in putout rate for the opposite corner outfield position to use as a control group. Rather than compare the sun field's day game performance to its night game performance, I compared the change from night to day for the sun field to the change for the non-sun-field.

For example, in 2012 Busch Stadium, left fielders recorded putouts on 6.13% of balls in play during night games, and 5.94% during day games. Right fielders were 6.62% for night games and 7.49% for day games. That means that left fielders dropped their PO/BIP by .0019 in day games, while right fielders raised their PO/BIP by .0087. I have right field as Busch's primary sun field, so the sun field was associated with a gain of .0106 putouts per ball in play over the control group in day games.

Doing this for every season in every stadium included in the study, I got an average* of 0.0002 gain in PO/BIP for the sun field over the non-sun-field, which is practically zero and very slightly in the wrong direction to indicate an effect.

*the average was a weighted mean, with the weight given to each stadium-season being the harmonic mean of day BIP and night BIP. For example, 2012 Busch stadium had 2872 night BIP and 1549 day BIP, which is a harmonic mean of 2012.5.

Since I was concerned that poor data on which field was the sun field may have masked any potential effect, or that only some parks might have a bad sun field, I checked to see if individual stadiums displayed any effect. If that were the case, it should still show up in the overall data as a diminished but still visible effect, but it was worth checking. Individual stadiums did vary from zero effect, but not any more than they would by random chance. When splitting stadium-seasons into even and odd numbered years, there was no correlation between the observed effect for a stadium in even years versus the same stadium in odd years.

Finally, I checked the same thing for individual fielders, to see if there was any evidence that particular fielders had notable trouble with the sun field that would show up in PO/BIP. The result was the same as the test for individual parks--fielders varied from zero effect but no more than expected by chance, and the even-odd season correlation for fielders was 0.

This does not necessarily mean that the sun does not affect fielders--I assume that when the ball is actually in the sun, it adds a great deal of difficulty. It is likely that this does not happen often enough to significantly alter a fielder's defensive numbers, though. At the very least, it appears that finding an effect would require much more precise data. For example, you could probably find something by using the sun's position at the time of the play and the trajectory of the batted ball to identify specific plays that are likely affected. Even if this data were available, however, it would be impossible to use it to evaluate Ruth specifically, and the overall effect I saw indicates that there is likely no need to adjust his defense valuation down simply because he rarely played the sun field.

Did Adrian Peterson really Outgain Eric Dickerson?

2013-01-24T16:11:00.001-07:00

A couple years ago, I wrote about how rounding errors affect yardage gains in football. The general rule was that, assuming the rounding error on each play is independent, the total rounding error follows a normal distribution with parameters mean = 0 and SD = sqrt(number of plays/12).

I began thinking about this again for two reasons. One, Adrian Peterson just came within 9 yards of Eric Dickerson's season rushing record. With 348 rushes for Peterson and 379 for Dickerson, that comes out to a standard deviation for the combined rounding errors of 7.8 yards, and about a 12% chance that the 9 yard difference is entirely due to rounding errors.

The other reason is that Brian Burke pointed out in the comments of the original article that the rounding errors of plays in the NFL are not independent. The total yardage gain for each drive has to round off to the correct figure. From Brian's comment:

"One other way to state this is that if a team has 2 plays in a row, and one goes for 4.5 yards but is scored as 4, and the next goes for 5.5 yds, it can't be scored as 5. It must be scored as a 6 yd gain because the ball is very clearly 10 yds further down field, not 9."

I wanted to try to account for this constraint and see how much difference it would make.

Note: the following is mostly dry and math-related, so if you want to skip it, I estimate the chance of rounding errors covering the 9 yard difference between Dickerson and Peterson at about 14%.

Read more »

THE EMPTY SET: Reflecting on Cooperstown’s Lost Year

2013-01-15T03:10:00.001-07:00

A sea of people stretched across the field and masked the green grass with Cardinal red. There was Bob Feller mingling across the fence beside the stage. There was Frank Robinson. There was Stan Musial. Somewhere, on our side of the fence, was Tug McGraw.

We were all there for Ozzie. There were a few scattered Phillie fans there for Harry Kalas, that year’s Frick Award recipient, if you looked carefully for the different insignias on their caps. Every here and there you'd see a maroon Mike Schmidt throwback. Other than that, it was just thousands of red-clad fans fixated on the wizard of a shortstop standing at the podium before us.

"This is awesome." It was the first my dad, uncle, brother, and I had seen of Induction Weekend. "We've got to come back in five years."

Five years is, of course, the waiting period for retired players before they become eligible for the Hall of Fame. Three of my generation's great players had just retired. And one was another beloved Cardinal.

**********

The BBWAA announced the results of their Hall of Fame balloting last Wednesday. No one got in. Barry Bonds didn't get in. Roger Clemens didn't get in. Not Biggio, not Bagwell. Not Jack Morris. Not Piazza, Trammell, Raines, Schilling, Martinez, Walker (either one), or Lofton. Not McGwire or Sosa or Palmeiro. Not even Shawn Green.

Someone will get in. In 1996, the last year no one met the 75% threshold, there were six players on the ballot (Niekro, Perez, Sutton, Santo, Rice, and Sutter) who would get in eventually. That's how it always is; every ballot has several candidates who will get in someday.

Biggio will get in. Every player who has ever gotten Biggio's level of support early in his candidacy has had no trouble getting elected sooner rather than later. Bagwell is at that high early level of support where almost everyone gets in eventually. Piazza even more so.

Jack Morris will probably get in as a Veterans Committee selection someday. Schilling will probably get in someday. Eventually, as the electorate gets a bit younger, Tim Raines will probably find the remaining votes he needs to get in, barring a complete disaster with the current and upcoming logjam that might never clear up before he falls off the ballot.

Maybe they won't all get in. But some of them will, and maybe some of the others as well. Trammell is the type of guy who could finally get his due when the Hall puts together a VC for his era. Edgar Martinez could pick up some support as the voters begin to accept that the DH is now part of the game. The voters, or the Hall, might someday come around on Bonds and Clemens.

Someone is going to get in. Definitely Biggio. Very likely Jack Morris. They're just going to have to wait. So too will Cooperstown, which swells up with tens of thousands of tourists (and their wallets) every July except this one.

Read more »

On Miguel Cabrera, Value, and the Triple Crown

2012-12-16T13:57:00.000-07:00

“In ’67, the triple crown was never even mentioned once. We were so involved in the pennant race, I didn’t know I won the triple crown until the next day, when I read it in the paper.”

-Carl Yastrzemski to the Boston Herald, published September 26, 2012

“Is it too early to say that [Cabrera] has a legitimate shot at a Triple Crown this season hitting in front of Fielder? I don't think so.”

-Fox News sports article, published April 13, 2012

The Triple Crown has grown in stature over the years. That’s not to say it wasn’t a big deal before, but reporters now are asking Carl Yastrzemski about someone else winning it faster than they ever asked him about winning it himself. In 1942, when Ted Williams won it, no one even had a list of previous winners compiled. An AP reporter had to research it for his story on Williams’ feat, and he still missed the most recent occurrence (Joe Medwick, whose Triple Crown just five years earlier escaped detection).

Back then, it was a cool thing. It wasn’t necessarily the historic thing it’s become. It didn’t yet carry the mythical ethos of the pantheon-dwellers -- Williams, Mantle, Yaz, Frank Robinson, etc -- who could once do what for so long escaped their modern counterparts. When someone won it, it didn’t carry the weight of a whole generation of fans who grew up hearing about it and never seeing it. It was just a cool thing.

I can see getting excited about it. It’s an impressive feat. It’s something we’ve waited for for a long time. It's something only a handful of the greats have even done.

And yet, I have a hard time getting excited. It was a great season, sure. A wonderful season at the plate. But the best season I’ve ever seen? Not close. Which means I’ve seen a lot of non-Triple-Crown seasons that were better, because this is the first Triple Crown of my lifetime. You don’t even have to look that hard to find a better season. There’s another one right in front of our noses.

I’m talking, of course, about Miguel Cabrera’s 2011 season.

I know that seems, at least on the surface, like a bit of a contrarian statement. How could he have been better when he hit 14 fewer home runs and drove in 36 fewer runs and didn’t, I don’t know, win the first Triple Crown in four and a half decades? I don’t mean it as a contrarian viewpoint, though. I just think Cabrera hit better in 2011 than in 2012.

Let me explain myself. First, we need to establish what we mean by “better”.

I grew up with a fairly traditional baseball upbringing. I was the son of a catcher who was the son of a catcher, saved only from the tools of ignorance myself by a bad case of sinistrality (a condition my dad only fully forgave me for when my younger sister took up softball and inherited his old gear). I learned the game from proud field generals who would rather hold their ground to a hard-charging runner than hit a home run, even if they dropped the ball in the process.

That’s not a bad way to learn the game. It was a great way to learn it. But part of that upbringing was growing up thinking that Rickey Henderson was Lou Brock-Lite, and that Ted Sizemore was the ideal #2 hitter, and that Tony Gwynn was the best hitter in the game. Part of that was drafting Ozzie Smith for my first fantasy league in a three-team-deep league.

It’s not that those things are necessarily wrong. I don’t remember or care what happened in that fantasy league, other than that I remember drafting my favourite player. I don’t remember or care how many runs the Padres scored with Tony Gwynn anchoring their lineup, or how many games they won. I remember that watching Tony Gwynn was unlike watching anyone else in baseball, because you felt like you knew you were going to see something happen. He was going to put the ball in play, and the defense was going to scramble to field it. When Tony won, it felt like he won because he could almost place the ball at the spot where it landed. When the defense won, it felt like they got away with one. It was exciting to someone who learned the game the way I did.

As far as baseball is a game of entertainment, maybe Tony Gwynn was the best hitter in the game. Arguing for Tony Gwynn over Frank Thomas, or Barry Bonds, or Fred McGriff, or a handful of other guys as a hitter, though, isn’t really an argument of value or production. It’s an argument of what “best” means to begin with. He was better at some things, yeah. Maybe better at the things that are most important to you. At some point, though, it started to hit me that, whatever abstract ideals I might hold about what a hitter should be, the very concrete objective of all hitters is the same. They hit as best they can to win games, and they do so by helping to score runs.

That’s something that’s hard to measure when your statistical upbringing comes mostly from Topps and Donruss. How many runs is Gwynn’s AVG worth? How many runs are Thomas’ walks and extra base hits worth? I don’t know. It doesn’t say on the back of the card. We all know when we watch a game that getting on base is important, that making outs is bad, and that getting to second or third is better than getting to first. How much better? I don’t know. And so the argument becomes about what best actually means, because the units of measurement are not helpful.

Read more »

Clutch, WPA/LI, and the Home Run Bias

2012-06-19T09:39:00.000-07:00

The stat Clutch, as published on FanGraphs and Baseball-Reference, is designed to quantify how much better or worse a hitter has produced in situations based on how critical those situations are in the immediate context of a game. Players who perform better in more critical situations (for example, late in a close game) than they normally do will have a positive Clutch rating, and players who perform worse in such situations will have a negative Clutch. It does this by comparing two values for a hitter: his WPA and his WPA.LI. I will assume you are familiar enough these two stats (not necessarily their inner workings, but at least what they are) as a prerequisite for this piece; if not, you can catch up on B-R's or FanGraphs' explanation pages.

WPA.LI follows two key constraints. The first is that, for a given game state (i.e. the inning, the score, the number of outs, and the placement of any runners on base), the relative value of a play is determined by how much that play affects the team's chances of winning. If the bases are empty, a walk is credited the same as a single. If the bases are loaded with the winning run on third, a walk is credited the same as a home run. This constraint works exactly like WPA (as one might expect from a WPA-based metric).

The second constraint differentiates WPA.LI from WPA. One of the properties of WPA is that some situations are inherently weighted more strongly than others. A key at bat late in a close game can swing a team's chances of winning by several times as much as the same result in a blowout, and it is credited accordingly. WPA.LI, on the other hand, ensures that the average play in every situation gets the same weight.

So, on the one hand, you have WPA, which weights PAs according to their immediate impact on the game. One clutch PA might be worth as much as 4 or 5 normal PAs, and one mop-up PA might be worth practically nothing. On the other hand, you have WPA.LI, which weights every PA equally, just like most other stats do. Basically, it is linear weights, but with the ability to tailor the value of each event to the specific situation rather than sticking to a blanket value for each event across all situations. While WPA tells the story of clutch hitting (who got the big hit when the team most needed production), WPA.LI tells the story of situational hitting (who got on base when the team needed baserunners, put the ball in play when the strikeout was most costly, or hit for power when advancing runners quickly was more important than getting another guy on first).

There is a third important constraint which WPA.LI does not adhere to, however. Ideally, the average value of each event would match its linear weights value. If a home run is worth 1.4 runs above average across all situations, then you would like the average WPA.LI value of a HR to be 1.4 runs (or rather, the equivalent value on the wins scale). That is not the case, however.

The following linear weights values represent the average change in run and win expectancy for that event across all situations, along with the average WPA.LI value of each event. All three versions have been placed on the runs scale by setting the value of the out at -.27 in order to make them easier to compare directly:

	RE	WPA	WPA.LI
1B	0.47	0.47	0.44
2B	0.77	0.75	0.75
3B	1.05	1.06	1.04
HR	1.41	1.42	1.58
BB	0.31	0.30	0.31
K	-0.29	-0.30	-0.29
out	-0.27	-0.27	-0.27

As you can see, WPA.LI does fine at assigning the correct value to most events, but the value of the HR is way off. This may seem counterintuitive; if WPA.LI just creates custom linear weights for each situation based on the WPA values, why would the average WPA.LI value be different from the average WPA value? We can look at the mathematical relationship between WPA and WPA.LI to see why this is.
Read more »

The Pujols Decision: One Fan's Reflections

2012-06-09T03:20:00.001-07:00

Stan Musial is the man in St. Louis. Nearly 50 years after Musial last played for the Cardinals, he remains the undisputed king of Cardinal baseball. His statue alone stands tall outside the main entrance to Busch Stadium, a few hundred feet south of the plaza where all the lesser (albeit much more attractive) statues of other Cardinal greats sit. For decades, no one in St. Louis thought they would ever see a player rival Musial.

And then Albert came along. Just one year and one Bobby Bonilla injury removed from his professional debut as a 13th round draft pick, Pujols was in the starting lineup and lighting up the National League. He hit for average. He hit for power. He got on base. He eventually learned to play a very good first base. For the first time, St. Louis fans saw a player and thought, "this could be the guy who tops Musial."

The accolades came. The MVPs (three of them, same as Musial), the All Star appearances, the Silver Sluggers and Gold Gloves, the home runs and hits and RBIs; all of them flocked to Pujols' Baseball-Reference page like moths to Matt Holliday's ear.

The wins followed. Led by Pujols' success, the team made the playoffs 7 out of 11 seasons, winning 3 pennants and 2 World Series along the way. From 2001-2011, only the high-spending Yankees and Red Sox won more games than did Pujols' Cardinals. Pujols was the best player in the game, a superstar of whose order the franchise had not seen in decades. Fans watched in awe and wondered how high his career would stack by the time it ended.

Pujols was, over his 11 years with St. Louis, remarkably similar to Musial when Musial was at his best. Compare Pujols’ career in St. Louis to Musial’s best 11 year stretch (1943-54):

	PA	H	RBI	HR	BB	R
Musial (1943-54)	7564	2251	1174	281	990	1301
Pujols (2001-11)	7433	2073	1329	445	975	1291

	AVG	OBP	SLG	wRC+	brWAR	fWAR
Musial (1943-54)	.346	.434	.591	171	88	98
Pujols (2001-11)	.328	.420	.617	167	84	88

In both traditional counting totals and more sabermetric evaluations, the two come up as near equals. Musial got on base a bit better (in an environment where hitters got on base more than they do today) while Pujols hit for more power (in an environment where hitters hit for more power than they did in Musial’s day). The two were comparable fielders, good for their position, but at the weak end of the fielding spectrum.

Musial rates slightly better in both Baseball-Reference’s and FanGraphs’ implementations of WAR, but they are close enough that which one you would pick will largely depend on how you approach the different eras (i.e. how you want to adjust for things like integration, expansion, population growth, international development, improved scouting, the war years, etc). They’re close enough that it would reasonable to take the position that no Cardinal fan has ever seen one of their own play at a higher level than Pujols has over his 11 years with the team, not even Musial. It’s not a slam-dunk position; maybe you still take Musial. But, for the first time since Musial retired, you’d probably at least have to think about it.

Watching Pujols play ignited Cardinal fans like watching Musial did, and we loved every minute of it. Naturally, we wanted that to continue. We wanted another all-time great to stay a career Cardinal. Then, out of nowhere, the report swept in from the winter meetings that Pujols had signed with the Angels. No build up, nothing. No one had even talked about the Angels in the weeks of negotiating that led off the offseason. Just like that, he was gone.

Read more »

Win Expectancy and Leverage Index tables, R Code

2012-03-15T02:42:00.003-07:00

This post is just a quick dump of some code you can use to create win-expectancy and leverage index tables like what I used for my recent Baseball PreGUESTus article. It is written for the free statistical program R, and it builds upon the excellent work on run-expectancy and run distribution tables done by Sobchak at ChancesIs.com.

In order to run this code, you will need R with the package plyr installed. You will also need the file bo_transitions.csv from ChancesIs (either the CSV file hosted on that site, or one created using a similar query to the one Sobchak published) and the file game_state_frequency.csv, which you can copy from this table. Sobchak's data and the game_state_frequency table are from the years 1993-2010. You can collect the data for other years by altering Sobchak's SQL query and this game_state_frequency query.

*note-you only need game_state_frequency.csv for calculating LI. You don't need it if all you want is a WE table.

Once you have those files on your computer, you can construct a win-expectancy table with the following R code:

Win Expectancy Table, R code

You will have to change the line

setwd("/Users/Seshoumaru/Desktop/untitled folder/baseball/run-win expectancy")

to the folder path where you saved the necessary CSV files.

The win expectancy values are generated based on Sobchak's simulated run distributions. It is currently set to run 100,000 simulated innings from each state to estimate the distributions. You can raise the number of simulations to increase the precision, but it will take longer to process. On my computer, 100,000 simulations took about 4 minutes to run. 1,000,000 simulations took about an hour. The win expectancies themselves are not simulated, however.

The code limits run scoring to 16 runs for the remainder of the inning you are in, plus 16 runs total for the rest of the game. This is done to greatly reduce processing time. The generated tables cover scores from the home team being down 16 to up 16 (all score differentials are from the perspective of the home team.

The above code assumes equal run distributions for both teams. With a few changes, you can alter the code to include home-field advantage by using separate distributions for the home and away teams. To do this, you will need to alter Sobchak's query to create additional bo_transition files for just the home team and just the away team (called bo_transitions_home.csv and bo_transitions_away.csv). Once you have added those files, you can run the following code:

Win Expectancy Table, HFA version, R code

Regression to the Mean and Beta Distributions

2011-08-17T08:00:00.009-07:00

This morning, a discussion of regression to the mean popped up on Phil's and Tango's blogs. This discussion touches upon some of the recent work I've been doing with Beta distributions, so I figured I'd go ahead and lay out the math linking regression to the mean with Bayesian probability with a Beta prior.

Many of the events we measure in baseball are Bernoulli trials, meaning we simply record whether they happen or not for each opportunity. For example, whether or not a team wins a game, or whether or not a batter gets on base are Bernoulli trials. When we observe these events over a period of time, the results follow a binomial distribution.

When we observe these binomial events, each team or player has a certain amount of skill in producing successes. Said skill level will vary from team to team or player to player, and, as a result, we will observe different results from different teams or players. Albert Pujols, for example, has a high degree of skill at getting on base compared to the whole population of MLB hitters, and we would expect to observe him getting on base more often than, say, Emilio Bonifacio.

The variance in talent levels is not the only thing driving the variance in obvserved results, however. As with any binomial process (excepting those with 0% or 100% probabilities, anyway), there is also random variance as described by the binonial distribution. Even if Albert's on-base skill is roughly 40%, and Bonifacio's is roughly 33%, it is still possible that you will occasionally observe Emilio to have a higher OBP than Albert over a given period of time.

In baseball, it is a practical problem that we do not know the true probability linked to each team's or player's skill, only their observed rate of success. Thus, if we want to know the true talent probability, we have to estimate it from the observed.

One way to do this is with regression to the mean. Say that we have a player with a .400 observed OBP over 500 PAs, and we want to estimate his true talent OBP. Regression to the mean says we need to find out how much, on average, our observed sample will reflect the hitter's true talent OBP, and how much it will reflect random binomial variation. Then, that will tell us how many PAs of the league average we need to add to the observed performance to estimate the hitter's true talent.

For example, say we decide that the number of league average PAs we need to add to regress a 500 PA sample of OBP is 250. We would take the observed performance (200 times on base in 500 PAs), and add 82.5 times on base in 250 PAs (i.e. the league average performance, assuming league average is about .330) to that.

200+82.5......282.5
------------ = -------- = .377
500+250........750

Therefore, regression to the mean would estimate the hitter's true OBP talent at .377.

As Phil demonstrated, once you decide that you have to add 250 PAs of league average performance to your sample to regress, you would use that same 250 PA figure to regress any OBP performance, regardless of how many PAs are in the observed sample. Whether you have 10 observed PAs or 1000 observed PAs, the amount of average performance you have to add to regress does not change.

Now, how would one go about finding that 250 PA figure? One way is to figure out the number of PAs at which the random binomial variance is equal to the variance of true talent in the population.

Start by taking the observed variance in the population. You would look at all hitters over a certain number of PAs (say, 500, for example), and you might observe that the variance in their observed OBPs is about .00132, with the average about .330. The observed variance is equal to the sum of the random binomial variance and the variance of true OBP talent across the population of hitters. We don't know the variance of true talent, but we can calculate the random binomial variance as p(1-p)/n, where p is the probability of getting on base (.330 for our observed population) and n is the observed number of PAs (500 in this case). For this example, that would be about .00044. Therefore, the variance of true talent in the population is approximately:

.00132 - .00044 = .00088

Next, we find the number of PAs where the random binomial variance will equal the variance of true talent:

p*(1-p)/n = true_var

.330*(1-.330)/n = .00088

n = .330*(1-.330)/.00088 ≈ 250

We can also approach the problem of estimating true talent from observed performance using Bayesian probability. In order to use Bayes, we need to make an assumption about the distribution of true talent in the population the hitter is being drawn from (i.e. the prior distribution). We will assume that true talent follows a Beta distribution.

Return now to our .400 observed OBP example. Bayes says the posterior distribution (i.e. the distribution of possible true talents for a hitter drawn from the prior distribution after observing his performance) is proportional to the product of the prior distribution and the likelihood function (i.e. the binomial distribution, which is the likelihood of observing a each possible OBP, given the prior probability).

The prior Beta distribtuion is:

x^(α-1) * (1-x)^(β-1)
------------------------
..........B(α,β)

where B(α,β) is a constant equal to the Beta function with parameters α and β.

The binomial likelihood for observing s successes in n trials (i.e. the observed on-base performance) is:

.....n!
--------- * x^s * (1-x)^(n-s)
s!(n-s)!

where x is the true probability of a success.

Next, we multiply the prior distribution by the likelihood distribution:

x^(α-1) * (1-x)^(β-1) .........n!
------------------------- * --------- * x^s * (1-x)^(n-s)
............B(α,β).... .......... s!(n-s)!

combine the exponents for the x and (1-x) factors:

x^(α + s - 1) * (1-x)^(β + n - s - 1).. .....n!
--------------------------------------- * --------
.......................B(α,β) ...................... s!(n-s)!

Separating the constant factors from the variables:

...........n!
------------------- * x^(α + s - 1) * (1-x)^(β + n - s - 1)
s!(n-s)! * B(α,β)

This product is proportional to the posterior distribution, so the posterior distribution will be the above multiplied by some constant in order to scale it so that the cumulative probability equals one. Since the left portion of the above expression is already a constant, we can simply absorb that into the scaling constant, and the final posterior distribution then becomes:

C * x^(α + s - 1) * (1-x)^(β + n - s - 1)

Notice that the above distribution conforms to a new Beta distribution with parameters α+s and β+n-s, and with a constant C = 1/B(α+s,β+n-s). When the prior distribution is a Beta distribution with parameters α and β and the likelihood function is binomial, then the posterior distribution will also be a Beta distribution, and it will have the parameters α+s and β+n-s.

We still need to choose values for the parameters α and β for the prior distribution. Recall from the regression example that we found a mean of .330 and a variance of .00088 for the true talent in the population (i.e. the prior distribution), so we will choose values for α and β that give us those values. For a Beta distribution, the mean is equal to:

α/(α+β)

and the variance is equal to:

............αβ
----------------------
(α+β)^2 * (α+β+1)

A bit of algebra gives us values for α and β of approximately 82.5 and 167.5 respectively. That means the posterior distribution will have as parameters:

α+s = 82.5 + 200 = 282.5
β+n-s = 167.5 + 500 - 200 = 467.5

and a mean of

........282.5...........282.5
----------------- = ------- = .377
282.5 + 467.5.......750

As you can see, this is identical to the regression estimate. This will always be the case as long as the prior distribution is Beta and the likelihood is binomial. We can see why if we derive the regression constant (the number of PAs of league average we need to add to the observed performance in order to regress) from the prior distribution.

Recall that the regression constant can be found by finding the point where random binomial variance equals prior distribution variance. Therefore:

p(1-p)/k ≈ prior variance

where k is the regression constant and p is the population mean.

p(1-p)/k ≈ αβ / ( (α + β)^2(α + β + 1) ) ; p ≈ α/(α+β)

α/(α+β) * ( 1 - α/(α+β) ) / k . ≈ αβ / ( (α + β)^2 * (α + β + 1) )
α/(α+β) - α^2/(α+β)^2.........≈ k * αβ / ( (α + β)^2 * (α + β + 1) )
(α(α+β) - α^2)/(α+β)^2 ...... ≈ k * αβ / ( (α + β)^2 * (α + β + 1) )
(α(α+β) -α^2)....................... ≈ k * αβ / (α + β + 1)
(α^2 + αβ - α^2) ..................≈ k * αβ / (α + β + 1)
αβ.......................................... ≈ k * αβ / (α + β + 1)
1 .............................................≈ k / (α + β + 1)
k....................,,,,,,,,,,,,,,,,........ ≈ α + β + 1

Since α and β for the prior in our example are 82.5 and 167.5, k would be 82.5 + 167.5 + 1 = 251.

This estimate of k is actually biased, because it assumes a random binomial variance based only on the population mean, whereas the actual random binomial variance for the prior distribution will be the average binomial variance over the entire distribution. In other words, not all of the population will have a .330 OBP skill; some hitters will have a .300 skill, while others will have a .400 skill, and they will all have different binomial variances associated with them. More precisely, the random binomial variation for the prior distribution will be the following definite integral taken from 0 to 1:

⌠ x(1-x)
| ------- * B(x;α,β) dx
⌡...k

which, conceptually, is the weighted sum of the the binomial variances for each possible value from the prior distribution, where each binomial variance is weighted by the probability density function of the prior.

.......1........ ⌠
------------ | x(1-x) * x^(α-1) * (1-x)^(β-1) dx
k * B(α,β) ⌡

........1....... ⌠
------------ | x^α * (1-x)^β dx
k * B(α,β) ⌡

The definite integral is in the form of the Beta function B(α+1,β+1), so we can rewrite this as

B(α+1,β+1)
-------------
k * B(α,β)

The Beta function is interchangeable with the Gamma Function in the following manner:

B(α,β) = Γ(α)*Γ(β) / Γ(α+β)

replacing the two Beta functions with their Gamma equivalencies:

Γ(α+1) * Γ(β+1) * Γ(α+β)
-------------------------------
k * Γ(α) * Γ(β) * Γ(α+β+2)

This revision is useful because the Gamma function has a property where Γ(x+1)/Γ(x) = x, so the above reduces to:

αβ * Γ(α+β)
---------------
k * Γ(α+β+2)

Furthermore, since Γ(x+1)/Γ(x) = x, it follows that Γ(x+2)/Γ(x+1) = x+1. If we multiply those two equations together, we find that

Γ(x+1)....Γ(x+2)
-------- * -------- = x(x+1)
..Γ(x)......Γ(x+1)

Γ(x+2)/Γ(x) = x(x+1)

Γ(x)/Γ(x+2) = 1/(x(x+1))

Therefore

.αβ * Γ(α+β) ..................αβ
---------------- = ------------------------
k * Γ(α+β+2).....k * (α+β) * (α+β+1)

Now that we have a manageable expression for the random binomial variance of the prior distribution, we return to the requirement that random binomial variance equals the variance of the prior distribution:

..............αβ ................................αβ
------------------------ = -----------------------
k * (α+β) * (α+β+1)......(α+β)^2 * (α+β+1)

k * (α+β) * (α+β+1) = (α+β)^2 * (α+β+1)

k = α+β

Using a more precise calculation for the random binomial variance of the prior, we find that k = α+β rather than α+β+1. Note that when we estimate k by assuming a constant binomial variance of p(1-p)/k, we get a value of k exactly 1 higher than when we run the full calculation for the binomial variance. This is useful because the former calculation is much simpler than the latter, so we can calculate k by using the former method and then subtracting 1. Also note that the 250 value we got in the initial regression to the mean example would also be 1 too high if we were using more precise figures; I've just been rounding them off for cleanliness' sake.

Let's look now at the calculation for regression to the mean:

true talent estimate = (s+pk)/(n+k)

where s is the observed successes, n is the observed trials, p is the population mean, and k is the regression constant.

We know from our prior that p=α/(α+β) and k=α+β, so

(s+pk)/(n+k) =

s + (α+β)*α/(α+β)
----------------------
......n + α + β

...α + s
-----------
α + β + n

And what does Bayes say? Our posterior is a Beta with parameters α+s and β+n-s, which has a mean

.......(α+s)
-----------------
(α+s)+(β+n-s)

...α + s
-----------
α + β + n

So Bayes and regression to the mean produce identical talent estimates under these conditions (a binomial process where true talent follows a Beta distribution).

k is far easier to estimate directly (such as by using the method in the initial regression tot he mean example) than α and β, so we would typically calculate α and β from k. To do that, we use the fact that p = α/(α+β), and that k=α+β, so by substitution we can easily find that:

α=kp
β=k(1-p)

where k is the regression constant and p is the population mean.

We can also see that the regression amount will be constant regardless of the number of observed PAs, because when we take our Bayesian talent estimate:

...α + s
-----------
α + β + n

we see that we are always adding the quantity kp (as substituted for α) to the observed successes (s), and always adding the quantity k (as substituted for (α+β)) to the observed trials (n), no matter what observed values we have for s and n. The amounts we add to the observed successes and trials depend only on the parameters of the prior, which do not change.

Poisson Processes in Sports

2011-07-20T17:44:00.003-07:00

In sports, the problem of relating a team's offensive and defensive production to its W-L record is closely related to the distribution of scoring events in the sport. For example, say you want to know how often a team that scores, on average, 4 times per game and allows 3 scores per game is expected to win. It is not enough to simply know that the team averages 4 scores and 3 scores allowed; you also have to have an idea of how likely the team is to score (or allow) 0 times, 1 time, 2 times, etc. If the nature of the sport provides for a very tight range of scores for each team (i.e. the 4-score team is very unlikely to score 0 or 1 time, or 7 or 8 times), then the team will win more often than if the sport sees a wider distribution of observed scores for each team.

Let's say, for example, that the team in this example scores and allows scores in the following distribution:

	score	allow
0	0.06	0.14
1	0.1	0.15
2	0.13	0.17
3	0.15	0.17
4	0.17	0.12
5	0.13	0.1
6	0.1	0.07
7	0.07	0.04
8	0.05	0.03
9	0.03	0.01
10	0.01	0

In the above table, the team would score 0 times 6% of the time and allow 0 scores 14% of the time.

To find the chances of the team winning, you first figure its chances of outscoring its opponents when it scores once. Since this team will allow fewer than one score 14% of the time, it would be expected to win 14% of the time it scores once. The team scores once 10% of the time, so one-score victories should account for .1*.14=1.4% of its games. Continuing for 2-score victories, the team allows less than 2 scores 29% of the time (14%+15%), so 2-score victories account for .13 2-score games * .29 wins per 2-score game = 3.8% of the team's total games.

Doing this for each possible number of scores, the team will win a total of 56% of its games. Repeating the same process for losses, it will lose 32% of the time (the other 12% of games will end tied).

As long as we know the probability of each possible number of scores and scores allowed, the expected W-L performance can be found in this way. In terms of summation notation, it looks something like this:

..∞............ i-1
∑ps(i) ∑pa(j)
i=0......,..j=0

where ps(i) is the probability of scoring i number of times and pa(j) is the probability of allowing j number of scores.

This is only useful if you have a reasonable model for finding these probabilities, however, which requires you to have some model for the distribution of possible scores around the team average. In baseball, no such distribution is obvious, so instead of the above process, we use shortcuts like PythagenPat to model the results of translating the underlying distribution of possible run-totals to an expected win percentage (by the way, the above example roughly resembles the actual distribution for a baseball team; traditional pythag would give you 4^2/(4^2+3^2) = 16/25 = .640 w%, while the example (ignoring ties) shows .56W/(.56W+.32L) = .636). Steven Miller showed that a Weibull distribution of runs gives a Pythagorean estimate of W%, and that the Weibull distribution is a reasonable assumption for his sample data (the 2004 American League), but that is just working backwards from the model in place.

Some sports, however, do present an obvious choice of model, namely the Poisson distribution. Both soccer and hockey are decent examples of Poisson processes because

-play happens over a predetermined length, measured in a continuous fashion (i.e. time, as opposed to something like baseball which is measured in discreet units of outs or innings)
-goals can only come one at a time (as opposed to something like basketball, where points can come in groups of 1, 2, 3, or 4)
-the number of goals scored over a given period of the game is largely independent of the number of goals scored over a separate period of the game (the fluid nature of possession is a key attribute here; for a sport like American football where a score dictates who has possession for a significant chunk of time, a team's score over one five-minute span will be somewhat dependent on whether it scored in the previous five-minute span, for example)
-the expectation for the number of goals over a period of time (once you know who is playing) depends mostly on the length of time

Hockey has at least one exception to the requirements of a Poisson process, in that the number of goals scored at the end of the game is not always independent of the number of goals scored earlier in the game due to empty net goals, but I don't know how much of an issue this presents. Soccer is a more straight-forward example (as well as a more homogeneous example due to the relative lack of substitution and penalties that are continually affecting the score-rate in hockey). Both, however, generally fit the mould for a Poisson process.

Using a Poisson distribution to fill out a table as in the above example (if you have Excel or a similar spreadsheet program, it should have a Poisson distribution function built in), we can then calculate expected W-L performances for a team. The first and second columns use the average number of goals for and against , respectively, as λ (in Excel, Poisson.Dist(x,avg goals for/against,False), where x is 1,2,3..). Say we do this for a soccer team that we expect to score an average of 2 goals per game and allow an average of 1 goal per game against its opponent. We get the following probabilities:

W: .606
L: .183
D: .212

Using the traditional soccer point-format (3 points for a win, 1 for a draw), this team would average about 2.03 points per game against its opponent.

We can also use the Poisson distribution to figure out what to expect if the game goes to overtime. Elimination soccer matches typically have a 30 minute OT (two 15-minute periods), so the λ (which, recall, are the average goals for and against, which are 2 and 1 in this example) for the OT will be 1/3 their regulation-match value (note that finding λ for regular-season hockey OTs will be more complicated because the 4v4 format will affect the scoring rate). Reconstructing the table with λ values of 2/3 and 1/3, we get the following results for games that go to OT:

OTW: .384
OTL: .161
OTD: .454

If overtime ends in a draw, the game is usually decided on PKs. If we assume that each team is 50/50 to win in PKs (which is not necessarily the case, but shootout odds should be closer to 50/50 than the rest of the match, and the odds in a shootout aren't necessarily based on expected goals for and against for the match), then our team's expected win% once a game goes to OT is .384 + .5*.454 = .611. Remember that the team wins 60.6% of the time in regulation, and the game goes to OT 21.2% of the time, so the team's total expected wins is .606 + .611*.212 = .735.

If we want to model a sudden death OT, such as in the Stanley Cup playoffs, the odds of winning in regulation remain unchanged, but we have to use a different formula to determine the chances of winning once the game goes to overtime. The Poisson distribution works for estimating the probability of scoring a certain number of goals in a pre-determined amount of time (such as a 20-minute period or a 60-minute game), but not for estimating the time until the next goal. For that, we instead need the exponential distribution, which models the amount of time until the next goal.

We want to know the probability that our team's time until its next goal is less than its opponent's time to its next goal. Recall the above formula we used to determine the odds of our team's goals scored being higher than its opponent's:

..∞............ i-1
∑ps(i) ∑pa(j)
i=0......,..j=0

Here, we use something similar, except that we want to know the chances of our team's value (time to the next goal) is less than that of its opponenent:

..∞............ i-1
∑pa(i) ∑ps(j)
i=0......,..j=0

where ps(j) is the probability of our team's next goal coming after j amount of elapsed time, and pa(i) is the probability of its opponent's next goal coming after i amount of elapsed time.

Additionally, we are now dealing with a continuous variable (time elapsed) rather than a discreet variable (number of goals scored), so we need to integrate instead of summate:

⌠∞.........⌠x
⌡0 f(x) ⌡0 g(x) dx dx

where f(x) models the amount of time until the opponent's next goal, and g(x) models the amount of time until our team's next goal. In this formula, f(x) is an exponential probability density function with λ=expected goals allowed (Ga), and ∫g(x)dx is an exponential cumulative distribution function with λ=expected goals scored (Gs):

⌠∞
⌡0 Ga*e^-(Gax)*(1-e^-(Gsx)) dx

This might look a bit ugly (or maybe not since e^x is such a simple integration), but it simplifies to just:

...Gs
-------
Gs+Ga

This makes perfect sense if we think about the next goal being a goal randomly selected from the distribution of possible goals in the game: the odds that the randomly selected goal comes from our team equal the percentage of total goals we expect to come from our team, and the odds that the randomly selected goal comes from our opponent equal the percentage of total goals we expect to come from them.

Now that we have a model for sudden-death OT, we can estimate a team's chances of winning a game with sudden death OT. For example, say we have a hockey game where we expect our team to score 3 goals and allow 2 goals on average. This team would be expected to win in regulation about 58.5% of the time, lose in regulation about 24.7% of the time, and go to OT 16.8% of the time. Once in OT, it will win 3/(3+2)=60% of the time, so its total expected wins is .585 + .6*.168 = .686.

Another interesting use of these distributions is to evaluate different strategies or lineups for a team (given that you can estimate the expected goals scored and allowed for varying lineups/strategies). Returning to the soccer team example where we have a team that we expect to score two goals and allow one, let's say that they are capable of making adjustments that make them stronger defensively, but at the cost of a significant portion of their offense. Say that they can play a defensive game and allow just .38 goals per game, but that doing so reduces their expected offensive output to 1.2 goals per game. In regular league play, the new defensive alignment will still average 2.03 points per game, so there is no benefit to this change.

In a tournament elimination game, however, their win expectancy rises from .735 to .761, because the increase in regulation draws will still lead to a lot of wins (~61% of OT games) instead of just 1-point outcomes. What's more, if they switch back to the more aggressive game in OT (their 2 goals for, 1 goal against form), they can slightly improve their OT win odds (from .608 to .611) by avoiding more shootouts.

Similarly, a sudden death format, where only the ratio of goals scored to goals allowed matters, can also produce different ideal strategies. Doubling both expected goals scored and allowed, for example, would have a significant effect on a team's odds of winning in regulation, but would have no effect on sudden-death because it preserves the ratio of offense to defense, and changes that have no impact on regulation (like going from 2 goals for/1 goal against to 1.2 goals for/.38 goals against in a regular season format soccer match) could have a significant impact on sudden death chances (.667 to .759 once you get to sudden death). Of course, any changes in strategy called for by different formats would depend on the team's ability to adapt to a different style of play and on how such changes affect its expected offensive and defensive production, but it is possible for an ideal lineup or strategy in one format to not be ideal in another, and using Poisson distributions to find the connection between offensive and defensive production and expected W-L performance is helpful in evaluating potential changes.

Luis Gonzalez painting

2011-05-15T21:27:00.006-07:00

My most recent painting, for a charity auction in Phoenix, AZ. It's supposed to be Game 7 of the 2001 World Series, but, coincidentally, it also happens to feature the only two numbers retired by the Diamondbacks. Click the image for a larger size.