Weighting Calculator for Past Data

Calculate Weighting for Past Data

decay factor: 


Estimate the regression constant (for binomial stats only)

regression constant: 

Continue Reading...

Math Behind Weighting Past Results (THT Article)

In my article "The Math of Weighting Past Results" on the Hardball Times, I gave a formula for finding the proper weighting for past data given certain inputs from the dataset. This formula defined the relationship between weighted results and talent, and the proper weighting was the value that maximized that relationship.

I started with a formula for a sample with exactly two days and then generalized that to cover any length of sample. I more or less explained where the two-day version came from in the article, but not the full version, which was as follows:

This supplement will go through the calculations of generalizing the simpler two-day formula to what we see above. It will rely heavily on the use of geometric series, so I would recommend having some familiarity with those before attempting to follow these calculations.

In the article, we treated each day's results as a separate variable and the overall sample as a sum of these individual daily variables. When we had two days in our sample, the combined variance was defined by the following formula:

This formula can be expanded to include more than two variables, but it starts to get messy really quickly. To make expanding is simpler, the formula can be re-written as a covariance matrix. If you have n variables, then the covariance matrix will be an n X n array, where each entry is the covariance between the variables representing that row and column. For two days, we would fill in the covariance matrix as follows:

x1 x2
x1 Varx1 Covx1,x2
x2 Covx1,x2 Varx2

The combined variance is equal to the sum of the items in the matrix, which you can see is equivalent to the above formula.

This makes it much simpler to expand the formula for additional variables since you just have to add more rows and columns to the matrix. In the article, we found that the day-to-day correlation of talent (r) and the decay factor used to weight past data (w) can be used to explain changes in the variances and covariances throughout the sample:

Translating this to our covariance matrix gives us (with Varx1 and Vartrue written as vx and vt to save space):

x1 wx2
x1 vx rw*vt
wx2 rw*vt w2vx

If we expand this to include additional days, every term except those on the diagonal will include a Vartrue factor, and those on the diagonal will instead have a Varx1 factor. (This is because the terms on the diagonal represent the covariance of each variable with itself, which is just the variance of that variable.) Similarly, every term contains an r factor and a w factor, except that the terms on the diagonal have no r (because these are relating the results of one day to themselves, so it is irrelevant how much talent changes from day to day).

For now, let's strip out the variance factors and focus only on what happens to r and w as we expand the matrix to cover more days. We'll look at r and w separately, but keep in mind these are just factors from the same matrix, not two separate matrices. If you placed one on top of the other, so that each r term lines up with the corresponding w term, and then put the variances back in, you'd get the full matrix.

This covariance matrix is essentially the same as what we worked with for the variance article, except now we are introducing weights for past results. As a result, the only real difference here is what happens with the w's, and the r terms follow the same pattern as in math for the variance article:

x1 wx2 w2x3 w3x4 ... wd-1xd
x1 r0 r1 r2 r3 ... rd-1
wx2 r1 r0 r1 r2 ... rd-2
w2x3 r2 r1 r0 r1 ... rd-3
w3x4 r3 r2 r1 r0 ... rd-4
wd-1xd rd-1 rd-2 rd-3 rd-4 ... r0

The weights also follow a pattern, though not the same one as the r factors. The weight for each term equals the combined weight of the two variables it represents:

x1 wx2 w2x3 w3x4 ... wd-1xd
x1 w0 w1 w2 w3 ... wd-1
wx2 w1 w2 w3 w4 ... wd
w2x3 w2 w3 w4 w5 ... wd+1
w3x4 w3 w4 w5 w6 ... wd+2
wd-1xd wd-1 wd wd+1 wd+2 ... w2(d-1)

While the two patterns are different, there are three important things to note that hold for both of them:

1) The terms on the main diagonal form their own distinct pattern.
2) The remaining terms are symmetrical about the diagonal, with the terms above and below the diagonal mirroring each other.
3) The terms on each diagonal parallel to the main diagonal follow a distinct pattern.

We need to find the sum of the matrix to get the variance in the weighted results. Using these three observations, we can simplify the sum by dividing the matrix up into parts.

We'll start with the main diagonal of the matrix. The terms on the diagonal follow the form w2i*Varx1. The sum of these terms is a geometric series, which makes it simple to evaluate:

Next, because the matrix is symmetrical about the diagonal, we can focus on the sum for only the terms above or below the diagonal and then double our result later.

We'll compute this sum by continuing to divide the matrix along its diagonal rows. The r values within a given diagonal are all identical, which we can see in this graphic from the math for the previous article on variance:

The w values within each diagonal also follow a set pattern, though slightly more complex than the one for r's. Rather than r1+r1+r1+..., we get w1+w3+w5+... The basic pattern for the first diagonal is:

That's just for the w component of each term. If we include the r and variance components, we get this for the sum of the terms in the first diagonal adjacent the main diagonal:

This is still a geometric series, so we can evaluate the sum for this diagonal.

For the second diagonal, the w's go w2+w4+w6+..., which gives us:

If we keep going, we'll find that for each additional diagonal, the exponent for r will rise by one, the starting value of i in the summation will rise by one (which also means the summation will have one fewer term, which we can see by looking at the matrix), and each diagonal will alternate having an extra w outside the geometric sum due to the diagonals alternating between odd and even exponents.

Fortunately, the alternating w problem disappears when distribute that w back into the result for the geometric sum of each odd diagonal. We end up with the following pattern for the sum of each diagonal (after factoring out the Vartrue component from each term):

This gives us two separate geometric series: the first multiplies by a factor of rw, and the second by a factor of r/w. Simplifying these geometric series gives us:

That gives us the sum of everything above the main diagonal in the covariance matrix. To get the full sum of the matrix, we need to double this (to account for everything below the diagonal, which mirrors this calculation) and add the sum of the main diagonal:

This gives us the full variance of the weighted results. Our formula calls for the standard deviation instead of the variance, so we just take the square root of this.

Next, we need to calculate the covariance between current talent and the weighted observations. We can get this using another covariance matrix based on the idea of "shared" variance mentioned in the Hardball Times article. The covariance between the results and talent for a given day is the same as the variance in talent, since the variance in talent is inherent in the variance of the results (i.e. that variance is shared between the results and the talent levels for that day).

To fill out the rest of the covariance matrix, we use the fact that the covariance between results and current talent drops the further the results are from the present time. The amount the covariance drops is determined by the day-to-day correlation in talent and the weight given to past data:

x1 wx2 w2x3 w3x4 ... wd-1xd
t1 (rw)0vt (rw)1vt (rw)2vt (rw)3vt ... (rw)d-1vt

This is also a geometric series which multiplies by a factor of rw. The sum simplifes to:

As long as we know the values for r, w, Vartrue and Varx1, we can work out what the variance will be over any number of days, which means as long as we know r, Vartrue and Varx1, we can find the value of w which maximizes the relationship between weighted results and current talent.

Typically we would find this by taking the derivative of the formula and finding the point where the derivative equals 0, but given that this is a rather unpleasant derivative to calculate (and most likely will have difficult-to-find zeroes), I would strongly recommend just using the optimize function in R or some other statistical program (the calculator on the Hardball Times uses the same method to minimize/maximize a function as the optimize function in R).

One final note: this all relies on the assumption of exponential decay weighting. Exponential decay is not necessarily implied by the underlying mathematical processes; it's an assumption we are making to make our lives easier. Theoretically, we could fit the weight for each day individually, but this is far, far more complicated and not really worth the effort.

If you had 100 days in your sample, instead of maximizing the correlation for w, you would have to maximize it for a system of 100 different weight variables. If you would like to attempt this, by all means, have fun, but, while the exponential decay assumption is a simplification, it does work pretty well.

The true weight values do tend to drop slightly faster for the most recent data and then level out more for older data than exponential decay allows for, but on the whole, it doesn't make that much difference to use exponential decay.
Continue Reading...

Math Behind Regression with Changing Talent Levels (THT Article)

In my article "Regression with Changing Talent Levels: the Effects of Variance" on the Hardball Times, I talk about how changes in players' true talent levels from day to day reduce the variance of talent in the population overall over time. In other words, the spread in talent over a 100-game sample will be smaller than the spread in talent over a one-game sample. In the article, I gave the following formula to calculate how much the spread in talent is reduced, which I will further explain here:

*Note: in the THT article, I used d for the number of days instead of n to avoid confusion with another formula that was referenced from a previous article, which used n for something else. For this article, I'm just going to use n for the number of days.

The value given by the formula is the ratio of talent variance over n days to the talent variance for a single day. In other words, the variance in talent drops by a multiplicative factor that is dependent on the length of the sample and the correlation of talent from day to day.

Now, how do we get that formula?

If we only have two days in our sample, it is not too difficult to calculate the drop in talent variance. Let t0 be a variable representing player talent levels on Day 1, and t1 be a variable representing player talent levels on Day 2. We want to find the variance of the average talent levels over both days, or (t0+t1)/2.

The following formula gives us the variance of the sum of two variables:

The covariance is directly proportional to the correlation between the two variables and is defined as follows:

(Note that sdt0sdt1 = vart0 = vart1 because the standard deviation and variance for both variables are the same.)

Before we continue, there is an important thing to note. Because we are trying to derive a formula for a ratio (variance in talent over n days divided by variance in talent over one day), we don't necessarily need to calculate the numerator and denominator of that ratio exactly. As long as we can calculate values that are proportional to those values by the same factor, the ratio will be preserved.

Technically, we want the variance of the value (t0+t1)/2 and not just t0+t1, which would be vart(1+r)/2 instead of 2vart(1+r). However, those two values are proportional, so it doesn't really matter for now which we calculate as long as we can also calculate a value for the denominator that is proportional by the same factor.

For two days, the above calculations are simple enough. Once you start adding more days, however, it starts to get more complicated. Fortunately, the above math can also be expressed with a covariance matrix:

t0 t1
t0 var0 cov0,1
t1 cov0,1 var1

The variance of the sum t0+t1 is equal to the sum of the terms in the covariance matrix, which you can see just gives us the formula: vart0+t1 = vart0 + vart1 + 2covt0,t1. The covariance matrix is convenient because it can be expanded for any number of days:

Covariance matrix between talent n days apart

t0 t1 t2 t3 ... tn-1
t0 var0 cov0,1 cov0,2 cov0,3 ... cov0,n-1
t1 cov0,1 var1 cov1,2 cov1,3 ... cov1,n-1
t2 cov0,2 cov1,2 var2 cov2,3 ... cov2,n-1
t3 cov0,3 cov1,3 cov2,3 var3 ... cov3,n-1
tn-1 cov0,n-1 cov1,n-1 cov2,n-1 cov3,n-1 ... varn-1

We can also construct a correlation matrix. Given that we know the correlation of talent from one day to the next, this isn't that difficult. If the correlation between talent levels on Day 1 and Day 2 is r, and the correlation between talent levels on Day 2 and Day 3 is also r, we can chain those two facts together to find that the correlation between talent levels on Day 1 and Day 3 is r2.

The same logic can be extended for any number of days, so that the correlation between talent levels n days apart is rn:

Correlation matrix between talent n days apart

t0 t1 t2 t3 ... tn-1
t0 r0 r1 r2 r3 ... rn-1
t1 r1 r0 r1 r2 ... rn-2
t2 r2 r1 r0 r1 ... rn-3
t3 r3 r2 r1 r0 ... rn-4
tn-1 rn-1 rn-2 rn-3 rn-4 ... r0

This matrix is more useful than the covariance matrix, because all we need to know to fill in the entire correlation matrix is the value of r. And because correlation is proportional to covariance (covt0,t1 = r · vart0), the sum of the correlation matrix is proportional to the sum of the covariance matrix.

Our next step, then, is to calculate the sum of the correlation matrix. Notice that the terms on each diagonal going from the top left to bottom right are identical:

We can use this pattern to simplify the sum. Since the matrix is symmetrical, we can ignore the terms below the long diagonal and calculate the sum for just the top half of the matrix, and then double it later:

... rn-1 rn-1

There is one r0 term in each column of the matrix, so there are n r0 terms in the sum. Likewise, there are (n-1) r1 terms, (n-2) r2 terms, etc. If we group each diagonal into its own distinct term, we get a sum whose terms follow the pattern (n-1)*ri:

Applying the distributive property and separating the terms of the sum, we get the following:

The first sum is a simple geometric series, which we can calculate using the formula for geometric series:

The second sum is similar, but the additional i factor makes it a bit trickier since it is no longer a geometric series. We can, however, transform it into a geometric series using a trick where we convert this from a single sum to a double sum, where we replace the expression inside the sum with another sum.

The idea is that each term of the series is itself a separate sum which has i terms of ri. This sum can be written as follows:

Notice that we switched to using the index h rather than i. This means there is nothing inside the sum that increments on each successive term, and the i acts as a static value. In other words, this is just adding up the value ri i times, which is of course equal to iri.

In order to visualize how this double sum works, we can write down the terms of the sum in an array with i rows and h columns, where the value corresponding to each pair of (i,h) values is ri. For example, here is what the array would look like with n=4:

h=0 h=1 h=2 h=3
i=1 r1
i=2 r2 r2
i=3 r3 r3 r3

The greyed-out values are included to complete the array, but are not actually part of the sum. If we go through the sum iteratively, we start at i=0, and take the sum of ri from h=0 to h=-1. Since you can't count up from 0 to -1, there are no values to count in this row, which represents the fact that iri = 0 when i=0.

Next, we go to i=1, and fill in the values r1 for k=0 to k=0. The next row, when i=2, we go from h=0 to h=1. And so on.

We are currently taking the sum of each row and then adding those individual sums together. However, we could also start by taking the sum of each column, which would be equivalent to reversing the order of the two sums in our double series:

Note that the inner sum now goes from i=h+1 to i=n-1, which you can see in the columns of the array of terms above.

This is useful because each column of the array is a geometric series, meaning it will be easy to compute. The sum of each column is just the geometric series from i=0 to i=n-1. Then, to eliminate the greyed-out values from the sum, we subtract the geometric series from i=0 to i=h.

This is the value for our inner sum, so we plug that back into the outer sum:

We now have values for both halves of our original sum, so next we combine them to get the full value:

We still have one more step to go to calculate the full sum of the correlation matrix. Recall that when we started, we were working with a symmetrical correlation matrix, and because the matrix was symmetrical along the diameter, we set out to find the sum for only the upper half of the matrix. In order to get the sum of the full matrix, we have to double this value:

Finally, note that the long diagonal of the correlation matrix only occurs once in the matrix, so by doubling our initial sum, we are double-counting that diagonal. In order to correct for this, we need to subtract the sum of that diagonal, which is just n*1 (since each element in that diagonal equals 1):

This value is proportional to the sum of the covariance matrix, which is proportional to the variance of talent in the population over n days.

Next, we need to come up with a corresponding value to represent the variance of talent over a single day. To do this, we can rely on the fact that as long as talent never changes, the variance in talent over any number of days is the same as the variance in talent over a single day. Instead of comparing to the variance in talent over a single day, we can instead compare to the variance in talent over n days when talent is constant from day to day.

This allows us to construct a similar correlation matrix to represent the constant-talent scenario. Compared to the correlation matrix for changing talent, this is trivially simple: since talent levels are the same throughout the sample, the correlation between talent from one day to the next will always be one.

In other words, the correlation matrix will just be an n x n array of 1s. And the sum of an n x n array of 1s is just n^2.

The ratio of these two values will give us the ratio of talent variance after n days of talent changes to the talent variance when talent is constant:

And that is our formula for finding the ratio of variance in true talent over n days to the variance in true talent on a single day, given the value r for the correlation of true talent from one day to the next. With some simplification, the above formula is equivalent to what was posted in the THT article:

Continue Reading...

The Fight

It happened on May 15, 1912.

The once-mighty Detroit Tigers were off to a slow start.  It was to be a long season, their first losing one in six years.  Far from mollifying the pain of defeat, their past success only served to heighten the tension they felt—the old veterans had nearly forgotten what it was to lose, and the youthful among them had not known to begin with.  By contrast, their current situation, while not objectively hopeless, only felt that much more dire.

Needless to say, when the Tigers rolled into New York on their steam locomotive from Boston, where they’d just dropped another two out of three, and cozied up to Hilltop Park, they were a cohort on edge.

Hilltop Park, as it happened, seemed at first the perfect destination for such a group of men.  The Highlanders, not yet the storied franchise they would later become, were one of the few teams in the American League still worse than they were, and their boys were ripe for the beating.  Over the next three days, Detroit began to feel their season reforming beneath their cleats.  They took two of the first three and were nearly back to .500.  Once-shattered men began again to believe.

And so they took the field for the fourth and final game of the series.  Things began inauspiciously, with the teams trading blows for the first two innings and Detroit emerging from the proverbial fracas with a one-run lead.  As it were, such acts of violence were not to remain figurative.

Detroit’s star centerfielder, Tyrus Raymond Cobb, was so known for his gentle disposition that his teammates, half-mockingly but not without a hint of affection, referred to him as “the Georgia Peach”.  However, as Detroit’s standout performer, it was Cobb who found himself the target of the local malcontents who had made it their duty to suffer Highlander seasons firsthand.

Loudest among these was one Claude Lueker, a man whose brazenness had been honed in the fiery confines of Tammany Hall, and he spoke in ways of which only a man entrenched in politics could even conceive.  Such foul narratives poured from his mouth as would turn an oak tree barren just from the stench of their connotations.

For four innings this continued.  Cobb tried to escape the abuse by staying in centerfield for both turns at bat, sitting quietly against the outfield scoreboard and only speaking up to help direct the New York outfielders to avoid collisions.  However, Cobb was accustomed to reading between innings, and had in fact been looking forward to the New York trip where the country’s leading literary critics resided and published, and had that very day picked up a new analysis of MacBeth from just such a scholar before the game.  Only Cobb had left his reading glasses in the dugout, and was unable to study his text from the outfield.

And so, after four innings of careful isolation, Cobb finally felt it safe to brave the trek back to the dugout to retrieve his spectacles.  He knew at once he had been mistaken.  The heckler was on him again, this time saying things Cobb was certain could turn even the most ardent of free speech advocates into anti-seditionists.

Once in the dugout, Cobb was immediately accosted for his inaction.

“Dammit, Cobb!” cried Sam Crawford.  “This has gone on long enough!  There are children here, for crying out loud!”

Ed Willett soon chimed in.  “You can escape this nonsense out there in centerfield, but I’ve got to stand on the mound and listen to it!  You think Donie Bush would let this kind of thing go?  Sometimes I wish he were our future Hall of Famer.”

Cobb protested.  “Look, I’m sorry you all have to put up with this, but there’s nothing we can do.  We’ll be out of New York tomorrow, and we can put the whole thing behind us then.”

Wanting nothing more than to go back to the outfield where the fans were much more docile and many were willing to debate the merits of Mark Twain’s lesser novels (which was one of Cobb’s pet subjects), Cobb hoped he could leave it at that.  It was at this moment that an insult so offensive crept over the lip of the dugout and into the ears of the Detroit men that there was no longer anything Cobb could do for the hurler.

Hughie Jennings walked over and put his arm on Cobb’s shoulder.  “Look, son, I know you don’t like this any more than the rest of us.  Probably less than the rest of us.  But you’ve got to do something to shut that man up.”  Jennings' eyes glowed with a warm fierceness Cobb knew from experience he could not allay.  With a final pat on Cobb's shoulder, Jennings bored into him with those eyes and tried to reassure him:  “We’ll have your back.”  Cobb turned reluctantly toward the dugout steps.

After a tentative step into the stands, Cobb quickly retreated.  Jennings began to protest, but Cobb cut him off.  “Look, I know what you’re going to say, but the man is an invalid!  He’s got no hands!”

“I don’t care if he doesn’t have any feet!” Jennings bellowed.  “What must be done will be done, if not by you then by someone else!”

From the corner of his eye, Cobb saw Bill Burns reaching for his lumber.  Burns had long since washed out as an effective pitcher and had never been able to hit a lick, but he remained a towering hulk of a man, and Cobb knew it would not end pleasantly were he commissioned for the task.  So, even more reluctantly than before, Cobb slunk back up the dugout steps and into the stands, trailed behind by his fellow Tigers.

“Look,” Cobb said as he approached the man, “I wish you wouldn’t create such a ruckus, but also know that I haven’t any ill intent toward you.”  With that, Cobb raised his fist half-heartedly, when suddenly the man heaved his entire weight in the direction of Cobb.  Like two anteaters on the savanna they tumbled.  Cobb’s teammates jumped at the sight, storming into the stands with bats in hand.  Mayhem was upon the lower grandstand like flies on a heap of corpses and was not to be driven away.

At this point, the Highlanders, who had been surveying the local architecture beyond left field using Hal Chase’s new engineering sextant, heard the commotion and were made aware of the delay in the game.  They rushed to the aid of their fellow professionals, leaping unaware into the middle of the fray.  For the next forty-five minutes, fans and players were at each other in a most uncivilized manner before the umpires managed to get through to the telegraph office in the press box to wire the police.

By the time it was over, more than two dozen fans were injured, and several players received stern warnings for their behavior.  Ban Johnson, who happened to be in attendance and witnessed the second half of the brawl after returning from the concession stand, suspended the entire Detroit roster, and they had to play three days later against Philadelphia with a replacement nine.

And that, to this day, remains without a doubt the greatest fight in baseball history.
Continue Reading...

Baseball is Dying (1892 version)

At least that seems to be the opinion of Pittsburgh Dispatch sports editor John D. Pringle in his weekly "A Review of Sports" column:
If there were ever any doubts concerning the waning interest in baseball, the meeting of the magnates at Chicago during the past week must have dispelled them. The gathering was more like the meeting together of a lot of men to sing a funeral dirge than anything else. The proceedings were doleful despite the efforts of the magnates to wear smiles. Most certainly this annual meeting was far below par in enthusiasm with those of former years.
To be sure, those persons who court notoriety by always wanting rules changed and tinkered were at the meeting. There was no millenium plan this time; it is an exploded bladder now, but there was the new diamond notion and a few other things just as silly and just as characteristic of liquid intellects as the Utopian "plan." Of course all the venders of quack remedies pointed out that "something must be done to revive an interest in baseball." Ah! You see they admit the game's popularity is waning. Happily no changes were decided on.

Even more pessimistic was the Kansas City Times, which apparently wrote:
BASEBALL has apparently served its day and its days seem near an end. Perhaps there may be a renaissance. But the ball players have come to the end of their string; they can play very little better; there is no more progress to be made. The people have seen it all. They are tired of reviewing it.

By the way, this is the "new diamond notion" Pringle refers to:

As you can see, the proposal was to add a fifth base, with the middle bases positioned roughly where the infielders actually play. The basis for the proposal was twofold: One, it would increase the amount of fair territory by widening the angle between the first and third baselines, resulting in more base hits and fewer foul balls. Two, it would shorten the distance between stealable bases to 70 feet (along with the distance the catcher would have to throw the ball), leading to a more active running game.

By keeping the distance to first and to home the same, proponents hoped to minimize the impact on infield hits and scoring plays. By adding an extra base station and increasing the total distance around the bases, the extra action of more base hits and base stealing would not necessarily lead to a huge increase in scoring.

Continue Reading...


The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.


Finally, I think one of the biggest issues is that Howard may have misrepresented his research in the article. Since the full paper is behind a paywall, I don't know for sure or to what extent, but there are certainly indications that the article overstates Howard's conclusions.

One is the following graph, which is one of the few pieces of data Howard shares from his research:

The graph purportedly refutes the participation hypothesis by showing that the rating gap between males and females increases as the female participation rate increases. This supports Howard's alternative hypothesis that the most talented females are already playing no matter how low the overall female participation rate is, and that increasing the participation rate only adds less talented players and can never catch females up to males.

A few things jump out about this graph, though. First, the data on federations between 5-10% and 15-25% is completely missing from the graph, with the three remaining points forming a neat line with a clear slope. I have no idea if this was deliberate, but it is at least strange.

More importantly, Howard doesn't explain anywhere in his summary how the data is aggregated, how many players are included in each group, what countries are included in each group, how any individual federations rated, or why this particular graph was chosen out of the various studies or various number-of-games controls Howard seems to have run.

Howard singles out only Vietnam and Georgia as countries with high female participation in the text of the article. Except when I downloaded the April, 2015 rating list, the difference between the average male rating and the average female rating in Vietnam (94 points) was significantly lower than the difference worldwide (153 points). And Georgia (35 points) had one of the smallest gender rating gaps in the world. I don't have data on the number of games played to check what happens when you include that control, but as I wrote in the previous post, I am skeptical that that could possibly cause the rating gap for Georgia or Vietnam to suddenly jump above average.

What countries with high (25+%) female participation rate among FIDE-rated players had higher than average gender gaps? Ethiopia had a massive gap, with the average male rated 621 points higher than the average female. But there are only 30 Ethiopian players on the list, with just 9 females. Most of the other countries with a high percentage of females on the rating list that had above-average rating gaps also had very few players.

Now, I don't think it is Ethiopia that is throwing off Howard's chart, because I don't think any of the female players from Ethiopa have played enough FIDE-rated games to qualify for Howard's cutoff, but I wonder if Howard's graph is simply weighting all federations equally when he aggregates the data. If I try to recreate something like Howard's chart with the April, 2015 rating data without any control for games played, then I do get a positive slope if I just take the simple average of each federation's rating gap. If I instead weight each federation's rating gap by the number of female players, so that, for example, Georgia with its hundreds of rated players gets more weight in the aggregate than Ethiopia with its 30, then I get a negative slope:

So it could be that Howard's graph is aggregating the data in a misleading way. I don't know for sure, but his results look a lot more like what I get when I aggregate the data in a misleading way. It is also possible that setting a control for players at 350 rated games played left relatively few players, and that after further splitting up the data into separate federations like this, there are simply not enough data points to get reliable results.

It is definitely misleading for Howard to highlight Georgia as his prime example of a federation that encourages female participation while he is showing that these countries have a larger gender gap, because Georgia definitely has a smaller than average gender gap. The following line in particular sounds suspicious:

"I also tackled the participation rate hypothesis by replicating a variety of studies with players from Georgia, where women are strongly encouraged to play chess and the female FIDE participation rate is high at over 30%. The overall results were much the same as with the entire FIDE list, but sometimes not quite as pronounced."

This is right after the graph showing that the gender gap goes up as female participation increases, and right after he singled out only Georgia and Vietnam as examples of countries included in that graph. Howard finds that the gender gap is actually lower in Georgia ("sometimes not quite as pronounced"), but he completely downplays this finding and neglects to report any quantitative representation showing how the results were less pronounced. It is no wonder that readers like Nigel Short got completely the wrong impression of Howard's results, as when Short summarized this graph in the following manner:

"Howard debunks this by showing that in countries like Georgia, where female participation is substantially higher than average, the gender gap actually increases – which is, of course, the exact opposite of what one would expect were the participatory hypothesis true."

I found this review of the full paper written by Australian grandmaster David Smerdon. Smerdon's review gives a very different impression of Howard's work than Howard's own Chessbase summary. For example, in reference to the Georgia data and Short's interpretation:

"I don’t know what Short is referring to here, because there is nothing in the Howard article that suggests this. Figure 1 of the study shows that the gender gap is, and has always been, lower in Georgia than in the rest of the world for the subsamples tested (top 10 and top 50). Short may be referring to Figure 2, which, to be fair, probably shouldn’t have been included in the final paper. It looks at the gender gap as the number of games increases, but on the previous page of the article, Howard himself acknowledges that accounting for number of games played supports the participation hypothesis at all levels except the very extreme."

And later, summarizing Howard's research on the gender gap in Georgia:

"...This supports a nurture argument to the gender gap, but again, the sample size is too small for anything definitive to be concluded."

This sounds like it is describing completely different research from Howard's Chessbase article. While Short definitely did not do himself or the gender discussion any favours with his interpretation, neither does Howard do his research justice with his published summary.
Continue Reading...


The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.


Howard criticizes a 2009 study published by the British Royal Society that found support for the participation hypothesis--that there are fewer elite female chess players simply because there are fewer female chess players overall. The Bilalić, et al study looked at the top 100 rated male and female players in the German federation and compared the distribution of their ratings to an expected distribution based on the overall participation rates by gender. The observed gender gap was close to what was expected from the overall participation rates.

Howard issues three main criticisms of the study:

1) It is too difficult to determine cause and effect from their data.
2) They didn't control for the number of rated games played.
3) The study relies on data from only the German Federation and thus could simply be a sample size fluke.


Howard argues that showing that the gender gap is in line with what we would expect from participation rates is not enough to establish participation rates as the cause of the gender gap. However, Howard himself does no better in establishing a cause/effect relationship between the gender gap and his hypothesis that men are innately more talented at chess.

Howard supports his claim with data showing that the rating difference between the top male and female players has remained relatively constant over the years, which he assumes means the gender gap has not closed (which is probably incorrect). He then assumes that if there were non-biological causes behind the gender gap, the gap must have diminished over the past several decades as feminism has advanced in many developed countries, and if it hasn't then that means there is likely a biological cause.

But he doesn't provide any more support for this assumption than Bilalić, et al do for theirs. Several areas in sports that should be unaffected by the physical differences between males and females, such as coaching, general management, and officiating positions, have seen little to no progress in gender disparity over that same time span in spite of any general advances in society. It is not a given that a lack of significant progress means that gender disparity is due to natural talent.

I think Howard overestimates his evidence of a causal relationship in part because underestimates the "gatekeeper" effect in chess. In his 2005 paper, he gives this as an important factor in testing his hypothesis:

"Adequately testing the evolutionary psychology view, that the achievement differences at least partly are due to ability differences, requires a domain with very special characteristics. First, it should be a complete meritocracy with no influence of gatekeepers, in which talent of either gender can rise readily."

Howard relies on the assumption that chess is close to a complete meritocracy because most tournaments are open* and results are based on your performance. Howard contrasts this to fields like science, where decision-makers control access to resources and could be susceptible to bias:

"In most domains, gatekeepers control resources needed for high achievement and may run an ‘old boy’s network’ favouring males. In science, for instance, gatekeepers distribute graduate school places, jobs, research grants, and journal and laboratory space."

*Most major tournaments involving the top players are actually not open, but invitational. Most tournaments below the elite level are open, however.

The absence of decision-makers with the ability to deny players access to tournaments does not mean there are no gatekeeper forces at work, however. There are other forces that can have just as strong an effect. WIM Sabrina Chevannes gives some examples of social pressures (under the section "My thoughts on sexism in chess") that commonly make women feel unwelcome or uncomfortable at predominantly male tournaments, ranging from belittling remarks to flat-out harassment.

These problems are driving established female players away from the game, but they can also be important for young players getting into the game. Most grandmasters start chess at a young age, and research backs the idea that starting age is an important factor in chess mastery (full paper)), both because starting earlier allows for greater total accumulation of practice, and because chess likely has a "critical period" effect for learning (the same effect that makes it much easier for a young child to learn a language than an adult).

This means that even subtle effects, such as a parent being more likely to teach the game to male children at a young age, or young males being more attracted to the social environment of a predominantly male local club, can have a significant gatekeeper effect. Things like age of exposure to chess, access to high-level coaching and competition, and social compatibility with existing chess culture are all important factors in developing a player's ability.

This is probably why we see strong chess countries like Russia or other former Soviet nations consistently dominating chess, even though they probably don't have any biological ability advantage. The more children who are exposed to favourable learning criteria, the more high-level chess players a population will produce. Just like these factors help keep the strongest federations on top, they could conceivably favour male players over female players.

Chevannes also points out more explicit gatekeeper behaviour, such as limited access to funding and coaching for England's womens Olympiad team ("Effects of sexism in English Chess"). Several countries provide state-funding or private grants for chess development, similar to the type of gatekeeper influences Howard describes in science. For example, the USCF has the Samford Chess Fellowship, a private grant currently for $42,000, which has been awarded annually since 1987. Thirty of the 32 recipients (three years the grant was split between two recipients) have been male.

And, as mentioned earlier, most of the top tournaments are actually invitational, which also fits Howard's criteria for gatekeeper influence. The potential gatekeeper effect of invitational tournaments preserving rating gaps is even something players have complained about: when the top tournaments only hand out invitations to the same group of top-rated players, those players just end up trade rating points among themselves, which leaves little opportunity for them to give rating points back to the rest of the field.

These factors are incredibly difficult to measure and separate out from your data, which is why Howard considers the absence of such factors essential to test his hypothesis. By ignoring these factors, Howard strongly inflates his evidence in support of a biological cause. In fact, this is a common criticism of the entire field of evolutionary psychology which Howard uses to approach this question: its hypotheses about cause and effect are so difficult to properly test, it is debatable whether it actually qualifies as science.


As discussed in the previous post, I don't think controlling for the number of rated games played adequately separates out the effect of practice and development from that of natural talent. More importantly, though, Howard's criticism here is confusing because he only describes the importance of controlling for number games as something that could avoid a potential bias against female players. Females tend to play far fewer rated games on average, and a player's rating tends to increase the more games they play.

In order for this criticism to be relevant to the Bilalić study, omitting this control would have to bias the results in favour of female players. Howard offers no reasoning as to why this would be the case, and it is not at all obvious how it could be. Howard's own data appears to show a decreased gender gap after controlling for number of games.


While Bilalić, et al did only look at players from the German federation, they compared ratings for the top 100 players of each gender. In Howard's original study, he included players from all federations, but still only compared the ratings of the top 10, 50, and 100 players of each gender, so he was not actually using any more data points than the study he is criticizing.

Just as importantly, Bilalić, et al actually had a reason for using data from just one federation rather than FIDE data, as outlined in a later paper by Bilalić, Nemanja Vaci, and Bartosz Gula. FIDE rating data is limited to only above-average players and omits a lot of data from developing or below-average players. Rating data from individual federations can allow for a more comprehensive view of the population, such as a better estimation of overall participation rates, which was necessary for their study.

In Howard's summary article, he refutes the Bilalić study by showing data from more federations, but he doesn't actually repeat their study to create a comparison to their work. Instead, he just shows aggregated data with no indication of how many players were included or how the data was aggregated. It is not clear that he actually used more data points to draw his conclusion than Bilalić, et al used, only that he looked at players from multiple federations.

Howard's most recent study is behind a paywall, so unfortunately all I have to go by is his summary published on I assume there are more details in the full study, but it is impossible to tell how his data really compares with the data from the Bilalić study from what he published in the summary, which is largely written as a refutation to the Bilalić study.

Continue Reading...