Math Behind Regression with Changing Talent Levels (THT Article)

In my article "Regression with Changing Talent Levels: the Effects of Variance" on the Hardball Times, I talk about how changes in players' true talent levels from day to day reduce the variance of talent in the population overall over time. In other words, the spread in talent over a 100-game sample will be smaller than the spread in talent over a one-game sample. In the article, I gave the following formula to calculate how much the spread in talent is reduced, which I will further explain here:

*Note: in the THT article, I used d for the number of days instead of n to avoid confusion with another formula that was referenced from a previous article, which used n for something else. For this article, I'm just going to use n for the number of days.

The value given by the formula is the ratio of talent variance over n days to the talent variance for a single day. In other words, the variance in talent drops by a multiplicative factor that is dependent on the length of the sample and the correlation of talent from day to day.

Now, how do we get that formula?

If we only have two days in our sample, it is not too difficult to calculate the drop in talent variance. Let t0 be a variable representing player talent levels on Day 1, and t1 be a variable representing player talent levels on Day 2. We want to find the variance of the average talent levels over both days, or (t0+t1)/2.

The following formula gives us the variance of the sum of two variables:

The covariance is directly proportional to the correlation between the two variables and is defined as follows:

(Note that sdt0sdt1 = vart0 = vart1 because the standard deviation and variance for both variables are the same.)

Before we continue, there is an important thing to note. Because we are trying to derive a formula for a ratio (variance in talent over n days divided by variance in talent over one day), we don't necessarily need to calculate the numerator and denominator of that ratio exactly. As long as we can calculate values that are proportional to those values by the same factor, the ratio will be preserved.

Technically, we want the variance of the value (t0+t1)/2 and not just t0+t1, which would be vart(1+r)/2 instead of 2vart(1+r). However, those two values are proportional, so it doesn't really matter for now which we calculate as long as we can also calculate a value for the denominator that is proportional by the same factor.

For two days, the above calculations are simple enough. Once you start adding more days, however, it starts to get more complicated. Fortunately, the above math can also be expressed with a covariance matrix:

t0 t1
t0 var0 cov0,1
t1 cov0,1 var1

The variance of the sum t0+t1 is equal to the sum of the terms in the covariance matrix, which you can see just gives us the formula: vart0+t1 = vart0 + vart1 + 2covt0,t1. The covariance matrix is convenient because it can be expanded for any number of days:

Covariance matrix between talent n days apart

t0 t1 t2 t3 ... tn-1
t0 var0 cov0,1 cov0,2 cov0,3 ... cov0,n-1
t1 cov0,1 var1 cov1,2 cov1,3 ... cov1,n-1
t2 cov0,2 cov1,2 var2 cov2,3 ... cov2,n-1
t3 cov0,3 cov1,3 cov2,3 var3 ... cov3,n-1
tn-1 cov0,n-1 cov1,n-1 cov2,n-1 cov3,n-1 ... varn-1

We can also construct a correlation matrix. Given that we know the correlation of talent from one day to the next, this isn't that difficult. If the correlation between talent levels on Day 1 and Day 2 is r, and the correlation between talent levels on Day 2 and Day 3 is also r, we can chain those two facts together to find that the correlation between talent levels on Day 1 and Day 3 is r2.

The same logic can be extended for any number of days, so that the correlation between talent levels n days apart is rn:

Correlation matrix between talent n days apart

t0 t1 t2 t3 ... tn-1
t0 r0 r1 r2 r3 ... rn-1
t1 r1 r0 r1 r2 ... rn-2
t2 r2 r1 r0 r1 ... rn-3
t3 r3 r2 r1 r0 ... rn-4
tn-1 rn-1 rn-2 rn-3 rn-4 ... r0

This matrix is more useful than the covariance matrix, because all we need to know to fill in the entire correlation matrix is the value of r. And because correlation is proportional to covariance (covt0,t1 = r · vart0), the sum of the correlation matrix is proportional to the sum of the covariance matrix.

Our next step, then, is to calculate the sum of the correlation matrix. Notice that the terms on each diagonal going from the top left to bottom right are identical:

We can use this pattern to simplify the sum. Since the matrix is symmetrical, we can ignore the terms below the long diagonal and calculate the sum for just the top half of the matrix, and then double it later:

... rn-1 rn-1

There is one r0 term in each column of the matrix, so there are n r0 terms in the sum. Likewise, there are (n-1) r1 terms, (n-2) r2 terms, etc. If we group each diagonal into its own distinct term, we get a sum whose terms follow the pattern (n-1)*ri:

Applying the distributive property and separating the terms of the sum, we get the following:

The first sum is a simple geometric series, which we can calculate using the formula for geometric series:

The second sum is similar, but the additional i factor makes it a bit trickier since it is no longer a geometric series. We can, however, transform it into a geometric series using a trick where we convert this from a single sum to a double sum, where we replace the expression inside the sum with another sum.

The idea is that each term of the series is itself a separate sum which has i terms of ri. This sum can be written as follows:

Notice that we switched to using the index h rather than i. This means there is nothing inside the sum that increments on each successive term, and the i acts as a static value. In other words, this is just adding up the value ri i times, which is of course equal to iri.

In order to visualize how this double sum works, we can write down the terms of the sum in an array with i rows and h columns, where the value corresponding to each pair of (i,h) values is ri. For example, here is what the array would look like with n=4:

h=0 h=1 h=2 h=3
i=1 r1
i=2 r2 r2
i=3 r3 r3 r3

The greyed-out values are included to complete the array, but are not actually part of the sum. If we go through the sum iteratively, we start at i=0, and take the sum of ri from h=0 to h=-1. Since you can't count up from 0 to -1, there are no values to count in this row, which represents the fact that iri = 0 when i=0.

Next, we go to i=1, and fill in the values r1 for k=0 to k=0. The next row, when i=2, we go from h=0 to h=1. And so on.

We are currently taking the sum of each row and then adding those individual sums together. However, we could also start by taking the sum of each column, which would be equivalent to reversing the order of the two sums in our double series:

Note that the inner sum now goes from i=h+1 to i=n-1, which you can see in the columns of the array of terms above.

This is useful because each column of the array is a geometric series, meaning it will be easy to compute. The sum of each column is just the geometric series from i=0 to i=n-1. Then, to eliminate the greyed-out values from the sum, we subtract the geometric series from i=0 to i=h.

This is the value for our inner sum, so we plug that back into the outer sum:

We now have values for both halves of our original sum, so next we combine them to get the full value:

We still have one more step to go to calculate the full sum of the correlation matrix. Recall that when we started, we were working with a symmetrical correlation matrix, and because the matrix was symmetrical along the diameter, we set out to find the sum for only the upper half of the matrix. In order to get the sum of the full matrix, we have to double this value:

Finally, note that the long diagonal of the correlation matrix only occurs once in the matrix, so by doubling our initial sum, we are double-counting that diagonal. In order to correct for this, we need to subtract the sum of that diagonal, which is just n*1 (since each element in that diagonal equals 1):

This value is proportional to the sum of the covariance matrix, which is proportional to the variance of talent in the population over n days.

Next, we need to come up with a corresponding value to represent the variance of talent over a single day. To do this, we can rely on the fact that as long as talent never changes, the variance in talent over any number of days is the same as the variance in talent over a single day. Instead of comparing to the variance in talent over a single day, we can instead compare to the variance in talent over n days when talent is constant from day to day.

This allows us to construct a similar correlation matrix to represent the constant-talent scenario. Compared to the correlation matrix for changing talent, this is trivially simple: since talent levels are the same throughout the sample, the correlation between talent from one day to the next will always be one.

In other words, the correlation matrix will just be an n x n array of 1s. And the sum of an n x n array of 1s is just n^2.

The ratio of these two values will give us the ratio of talent variance after n days of talent changes to the talent variance when talent is constant:

And that is our formula for finding the ratio of variance in true talent over n days to the variance in true talent on a single day, given the value r for the correlation of true talent from one day to the next. With some simplification, the above formula is equivalent to what was posted in the THT article:

Continue Reading...

The Fight

It happened on May 15, 1912.

The once-mighty Detroit Tigers were off to a slow start.  It was to be a long season, their first losing one in six years.  Far from mollifying the pain of defeat, their past success only served to heighten the tension they felt—the old veterans had nearly forgotten what it was to lose, and the youthful among them had not known to begin with.  By contrast, their current situation, while not objectively hopeless, only felt that much more dire.

Needless to say, when the Tigers rolled into New York on their steam locomotive from Boston, where they’d just dropped another two out of three, and cozied up to Hilltop Park, they were a cohort on edge.

Hilltop Park, as it happened, seemed at first the perfect destination for such a group of men.  The Highlanders, not yet the storied franchise they would later become, were one of the few teams in the American League still worse than they were, and their boys were ripe for the beating.  Over the next three days, Detroit began to feel their season reforming beneath their cleats.  They took two of the first three and were nearly back to .500.  Once-shattered men began again to believe.

And so they took the field for the fourth and final game of the series.  Things began inauspiciously, with the teams trading blows for the first two innings and Detroit emerging from the proverbial fracas with a one-run lead.  As it were, such acts of violence were not to remain figurative.

Detroit’s star centerfielder, Tyrus Raymond Cobb, was so known for his gentle disposition that his teammates, half-mockingly but not without a hint of affection, referred to him as “the Georgia Peach”.  However, as Detroit’s standout performer, it was Cobb who found himself the target of the local malcontents who had made it their duty to suffer Highlander seasons firsthand.

Loudest among these was one Claude Lueker, a man whose brazenness had been honed in the fiery confines of Tammany Hall, and he spoke in ways of which only a man entrenched in politics could even conceive.  Such foul narratives poured from his mouth as would turn an oak tree barren just from the stench of their connotations.

For four innings this continued.  Cobb tried to escape the abuse by staying in centerfield for both turns at bat, sitting quietly against the outfield scoreboard and only speaking up to help direct the New York outfielders to avoid collisions.  However, Cobb was accustomed to reading between innings, and had in fact been looking forward to the New York trip where the country’s leading literary critics resided and published, and had that very day picked up a new analysis of MacBeth from just such a scholar before the game.  Only Cobb had left his reading glasses in the dugout, and was unable to study his text from the outfield.

And so, after four innings of careful isolation, Cobb finally felt it safe to brave the trek back to the dugout to retrieve his spectacles.  He knew at once he had been mistaken.  The heckler was on him again, this time saying things Cobb was certain could turn even the most ardent of free speech advocates into anti-seditionists.

Once in the dugout, Cobb was immediately accosted for his inaction.

“Dammit, Cobb!” cried Sam Crawford.  “This has gone on long enough!  There are children here, for crying out loud!”

Ed Willett soon chimed in.  “You can escape this nonsense out there in centerfield, but I’ve got to stand on the mound and listen to it!  You think Donie Bush would let this kind of thing go?  Sometimes I wish he were our future Hall of Famer.”

Cobb protested.  “Look, I’m sorry you all have to put up with this, but there’s nothing we can do.  We’ll be out of New York tomorrow, and we can put the whole thing behind us then.”

Wanting nothing more than to go back to the outfield where the fans were much more docile and many were willing to debate the merits of Mark Twain’s lesser novels (which was one of Cobb’s pet subjects), Cobb hoped he could leave it at that.  It was at this moment that an insult so offensive crept over the lip of the dugout and into the ears of the Detroit men that there was no longer anything Cobb could do for the hurler.

Hughie Jennings walked over and put his arm on Cobb’s shoulder.  “Look, son, I know you don’t like this any more than the rest of us.  Probably less than the rest of us.  But you’ve got to do something to shut that man up.”  Jennings' eyes glowed with a warm fierceness Cobb knew from experience he could not allay.  With a final pat on Cobb's shoulder, Jennings bored into him with those eyes and tried to reassure him:  “We’ll have your back.”  Cobb turned reluctantly toward the dugout steps.

After a tentative step into the stands, Cobb quickly retreated.  Jennings began to protest, but Cobb cut him off.  “Look, I know what you’re going to say, but the man is an invalid!  He’s got no hands!”

“I don’t care if he doesn’t have any feet!” Jennings bellowed.  “What must be done will be done, if not by you then by someone else!”

From the corner of his eye, Cobb saw Bill Burns reaching for his lumber.  Burns had long since washed out as an effective pitcher and had never been able to hit a lick, but he remained a towering hulk of a man, and Cobb knew it would not end pleasantly were he commissioned for the task.  So, even more reluctantly than before, Cobb slunk back up the dugout steps and into the stands, trailed behind by his fellow Tigers.

“Look,” Cobb said as he approached the man, “I wish you wouldn’t create such a ruckus, but also know that I haven’t any ill intent toward you.”  With that, Cobb raised his fist half-heartedly, when suddenly the man heaved his entire weight in the direction of Cobb.  Like two anteaters on the savanna they tumbled.  Cobb’s teammates jumped at the sight, storming into the stands with bats in hand.  Mayhem was upon the lower grandstand like flies on a heap of corpses and was not to be driven away.

At this point, the Highlanders, who had been surveying the local architecture beyond left field using Hal Chase’s new engineering sextant, heard the commotion and were made aware of the delay in the game.  They rushed to the aid of their fellow professionals, leaping unaware into the middle of the fray.  For the next forty-five minutes, fans and players were at each other in a most uncivilized manner before the umpires managed to get through to the telegraph office in the press box to wire the police.

By the time it was over, more than two dozen fans were injured, and several players received stern warnings for their behavior.  Ban Johnson, who happened to be in attendance and witnessed the second half of the brawl after returning from the concession stand, suspended the entire Detroit roster, and they had to play three days later against Philadelphia with a replacement nine.

And that, to this day, remains without a doubt the greatest fight in baseball history.
Continue Reading...

Baseball is Dying (1892 version)

At least that seems to be the opinion of Pittsburgh Dispatch sports editor John D. Pringle in his weekly "A Review of Sports" column:
If there were ever any doubts concerning the waning interest in baseball, the meeting of the magnates at Chicago during the past week must have dispelled them. The gathering was more like the meeting together of a lot of men to sing a funeral dirge than anything else. The proceedings were doleful despite the efforts of the magnates to wear smiles. Most certainly this annual meeting was far below par in enthusiasm with those of former years.
To be sure, those persons who court notoriety by always wanting rules changed and tinkered were at the meeting. There was no millenium plan this time; it is an exploded bladder now, but there was the new diamond notion and a few other things just as silly and just as characteristic of liquid intellects as the Utopian "plan." Of course all the venders of quack remedies pointed out that "something must be done to revive an interest in baseball." Ah! You see they admit the game's popularity is waning. Happily no changes were decided on.

Even more pessimistic was the Kansas City Times, which apparently wrote:
BASEBALL has apparently served its day and its days seem near an end. Perhaps there may be a renaissance. But the ball players have come to the end of their string; they can play very little better; there is no more progress to be made. The people have seen it all. They are tired of reviewing it.

By the way, this is the "new diamond notion" Pringle refers to:

As you can see, the proposal was to add a fifth base, with the middle bases positioned roughly where the infielders actually play. The basis for the proposal was twofold: One, it would increase the amount of fair territory by widening the angle between the first and third baselines, resulting in more base hits and fewer foul balls. Two, it would shorten the distance between stealable bases to 70 feet (along with the distance the catcher would have to throw the ball), leading to a more active running game.

By keeping the distance to first and to home the same, proponents hoped to minimize the impact on infield hits and scoring plays. By adding an extra base station and increasing the total distance around the bases, the extra action of more base hits and base stealing would not necessarily lead to a huge increase in scoring.

Continue Reading...


The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.


Finally, I think one of the biggest issues is that Howard may have misrepresented his research in the article. Since the full paper is behind a paywall, I don't know for sure or to what extent, but there are certainly indications that the article overstates Howard's conclusions.

One is the following graph, which is one of the few pieces of data Howard shares from his research:

The graph purportedly refutes the participation hypothesis by showing that the rating gap between males and females increases as the female participation rate increases. This supports Howard's alternative hypothesis that the most talented females are already playing no matter how low the overall female participation rate is, and that increasing the participation rate only adds less talented players and can never catch females up to males.

A few things jump out about this graph, though. First, the data on federations between 5-10% and 15-25% is completely missing from the graph, with the three remaining points forming a neat line with a clear slope. I have no idea if this was deliberate, but it is at least strange.

More importantly, Howard doesn't explain anywhere in his summary how the data is aggregated, how many players are included in each group, what countries are included in each group, how any individual federations rated, or why this particular graph was chosen out of the various studies or various number-of-games controls Howard seems to have run.

Howard singles out only Vietnam and Georgia as countries with high female participation in the text of the article. Except when I downloaded the April, 2015 rating list, the difference between the average male rating and the average female rating in Vietnam (94 points) was significantly lower than the difference worldwide (153 points). And Georgia (35 points) had one of the smallest gender rating gaps in the world. I don't have data on the number of games played to check what happens when you include that control, but as I wrote in the previous post, I am skeptical that that could possibly cause the rating gap for Georgia or Vietnam to suddenly jump above average.

What countries with high (25+%) female participation rate among FIDE-rated players had higher than average gender gaps? Ethiopia had a massive gap, with the average male rated 621 points higher than the average female. But there are only 30 Ethiopian players on the list, with just 9 females. Most of the other countries with a high percentage of females on the rating list that had above-average rating gaps also had very few players.

Now, I don't think it is Ethiopia that is throwing off Howard's chart, because I don't think any of the female players from Ethiopa have played enough FIDE-rated games to qualify for Howard's cutoff, but I wonder if Howard's graph is simply weighting all federations equally when he aggregates the data. If I try to recreate something like Howard's chart with the April, 2015 rating data without any control for games played, then I do get a positive slope if I just take the simple average of each federation's rating gap. If I instead weight each federation's rating gap by the number of female players, so that, for example, Georgia with its hundreds of rated players gets more weight in the aggregate than Ethiopia with its 30, then I get a negative slope:

So it could be that Howard's graph is aggregating the data in a misleading way. I don't know for sure, but his results look a lot more like what I get when I aggregate the data in a misleading way. It is also possible that setting a control for players at 350 rated games played left relatively few players, and that after further splitting up the data into separate federations like this, there are simply not enough data points to get reliable results.

It is definitely misleading for Howard to highlight Georgia as his prime example of a federation that encourages female participation while he is showing that these countries have a larger gender gap, because Georgia definitely has a smaller than average gender gap. The following line in particular sounds suspicious:

"I also tackled the participation rate hypothesis by replicating a variety of studies with players from Georgia, where women are strongly encouraged to play chess and the female FIDE participation rate is high at over 30%. The overall results were much the same as with the entire FIDE list, but sometimes not quite as pronounced."

This is right after the graph showing that the gender gap goes up as female participation increases, and right after he singled out only Georgia and Vietnam as examples of countries included in that graph. Howard finds that the gender gap is actually lower in Georgia ("sometimes not quite as pronounced"), but he completely downplays this finding and neglects to report any quantitative representation showing how the results were less pronounced. It is no wonder that readers like Nigel Short got completely the wrong impression of Howard's results, as when Short summarized this graph in the following manner:

"Howard debunks this by showing that in countries like Georgia, where female participation is substantially higher than average, the gender gap actually increases – which is, of course, the exact opposite of what one would expect were the participatory hypothesis true."

I found this review of the full paper written by Australian grandmaster David Smerdon. Smerdon's review gives a very different impression of Howard's work than Howard's own Chessbase summary. For example, in reference to the Georgia data and Short's interpretation:

"I don’t know what Short is referring to here, because there is nothing in the Howard article that suggests this. Figure 1 of the study shows that the gender gap is, and has always been, lower in Georgia than in the rest of the world for the subsamples tested (top 10 and top 50). Short may be referring to Figure 2, which, to be fair, probably shouldn’t have been included in the final paper. It looks at the gender gap as the number of games increases, but on the previous page of the article, Howard himself acknowledges that accounting for number of games played supports the participation hypothesis at all levels except the very extreme."

And later, summarizing Howard's research on the gender gap in Georgia:

"...This supports a nurture argument to the gender gap, but again, the sample size is too small for anything definitive to be concluded."

This sounds like it is describing completely different research from Howard's Chessbase article. While Short definitely did not do himself or the gender discussion any favours with his interpretation, neither does Howard do his research justice with his published summary.
Continue Reading...


The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.


Howard criticizes a 2009 study published by the British Royal Society that found support for the participation hypothesis--that there are fewer elite female chess players simply because there are fewer female chess players overall. The Bilalić, et al study looked at the top 100 rated male and female players in the German federation and compared the distribution of their ratings to an expected distribution based on the overall participation rates by gender. The observed gender gap was close to what was expected from the overall participation rates.

Howard issues three main criticisms of the study:

1) It is too difficult to determine cause and effect from their data.
2) They didn't control for the number of rated games played.
3) The study relies on data from only the German Federation and thus could simply be a sample size fluke.


Howard argues that showing that the gender gap is in line with what we would expect from participation rates is not enough to establish participation rates as the cause of the gender gap. However, Howard himself does no better in establishing a cause/effect relationship between the gender gap and his hypothesis that men are innately more talented at chess.

Howard supports his claim with data showing that the rating difference between the top male and female players has remained relatively constant over the years, which he assumes means the gender gap has not closed (which is probably incorrect). He then assumes that if there were non-biological causes behind the gender gap, the gap must have diminished over the past several decades as feminism has advanced in many developed countries, and if it hasn't then that means there is likely a biological cause.

But he doesn't provide any more support for this assumption than Bilalić, et al do for theirs. Several areas in sports that should be unaffected by the physical differences between males and females, such as coaching, general management, and officiating positions, have seen little to no progress in gender disparity over that same time span in spite of any general advances in society. It is not a given that a lack of significant progress means that gender disparity is due to natural talent.

I think Howard overestimates his evidence of a causal relationship in part because underestimates the "gatekeeper" effect in chess. In his 2005 paper, he gives this as an important factor in testing his hypothesis:

"Adequately testing the evolutionary psychology view, that the achievement differences at least partly are due to ability differences, requires a domain with very special characteristics. First, it should be a complete meritocracy with no influence of gatekeepers, in which talent of either gender can rise readily."

Howard relies on the assumption that chess is close to a complete meritocracy because most tournaments are open* and results are based on your performance. Howard contrasts this to fields like science, where decision-makers control access to resources and could be susceptible to bias:

"In most domains, gatekeepers control resources needed for high achievement and may run an ‘old boy’s network’ favouring males. In science, for instance, gatekeepers distribute graduate school places, jobs, research grants, and journal and laboratory space."

*Most major tournaments involving the top players are actually not open, but invitational. Most tournaments below the elite level are open, however.

The absence of decision-makers with the ability to deny players access to tournaments does not mean there are no gatekeeper forces at work, however. There are other forces that can have just as strong an effect. WIM Sabrina Chevannes gives some examples of social pressures (under the section "My thoughts on sexism in chess") that commonly make women feel unwelcome or uncomfortable at predominantly male tournaments, ranging from belittling remarks to flat-out harassment.

These problems are driving established female players away from the game, but they can also be important for young players getting into the game. Most grandmasters start chess at a young age, and research backs the idea that starting age is an important factor in chess mastery (full paper)), both because starting earlier allows for greater total accumulation of practice, and because chess likely has a "critical period" effect for learning (the same effect that makes it much easier for a young child to learn a language than an adult).

This means that even subtle effects, such as a parent being more likely to teach the game to male children at a young age, or young males being more attracted to the social environment of a predominantly male local club, can have a significant gatekeeper effect. Things like age of exposure to chess, access to high-level coaching and competition, and social compatibility with existing chess culture are all important factors in developing a player's ability.

This is probably why we see strong chess countries like Russia or other former Soviet nations consistently dominating chess, even though they probably don't have any biological ability advantage. The more children who are exposed to favourable learning criteria, the more high-level chess players a population will produce. Just like these factors help keep the strongest federations on top, they could conceivably favour male players over female players.

Chevannes also points out more explicit gatekeeper behaviour, such as limited access to funding and coaching for England's womens Olympiad team ("Effects of sexism in English Chess"). Several countries provide state-funding or private grants for chess development, similar to the type of gatekeeper influences Howard describes in science. For example, the USCF has the Samford Chess Fellowship, a private grant currently for $42,000, which has been awarded annually since 1987. Thirty of the 32 recipients (three years the grant was split between two recipients) have been male.

And, as mentioned earlier, most of the top tournaments are actually invitational, which also fits Howard's criteria for gatekeeper influence. The potential gatekeeper effect of invitational tournaments preserving rating gaps is even something players have complained about: when the top tournaments only hand out invitations to the same group of top-rated players, those players just end up trade rating points among themselves, which leaves little opportunity for them to give rating points back to the rest of the field.

These factors are incredibly difficult to measure and separate out from your data, which is why Howard considers the absence of such factors essential to test his hypothesis. By ignoring these factors, Howard strongly inflates his evidence in support of a biological cause. In fact, this is a common criticism of the entire field of evolutionary psychology which Howard uses to approach this question: its hypotheses about cause and effect are so difficult to properly test, it is debatable whether it actually qualifies as science.


As discussed in the previous post, I don't think controlling for the number of rated games played adequately separates out the effect of practice and development from that of natural talent. More importantly, though, Howard's criticism here is confusing because he only describes the importance of controlling for number games as something that could avoid a potential bias against female players. Females tend to play far fewer rated games on average, and a player's rating tends to increase the more games they play.

In order for this criticism to be relevant to the Bilalić study, omitting this control would have to bias the results in favour of female players. Howard offers no reasoning as to why this would be the case, and it is not at all obvious how it could be. Howard's own data appears to show a decreased gender gap after controlling for number of games.


While Bilalić, et al did only look at players from the German federation, they compared ratings for the top 100 players of each gender. In Howard's original study, he included players from all federations, but still only compared the ratings of the top 10, 50, and 100 players of each gender, so he was not actually using any more data points than the study he is criticizing.

Just as importantly, Bilalić, et al actually had a reason for using data from just one federation rather than FIDE data, as outlined in a later paper by Bilalić, Nemanja Vaci, and Bartosz Gula. FIDE rating data is limited to only above-average players and omits a lot of data from developing or below-average players. Rating data from individual federations can allow for a more comprehensive view of the population, such as a better estimation of overall participation rates, which was necessary for their study.

In Howard's summary article, he refutes the Bilalić study by showing data from more federations, but he doesn't actually repeat their study to create a comparison to their work. Instead, he just shows aggregated data with no indication of how many players were included or how the data was aggregated. It is not clear that he actually used more data points to draw his conclusion than Bilalić, et al used, only that he looked at players from multiple federations.

Howard's most recent study is behind a paywall, so unfortunately all I have to go by is his summary published on I assume there are more details in the full study, but it is impossible to tell how his data really compares with the data from the Bilalić study from what he published in the summary, which is largely written as a refutation to the Bilalić study.

Continue Reading...

Gender in Chess PART 2: ELO RATINGS

The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.



We saw previously that the lack of significant change in the rating gap between the top male and female players can actually be evidence that the gender gap in chess has diminished over the past few decades. That is not the only potential interpretation problem with Howard's conclusion, though. Even if control groups hadn't indicated that the Elo gap should be increasing absent any closing of the gender gap, it could still be conceivable that the gender gap has in fact diminished.

This is because Elo ratings are not indicators of absolute playing strength, only of strength relative to the field of rated players. In other words, a 2500 Elo rating among one group of players is not necessarily equivalent to a 2500 rating among another group of players. For example, Hikaru Nakamura's FIDE rating for April, 2015 is 2798. His USCF rating is 2881. Both are Elo ratings, but because they are tracked among different pools of players, they don't have to match up even though they are both describing the strength of the exact same player.

Howard is looking at the same FIDE rating for both male and female players, though, so this shouldn't be a problem, right? Possibly, but we don't know for sure.

Elo ratings work by taking points away from one player and giving them to the other each time a game is played. If a player has a true playing strength of 2500 but is rated at 2400, then they would be expected to take points from their opponents until their rating matches their playing strength. Likewise, a player who is overrated will give points back to the field until their rating returns to their ability.

Many top female players play predominantly or exclusively in womens events. And some of the top female players who play in open events, such as Judit and Susan Polgar when they were still active, rarely play womens events at all, and as a result rarely play against other women. Because of this, if males or females are over- or underrated as a group, there might not be enough games between the two groups to transfer the necessary rating points to bring them back in line. It is possible that female players and male players form two sufficiently isolated player pools that their ratings are not necessarily comparable.

This might sound far-fetched, but it is actually a known problem and has occurred before. In 1987, FIDE commissioned a study comparing the performance of top female players against men to their performance against other women because of this exact issue. The six women who had played a sufficient number of games against both genders over the mid-80s to qualify for the study all held significantly higher performance ratings against male opponents than female opponents--on average more than 100 points higher.

This suggested that, for example, a 2400 rated female player was likely stronger than a 2400 male player. To compensate, FIDE added 100 points to all rated female players (except Susan Polgar*) in order to bring their ratings in line with the male ratings. It is possible that the two pools of players have remained isolated enough to drift out of sync again over the last few decades, however.

*The reasoning was that Polgar already played mostly within the male pool of players and didn't need the adjustment. However, the decision to give the full 100 points to her top rivals, who also played a significant number of games against men, and 0 points to Polgar was nonsensical and controversial, and there were accusations that FIDE was deliberately manipulating the ratings to place Maya Chiburdanidze in the #1 spot ahead of Polgar.

Most people who follow chess believe that some form of inflation exists in the ratings . In other words, they believe a 2800 rating is not as strong now, when there are a handful of players hovering around that level, as it was when Garry Kasparov first achieved it back in 1990 and Anatoly Karpov was the only other player over 2700, or in 1972 when Bobby Fischer topped the ratings list by over 100 points at 2785.

The mechanism of inflation is not well understood, however, and it is not clear that it would necessarily have had the same effect on a fairly isolated pool of female players as on the population as a whole. It could be that after nearly 30 years, male ratings have inflated faster than female ratings, and we have once again reached a point where female players as a whole are underrated.

Howard himself notes another potential interpretation problem with using FIDE ratings to measure the gender gap in chess:

"I found that women typically play many fewer FIDE-rated games than males, only about one third of the number on average. Now, the usual learning curve for chess players is a progressive ascent to a peak at around 750 FIDE-rated games. ... Comparing modestly- and highly-practiced individuals can be misleading. Studies should control for differences in number of games played, either by equating males and females on this or by examining differences at the typical rating peak at around 750 games."

Howard then dismisses this explanation because even after controlling for the number of rated games played, males still had higher ratings.

The number of FIDE-rated games played itself isn't really what we care about, though. It's just a proxy for "modestly-practiced", "highly-practiced", etc. Players who have played more games should, in general, have more experience and further development. Games played are far from a perfect indicator of a player's level of development or experience, however.

Most obviously, not all games are FIDE-rated. While top-rated players do for the most part compete exclusively in FIDE-rated events, that is not true for developing players. For example, U.S. prodigy Sam Sevian has played 539 FIDE-rated games as of April, 2015. He's played 922 USCF-rated games. Even ignoring casual and club games, that is hundreds of competitive games that are not in Howard's data (and it is more than the difference between 922 and 539, because not all FIDE-rated games are USCF-rated).

The amount of study devoted to chess outside of rated games is also a huge factor in development. Someone who is devoted to studying chess full-time will develop much more than someone who competes as a casual hobby, even if you control for the number of rated games played. Likewise, someone who competes fairly regularly and reaches 750 games in their 20s is different from someone who competes less frequently and reaches 750 games in their 40s or 50s (or even later). The former is probably much more likely to still be ascending and hitting their peak at that point, while the latter likely peaked or plateaued at a much lower number of games, and would probably have begun declining with age by the time they reached 750 games.

It's easy to see how two players can be at vastly different stages of development even after the same number of games played. Howard isn't comparing two individual players, though--he is comparing two groups of players (male and female). As long as you look at enough players in each group, shouldn't those other factors start to even out?

Ideally, they should. If there is a bias that applies to the group as a whole, though, that won't happen. For example, if female players tend to begin playing FIDE-events at an earlier stage in their development, or if they tend to compete less frequently than male competitors, that would introduce a bias that won't even out.

Many female players compete predominantly in female-only events, which are less frequent than open-gender events. And because these female-only events draw from a much smaller segment of the chess-playing population than the open-gender events, they also tend to be less intimidating for less-experienced players to enter. So there is a good chance this bias does exist.

In fact, Howard's data supports this. His 2005 paper includes a table summarizing males and females who entered the rating list between 1985 and 1989, and shows that the median age at which females first appeared on the list was about five years younger than the median age for males (though for top 100 females, it was only about 6 months younger than for top 100 males). And, in spite of entering the rating list at a younger age, the females on average still played significantly fewer games in their competitive careers.

Howard hits on an important idea about a player's rating being reflective of both their innate abilities and their level of practice and development. In order to test for the effects of innate abilities alone, as Howard sets out to do, he realizes that he needs to strip out the effects of development. However, this is a much more complicated issue than Howard acknowledges, and simply controlling for the number of rated games played is not adequate to make the assumption that any remaining differences must reflect natural ability.

Continue Reading...


The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.


Howard begins by revisiting a 2005 paper he published on the same topic showing the gap between the average Elo rating of the top 50 male players and the top 50 female players:

Howard then argues that because the Elo gap has remained relatively constant in spite of societal changes over that time period, the difference between the male and female ratings is not due to societal factors and is at least partially biologically-based.

This finding is likely surprising to most people in chess. For example, the legendary Garry Kasparov, who early in his career expressed a somewhat Fischer-esque dismissal of female chess talent, grew to greatly respect the Polgar sisters (one of whom has defeated Kasparov himself) and felt they broke new ground for female players. In a recent interview at an exhibition match with Short in St. Louis, Kasparov rejected the claim that the gender gap has not closed. Even Short himself wrote that he had assumed the gap had closed somewhat before reading Howard's article.

Howard acknowledges this prior expectation in his 2005 paper:

"Anecdotally at least, there has been some convergence in chess at top levels. For example, there are more female grandmasters. Judit Polgar, born in 1976 and the strongest-ever female player, regularly wins tournaments against top male competition and several times has made the top ten players list. She once held the record for youngest-ever grandmaster. But, the extent of gender differences and their trends over time have never been quantified."

After quantifying the Elo difference, though, Howard simply assumes that the difference remaining flat means there has been no closing of the gender gap. This might seem like an reasonable assumption, but it lacks an important step: he has no control group to help interpret his results.


Computers have revolutionized how chess is played and studied at the top level. With the help of computer engines that are much stronger than any human grandmaster, known opening lines are constantly being analyzed more and more thoroughly. The more thoroughly these lines are known, the more important it is for players to memorize them, and the deeper they have to look for new ideas that could lead to a winning position. Strong grandmasters spend most of their time studying and developing these lines.

Former World Champion Vladimir Kramnik (born 1975, reached grandmaster 1991) said at his most recent tournament that top players have to work much harder now than when his career was starting. However, only the very top players can support themselves studying chess and competing full time. Most grandmasters, let alone lower titled or untitled players, don't have the time to keep up with all of these advances.

It is possible that this has led to the top players distancing themselves from the field. If that is the case, then, absent any closing of the gender gap, we would expect the Elo gap between the top 50 males and the top 50 females to have grown over time, just because there are more males in the group at the very top that is pulling away from everyone else. We need some kind of control group to compare to in order to help us interpret Howard's graph before we conclude the gender gap has not closed.

One way to do this is to compare breakdowns other than the top 50 females vs. the top 50 males. For example, what if we take the top 50 Russian players, and compare them to the top 50 non-Russian players?

The top 50 Russians in the April 2015 FIDE rating list have an average Elo rating of 2659. The top 50 players from outside Russia are at 2726. So the top 50 Russian players are 67 points below the non-Russians.

If we go back to 1991 (the first year the Soviet federations were listed separately--it would be impossible to make comparisons before that because the USSR included many strong players from outside Russia), the top 50 Russians were 54 points behind the top non-Russians. So the gap has grown a bit in the last couple decades, in spite of the fact that Russia remains by far the top federation.

Of course, you might be able to make a case that Russia is a bit weaker than it was in the early 90s when Kasparov and Karpov were still dominating chess. Except here's the thing: when we compare Russia to the rest of the world, Russia has lost ground. But if we instead compare Russia to each individual federation, they have actually gained ground over most of them. This seems paradoxical, but it makes sense if the top end of the spectrum is stretching itself out.

Let's take a look at some of these other countries.

The U.S. is experiencing something of a golden age for chess right now. They currently have two of the top ten players in the world. Hikaru Nakamura, the best American player since Bobby Fischer*, has been as high as #2 in the world in the live rankings this year, and recently became the first American to hit 2800 Elo. Increased funding and efforts in development programs have produced some remarkable young talent, including Sam Sevian, who in 2014 became the sixth-youngest grandmaster ever at 13 years old.

*at least not counting Fabiano Caruana, who has spent most of the last year as the #2 player in the world--Caruana was born in the U.S. but moved to Europe at age 13 and has represented Italy for his professional career

The emergence of serious collegiate chess teams has also attracted strong talent from around the world to the U.S. For example, five of the twelve competitors in the open-gender division of the 2015 U.S. National Championship (and at least that many from the Womens division) had originally competed under a different national federation before transferring to the USCF, including world #7 Wesley So. Likely influenced by the emergence of American chess, the aforementioned Caruana recently announced that he is transferring back to the USCF.

You would be hard pressed to argue that the U.S. federation is weaker now than in 1991, and certainly not much weaker. Yet in 1991, the top 50 American players were 105 points behind the top 50 non-Americans. Now, they're 185 points back.

What about Norway, the home of current World Champion and clear #1 Magnus Carlsen? Carlsen has sparked a chess craze in Norway, where tournaments now get national TV coverage. Norway hosts one of the top chess tournaments in the world (Norway Chess) and last year hosted the Chess Olympiad. The number of Norwegians in FIDE's published rating list grew from 92 in 1991 to 1306 this year.

The gap between the top 50 Norwegian players and the rest of the world grew from 289 points in 1991 to 337 points in 2015.

Not all federations saw their gap increase. China, for example, has without a doubt become much stronger in chess since 1991. Chess has had difficulty catching on in China due to the prevalence of xianqi, China's native chess variant, and go, another popular strategy game. Chess was even outlawed for a period in the 1960s and '70s as part of Chairman Mao's Cultural Revolution. Starting in the 1970s, however, China began pouring an increasing amount of funding and effort into growing its chess program.

This has ramped up in recent years, and China has finally emerged as a world chess power. Their women's team has won gold in four of the nine Chess Olympiads held since 1998 and three of the five World Team Championships since a women's division was created in 2007. The open-gender team won gold in the 2014 Olympiad and the 2015 World Team Championships. Their top 50 went from 329 points back of the world in 1991 to 207 points back in 2015.

Still, the vast majority of federations saw increases. Here are the Elo gaps for each of the 38 federations that had at least 50 FIDE-rated players in both 1991 and 2015:

Only 5 of the 38 federations closed the Elo gap at all, and on average the gap grew by 54 points.

When we look at the individual federations as control groups, we see evidence that the top really is separating itself further away from the field as time goes on. In spite of that, Howard's graph shows that women actually closed the Elo gap by a small amount. This can be interpreted as evidence that the gender gap is in fact closing, because it is offsetting the effect we are seeing with the national federations.

It is tempting to see evidence that supports your hypothesis in a vacuum, such as the relatively constant Elo gap between male and female players over the years, and to stop there. It is also tempting to believe a variable you believe to be objective and unbiased, such as Elo ratings, is self-explanatory and needs no control group to interpret. However, this is a dangerous practice. Especially when your results run counter to what subject matter experts would expect, as this finding did, it is important to make sure you have the proper context to interpret your results before jumping to conclusions.

Continue Reading...