Baseball is Dying (1892 version)

At least that seems to be the opinion of Pittsburgh Dispatch sports editor John D. Pringle in his weekly "A Review of Sports" column:
If there were ever any doubts concerning the waning interest in baseball, the meeting of the magnates at Chicago during the past week must have dispelled them. The gathering was more like the meeting together of a lot of men to sing a funeral dirge than anything else. The proceedings were doleful despite the efforts of the magnates to wear smiles. Most certainly this annual meeting was far below par in enthusiasm with those of former years.
To be sure, those persons who court notoriety by always wanting rules changed and tinkered were at the meeting. There was no millenium plan this time; it is an exploded bladder now, but there was the new diamond notion and a few other things just as silly and just as characteristic of liquid intellects as the Utopian "plan." Of course all the venders of quack remedies pointed out that "something must be done to revive an interest in baseball." Ah! You see they admit the game's popularity is waning. Happily no changes were decided on.

Even more pessimistic was the Kansas City Times, which apparently wrote:
BASEBALL has apparently served its day and its days seem near an end. Perhaps there may be a renaissance. But the ball players have come to the end of their string; they can play very little better; there is no more progress to be made. The people have seen it all. They are tired of reviewing it.

By the way, this is the "new diamond notion" Pringle refers to:

As you can see, the proposal was to add a fifth base, with the middle bases positioned roughly where the infielders actually play. The basis for the proposal was twofold: One, it would increase the amount of fair territory by widening the angle between the first and third baselines, resulting in more base hits and fewer foul balls. Two, it would shorten the distance between stealable bases to 70 feet (along with the distance the catcher would have to throw the ball), leading to a more active running game.

By keeping the distance to first and to home the same, proponents hoped to minimize the impact on infield hits and scoring plays. By adding an extra base station and increasing the total distance around the bases, the extra action of more base hits and base stealing would not necessarily lead to a huge increase in scoring.

Continue Reading...


The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.


Finally, I think one of the biggest issues is that Howard may have misrepresented his research in the article. Since the full paper is behind a paywall, I don't know for sure or to what extent, but there are certainly indications that the article overstates Howard's conclusions.

One is the following graph, which is one of the few pieces of data Howard shares from his research:

The graph purportedly refutes the participation hypothesis by showing that the rating gap between males and females increases as the female participation rate increases. This supports Howard's alternative hypothesis that the most talented females are already playing no matter how low the overall female participation rate is, and that increasing the participation rate only adds less talented players and can never catch females up to males.

A few things jump out about this graph, though. First, the data on federations between 5-10% and 15-25% is completely missing from the graph, with the three remaining points forming a neat line with a clear slope. I have no idea if this was deliberate, but it is at least strange.

More importantly, Howard doesn't explain anywhere in his summary how the data is aggregated, how many players are included in each group, what countries are included in each group, how any individual federations rated, or why this particular graph was chosen out of the various studies or various number-of-games controls Howard seems to have run.

Howard singles out only Vietnam and Georgia as countries with high female participation in the text of the article. Except when I downloaded the April, 2015 rating list, the difference between the average male rating and the average female rating in Vietnam (94 points) was significantly lower than the difference worldwide (153 points). And Georgia (35 points) had one of the smallest gender rating gaps in the world. I don't have data on the number of games played to check what happens when you include that control, but as I wrote in the previous post, I am skeptical that that could possibly cause the rating gap for Georgia or Vietnam to suddenly jump above average.

What countries with high (25+%) female participation rate among FIDE-rated players had higher than average gender gaps? Ethiopia had a massive gap, with the average male rated 621 points higher than the average female. But there are only 30 Ethiopian players on the list, with just 9 females. Most of the other countries with a high percentage of females on the rating list that had above-average rating gaps also had very few players.

Now, I don't think it is Ethiopia that is throwing off Howard's chart, because I don't think any of the female players from Ethiopa have played enough FIDE-rated games to qualify for Howard's cutoff, but I wonder if Howard's graph is simply weighting all federations equally when he aggregates the data. If I try to recreate something like Howard's chart with the April, 2015 rating data without any control for games played, then I do get a positive slope if I just take the simple average of each federation's rating gap. If I instead weight each federation's rating gap by the number of female players, so that, for example, Georgia with its hundreds of rated players gets more weight in the aggregate than Ethiopia with its 30, then I get a negative slope:

So it could be that Howard's graph is aggregating the data in a misleading way. I don't know for sure, but his results look a lot more like what I get when I aggregate the data in a misleading way. It is also possible that setting a control for players at 350 rated games played left relatively few players, and that after further splitting up the data into separate federations like this, there are simply not enough data points to get reliable results.

It is definitely misleading for Howard to highlight Georgia as his prime example of a federation that encourages female participation while he is showing that these countries have a larger gender gap, because Georgia definitely has a smaller than average gender gap. The following line in particular sounds suspicious:

"I also tackled the participation rate hypothesis by replicating a variety of studies with players from Georgia, where women are strongly encouraged to play chess and the female FIDE participation rate is high at over 30%. The overall results were much the same as with the entire FIDE list, but sometimes not quite as pronounced."

This is right after the graph showing that the gender gap goes up as female participation increases, and right after he singled out only Georgia and Vietnam as examples of countries included in that graph. Howard finds that the gender gap is actually lower in Georgia ("sometimes not quite as pronounced"), but he completely downplays this finding and neglects to report any quantitative representation showing how the results were less pronounced. It is no wonder that readers like Nigel Short got completely the wrong impression of Howard's results, as when Short summarized this graph in the following manner:

"Howard debunks this by showing that in countries like Georgia, where female participation is substantially higher than average, the gender gap actually increases – which is, of course, the exact opposite of what one would expect were the participatory hypothesis true."

I found this review of the full paper written by Australian grandmaster David Smerdon. Smerdon's review gives a very different impression of Howard's work than Howard's own Chessbase summary. For example, in reference to the Georgia data and Short's interpretation:

"I don’t know what Short is referring to here, because there is nothing in the Howard article that suggests this. Figure 1 of the study shows that the gender gap is, and has always been, lower in Georgia than in the rest of the world for the subsamples tested (top 10 and top 50). Short may be referring to Figure 2, which, to be fair, probably shouldn’t have been included in the final paper. It looks at the gender gap as the number of games increases, but on the previous page of the article, Howard himself acknowledges that accounting for number of games played supports the participation hypothesis at all levels except the very extreme."

And later, summarizing Howard's research on the gender gap in Georgia:

"...This supports a nurture argument to the gender gap, but again, the sample size is too small for anything definitive to be concluded."

This sounds like it is describing completely different research from Howard's Chessbase article. While Short definitely did not do himself or the gender discussion any favours with his interpretation, neither does Howard do his research justice with his published summary.
Continue Reading...


The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.


Howard criticizes a 2009 study published by the British Royal Society that found support for the participation hypothesis--that there are fewer elite female chess players simply because there are fewer female chess players overall. The Bilalić, et al study looked at the top 100 rated male and female players in the German federation and compared the distribution of their ratings to an expected distribution based on the overall participation rates by gender. The observed gender gap was close to what was expected from the overall participation rates.

Howard issues three main criticisms of the study:

1) It is too difficult to determine cause and effect from their data.
2) They didn't control for the number of rated games played.
3) The study relies on data from only the German Federation and thus could simply be a sample size fluke.


Howard argues that showing that the gender gap is in line with what we would expect from participation rates is not enough to establish participation rates as the cause of the gender gap. However, Howard himself does no better in establishing a cause/effect relationship between the gender gap and his hypothesis that men are innately more talented at chess.

Howard supports his claim with data showing that the rating difference between the top male and female players has remained relatively constant over the years, which he assumes means the gender gap has not closed (which is probably incorrect). He then assumes that if there were non-biological causes behind the gender gap, the gap must have diminished over the past several decades as feminism has advanced in many developed countries, and if it hasn't then that means there is likely a biological cause.

But he doesn't provide any more support for this assumption than Bilalić, et al do for theirs. Several areas in sports that should be unaffected by the physical differences between males and females, such as coaching, general management, and officiating positions, have seen little to no progress in gender disparity over that same time span in spite of any general advances in society. It is not a given that a lack of significant progress means that gender disparity is due to natural talent.

I think Howard overestimates his evidence of a causal relationship in part because underestimates the "gatekeeper" effect in chess. In his 2005 paper, he gives this as an important factor in testing his hypothesis:

"Adequately testing the evolutionary psychology view, that the achievement differences at least partly are due to ability differences, requires a domain with very special characteristics. First, it should be a complete meritocracy with no influence of gatekeepers, in which talent of either gender can rise readily."

Howard relies on the assumption that chess is close to a complete meritocracy because most tournaments are open* and results are based on your performance. Howard contrasts this to fields like science, where decision-makers control access to resources and could be susceptible to bias:

"In most domains, gatekeepers control resources needed for high achievement and may run an ‘old boy’s network’ favouring males. In science, for instance, gatekeepers distribute graduate school places, jobs, research grants, and journal and laboratory space."

*Most major tournaments involving the top players are actually not open, but invitational. Most tournaments below the elite level are open, however.

The absence of decision-makers with the ability to deny players access to tournaments does not mean there are no gatekeeper forces at work, however. There are other forces that can have just as strong an effect. WIM Sabrina Chevannes gives some examples of social pressures (under the section "My thoughts on sexism in chess") that commonly make women feel unwelcome or uncomfortable at predominantly male tournaments, ranging from belittling remarks to flat-out harassment.

These problems are driving established female players away from the game, but they can also be important for young players getting into the game. Most grandmasters start chess at a young age, and research backs the idea that starting age is an important factor in chess mastery (full paper)), both because starting earlier allows for greater total accumulation of practice, and because chess likely has a "critical period" effect for learning (the same effect that makes it much easier for a young child to learn a language than an adult).

This means that even subtle effects, such as a parent being more likely to teach the game to male children at a young age, or young males being more attracted to the social environment of a predominantly male local club, can have a significant gatekeeper effect. Things like age of exposure to chess, access to high-level coaching and competition, and social compatibility with existing chess culture are all important factors in developing a player's ability.

This is probably why we see strong chess countries like Russia or other former Soviet nations consistently dominating chess, even though they probably don't have any biological ability advantage. The more children who are exposed to favourable learning criteria, the more high-level chess players a population will produce. Just like these factors help keep the strongest federations on top, they could conceivably favour male players over female players.

Chevannes also points out more explicit gatekeeper behaviour, such as limited access to funding and coaching for England's womens Olympiad team ("Effects of sexism in English Chess"). Several countries provide state-funding or private grants for chess development, similar to the type of gatekeeper influences Howard describes in science. For example, the USCF has the Samford Chess Fellowship, a private grant currently for $42,000, which has been awarded annually since 1987. Thirty of the 32 recipients (three years the grant was split between two recipients) have been male.

And, as mentioned earlier, most of the top tournaments are actually invitational, which also fits Howard's criteria for gatekeeper influence. The potential gatekeeper effect of invitational tournaments preserving rating gaps is even something players have complained about: when the top tournaments only hand out invitations to the same group of top-rated players, those players just end up trade rating points among themselves, which leaves little opportunity for them to give rating points back to the rest of the field.

These factors are incredibly difficult to measure and separate out from your data, which is why Howard considers the absence of such factors essential to test his hypothesis. By ignoring these factors, Howard strongly inflates his evidence in support of a biological cause. In fact, this is a common criticism of the entire field of evolutionary psychology which Howard uses to approach this question: its hypotheses about cause and effect are so difficult to properly test, it is debatable whether it actually qualifies as science.


As discussed in the previous post, I don't think controlling for the number of rated games played adequately separates out the effect of practice and development from that of natural talent. More importantly, though, Howard's criticism here is confusing because he only describes the importance of controlling for number games as something that could avoid a potential bias against female players. Females tend to play far fewer rated games on average, and a player's rating tends to increase the more games they play.

In order for this criticism to be relevant to the Bilalić study, omitting this control would have to bias the results in favour of female players. Howard offers no reasoning as to why this would be the case, and it is not at all obvious how it could be. Howard's own data appears to show a decreased gender gap after controlling for number of games.


While Bilalić, et al did only look at players from the German federation, they compared ratings for the top 100 players of each gender. In Howard's original study, he included players from all federations, but still only compared the ratings of the top 10, 50, and 100 players of each gender, so he was not actually using any more data points than the study he is criticizing.

Just as importantly, Bilalić, et al actually had a reason for using data from just one federation rather than FIDE data, as outlined in a later paper by Bilalić, Nemanja Vaci, and Bartosz Gula. FIDE rating data is limited to only above-average players and omits a lot of data from developing or below-average players. Rating data from individual federations can allow for a more comprehensive view of the population, such as a better estimation of overall participation rates, which was necessary for their study.

In Howard's summary article, he refutes the Bilalić study by showing data from more federations, but he doesn't actually repeat their study to create a comparison to their work. Instead, he just shows aggregated data with no indication of how many players were included or how the data was aggregated. It is not clear that he actually used more data points to draw his conclusion than Bilalić, et al used, only that he looked at players from multiple federations.

Howard's most recent study is behind a paywall, so unfortunately all I have to go by is his summary published on I assume there are more details in the full study, but it is impossible to tell how his data really compares with the data from the Bilalić study from what he published in the summary, which is largely written as a refutation to the Bilalić study.

Continue Reading...

Gender in Chess PART 2: ELO RATINGS

The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.



We saw previously that the lack of significant change in the rating gap between the top male and female players can actually be evidence that the gender gap in chess has diminished over the past few decades. That is not the only potential interpretation problem with Howard's conclusion, though. Even if control groups hadn't indicated that the Elo gap should be increasing absent any closing of the gender gap, it could still be conceivable that the gender gap has in fact diminished.

This is because Elo ratings are not indicators of absolute playing strength, only of strength relative to the field of rated players. In other words, a 2500 Elo rating among one group of players is not necessarily equivalent to a 2500 rating among another group of players. For example, Hikaru Nakamura's FIDE rating for April, 2015 is 2798. His USCF rating is 2881. Both are Elo ratings, but because they are tracked among different pools of players, they don't have to match up even though they are both describing the strength of the exact same player.

Howard is looking at the same FIDE rating for both male and female players, though, so this shouldn't be a problem, right? Possibly, but we don't know for sure.

Elo ratings work by taking points away from one player and giving them to the other each time a game is played. If a player has a true playing strength of 2500 but is rated at 2400, then they would be expected to take points from their opponents until their rating matches their playing strength. Likewise, a player who is overrated will give points back to the field until their rating returns to their ability.

Many top female players play predominantly or exclusively in womens events. And some of the top female players who play in open events, such as Judit and Susan Polgar when they were still active, rarely play womens events at all, and as a result rarely play against other women. Because of this, if males or females are over- or underrated as a group, there might not be enough games between the two groups to transfer the necessary rating points to bring them back in line. It is possible that female players and male players form two sufficiently isolated player pools that their ratings are not necessarily comparable.

This might sound far-fetched, but it is actually a known problem and has occurred before. In 1987, FIDE commissioned a study comparing the performance of top female players against men to their performance against other women because of this exact issue. The six women who had played a sufficient number of games against both genders over the mid-80s to qualify for the study all held significantly higher performance ratings against male opponents than female opponents--on average more than 100 points higher.

This suggested that, for example, a 2400 rated female player was likely stronger than a 2400 male player. To compensate, FIDE added 100 points to all rated female players (except Susan Polgar*) in order to bring their ratings in line with the male ratings. It is possible that the two pools of players have remained isolated enough to drift out of sync again over the last few decades, however.

*The reasoning was that Polgar already played mostly within the male pool of players and didn't need the adjustment. However, the decision to give the full 100 points to her top rivals, who also played a significant number of games against men, and 0 points to Polgar was nonsensical and controversial, and there were accusations that FIDE was deliberately manipulating the ratings to place Maya Chiburdanidze in the #1 spot ahead of Polgar.

Most people who follow chess believe that some form of inflation exists in the ratings . In other words, they believe a 2800 rating is not as strong now, when there are a handful of players hovering around that level, as it was when Garry Kasparov first achieved it back in 1990 and Anatoly Karpov was the only other player over 2700, or in 1972 when Bobby Fischer topped the ratings list by over 100 points at 2785.

The mechanism of inflation is not well understood, however, and it is not clear that it would necessarily have had the same effect on a fairly isolated pool of female players as on the population as a whole. It could be that after nearly 30 years, male ratings have inflated faster than female ratings, and we have once again reached a point where female players as a whole are underrated.

Howard himself notes another potential interpretation problem with using FIDE ratings to measure the gender gap in chess:

"I found that women typically play many fewer FIDE-rated games than males, only about one third of the number on average. Now, the usual learning curve for chess players is a progressive ascent to a peak at around 750 FIDE-rated games. ... Comparing modestly- and highly-practiced individuals can be misleading. Studies should control for differences in number of games played, either by equating males and females on this or by examining differences at the typical rating peak at around 750 games."

Howard then dismisses this explanation because even after controlling for the number of rated games played, males still had higher ratings.

The number of FIDE-rated games played itself isn't really what we care about, though. It's just a proxy for "modestly-practiced", "highly-practiced", etc. Players who have played more games should, in general, have more experience and further development. Games played are far from a perfect indicator of a player's level of development or experience, however.

Most obviously, not all games are FIDE-rated. While top-rated players do for the most part compete exclusively in FIDE-rated events, that is not true for developing players. For example, U.S. prodigy Sam Sevian has played 539 FIDE-rated games as of April, 2015. He's played 922 USCF-rated games. Even ignoring casual and club games, that is hundreds of competitive games that are not in Howard's data (and it is more than the difference between 922 and 539, because not all FIDE-rated games are USCF-rated).

The amount of study devoted to chess outside of rated games is also a huge factor in development. Someone who is devoted to studying chess full-time will develop much more than someone who competes as a casual hobby, even if you control for the number of rated games played. Likewise, someone who competes fairly regularly and reaches 750 games in their 20s is different from someone who competes less frequently and reaches 750 games in their 40s or 50s (or even later). The former is probably much more likely to still be ascending and hitting their peak at that point, while the latter likely peaked or plateaued at a much lower number of games, and would probably have begun declining with age by the time they reached 750 games.

It's easy to see how two players can be at vastly different stages of development even after the same number of games played. Howard isn't comparing two individual players, though--he is comparing two groups of players (male and female). As long as you look at enough players in each group, shouldn't those other factors start to even out?

Ideally, they should. If there is a bias that applies to the group as a whole, though, that won't happen. For example, if female players tend to begin playing FIDE-events at an earlier stage in their development, or if they tend to compete less frequently than male competitors, that would introduce a bias that won't even out.

Many female players compete predominantly in female-only events, which are less frequent than open-gender events. And because these female-only events draw from a much smaller segment of the chess-playing population than the open-gender events, they also tend to be less intimidating for less-experienced players to enter. So there is a good chance this bias does exist.

In fact, Howard's data supports this. His 2005 paper includes a table summarizing males and females who entered the rating list between 1985 and 1989, and shows that the median age at which females first appeared on the list was about five years younger than the median age for males (though for top 100 females, it was only about 6 months younger than for top 100 males). And, in spite of entering the rating list at a younger age, the females on average still played significantly fewer games in their competitive careers.

Howard hits on an important idea about a player's rating being reflective of both their innate abilities and their level of practice and development. In order to test for the effects of innate abilities alone, as Howard sets out to do, he realizes that he needs to strip out the effects of development. However, this is a much more complicated issue than Howard acknowledges, and simply controlling for the number of rated games played is not adequate to make the assumption that any remaining differences must reflect natural ability.

Continue Reading...


The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.


Howard begins by revisiting a 2005 paper he published on the same topic showing the gap between the average Elo rating of the top 50 male players and the top 50 female players:

Howard then argues that because the Elo gap has remained relatively constant in spite of societal changes over that time period, the difference between the male and female ratings is not due to societal factors and is at least partially biologically-based.

This finding is likely surprising to most people in chess. For example, the legendary Garry Kasparov, who early in his career expressed a somewhat Fischer-esque dismissal of female chess talent, grew to greatly respect the Polgar sisters (one of whom has defeated Kasparov himself) and felt they broke new ground for female players. In a recent interview at an exhibition match with Short in St. Louis, Kasparov rejected the claim that the gender gap has not closed. Even Short himself wrote that he had assumed the gap had closed somewhat before reading Howard's article.

Howard acknowledges this prior expectation in his 2005 paper:

"Anecdotally at least, there has been some convergence in chess at top levels. For example, there are more female grandmasters. Judit Polgar, born in 1976 and the strongest-ever female player, regularly wins tournaments against top male competition and several times has made the top ten players list. She once held the record for youngest-ever grandmaster. But, the extent of gender differences and their trends over time have never been quantified."

After quantifying the Elo difference, though, Howard simply assumes that the difference remaining flat means there has been no closing of the gender gap. This might seem like an reasonable assumption, but it lacks an important step: he has no control group to help interpret his results.


Computers have revolutionized how chess is played and studied at the top level. With the help of computer engines that are much stronger than any human grandmaster, known opening lines are constantly being analyzed more and more thoroughly. The more thoroughly these lines are known, the more important it is for players to memorize them, and the deeper they have to look for new ideas that could lead to a winning position. Strong grandmasters spend most of their time studying and developing these lines.

Former World Champion Vladimir Kramnik (born 1975, reached grandmaster 1991) said at his most recent tournament that top players have to work much harder now than when his career was starting. However, only the very top players can support themselves studying chess and competing full time. Most grandmasters, let alone lower titled or untitled players, don't have the time to keep up with all of these advances.

It is possible that this has led to the top players distancing themselves from the field. If that is the case, then, absent any closing of the gender gap, we would expect the Elo gap between the top 50 males and the top 50 females to have grown over time, just because there are more males in the group at the very top that is pulling away from everyone else. We need some kind of control group to compare to in order to help us interpret Howard's graph before we conclude the gender gap has not closed.

One way to do this is to compare breakdowns other than the top 50 females vs. the top 50 males. For example, what if we take the top 50 Russian players, and compare them to the top 50 non-Russian players?

The top 50 Russians in the April 2015 FIDE rating list have an average Elo rating of 2659. The top 50 players from outside Russia are at 2726. So the top 50 Russian players are 67 points below the non-Russians.

If we go back to 1991 (the first year the Soviet federations were listed separately--it would be impossible to make comparisons before that because the USSR included many strong players from outside Russia), the top 50 Russians were 54 points behind the top non-Russians. So the gap has grown a bit in the last couple decades, in spite of the fact that Russia remains by far the top federation.

Of course, you might be able to make a case that Russia is a bit weaker than it was in the early 90s when Kasparov and Karpov were still dominating chess. Except here's the thing: when we compare Russia to the rest of the world, Russia has lost ground. But if we instead compare Russia to each individual federation, they have actually gained ground over most of them. This seems paradoxical, but it makes sense if the top end of the spectrum is stretching itself out.

Let's take a look at some of these other countries.

The U.S. is experiencing something of a golden age for chess right now. They currently have two of the top ten players in the world. Hikaru Nakamura, the best American player since Bobby Fischer*, has been as high as #2 in the world in the live rankings this year, and recently became the first American to hit 2800 Elo. Increased funding and efforts in development programs have produced some remarkable young talent, including Sam Sevian, who in 2014 became the sixth-youngest grandmaster ever at 13 years old.

*at least not counting Fabiano Caruana, who has spent most of the last year as the #2 player in the world--Caruana was born in the U.S. but moved to Europe at age 13 and has represented Italy for his professional career

The emergence of serious collegiate chess teams has also attracted strong talent from around the world to the U.S. For example, five of the twelve competitors in the open-gender division of the 2015 U.S. National Championship (and at least that many from the Womens division) had originally competed under a different national federation before transferring to the USCF, including world #7 Wesley So. Likely influenced by the emergence of American chess, the aforementioned Caruana recently announced that he is transferring back to the USCF.

You would be hard pressed to argue that the U.S. federation is weaker now than in 1991, and certainly not much weaker. Yet in 1991, the top 50 American players were 105 points behind the top 50 non-Americans. Now, they're 185 points back.

What about Norway, the home of current World Champion and clear #1 Magnus Carlsen? Carlsen has sparked a chess craze in Norway, where tournaments now get national TV coverage. Norway hosts one of the top chess tournaments in the world (Norway Chess) and last year hosted the Chess Olympiad. The number of Norwegians in FIDE's published rating list grew from 92 in 1991 to 1306 this year.

The gap between the top 50 Norwegian players and the rest of the world grew from 289 points in 1991 to 337 points in 2015.

Not all federations saw their gap increase. China, for example, has without a doubt become much stronger in chess since 1991. Chess has had difficulty catching on in China due to the prevalence of xianqi, China's native chess variant, and go, another popular strategy game. Chess was even outlawed for a period in the 1960s and '70s as part of Chairman Mao's Cultural Revolution. Starting in the 1970s, however, China began pouring an increasing amount of funding and effort into growing its chess program.

This has ramped up in recent years, and China has finally emerged as a world chess power. Their women's team has won gold in four of the nine Chess Olympiads held since 1998 and three of the five World Team Championships since a women's division was created in 2007. The open-gender team won gold in the 2014 Olympiad and the 2015 World Team Championships. Their top 50 went from 329 points back of the world in 1991 to 207 points back in 2015.

Still, the vast majority of federations saw increases. Here are the Elo gaps for each of the 38 federations that had at least 50 FIDE-rated players in both 1991 and 2015:

Only 5 of the 38 federations closed the Elo gap at all, and on average the gap grew by 54 points.

When we look at the individual federations as control groups, we see evidence that the top really is separating itself further away from the field as time goes on. In spite of that, Howard's graph shows that women actually closed the Elo gap by a small amount. This can be interpreted as evidence that the gender gap is in fact closing, because it is offsetting the effect we are seeing with the national federations.

It is tempting to see evidence that supports your hypothesis in a vacuum, such as the relatively constant Elo gap between male and female players over the years, and to stop there. It is also tempting to believe a variable you believe to be objective and unbiased, such as Elo ratings, is self-explanatory and needs no control group to interpret. However, this is a dangerous practice. Especially when your results run counter to what subject matter experts would expect, as this finding did, it is important to make sure you have the proper context to interpret your results before jumping to conclusions.

Continue Reading...


The following is an introduction to a series of posts about some of the difficulties with conducting and interpreting statistical research, with links to the rest of the series at the end of this post.

Bobby Fischer once said he could beat any woman in the world giving them knight odds* (the full quote, in true Fischer fashion, is worse). Mikhail Tal famously responded, "Fischer is Fischer, but a knight is a knight!"

*Knight odds means the player giving odds starts the game with one knight already off the board.

Tal was correct, of course. In 2008, a master player named John Meyer, rated 2284 (grandmasters are rated 2500+, with the top GMs well over 2700 or even 2800), played a match against the computer program Rybka with knight odds. By that time, computers had far surpassed humans in chess. Rybka could have easily defeated the world champion in a non-handicapped match. With knight odds, Meyer won the match 4-0. There were women in Fischer's generation much stronger than Meyer who would have had no problem beating Fischer given such a handicap.

Still, chess remains a largely male-dominated profession. Currently, there are just two women in the top 100 rated players in the world, and one (Judit Polgar) is retired and will fall out of the active rankings later this year. Theoretically, chess should be among the most gender-neutral competitive disciplines, but the overwhelming majority of players are male. In fact, the predominance of male players is so strong that the URL for FIDE's top 100 overall list actually ends with "...?list=men", even though there are women on the list.

The question of why this is and what can (or should) be done about it has long been a point of discussion in the game, but this discussion reached the mainstream media last month due to controversy over an article written by British Grandmaster Nigel Short in the magazine New in Chess.

If you don't know Short (and you probably don't unless you particularly follow chess or remember his highly publicized World Championship match with Garry Kasparov in 1993), he's...well he's not really the best representative to speak about anything, really. When asked to write an obituary in his newspaper chess column for fellow British Grandmaster Tony Miles, he pretended to write a proper obituary for a few paragraphs before descending into a long-winded rant about why he didn't like Tony Miles, culminating with the line "I obtained a measure of revenge not only by eclipsing Tony in terms of chess performance, but also by sleeping with his girlfriend, which was definitely satisfying but perhaps not entirely gentlemanly." Nigel Short, everyone.

So it's no surprise that Short set off some fuses when asked to write about this topic (by the time he gets to the part about how he has to "manoeuvre the car out of our narrow garage" for his wife, you kind of get the sense that he's just doing this on purpose--which, in a media environment where controversy equals views equals money, he may well be.)

In the midst of his rambling, though, Short actually does cite an academic paper by Robert Howard (actually, a synopsis of the study posted by Howard to the chess website

"Nevertheless, my gut feeling was that female chess players are both stronger and more numerous than they were when I first began competing. The latter is certainly true, but an excellent article by the Australian Robert Howard on the website last year demonstrated that, despite the enormous societal changes over 40 years, the gap between the leading males and females has remained fairly constant at nearly 250 Elo points – a yawning chasm in ability. That women seem stronger has more to do with universally higher standards, due to the ubiquity of computers, than any closing of the gender gap."

Unfortunately, Short's citation comes with a clear agenda, as is evident in how he presents a second academic study which reached different conclusions:

"Howard also subtly critiques the most absurd theory to gain prominence in recent years, by Bilalić, Smallbone, McLeod and Gobet (which was submitted to the prestigious Royal Society, no less), that the rating sex difference is almost entirely attributable to participatory numbers (they comprise just 1% of the readership of this magazine). With the aid of a couple of bell curves this foursome neatly solve the eternal chess conundrum of why women lag behind their male counterparts, while simultaneously satisfying that irritating modern psychological urge to prove all of us, everywhere, are equal. Only a bunch of academics could come up with such a preposterous conclusion which flies in the face of observation, common sense and an enormous amount of empirical evidence too. Howard debunks this by showing that in countries like Georgia, where female participation is substantially higher than average, the gender gap actually increases – which is, of course, the exact opposite of what one would expect were the participatory hypothesis true."

The problem is partially that Short probably has no idea what the studies are doing (for example, Short seems unaware that Howard found the gender gap did decrease in Georgia compared to the rest of the world, makes up the term "enormous amount of empirical evidence" without justification, and I don't get the impression he's even read the Bilalić, et al study), but in this case, the blame doesn't lie entirely with Short. Howard's synopsis itself is largely responsible. It appears to misrepresent Howard's own work, as well as point to some potential critical issues with the study.

That being the case, I'd like to use this as an opportunity to cover some of the potential pitfalls in running this type of statistical analysis.

Continue Reading...

Math Behind Projecting the Division Winner (THT Article)

Note: this article uses examples from the free statistical software R

In my Hardball Times article about the projecting the number of wins we expect from the division winner, I included the following example:

Instead of having five baseball teams, let's say we have five coins. All we are going to do is flip each coin 162 times. Each time a coin lands on heads, it gets a win, and each time it lands on tails, it gets a loss. The coin with the most wins after 162 flips wins the division.

How many wins would you project for the coin that ends up winning the division, whichever coin that might be?

No coin by itself is going to have an expected value of more than 81 wins, but it is extremely likely that at least one out of the five coins will end up with more than 81 wins just by chance. It turns out that if you repeat this experiment a bunch of times, the coin that wins the division will end up with about 88 wins, on average.

Hopefully this makes sense conceptually, but how do I get 88 wins (or, more precisely, 88.3943...)?

One way, of course, is to actually do what I said, and flip a bunch of coins over and over and over and record the results. Let's say I repeat this experiment 10 times, and I get the following results for the "division winners":

94, 85, 89, 87, 89, 90, 82, 86, 85, 86

That is an average of 87.3--pretty good, but obviously not the most precise estimate. We need to repeat the experiment more than ten times to make sure we get something closer to the true mean. Rather than spend hours upon hours flipping coins, we can actually cheat and get a computer to pretend to do it for us. This is called simulation, and it can be a very powerful statistical tool for determining probabilities, averages, distributions, etc that are not computationally obvious (full disclosure: I actually cheated and simulated the 10 seasons rather than record and tally 8000+ coin flips).

Now, let's simulate 1000 seasons: this time, we get 88.5940 wins leading the division, on average. Much better, but still a couple tenths off. Bumping the number of seasons up to 10,000, this time we get 88.4296. And if we keep simulating more and more seasons, we are going to start seeing the results stay clustered more and more closely around 88.3943.

So that's one way to estimate the expected win total for our division winner. How do I know that the results should cluster around 88.3943 specifically, though, other than simulating millions and millions of seasons?

We can get the answer without simulation by starting with a simpler question. What is the probability that none of the teams wins more than, for example, 81 games? The probability that one team wins no more than 81 games is a simple binomial distribution problem: pbinom(81,162,.5) ~ .5313. The probability that all five are at 81 or lower then becomes .5313^5 ~ .04233.

There is about a 4% chance that the division winner will have 81 or fewer wins. We can repeat that calculation for 80 wins, and we see that there is about a .02262 probability of the division winner having 80 or fewer wins. That means the probability of the division winner having exactly 81 wins is .04233 - .02262 = .01971.

Then, we repeat that process for every number from 0 to 162, and we end up with a table of probabilities of the division winner ending up on each possible number of wins. (If you were to do this by hand, you could shortcut a bit by only going from something like 70 to 115 since the probabilities outside that range are all virtually zero anyway.)

Finally, we multiply each possible win total by the probability of the division winner finishing with that number of wins, and we add up the results to get a mean for the distribution. And doing that gives us 88.3943.


#calculate expected mean value of division winner
p <- .5 #probability of each team winning each game
n <- 162 #number of games per season
teams <- 5 #number of teams in the division

games <- 0:n # list of possible win totals (0:162)
p.list <- pbinom(games,n,p)^teams # p of div winner winning X games or fewer
wts <- c(p.list[1],diff(p.list)) # p of div winner winning exactly X games
sum(games*wts) # average wins by division winner

[1] 88.39431

As we can see, it is possible to calculate the mean of this distribution exactly, but it is still pretty cumbersome to do so without a computer. As such, let's discuss one final way to estimate this mean using simpler calculations.

First, we will need a continuous distribution, so we use a normal approximation for the binomial distribution. The mean of the normal distribution will just be 81 (the average number of wins we expect from a team in our example), and the standard deviation will be sqrt(npq) = sqrt(162*.5*.5) ~ 6.36.

All we need to do now is find the point where there is a 50% chance that five numbers randomly sampled from this distribution will all fall below that number. Start by finding the percentile of the distribution that fulfils this condition:

p^5 = .5
p = .5^(1/5) ~ 0.8706

This means we want point at the 0.8706 percentile of our normal distribution, which is simple to look up using an online tool or simple statistical software:

qnorm(0.8706,81,6.36) ~ 88.1849

That is our estimate for the expected number of wins from the division winner. This is slightly off because we are actually calculating the median and not the mean (and because we used a normal approximation, but that makes less difference), but it is still a pretty good estimate given the amount of calculation we simplified.
Continue Reading...