3-D Baseball: Gender in Chess PART 2: ELO RATINGS

The following is part of a series of posts about some of the difficulties with conducting and interpreting statistical research.

Previous:
Intro
PART 1: MEASURING THE GENDER GAP

We saw previously that the lack of significant change in the rating gap between the top male and female players can actually be evidence that the gender gap in chess has diminished over the past few decades. That is not the only potential interpretation problem with Howard's conclusion, though. Even if control groups hadn't indicated that the Elo gap should be increasing absent any closing of the gender gap, it could still be conceivable that the gender gap has in fact diminished.

This is because Elo ratings are not indicators of absolute playing strength, only of strength relative to the field of rated players. In other words, a 2500 Elo rating among one group of players is not necessarily equivalent to a 2500 rating among another group of players. For example, Hikaru Nakamura's FIDE rating for April, 2015 is 2798. His USCF rating is 2881. Both are Elo ratings, but because they are tracked among different pools of players, they don't have to match up even though they are both describing the strength of the exact same player.

Howard is looking at the same FIDE rating for both male and female players, though, so this shouldn't be a problem, right? Possibly, but we don't know for sure.

Elo ratings work by taking points away from one player and giving them to the other each time a game is played. If a player has a true playing strength of 2500 but is rated at 2400, then they would be expected to take points from their opponents until their rating matches their playing strength. Likewise, a player who is overrated will give points back to the field until their rating returns to their ability.

Many top female players play predominantly or exclusively in womens events. And some of the top female players who play in open events, such as Judit and Susan Polgar when they were still active, rarely play womens events at all, and as a result rarely play against other women. Because of this, if males or females are over- or underrated as a group, there might not be enough games between the two groups to transfer the necessary rating points to bring them back in line. It is possible that female players and male players form two sufficiently isolated player pools that their ratings are not necessarily comparable.

This might sound far-fetched, but it is actually a known problem and has occurred before. In 1987, FIDE commissioned a study comparing the performance of top female players against men to their performance against other women because of this exact issue. The six women who had played a sufficient number of games against both genders over the mid-80s to qualify for the study all held significantly higher performance ratings against male opponents than female opponents--on average more than 100 points higher.

This suggested that, for example, a 2400 rated female player was likely stronger than a 2400 male player. To compensate, FIDE added 100 points to all rated female players (except Susan Polgar*) in order to bring their ratings in line with the male ratings. It is possible that the two pools of players have remained isolated enough to drift out of sync again over the last few decades, however.

*The reasoning was that Polgar already played mostly within the male pool of players and didn't need the adjustment. However, the decision to give the full 100 points to her top rivals, who also played a significant number of games against men, and 0 points to Polgar was nonsensical and controversial, and there were accusations that FIDE was deliberately manipulating the ratings to place Maya Chiburdanidze in the #1 spot ahead of Polgar.

Most people who follow chess believe that some form of inflation exists in the ratings . In other words, they believe a 2800 rating is not as strong now, when there are a handful of players hovering around that level, as it was when Garry Kasparov first achieved it back in 1990 and Anatoly Karpov was the only other player over 2700, or in 1972 when Bobby Fischer topped the ratings list by over 100 points at 2785.

The mechanism of inflation is not well understood, however, and it is not clear that it would necessarily have had the same effect on a fairly isolated pool of female players as on the population as a whole. It could be that after nearly 30 years, male ratings have inflated faster than female ratings, and we have once again reached a point where female players as a whole are underrated.

Howard himself notes another potential interpretation problem with using FIDE ratings to measure the gender gap in chess:

"I found that women typically play many fewer FIDE-rated games than males, only about one third of the number on average. Now, the usual learning curve for chess players is a progressive ascent to a peak at around 750 FIDE-rated games. ... Comparing modestly- and highly-practiced individuals can be misleading. Studies should control for differences in number of games played, either by equating males and females on this or by examining differences at the typical rating peak at around 750 games."

Howard then dismisses this explanation because even after controlling for the number of rated games played, males still had higher ratings.

The number of FIDE-rated games played itself isn't really what we care about, though. It's just a proxy for "modestly-practiced", "highly-practiced", etc. Players who have played more games should, in general, have more experience and further development. Games played are far from a perfect indicator of a player's level of development or experience, however.

Most obviously, not all games are FIDE-rated. While top-rated players do for the most part compete exclusively in FIDE-rated events, that is not true for developing players. For example, U.S. prodigy Sam Sevian has played 539 FIDE-rated games as of April, 2015. He's played 922 USCF-rated games. Even ignoring casual and club games, that is hundreds of competitive games that are not in Howard's data (and it is more than the difference between 922 and 539, because not all FIDE-rated games are USCF-rated).

The amount of study devoted to chess outside of rated games is also a huge factor in development. Someone who is devoted to studying chess full-time will develop much more than someone who competes as a casual hobby, even if you control for the number of rated games played. Likewise, someone who competes fairly regularly and reaches 750 games in their 20s is different from someone who competes less frequently and reaches 750 games in their 40s or 50s (or even later). The former is probably much more likely to still be ascending and hitting their peak at that point, while the latter likely peaked or plateaued at a much lower number of games, and would probably have begun declining with age by the time they reached 750 games.

It's easy to see how two players can be at vastly different stages of development even after the same number of games played. Howard isn't comparing two individual players, though--he is comparing two groups of players (male and female). As long as you look at enough players in each group, shouldn't those other factors start to even out?

Ideally, they should. If there is a bias that applies to the group as a whole, though, that won't happen. For example, if female players tend to begin playing FIDE-events at an earlier stage in their development, or if they tend to compete less frequently than male competitors, that would introduce a bias that won't even out.

Many female players compete predominantly in female-only events, which are less frequent than open-gender events. And because these female-only events draw from a much smaller segment of the chess-playing population than the open-gender events, they also tend to be less intimidating for less-experienced players to enter. So there is a good chance this bias does exist.

In fact, Howard's data supports this. His 2005 paper includes a table summarizing males and females who entered the rating list between 1985 and 1989, and shows that the median age at which females first appeared on the list was about five years younger than the median age for males (though for top 100 females, it was only about 6 months younger than for top 100 males). And, in spite of entering the rating list at a younger age, the females on average still played significantly fewer games in their competitive careers.

Howard hits on an important idea about a player's rating being reflective of both their innate abilities and their level of practice and development. In order to test for the effects of innate abilities alone, as Howard sets out to do, he realizes that he needs to strip out the effects of development. However, this is a much more complicated issue than Howard acknowledges, and simply controlling for the number of rated games played is not adequate to make the assumption that any remaining differences must reflect natural ability.

NEXT:
PART 3: CAUSE AND EFFECT, THE BILALIĆ, SMALLBONE, MCLEOD AND GOBET STUDY
PART 4: MISREPRESENTING THE DATA

3-D Baseball

Gender in Chess PART 2: ELO RATINGS

1 comments:

Post a Comment

Javier Vazquez K-Watch

Links

Retrosheet Credit

Lahman Credit

Contributors

Blog Archive