Note: This post is the first in a three-part series. It describes the motivation for this project and the data used. When parts two and three are posted I will link to them here.
Can you predict which sport or event an Olympian competes in based solely on her height, weight, age and sex? If so, that would suggest that physical features strongly drive athletes' relative abilities across sports, and that they pick sports that best leverage their physical predisposition. If not, we might infer that athleticism is a latent trait (like "grit") that can be applied to the sport of one's choice.
David Epstein argues that sporting success is largely based on heredity in his book, The Sports Gene. To support his argument, he describes how elite athletes' physical features have become more specialized to their sport over time (think Michael Phelps). At a basic level Epstein is correct: males and females differ at both a genetic level and in their physical features, generally speaking.
However, Epstein advanced a stronger claim in an interview (at 29:46) with Russ Roberts:
Roberts: [You argue that] if you simply had the height and weight of an Olympic roster, you could do a pretty good job of guessing what their events are. Is that correct?
Epstein: That’s definitely correct. I don’t think you would get every person accurately, but... I think you would get the vast majority of them correctly. And frankly, you could definitely do it easily if you had them charted on a height-and-weight graph, and I think you could do it for most positions in something like football as well.
I chose to assess Epstein's claim in a project for a machine learning course at Duke this semester. The data was collected by The Guardian, and includes all participants for the 2012 London Summer Olympics. There was complete data on age, sex, height, and weight for 8,856 participants, excluding dressage (an oddity of the data is that every horse-rider pair was treated as the sole participant in a unique event described by the horse's name). Olympians participate in one or more events (fairly specific competitions, like a 100m race), which are nested in sports (broader categories such as "Swimming" or "Athletics").
Athletics is by far the largest sport category (around 20 percent of athletes), so when it was included it dominated the predictions. To get more accurate classifications, I excluded Athletics participants from the sport classification task. This left 6,956 participants in 27 sports, split into a training set of size 3,520 and a test set of size 3,436. The 1,900 Athletics participants were classified into 48 different events, and also split into training (907 observations) and test sets (993 observations). For athletes participating in more than one event, only their first event was used.
What does an initial look at the data tell us? The features of athletes in some sports (Basketball, Rowing, Weightlifting, and Wrestling) and events (100m hurdles, Hammer throw, High jump, and Javelin) exhibit strong clustering patters. This makes it relatively easy to guess a participant's sport or event based on her features. In other sports (Archery, Swimming, Handball, Triathlon) and events (100m race, 400m hurdles, 400m race, and Marathon) there are many overlapping clusters making classification more difficult.
The next post, scheduled for Wednesday, will describe the machine learning methods I applied to this problem. The results will be presented on Friday.