Is being an Olympic champion determined by your genes? Longtime readers will remember the three-part series from 2014 exploring this question. That project was prompted by David Epstein’s claim, based on his book The Sports Gene, that given a roster of Olympic athletes and their weights and heights, he could predict their events with high accuracy. Using a variety of machine learning models I determined that they could achieve about 30 percent accuracy, which made me dubious of Epstein’s claims.

That project was made possible because The Guardian published data on all 9,000-plus participants in the 2012 Olympic games. With the 2016 Rio Olympics just around the corner, I recently began looking for similar data on this year’s participants but was unable to find it (if you know of such a data set please let me know). Guardian’s interest was likely due to the fact that the last summer games were held in London.

Instead, I found that the U.S. Olympic Committee made similar data available on its website. To put this into a useful format I scraped the site for each athlete’s name, sport, height, weight, date of birth, bio link, and bio image. The code and data are both available on Github. Here’s what the data looks like for the 518 (of 554 total) athletes for whom complete data was available:

female athletes

Data for female members of Team USA 2016


female athletes

Data for male members of Team USA 2016


Based upon a visual inspection, I am skeptical that machine learning models could do better on this data set than they did in 2012. There are some apparent clusters by sport (basketball players and wrestlers, for example) but the individual sports do not appear to be linearly separable. Moreover, there is a large number of clusters (30 different sports) and major class imbalance (only two synchronized swimmers but 124 track-and-field participants), which would make it difficult to ensure that there are examples of every sport in both the training and test sets.

Instead of predicting which sport each athlete particpates in, it may be more informative to have a summary look at the data. Here are the athletes representing the minimum and maximum values for each variable.

Team USA by the numbers:

  • The shortest team member of Team USA is Simone Biles, a gymnast (4’8”)
  • DeMarcus Cousins and DeAndre Jordan (both basketball players) are tied for tallest at 6’11”
  • Discus thrower Mason Finley is the heaviest participant (348 lbs)
  • The lightest-weight American athlete is Jiaqi Zheng, a table-tennis player (94 lbs)
  • Equestrian rider Phillip Dutton is the oldest member of the delegation at nearly 53 years of age
  • Team USA 2016’s youngest member is table-tennis player Kanak Jha (just over 16 years old)