# Classifying Olympic Athletes by Sport and Event (Part 2)

This is the second post in a three-part series. The first post, giving some background and describing the data, is here. In that post I pointed out David Epstein’s claim that he could identify an Olympian’s event knowing only her height and weight. The sheer number of Olympians–about 10,000–makes me skeptical, but I decided to see whether machine learning could accurately produce the predictions Mr. Epstein claims he could.

To do this, I tried four different machine learning methods. These are all well-documented methods implemented in existing R packages. Code and data for is here (for sports) and here (for events).

The first two methods, conditional inference trees (using the party package) and evolutionary trees (using evtree), are both decision tree-based approaches. That means that they sequentially split the data based on binary decisions. If the data falls on one side of the split (say, height above 1.8 meters) you continue down one fork of the tree, and if not you go down the other fork. The difference between these two methods is how the tree is formed: the first recursively partitions the data based on conditional probability, while the second method (as the name suggests) uses an evolutionary algorithm. To get a feel for how this actually divides the data, see the figure below and this post.

If a single tree is good, a whole forest must be better–or at least that’s the thinking behind random forests, the third method I used. This method generates a large number of trees (500 in this case), each of which has access to only some of the features in the data. Once we have a whole forest of trees, we combine their predictions (usually through a voting process). The combination looks a little bit like the figure below, and a good explanation is here.

The fourth and final method used–artificial neural networks–is a bit harder to visualize. Neural networks are sort of a black box, making them difficult to interpret and explain. At a coarse level they are intended to work like neurons in the brain: take some input, and produce output based on whether the input crosses a certain threshold. The neural networks I used have a single hidden layer with 30 (for sports classification) or 50 hidden nodes (for event classification). To get a better feel for how neural networks work, see this three part series.

That’s a very quick overview of the four machine learning methods that I applied to classifying Olympians by sport and event. The data and R code are available at the link above. In the next post, scheduled for Friday, I’ll share the results.

# Classifying Olympic Athletes by Sport and Event (Part 1)

Note: This post is the first in a three-part series. It describes the motivation for this project and the data used. When parts two and three are posted I will link to them here.

Can you predict which sport or event an Olympian competes in based solely on her height, weight, age and sex? If so, that would suggest that physical features strongly drive athletes’ relative abilities across sports, and that they pick sports that best leverage their physical predisposition. If not, we might infer that athleticism is a latent trait (like “grit“) that can be applied to the sport of one’s choice.

David Epstein argues that sporting success is largely based on heredity in his book, The Sports Gene. To support his argument, he describes how elite athletes’ physical features have become more specialized to their sport over time (think Michael Phelps). At a basic level Epstein is correct: males and females differ at both a genetic level and in their physical features, generally speaking.

However, Epstein advanced a stronger claim in an interview (at 29:46) with Russ Roberts:

Roberts: [You argue that] if you simply had the height and weight of an Olympic roster, you could do a pretty good job of guessing what their events are. Is that correct?

Epstein: That’s definitely correct. I don’t think you would get every person accurately, but… I think you would get the vast majority of them correctly. And frankly, you could definitely do it easily if you had them charted on a height-and-weight graph, and I think you could do it for most positions in something like football as well.

I chose to assess Epstein’s claim in a project for a machine learning course at Duke this semester. The data was collected by The Guardian, and includes all participants for the 2012 London Summer Olympics. There was complete data on age, sex, height, and weight for 8,856 participants, excluding dressage (an oddity of the data is that every horse-rider pair was treated as the sole participant in a unique event described by the horse’s name). Olympians participate in one or more events (fairly specific competitions, like a 100m race), which are nested in sports (broader categories such as “Swimming” or “Athletics”).

Athletics is by far the largest sport category (around 20 percent of athletes), so when it was included it dominated the predictions. To get more accurate classifications, I excluded Athletics participants from the sport classification task. This left 6,956 participants in 27 sports, split into a training set of size 3,520 and a test set of size 3,436. The 1,900 Athletics participants were classified into 48 different events, and also split into training (907 observations) and test sets (993 observations). For athletes participating in more than one event, only their first event was used.

What does an initial look at the data tell us? The features of athletes in some sports (Basketball, Rowing, Weightlifting, and Wrestling) and events (100m hurdles, Hammer throw, High jump, and Javelin) exhibit strong clustering patters. This makes it relatively easy to guess a participant’s sport or event based on her features. In other sports (Archery, Swimming, Handball, Triathlon) and events (100m race, 400m hurdles, 400m race, and Marathon) there are many overlapping clusters making classification more difficult.

Well-defined (left) and poorly-defined clusters of height and weight by sport.

Well-defined (left) and poorly-defined clusters of height and weight by event.

The next post, scheduled for Wednesday, will describe the machine learning methods I applied to this problem. The results will be presented on Friday.

# What Really Happened to Nigeria’s Economy?

You may have heard the news that the size Nigeria’s economy now stands at nearly \$500 billion. Taken at face value (as many commenters have seemed all to happy to do) this means that the West African state “overtook” South Africa’s economy, which was roughly \$384 billion in 2012. Nigeria’s reported GDP for that year was \$262 billion, meaning it roughly doubled in a year.

How did this “growth” happen? As Bloomberg reported:

On paper, the size of the economy expanded by more than three-quarters to an estimated 80 trillion naira (\$488 billion) for 2013, Yemi Kale, head of the National Bureau of Statistics, said at a news conference yesterday to release the data in the capital, Abuja….

The NBS recalculated the value of GDP based on production patterns in 2010, increasing the number of industries it measures to 46 from 33 and giving greater weighting to sectors such as telecommunications and financial services.

The actual change appears to be due almost entirely to Nigeria including figures in GDP calculation that had been excluded previously. There is nothing wrong with this, per se, but it makes comparisons completely unrealistic. This would be like measuring your height in bare feet for years, then doing it while wearing platform shoes. Your reported height would look quite different, without any real growth taking place. Similar complications arise when comparing Nigeria’s new figures to other countries’, when the others have not changed their methodology.

Nigeria’s recalculation adds another layer of complexity to the problems plaguing African development statistics. Lack of transparency (not to mention accuracy) in reporting economic activity makes decisions about foreign aid and favorable loans more difficult. For more information on these problems, see this post discussing Morten Jerven’s book Poor NumbersIf you would like to know more about GDP and other economic summaries, and how they shape our world, I would recommend Macroeconomic Patterns and Stories (somewhat technical), The Leading Indicators, and GDP: A Brief but Affectionate History.

# “The Impact of Leadership Removal on Mexican Drug Trafficking Organizations”

That’s the title of a new article, now online at the Journal of Quantitative Criminology. Thanks to fellow grad students Cassy Dorff and Shahryar Minhas for their feedback. Thanks also to mentors at the University of Houston (Jim Granato, Ryan Kennedy) and Duke University (Michael D. Ward, Scott de Marchi, Guillermo Trejo) for thoughtful comments. The anonymous reviewers at JQC and elsewhere were also a big help.

Here is the abstract:

### Objectives

Has the Mexican government’s policy of removing drug-trafficking organization (DTO) leaders reduced or increased violence? In the first 4 years of the Calderón administration, over 34,000 drug-related murders were committed. In response, the Mexican government captured or killed 25 DTO leaders. This study analyzes changes in violence (drug-related murders) that followed those leadership removals.

### Methods

The analysis consists of cross-sectional time-series negative binomial modeling of 49 months of murder counts in 32 Mexican states (including the federal district).

### Results

Leadership removals are generally followed by increases in drug-related murders. A DTO’s home state experiences more subsequent violence than the state where the leader was removed. Killing leaders is associated with more violence than capturing them. However, removing leaders for whom a \$30m peso bounty was offered is associated with a smaller increase than other removals.

### Conclusion

DTO leadership removals in Mexico were associated with an estimated 415 additional deaths during the first 4 years of the Calderón administration. Reforming Mexican law enforcement and improving career prospects for young men are more promising counter-narcotics strategies. Further research is needed to analyze how the rank of leaders mediates the effect of their removal.

I didn’t shell out \$3,000 for open access, so the article is behind a paywall. If you’d like a draft of the manuscript just email me.

# Mexico Update Following Joaquin Guzmán’s Capture

As you probably know by now, the Sinaloa cartel’s leader Joaquin Guzmán was captured in Mexico last Saturday. How will violence in Mexico shift following Guzman’s removal?

(Alfredo Estrella/AFP/Getty Images)

I take up this question in an article forthcoming in the Journal of Quantitative Criminology. According to that research (which used negative binomial modeling on a cross-sectional time series of Mexican states from 2006 to 2010), DTO leadership removals in Mexico are generally followed by increased violence. However, capturing leaders is associated with less violence than killing them. The removal of leaders for whom a 30 million peso bounty (the highest in my dataset, which generally identified high-level leaders) been offered is also associated with less violence. The reward for Guzmán’s capture was higher than any other contemporary DTO leader: 87 million pesos. Given that Guzmán was a top-level leader and was arrested rather than killed, I would not expect a significant uptick in violence (in the next 6 months) due to his removal. This follows President Pena Nieto’s goal of reducing DTO violence.

My paper was in progress for a while, so the data is a few years old. Fortunately Brian Phillips has also taken up this question using additional data and similar methods, and his results largely corroborate mine:

Many governments kill or capture leaders of violent groups, but research on consequences of this strategy shows mixed results. Additionally, most studies have focused on political groups such as terrorists, ignoring criminal organizations – even though they can represent serious threats to security. This paper presents an argument for how criminal groups differ from political groups, and uses the framework to explain how decapitation should affect criminal groups in particular. Decapitation should weaken organizations, producing a short-term decrease in violence in the target’s territory. However, as groups fragment and newer groups emerge to address market demands, violence is likely to increase in the longer term. Hypotheses are tested with original data on Mexican drug-trafficking organizations (DTOs), 2006-2012, and results generally support the argument. The kingpin strategy is associated with a reduction of violence in the short term, but an increase in violence in the longer term. The reduction in violence is only associated with leaders arrested, not those killed.

A draft of the full paper is here.

# Visualizing the Indian Buffet Process with Shiny

(This is a somewhat more technical post than usual. If you just want the gist, skip to the visualization.)

N customers enter an Indian buffet restaurant, one after another. It has a seemingly endless array of dishes. The first customer fills her plate with a Poisson(α) number of dishes. Each successive customer i tastes the previously sampled dishes in proportion to their popularity (the number of previous customers who have sampled the kth dish, m_k, divided by i). The ith customer then samples a Poisson(α) number of new dishes.

That’s the basic idea behind the Indian Buffet Process (IBP). On Monday Eli Bingham and I gave a presentation on the IBP in our machine learning seminar at Duke, taught by Katherine Heller. The IBP is used in Bayesian non-parametrics to put a prior on (exchangeability classes of) binary matrices. The matrices usually represent the presence of features (“dishes” above, or the columns of the matrix) in objects (“customers,” or the rows of the matrix). The culinary metaphor is used by analogy to the Chinese Restaurant Process.

Although the visualizations in the main paper summarizing the IBP are good, I thought it would be helpful to have an interactive visualization where you could change α and N to see how what a random matrix with those parameters looks like. For this I used Shiny, although it would also be fun to do in d3.

One realization of the IBP, with α=10.

In the example above, the first customer (top row) sampled seven dishes. The second customer sampled four of those seven dishes, and then four more dishes that the first customer did not try. The process continues for all 10 customers. (Note that this matrix is not sorted into its left-ordered-form. It also sometimes gives an error if α << N, but I wanted users to be able to choose arbitrary values of N so I have not changed this yet.) You can play with the visualization yourself here.

Interactive online visualizations like this can be a helpful teaching tool, and the process of making them can also improve your own understanding of the process. If you would like to make another visualization of the IBP (or another machine learning tool that lends itself to graphical representation) I would be happy to share it here. I plan to add the Chinese restaurant process and a Dirichlet process mixture of Gaussians soon. You can find more about creating Shiny apps here.

# Constitutional Forks Revisited

Around this time last year, we discussed the idea of a constitutional “fork” that occurred with the founding of the Confederate States of America. That post briefly explains how forks work in open source software and how the Confederates used the US Constitution as the basis for their own, with deliberate and meaningful differences. Putting the two documents on Github allowed us to compare their differences visually and confirm our suspicions that many of them were related to issues of states’ rights and slavery.

Caleb McDaniel, a historian at Rice who undoubtedly has a much deeper and more thorough knowledge of the period, conducted a similar exercise and also posted his results on Github. He was faced with similar decisions of where to obtain the source text and which differences to retain as meaningful (for example, he left in section numbers where I did not). My method identifies 130 additions and 119 deletions when transitioning between the USA and CSA constitutions, whereas the stats for Caleb’s repo show 382 additions and 370 deletions.

What should we draw from these projects? In Caleb’s words:

My decisions make this project an interpretive act. You are welcome to inspect the changes more closely by looking at the commit histories for the individual Constitution files, which show the initial text as I got it from Avalon as well as the changes that I made.

You can take a look at both projects and conduct a difference-in-differences exploration of your own. More generally, these projects show the need for tools to visualize textual analyses, as well as the power of technology to enhance understanding of historical and political acts. Caleb’s readme file has great resources for learning more about this topic including the conversation that led him to this project, a New York Times interactive feature on the topic, and more.

# The Economy That Is Stanford

Five of the six most-visited websites in the world are here, in ranked order: Facebook, Google, YouTube (which Google owns), Yahoo! and Wikipedia. (Number five is a Chinese-language site.) If corporations founded by Stanford alumni were to form an independent nation, it would be the tenth largest economy in the world, with an annual revenue of \$2.7 trillion, as some professors at that university recently calculated. Another new report says: ‘If the internet was a country, its gross domestic product would eclipse all others but four within four years.’

That’s from this London Review of Books piece by Rebecca Solnit. The October, 2012, research report on which the claim is based is here, based on survey data. Solnit’s piece is interesting throughout, including a discussion of parallels and differences between the tech boom and the Gold Rush.

# Political Forecasting and the Use of Baseline Rates

As Joe Blitzstein likes to say, “Thinking conditionally is a condition for thinking.” Humans are not naturally good at this skill. Consider the following example: Kelly is interested in books and keeping things organized. She loves telling stories and attending book clubs. Is it more likely that Kelly is a bestselling novelist or an accountant?

Many of the “facts” about Kelly in that story might lead you to answer that she is a novelist. Only one–her sense of organization–might have pointed you toward an accountant. But think about the overall probability of each career. Very few bookworms become successful novelists, and there are many more accountants than (successful) authors in the modern workforce. Conditioning on the baseline rate helps make a more accurate decision.

I make a similar point–this time applied to political forecasting–in a recent blog post for the blog of Mike Ward’s lab (of which I am a member):

One piece of advice that Good Judgment forecasters are often reminded of is to use the baseline rate of an event as a starting point for their forecast. For example, insurgencies are a very rare event on the whole. For the period January, 2001 to August, 2013, insurgencies occurred in less than 10 percent of country-months in the ICEWS data set.

From this baseline, we can then incorporate information about the specific countries at hand and their recent history… Mozambique has not experienced an insurgency for the entire period of the ICEWS dataset. On the other hand, Chad had an insurgency that ended in December, 2003, and another that extended from November, 2005, to April, 2010. For the duration of the ICEWS data set, Chad has experienced an insurgency 59 percent of the time. This suggests that our predicted probability of insurgency in Chad should be higher than for Mozambique.

I started writing that post before rebels in Mozambique broke their treaty with the government. Maybe I spoke too soon, but the larger point is that baselines are the starting point–not the final product–of any successful forecast.

Having more data is useful, as long as it contributes more signal than noise. That’s what ICEWS aims to do, and I consider it a useful addition to the toolbox of forecasters participating in the Good Judgment Project. For more on this collaboration, as well as a map of insurgency rates around the globe as measured by ICEWS, see the aforementioned post here.

# Visualizing Political Unrest in Egypt, Syria, and Turkey

The lab of Michael D. Ward et al now has a blog. The inaugural post describes some of the lab’s ongoing projects that may come up in future entries including modeling of protests, insurgencies, and rebellions, event prediction (such as IED explosions), and machine learning techniques.

The second post compares two event data sets–GDELT and ICEWS–using recent political unrest in the Middle East as a focal point (more here):

We looked at protest events in Egypt and Turkey in 2011 and 2012 for both data sets, and we also looked at fighting in Syria over the same period…. What did we learn from these, limited comparisons?  First, we found out first hand what the GDELT community has been saying: the GDELT data are in BETA and currently have a lot of false positives. This is not optimal for a decision making aid such as ICEWS, in which drill-down to the specific events resulting in new predictions is a requirement. Second, no one has a good ground truth for event data — though we have some ideas on this and are working on a study to implement them. Third, geolocation is a boon. GDELT seems especially good a this, even with a lot of false positives.

The visualization, which I worked on as part of the lab, can be found here.  It relies on CartoDB to serve data from GDELT and ICEWS, with some preprocessing done using MySQL and R. The front-end is Javascript using a combination of d3 for timelines and Torque for maps.

GDELT (green) and ICEWS (blue) records of protests in Egypt and Turkey and conflict in Syria

If you have questions about the visualizations or the technology behind them, feel free to mention them here or on the lab blog.