Now in Print: “The Impact of Leadership Removal on Mexican Drug Trafficking Organizations”

My Journal of Quantitative Criminology article “The Impact of Leadership Removal on Mexican Drug Trafficking Organizations” is now in print. For the abstract and other discussions of the research see here, as well as the posts tagged “Mexico,” “drug trafficking,” and “leadership removal“.

Here is a timeline of the research and publication process:

  • Read an article in the Economist about DTO leadership removal, December 2010
  • Preliminary research for a graduate seminar in time series analysis at the University of Houston, Spring 2011
  • Draft of paper incorporating other research on organized crime and political violence for a seminar at Duke University, Fall 2011
  • Revised manuscript rejected from a security studies journal after R&R, Spring 2012
  • Revised manuscript rejected from a political violence journal after R&R, Late summer 2012
  • R&R from JQC, Summer 2013
  • Accepted for publication in JQC, December 2013
  • Published online, March 2014
  • Published in print, December 2014

All in all, a four-year project, with no significant changes to the manuscript in about 18 months previous to the print publication. The paper absolutely improved thanks to feedback from reviewers and quality, but I think you will agree that this is a very long feedback cycle.

Academia to Industry

Last week, Brian Keegan had a great post on moving from doctoral studies to industrial data science. If you have not yet read it, go read the whole thing. In this post I will share a couple of my favorite parts of the post, as well as one area where I strongly disagreed with Brian.

The first key point of the post is to obtain relevant, marketable skills while you are in grad school. There’s just no excuse not to, regardless of your field of study–taking classes and working with scholars in other departments is almost always allowed and frequently encouraged. As Brian puts it:

[I]f you spend 4+ years in graduate school without ever taking classes that demand general programming and/or data analysis skills, I unapologetically believe that your very real illiteracy has held you back from your potential as a scholar and citizen.

Another great nugget in the post is in the context of recruiters, but it is also very descriptive of a prevailing attitude in academia:

This [realizing recruiters’ self-interested motivations] is often hard for academics who have come up through a system that demands deference to others’ agendas under the assumption they have your interests at heart as future advocates.

The final point from the post that I want to discuss may be very attractive and comforting to graduate students doing industry interviews for the first time:

After 4+ years in a PhD program, you’ve earned the privilege to be treated better than the humiliation exercises 20-year old computer science majors are subjected to for software engineering internships.

My response to this is, “no, you haven’t.” This is for exactly the reasons mentioned above–that many graduate students can go through an entire curriculum without being able to code up FizzBuzz. A coding interview is standard for junior and midlevel engineers, even if they have a PhD. Frankly, there are a lot of people trying to pass themselves off as data scientists who can’t code their way out of a paper bag, and a coding interview is a necessary screen. Think of it as a relatively low threshold that greatly enhances the signal-to-noise ratio for the interviewer. If you’re uncomfortable coding in front of another person, spend a few hours pairing with a friend and getting their feedback on your code. Interviewers know that coding on a whiteboard or in a Google Doc is not the most natural environment, and should be able to calibrate for this.

With this one caveat, I heartily recommend the remainder of the original post. This is an interesting topic, and you can expect to hear more about it here in the future.

Tirole on Open Source

Jean Tirole is the latest recipient of the Nobel prize in economics, as was announced Monday. For more background on his work, see NPR and the New Yorker. My favorite portion of Tirole’s work (and, admittedly, pretty much the only part I’ve read) is his work on open source software communities. Much of this is joint work with Josh Lerner. Below I share a few selections from his work that indicate the general theme.

open_sourceThere are two main economic puzzles to open source software. First, why would highly skilled workers who earn a substantial hourly wage contribute their time to developing a product they won’t directly sell (and how do they convince their employers, in some cases, to support this)? Second, given the scale of these projects, how do they self-govern to set priorities and direct effort?

The answer to the first question is a combination of personal reputation and the ability to develop complementary software (Lerner and Tirole, 2002, p. 215-217). Most software work is “closed source,” meaning others can see the finished product but not the underlying code. For software developers, having your code out in the open gives others (especially potential collaborators or employers) the chance to assess your abilities. This is important to ensure career mobility. Open source software is also a complement to personal or professional projects. When there are components that are common across many projects, such as an operating system (Linux) or web framework (Rails), it makes sense for many programmers to contribute their effort to build a better mousetrap. This shared component can then improve everyone’s future projects by saving them time or effort. The collaboration of many developers also helps to identify bugs that may not have been caught by any single individual. Some of Tirole’s earlier work on collective reputations is closely related, as their appears to be an “alumni effect” for developers who participated in successful projects.

Tirole and Lerner’s answer to the second question revolves around leadership. Leaders are often the founders of or early participants in the open software project. Their skills and early membership status instill trust. As the authors put it, other programmers “must believe that the leader’s objectives are sufficiently congruent with theirs and not polluted by ego-driven, commercial, or political biases. In the end, the leader’s recommendations are only meant to convey her information to the community of participants.” (Lerner and Tirole, 2002, p. 222) This relates to some of Tirole’s other work, with Roland Benabou, on informal laws and social norms.

Again, this is only a small portion of Tirole’s work, but I find it fascinating. There’s more on open source governance in the archives. This post on reputation in hacker culture or this one on the Ruby community are good places to start.

Epstein on Athletes

As a follow-up to the most recent series of posts, you may enjoy this TED talk by David Epstein. Epstein is the author of The Sports Gene and offered the claim that kicked off those earlier posts–that he could accurately guess an Olympian’s sport knowing only her height and weight.

The talk offers some additional context for Epstein’s claim. Specifically Epstein describes how the average height and weight in a set of 24 sports has become more different over time:

In the early half of the 20th century, physical education instructors and coaches had the idea that the average body type was the best for all athletic endeavors: medium height, medium weight, no matter the sport. And this showed in athletes’ bodies. In the 1920s, the average elite high-jumper and average elite shot-putter were the same exact size. But as that idea started to fade away, as sports scientists and coaches realized that rather than the average body type, you want highly specialized bodies that fit into certain athletic niches, a form of artificial selection took place, a self-sorting for bodies that fit certain sports, and athletes’ bodies became more different from one another. Today, rather than the same size as the average elite high jumper, the average elite shot-putter is two and a half inches taller and 130 pounds heavier. And this happened throughout the sports world.

Here’s the chart used to support that point, with data points from the early twentieth century in yellow and more recent data points in blue:

Average height and mass for athletes in 24 sports in the early twentieth century (yellow) and today (blue)

Average height and mass for athletes in 24 sports in the early twentieth century (yellow) and today (blue)

This suggests that it has become easier over time to guess individuals’ sports based on physical characteristics, but as we saw it is still difficult to do with a high degree of accuracy.

Another interesting change highlighted in the talk is the role of technology:

In 1936, Jesse Owens held the world record in the 100 meters. Had Jesse Owens been racing last year in the world championships of the 100 meters, when Jamaican sprinter Usain Bolt finished, Owens would have still had 14 feet to go…. [C]onsider that Usain Bolt started by propelling himself out of blocks down a specially fabricated carpet designed to allow him to travel as fast as humanly possible. Jesse Owens, on the other hand, ran on cinders, the ash from burnt wood, and that soft surface stole far more energy from his legs as he ran. Rather than blocks, Jesse Owens had a gardening trowel that he had to use to dig holes in the cinders to start from. Biomechanical analysis of the speed of Owens’ joints shows that had been running on the same surface as Bolt, he wouldn’t have been 14 feet behind, he would have been within one stride. 

The third change Epstein discusses is more dubious: a “changing mindset” among athletes giving them a “can do” attitude. In particular he mentions Roger Bannister’s four-minute mile as a major psychological breakthrough in sporting. As this interview makes clear, Bannister attributes the fact that no progress was made in the fastest mile time between 1945 and 1954 to the destruction, rationing, and overall quite distracting events of WWII. It’s possible that a four-minute mile was run as early as 1770. I wonder what Epstein’s claims would look like on that time scale?

Classifying Olympic Athletes by Sport and Event (Part 3)

This is the last post in a three-part series. Part one, describing the data, is here. Part two gives an overview of the machine learning methods and can be found here. This post presents the results.

To present the results I will use classification matrices, transformed into heatmaps. The rows indicate Olympians’ actual sports, and the columns are their predicted sports. A dark value on the diagonal indicates accurate predictions (the athlete is predicted to be in their actual sport) while light values on the diagonal suggest that Olympians in a certain sport are misclassified by the algorithms used. In each case results for the training set are in the left column and results for the test set are on the right. For a higher resolution version, see this pdf.

Classifying Athletes by Sport

sport-matrices

 

For most rows, swimming is the most common predicted sport. That’s partially because there are so many swimmers in the data and partially due to the fact that swimmers have a fairly generic body type as measured by height and weight (see the first post). With more features such as arm length and torso length we could better distinguish between swimmers and non-swimmers.

Three out of the four methods perform similarly. The real oddball here is random forest: it classifies the training data very well, but does about as well on the test data as the other methods. This suggests that random forest is overfitting the data, and won’t give us great predictions on new data.

Classifying Athletes by Event

event-matrices

The results here are similar to the ones above: all four methods do about equally well for the test data, while random forest overfits the training data. The two squares in each figure represent male and female sports. This is a good sanity check–at least our methods aren’t misclassifying men into women’s events or vice versa (recall that sex is one of the four features used for classification).

Accuracy

Visualizations are more helpful than looking at a large table of predicted probabilities, but what are the actual numbers? How accurate are the predictions from these methods? The table below presents accuracy for both tasks, for training and test sets.

accuracy

The various methods classify Olympians into sports and events with about 25-30 percent accuracy. This isn’t great performance. Keep in mind that we only had four features to go on, though–with additional data about the participants we could probably do better.

After seeing these results I am deeply skeptical that David Epstein could classify Olympians by event using only their height and weight. Giving him the benefit of the doubt, he probably had in mind the kind of sports and events that we saw were easy to classify: basketball, weightlifting, and high jump, for example. These are the types of competitions that The Sports Gene focuses on. As we have seen, though, there is a wide range of sporting events and a corresponding diversity of body types. Being naturally tall or strong doesn’t hurt, by it also doesn’t automatically qualify you for the Olympics. Training and hard work play an important role, and Olympic athletes exhibit a wide range of physical characteristics.

Classifying Olympic Athletes by Sport and Event (Part 2)

This is the second post in a three-part series. The first post, giving some background and describing the data, is here. In that post I pointed out David Epstein’s claim that he could identify an Olympian’s event knowing only her height and weight. The sheer number of Olympians–about 10,000–makes me skeptical, but I decided to see whether machine learning could accurately produce the predictions Mr. Epstein claims he could.

To do this, I tried four different machine learning methods. These are all well-documented methods implemented in existing R packages. Code and data for is here (for sports) and here (for events).

The first two methods, conditional inference trees (using the party package) and evolutionary trees (using evtree), are both decision tree-based approaches. That means that they sequentially split the data based on binary decisions. If the data falls on one side of the split (say, height above 1.8 meters) you continue down one fork of the tree, and if not you go down the other fork. The difference between these two methods is how the tree is formed: the first recursively partitions the data based on conditional probability, while the second method (as the name suggests) uses an evolutionary algorithm. To get a feel for how this actually divides the data, see the figure below and this post.

 

If a single tree is good, a whole forest must be better–or at least that’s the thinking behind random forests, the third method I used. This method generates a large number of trees (500 in this case), each of which has access to only some of the features in the data. Once we have a whole forest of trees, we combine their predictions (usually through a voting process). The combination looks a little bit like the figure below, and a good explanation is here.

 

The fourth and final method used–artificial neural networks–is a bit harder to visualize. Neural networks are sort of a black box, making them difficult to interpret and explain. At a coarse level they are intended to work like neurons in the brain: take some input, and produce output based on whether the input crosses a certain threshold. The neural networks I used have a single hidden layer with 30 (for sports classification) or 50 hidden nodes (for event classification). To get a better feel for how neural networks work, see this three part series.

That’s a very quick overview of the four machine learning methods that I applied to classifying Olympians by sport and event. The data and R code are available at the link above. In the next post, scheduled for Friday, I’ll share the results.

Classifying Olympic Athletes by Sport and Event (Part 1)

Note: This post is the first in a three-part series. It describes the motivation for this project and the data used. When parts two and three are posted I will link to them here.

Can you predict which sport or event an Olympian competes in based solely on her height, weight, age and sex? If so, that would suggest that physical features strongly drive athletes’ relative abilities across sports, and that they pick sports that best leverage their physical predisposition. If not, we might infer that athleticism is a latent trait (like “grit“) that can be applied to the sport of one’s choice.

SportsGeneDavid Epstein argues that sporting success is largely based on heredity in his book, The Sports Gene. To support his argument, he describes how elite athletes’ physical features have become more specialized to their sport over time (think Michael Phelps). At a basic level Epstein is correct: males and females differ at both a genetic level and in their physical features, generally speaking.

However, Epstein advanced a stronger claim in an interview (at 29:46) with Russ Roberts:

Roberts: [You argue that] if you simply had the height and weight of an Olympic roster, you could do a pretty good job of guessing what their events are. Is that correct?

Epstein: That’s definitely correct. I don’t think you would get every person accurately, but… I think you would get the vast majority of them correctly. And frankly, you could definitely do it easily if you had them charted on a height-and-weight graph, and I think you could do it for most positions in something like football as well.

I chose to assess Epstein’s claim in a project for a machine learning course at Duke this semester. The data was collected by The Guardian, and includes all participants for the 2012 London Summer Olympics. There was complete data on age, sex, height, and weight for 8,856 participants, excluding dressage (an oddity of the data is that every horse-rider pair was treated as the sole participant in a unique event described by the horse’s name). Olympians participate in one or more events (fairly specific competitions, like a 100m race), which are nested in sports (broader categories such as “Swimming” or “Athletics”).

Athletics is by far the largest sport category (around 20 percent of athletes), so when it was included it dominated the predictions. To get more accurate classifications, I excluded Athletics participants from the sport classification task. This left 6,956 participants in 27 sports, split into a training set of size 3,520 and a test set of size 3,436. The 1,900 Athletics participants were classified into 48 different events, and also split into training (907 observations) and test sets (993 observations). For athletes participating in more than one event, only their first event was used.

What does an initial look at the data tell us? The features of athletes in some sports (Basketball, Rowing, Weightlifting, and Wrestling) and events (100m hurdles, Hammer throw, High jump, and Javelin) exhibit strong clustering patters. This makes it relatively easy to guess a participant’s sport or event based on her features. In other sports (Archery, Swimming, Handball, Triathlon) and events (100m race, 400m hurdles, 400m race, and Marathon) there are many overlapping clusters making classification more difficult.

sport-descriptive

Well-defined (left) and poorly-defined clusters of height and weight by sport.

Well-defined (left) and poorly-defined clusters of height and weight by event.

Well-defined (left) and poorly-defined clusters of height and weight by event.

The next post, scheduled for Wednesday, will describe the machine learning methods I applied to this problem. The results will be presented on Friday.

Two Unusual Papers on Monte Carlo Simulation

For Bayesian inference, Markov Chain Monte Carlo (MCMC) methods were a huge breakthrough. These methods provide a principled way for simulating from a posterior probability distribution, and are useful for integrating distributions that are computationally intractable. Usually MCMC methods are performed with computers, but I recently read two papers that apply Monte Carlo simulation in interesting ways.

The first is Markov Chain Monte Carlo with People. MCMC with people is somewhat similar to playing the game of telephone–there is input “data” (think of the starting word in the telephone game) that is transmitted across stages where it can be modified and then output at the end. In the paper the authors construct a task so that human learners approximately follow an MCMC acceptance rule. I have summarized the paper in slightly more detail here.

The second paper is even less conventional: the authors approximate the value of π using a “Mossberg 500 pump-action shotgun as the proposal distribution.” Their simulated value is 3.131, within 0.33% of the true value. As the authors state, “this represents the first attempt at estimating π using such method, thus opening up new perspectives towards computing mathematical constants using everyday tools.” Who said statistics has to be boring?

 

What Really Happened to Nigeria’s Economy?

You may have heard the news that the size Nigeria’s economy now stands at nearly $500 billion. Taken at face value (as many commenters have seemed all to happy to do) this means that the West African state “overtook” South Africa’s economy, which was roughly $384 billion in 2012. Nigeria’s reported GDP for that year was $262 billion, meaning it roughly doubled in a year.

How did this “growth” happen? As Bloomberg reported:

On paper, the size of the economy expanded by more than three-quarters to an estimated 80 trillion naira ($488 billion) for 2013, Yemi Kale, head of the National Bureau of Statistics, said at a news conference yesterday to release the data in the capital, Abuja….

The NBS recalculated the value of GDP based on production patterns in 2010, increasing the number of industries it measures to 46 from 33 and giving greater weighting to sectors such as telecommunications and financial services.

The actual change appears to be due almost entirely to Nigeria including figures in GDP calculation that had been excluded previously. There is nothing wrong with this, per se, but it makes comparisons completely unrealistic. This would be like measuring your height in bare feet for years, then doing it while wearing platform shoes. Your reported height would look quite different, without any real growth taking place. Similar complications arise when comparing Nigeria’s new figures to other countries’, when the others have not changed their methodology.

Nigeria’s recalculation adds another layer of complexity to the problems plaguing African development statistics. Lack of transparency (not to mention accuracy) in reporting economic activity makes decisions about foreign aid and favorable loans more difficult. For more information on these problems, see this post discussing Morten Jerven’s book Poor NumbersIf you would like to know more about GDP and other economic summaries, and how they shape our world, I would recommend Macroeconomic Patterns and Stories (somewhat technical), The Leading Indicators, and GDP: A Brief but Affectionate History.

Schneier on Data and Power

Data and Power is the tentative title of a new book, forthcoming from Bruce Schneier. Here’s more from the post describing the topic of the book:

Corporations are collecting vast dossiers on our activities on- and off-line — initially to personalize marketing efforts, but increasingly to control their customer relationships. Governments are using surveillance, censorship, and propaganda — both to protect us from harm and to protect their own power. Distributed groups — socially motivated hackers, political dissidents, criminals, communities of interest — are using the Internet to both organize and effect change. And we as individuals are becoming both more powerful and less powerful. We can’t evade surveillance, but we can post videos of police atrocities online, bypassing censors and informing the world. How long we’ll still have those capabilities is unclear….

There’s a fundamental trade-off we need to make as society. Our data is enormously valuable in aggregate, yet it’s incredibly personal. The powerful will continue to demand aggregate data, yet we have to protect its intimate details. Balancing those two conflicting values is difficult, whether it’s medical data, location data, Internet search data, or telephone metadata. But balancing them is what society needs to do, and is almost certainly the fundamental issue of the Information Age.

There’s more at the link, including several other potential titles. The topic will likely interest many readers of this blog. It will likely build on his ideas of inequality and online feudalism, discussed here.